Way back like 8 or 9 years ago - I analyzed Seattle housing data and got a few job offers along the way from doing it.
What I did was scrape housing data from Craigslist and plotted a regression model to figure out which neighborhoods were most expensive. Then everyone on Hackernew and Reddit used it as way to complain about housing.
But part of the reason why I couldn’t practically use the model to find good deals was because very important features about an apartment were rather difficult to extract.
For example, my goal was to build an apartment alerter that would email / text me whenever a new listing that fulfilled my base criteria was posted on Craigslist* and was “undervalued” by my regression model. But what would happen is that I would only get emails for rather shiity looking old apartments because I couldn’t featurize some of the nuances that make finding a good deal on an apartment, an actual good deal!
So how have things changed now with AI?
Well analyzing images is still hard. I didn’t put any effort into doing so even if it’s doable now in 2023?* But what is completely possible now is abstracting away some of the advanced natural language processing.
For example, before generative AI, if I wanted to see if an apartment listing had natural lighting - I would likely do a lot of research on advanced NLP, trying to find contextual or semantic meaning within the text, and at the end a long day likely would just manually combing through a long list of keywords that were related to “natural lighting” in hopes it would catch everything.
However now with ChatGPT - I can just give it a prompt to classify the entirety of the text for me into pre-defined features that I care about.
And here were the results:
It’s incredible how much context and power ChatGPT has when you give it parameters for what you want. It is effectively giving me the nuance that isn’t exactly built into Craigslist yet (or never will be). While a lot of this exists in the filters, I still can’t just search “natural lighting” into Craigslist and get all the apartments that fulfill this criteria. But if I scrape Craigslist and apply a long list of feature parameters on each listing, I can get the results I want.
So next step was to naturally get the feature set into a JSON format. I set up my prompt to be more specific this time.
You are given raw text from an apartment listing online. Please extract features from the apartment listings text and return a JSON. In the feature list below to the right of the arrow (->) is the description of the field and example output values.
qualities = ['price', 'available_date', 'number_of_bedrooms', 'number_of_bathrooms', 'apartment_address', 'house_type', 'natural_lighting', 'natural_lighting_evidence', 'modern', 'modern_evidence', 'laundry', 'laundry_evidence', 'hardwood_floors', 'square_footage', 'large_space', 'large_space_evidence', 'parking', 'parking_cost', 'parking_evidence', utilities_included', 'sublease_ability', 'initial_lease_term', 'patio']
Additionally I asked ChatGPT to provide “evidence” of the features that I wanted. For example, for something like natural lighting I specify the prompt below:
Natural Lighting -> (Yes or No) boolean. A unit has natural lighting if it's mentioned in the listing. Additional features would be if the unit is described as bright or has south facing windows which means that it gets more sunlight.
Natural Lighting Evidence -> String field. Add in any language, words, or phrases in the ad that help indicate if the unit has natural lighting or not. Example phrases like "Front of a window" or "dual paned windows" or "large living room" are not evidence of natural lighting.
I made a script to scrape Craigslist and then classify each listing within the area that I wanted to live in San Francisco.
But turns out - if you run a prompt against every single listing……it’s very slow. Not sure if this is solvable. I haven’t seen anything on the internet yet about improving the speed times here, especially if it’s not parallelized. Moreover OpenAI’s servers would almost always crash if it there were more than 100+ listings to analyze.
Welp - there goes 53 cents. I guess I should have stored the outputs more often.
I ended up quitting the project after this. I think it’s still interesting but the effort involved in building it wasn’t exactly less than the effort of just checking Craigslist or Zillow for 15 minutes a day for one or two weeks.
But if anyone else wants to continue the project and build something for their portfolio - I’d be happy to give you my source code + prompts.
Concluding Thoughts
Lastly - while I think this is a more fun project rather than monetizable one - I do think there is value added building projects using ChatGPT not in the traditional way of AI responses but through feature engineering.
Most AI projects that I hear about are generally modeled through paying for AI generated responses. This framework instead takes the idea that maybe AI can be used to embellish existing datasets, classify new ones, or create new feature parameters that can then be turned into proprietary models that can be sold.
Would love to hear any opinions here.
FYI - if this post gets any traction - I’ll do part 2 on “How to find the most optimal neighborhood in SF”.
Remember the housing crisis circa 2014 / 2015 when you had to show up to an apartment with an entire printed out credit report, pre-signed blank check security deposit, letter of recommendation from your Ivy League professor, and wholesale dedication towards paying more than any other schmuchk there?
What’s the state of computer vision now in 2023? I’m guessing that if I had 1000+ apartments and I manually labeled them shitty vs not shitty, maybe it could do a good job detecting this in the future? But who has time for that.