As Craigslist is among the largest portal of it’s kind, and also a ‘free-market’ platform with many categories, it is an attractive target for web scraping. As I have struggled with buying a car on Craigslist myself, I was curious if I could use my data science skills to identify the best car listings.
Elements (see red box) Scraped from Craigslist Pages
This also coincides with another topic that has intrigued me for a while now, namely ‘The Market for Lemons’: When selling or buying cars, there is often an “information asymmetry” between buyers and sellers. When a car is being offered below it’s market price, it is often assumed that the car must be a lemon (dud). However, the seller of the car could just be selling it at a low price to be able sell it quickly. Similarly, a lemon-seller could price his car higher than the market price for any number of reasons. Just as prices end up being a signaling mechanism, brands are yet another type of signals in this market. Thus, a Lamborghini Gallardo priced at $5,000 sends out a different signal from a Honda Civic priced at $5,000. The problem, however, is best summed by this Economist article:
“… Assume that used cars come in two types: those that are in good repair, and duds (or “lemons” as Americans and most economists call them). Suppose further that used-car shoppers would be prepared to pay $20,000 for a good one and $10,000 for a lemon. As for the sellers, lemon-owners require $8,000 to part with their old banger, while the one-owner, careful-driver old lady with the well-maintained estate won’t part with hers for less than $17,000. If buyers had the information to tell wheat from chaff, they could strike fair trades with the sellers, the old lady getting a high price and the lemon-owner rather less … If buyers cannot spot the quality difference, though, as is often the case in the real world, there will be only one market for all used cars, and buyers will be ready to pay only the average price of a good car and a lemon, or $15,000 …”
If one could determine the ideal price with regression analysis, it would be possible to gauge the market for models that are priced above or below their market value. In this post, I discuss results from scraping Craigslist to assess the used car market. The motivation for this project can be found among potential applications below:
Predict a car’s value from its specifications
Identify trends in the market for select features
Derive insights into price differences by title-type, gas-type, transmission-type etc
Identify cars that are being offered above/below their market value
Models that lose value quickly
Vintage car market listings
As the Kelley Blue Book (KBB) reports market value prices for new and used automobiles of all types, one can think of this project as a mini version of KBB. With modules such as Beautiful Soup and Selenium, I was able to select Web Elements by XPATH, Class Name, Link Text and Tag Name. I was able to scrape ~5,000 pages within a few hours, and extract ~25,000 data points from those pages (Interesting Fact: The oldest car being offered was a 1928 Ford Model T!).
Schematic for Entries in Scraped Data-Set
However, the raw data looked like that in the figure above, and had to be cleaned for uniformity. Entries without price were removed, whereas features such as size, drive and condition were removed for an initial analysis. The data was further cleaned for Make and Model names until a high level of parity was observed (>3,000 rows).
Prices Corresponding to Top 15 Brands on Craigslist Seattle: Note the Relatively Small Bands for Toyota, Honda, Volkswagen and Ford
Price versus Year for the Entire Data-Set
Correlation maps are a quick way to assess correlations among features of a large data-set. With Price as my dependent variable, and with Year, Odometer, Title-type, Fuel-type and Transmission-type as my features, I explored various regression models such as OLS, Ridge, Lasso and Elastic (Code). The data was expanded to include higher order (up to 15) polynomial terms for the quantitative variables, and the standardized data (standardized with Scikit-learn’s StandardScaler) was then divided into Training and Testing sets. As Figure xx shows, the R2 for the training set reaches a maximum with increasing polynomial complexity, the R2 for the test data drops instead. This is a classic case of over-fitting. As the improvement in R2 is not significant for third-degree terms and higher, it is safe to limit our model to include first and second-degree terms alone.
Heat-Map for Pearson's Coefficients Across Major Features
Drop in R-Squared with Increasing Polyomial Complexity
When OLS is applied to this dataset with first- and second-degree terms, a R2 of 0.545 is observed. Note that there is an inherent limitation with applying regression analysis to a dataset that contains different Make and Model combinations. On an average, if the title is ‘rebuilt’, a car loses $6,175, whereas if its ‘salvage’ or ‘missing’, it loses $6,629 in value. For listings on Craigslist, non-gasoline cars were being offered at a price $4,883 higher than gasoline cars. Also, the type of transmission, manual or automatic, had no significant impact on the price of cars being offered.
3-D Plot for Actual and Predicted Values for Suburban 1500
Similar analyses were carried out on specific Make-Model combinations, and in case of Chevrolet Suburban 1500, a R2 of 0.91 was obtained. Year and (Year)^2 had the lowest p-values for the final regression model. Also, the diagnostic plots were satisfactory, and the residual plots showed an equitable distribution of residuals across different price-bands. While a 3D plot could be drawn for Price (Z) versus quantitative variables (X-Y), individual 2D visualizations tell a more compelling story of my regression results for the Suburban 1500.
3-D Plot Decomposed into Two 2-D Plots
While a discussion of every Make-Model combination is beyond the scope of this project, I conclude that prices can be predicted with a high level of accuracy for a given model. Given enough data, one can easily create a competitor to KBB. Nonetheless, to complete with them, one would need to explore numerous other features (including location!) from different sources. During this analysis on cars, I came across 81 Makes (Brands) and 1,084 Models. To be able to create a database with extensive information on every Model, one would need to analyze approximately at least 5M instances of data.
Commentaires