Category Archives: Data Science


The project I’ve worked on during my time at Insight Data Science in New York City is called InspectCare: Using Online Activity to Identify Illegal Childcare Operations. This was done as a consultation with a city agency over three weeks.

Many illegal childcare centers operate in New York City, and when something goes wrong, they can make big headlines and attract a lot of attention. Who wouldn’t be outraged at kids dying in poorly run childcare centers? Oftentimes the parents don’t even know these centers are illegal; they operate so openly that they give the impression that they’re legitimate. Naturally, people expect the city government to either enforce regulations, or close these places down.

To that end, the city does what it can to monitor for illegal activity, and it will make further investigations if anything is found. This includes monitoring the internet, which is no small task! It takes a lot of time and effort for workers to keep tabs on the many websites out there.

Scraping Craigslist

A particularly difficult website to monitor is Craigslist. The posts are often poorly formatted, and there’s a high volume of content with low quality. There are roughly 2000 posts a week, after which they expire and are removed from the site. For my project, I wanted to see what could be done to improve the activity monitoring here.

There is a childcare subforum, however there are still a few problems. For one, there are a lot of duplicates. For another, there are a few distinct subcategories that aren’t labeled.

As you can see in the above screenshot, there are posts for childcare centers, nannies and babysitters, as well as various other things like tutors, transportation, and classic spam. This is a good place to apply a classification algorithm.

To start, we first scrape the data from Craigslist using the Scrapy Python package. The main information will be taken from the body of the post. It’s only possible to get one week at a time this way, since older posts expire.

Classifying Posts

The data from Craigslist isn’t already labeled into the three categories: Childcare Center, Babysitter/Nannies, Other. There aren’t any unsupervised machine learning algorithms appropriate for this problem, so I had to go and do the dirty work of labeling many (619) of the posts myself in order to establish some ground truth data.

Once the corpus of Craigslist posts is collected and a portion is labeled, some NLP (natural language processing) algorithms are used. First, a tf-idf vectorizer is applied to the whole corpus. This collects all of the terms and weights them based on their frequency, while at the same time terms that are used too frequently and are overly common have their weights reduced. Similarly, stopwords are removed. In this way the text features are automatically selected.

The features are then piped into a multinomial logistic regression model. This model is good for wide datasets with many features, which is common with NLP problems.

Matching the Database

The list of scraped Craigslist posts has now been narrowed down to childcare center posts without duplication, which is about 20% of the original amount; already quite an improvement on readability. The next step is to see if any of these posts can be matched against the databases provided by the city. They have records on licensed operations, and leads on suspected illegal operations.

The search is done over the names and addresses, which can appear differently in the database compared to the Craigslist post. For example, many organizations have an Inc. or LLC in the database, but don’t included that in their postings. There are also address issues such as Ave or Avenue. This calls for a fuzzy full text search, which can be done in PostgreSQL. For these reasons it’s difficult to make a match, although it’s been done for about 10% of the posts so far. Further improvements are in the works as more is learned about the data, and as more Craigslist data is scraped.

Neural Network Applied to Q&A Column

A little while back I came across an interesting post on the use of recurrent neural networks to generate text character by character after having been fed some sample text to train on. For example, the entire works of Shakespeare were used as training data, and it produced sample output such as:

Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

I thought I would give the neural network a try myself. The question became, what text should I try to use? Something with a distinctive pattern would produce the most recognizable output.

It occurred to me one day to use the Deadspin Funbag Q&A column. It follows the pattern of having someone write in a question, enclosed in a blockquote, followed by an answer. The columnist also has a distinctive style, frequently using certain phrases, and peppering the text with all caps sentence fragments, like “WHAT A COINCIDENCE”.

The neural network uses a single text file for training. So the first step is to scrape the website for every column, extract the relevant data from the HTML, and glue them altogether into a single file.

I used the scrapy package in Python to create a web spider to crawl the Deadspin site and collect all of the URLs for the Funbag columns. Then for each URL, I extracted the data that lived in the paragraph and blockquote HTML tags, ignoring everything else on the page. I used Beautiful Soup to pull out the HTML since I was already comfortable with the package, although I think I could’ve stuck with scrapy to do it too.

With all of the HTML data sanitized into 431 separate files, I concatenated them all together in the shell, and the final 10 MB file was then ready for processing. The neural network is available as Torch and Lua code. There are also helpful installation instructions for OS X. Unfortunately I ran into a run-time error after installation, but luckily the fix mentioned by the commenter “tbornt” worked.

The model took about 14 hours to train on my Macbook using the CPU. It would probably take a lot less time using a desktop with a decent graphics card. The Torch code supports CUDA with NVIDIA GPUs.

The output includes some HTML tags. Besides the blockquote and paragraph tags, there are also URL tags, bold and italics tags. These polluted the output a bit since they weren’t always closed promptly. Here’s an example of the output:


Aftime like a parky. Now you’re goy to the only of the gymnasts and right: permission to not he pretty high drinking that hours), you deserve a person drumble and forfea to kill again. Why they all own, and she’s trink to the president

I gas as you’re doctors. And then they will ever never start have a Prime Drinks died well and informing cogy dropping on a correct thing with until we have, look Pastan dilemmate only-neighboly asked the helping out how to give back trroal will use at it and all even the stance of Peyton War York dark of their own -“20s) begas up). I don’t get a roommates, which you’re will beding ten to a dinab.” Cai, and there will sugared the nor-own, and BY. Louisses sucked when Friendamishmousy at that. And became that two mentials think about me to you like, MOYA of a det-football three next week on TV and being cleaned the haguat. You’re still into a fall.

The model generally picked up the right structure, and even formed conventional names for the emailers most of the time. It also picked up on all caps phrases, and frequently appearing characters like Peyton Manning. The spelling and grammar could stand to use some improvement however. Nonetheless, the model knew nothing about English before training, so it’s still impressive it can get this close.

A more powerful model could be used by tweaking some parameters, such as the number of hidden units, and number of layers in the RNN. Although I think I would need a more powerful computer for this. It’s also possible to adjust the “temperature” of the output when sampling from the model, which can control how loose the text structure is. It might also help to sanitize the data a bit more and remove extraneous unicode characters. Here’s another sample:


Kent of Mami San weekend:

Where is the shower Francing, you’re probably start ground of curse before you could problem? Yeah his own a young shit gift and masking away with the head for his corpse, “I’m. GAHHHHHHHHHHHHH! You choose out of the number of legits, get with Tragland and friends of me and bread, and I would have happened. You survive law capabot.

I was in for a clark of the momnivies jers and the games, I could engrith the Soup on it?

I got for a shitface station daughters of the days. I love to trip on the better state, the other money the Candy Boel Russia, Football tab for a woman. I haven’t watch you. It has to know the worst spaction champages were universe with grosses and you have to see it really you just slide of the APGIT, don’t have me out during that procles? I mether if I was not recent them and starting when I plug sweaty screen in the fact that have water how long (of dictase with a professional on Solator. They get upstairs like a bag. We also fell, “found in the Earth Favrel players back” him. A following, like advantage of people like them were masturbated going on one of the songs are all of the games when they got her to pick swaterbiod. We are factically, for next to forget and use the extra day. You gotta be spidets night, which is better driver can’t run the bunch ofFuck Texasson . Given I gotta choice and even more than pointed like a compound meeting things up life yourself as an offer outbeem as can fuck it people to dead on. No.


Mazda RX-8 Mileage Data Analysis

For a few years now I’ve been collecting data about gas fill ups for my RX-8. I knew before I got the car it was going to get poor mileage due to its rotary engine. Luckily I drive a relatively low number of miles per year, and the car offers some cheap thrills with the 9.4K rpm fuel cut-off. I’m also quite partial to the engine because of its uniqueness and quirkiness. It’ll probably never be put in a production car again (for good reasons). Nonetheless I was curious to see what exact mileage I was getting, and how it would change with different usage patterns.

This dataset is a nice place to start showing some of the tools used in data science. A link to the Python code in a Jupyter Notebook can be found on my github page. I’ll present a few of the results here sans the code.

The data was initially saved in a Google Docs spreadsheet: the date, miles driven since last fill up, amount of gas pumped into the tank, the price per gallon, and gas station. Each time the tank was filled up completely, allowing the average miles per gallon to be computed over the past tank. There is some noise to be expected since the pumps may not all be calibrated quite the same way and may not fill the tank up to the exact same amount each time. At the time of writing, there are 114 observations.

Modern cars have trip computers that can record a lot more data and give an instantaneous mileage calculation. It would be interesting to have that much data and take different factors into account, such as air temperature, tire pressure, etc. The biggest source of variation in mileage for my driving pattern is likely the number of cold starts in a day, of which I did not collect data. The car dumps a lot of excess fuel when it’s warming up to operating temperature. The data isn’t ideal, so it’s interesting to see what can be learned from it.

Gas Prices

The easiest thing to analyze is the data for the gas prices. The first graph on the left is a time plot of the prices. This plot alone would be misleading. Prices around July 2016 suddenly get very volatile, and one might conclude that the market was itself volatile. However, the underlying reason is that I broke my discipline of only filling up at Costco, which is significantly cheaper than other stations.

As an aside, I’ve wondered whether Costco had poorer quality gas to go along with the cut rate pricing. Supposedly gas stations use gas that comes out of the same refineries/pipelines, with the difference being the detergents or other additives they use. There’s a licensing body called Top Tier that Costco belongs to, which claims they use respectable additives. I’ve also heard that what might really be a problem is the condition of the tanks that a particular gas station uses, e.g., they might have leaks, or contain a lot of water or sediment contaminating the gas. This Costco station seems pretty well maintained, and they churn through a lot of gas, so I don’t expect that to be an issue.

Getting back to the data, it looks like Costco was consistently about $0.64 cheaper than the other stations. Another thing to keep in mind is that other stations have a credit card surcharge, which I was getting hit with too. On average I’d fill up 12.7 gallons, which means Costco would save me about $8.22 per fill up. The Costco membership is $55 per year, and on average I’d fill up every 10 days, so that means I’d recover the cost of membership in 70 days. After that, if I filled up exclusively at Costco, I’d save about $238 per year. The station is on my regular route home, so I don’t waste gas getting to and from there. Almost makes it worth dealing with the lines and pushy people.

Gas Mileage

The mean mileage is 14.2 mpg; pretty embarrassing. On the plus side I drive only about 7,000 miles a year. Still, sorry Earth.

The histogram appears to be bimodal (two humps), rather than a nice looking unimodal normal-looking distribution. Here’s the mileage history.

The mileage appears to drop off around late 2014. I’ll get back to that shortly. We would expect the spikes in the graph to correspond to highway driving.

To determine if a tank of gas corresponded to highway driving, I figured that the number of miles driven per day (mpd) would be a predictor. For example, if I drove 150 miles per day over a tank, I was probably hitting the highway. Unfortunately I never drove on the highway much with this car; knowing the mileage was bad I decided to use others in that situation. I used an arbitrary cutoff of 35 mpd to distinguish between city driving and mixed highway/city driving.

Generally we see that a higher mpd corresponds to a higher mpg, as we’d expect. Here’s a least squares linear regression. The data is overly clustered, so we don’t get a great model.

New Tires

Let’s go back to that blue line in the mileage history graph. In December 2014 I put on new tires: Bridgestone Potenza S-04 Pole Position. These fall in the max summer performance category. The original tires were also summer tires, but apparently these new ones get worse mileage. I was a little shocked at first, but I figured out the likely cause is that they have a stickier compound and higher rolling resistance. Let’s revisit some of these graphs again while separating the new tires from the old.

The new histograms explain the bimodal distribution we saw earlier. It should really be thought of as two separate distributions. There are many more observations for the new tires compared to the old tires (87 to 27).

The mean difference between the old and new tires is 2.2 mpg. Is this statistically significant? This is a sensitive question, and one could cast doubt on the independence of the observations. Nonetheless, if we slap the data into a Welch’s t-test, we get a p-value of 7.83e-13. Pretty low! However, p-values are perhaps not that desirable. We can also compute the 95% confidence interval for the difference of the two means: (1.76, 2.67). I think the visualizations really tell the important story though. Here are linear regressions broken across the two categories of tires.

For the 16,000 miles I put on the new tires, I calculated that I would’ve saved about $422 if I had gotten the same mileage as the old tires. Certainly an extra cost to keep in mind when buying new tires! I’m not sure I actually got any more driving satisfaction out of the extra grip provided by these performance tires. To make things worse, the treadwear is abysmal, and I’ll be lucky if I get 20,000 miles out of them before replacement.

From the data, I can see that the type of tires has probably affected my mileage more than anything else. There’s a minor increase with mixed city/highway driving over just city, but nothing spectacular. If I ever have a significant commute where I have to put on a lot of miles, I think I’ll just keep my RX-8 as the weekend car!