A little while back I came across an interesting post on the use of recurrent neural networks to generate text character by character after having been fed some sample text to train on. For example, the entire works of Shakespeare were used as training data, and it produced sample output such as:
PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.
I thought I would give the neural network a try myself. The question became, what text should I try to use? Something with a distinctive pattern would produce the most recognizable output.
It occurred to me one day to use the Deadspin Funbag Q&A column. It follows the pattern of having someone write in a question, enclosed in a blockquote, followed by an answer. The columnist also has a distinctive style, frequently using certain phrases, and peppering the text with all caps sentence fragments, like “WHAT A COINCIDENCE”.
The neural network uses a single text file for training. So the first step is to scrape the website for every column, extract the relevant data from the HTML, and glue them altogether into a single file.
I used the scrapy package in Python to create a web spider to crawl the Deadspin site and collect all of the URLs for the Funbag columns. Then for each URL, I extracted the data that lived in the paragraph and blockquote HTML tags, ignoring everything else on the page. I used Beautiful Soup to pull out the HTML since I was already comfortable with the package, although I think I could’ve stuck with scrapy to do it too.
With all of the HTML data sanitized into 431 separate files, I concatenated them all together in the shell, and the final 10 MB file was then ready for processing. The neural network is available as Torch and Lua code. There are also helpful installation instructions for OS X. Unfortunately I ran into a run-time error after installation, but luckily the fix mentioned by the commenter “tbornt” worked.
The model took about 14 hours to train on my Macbook using the CPU. It would probably take a lot less time using a desktop with a decent graphics card. The Torch code supports CUDA with NVIDIA GPUs.
The output includes some HTML tags. Besides the blockquote and paragraph tags, there are also URL tags, bold and italics tags. These polluted the output a bit since they weren’t always closed promptly. Here’s an example of the output:
Aftime like a parky. Now you’re goy to the only of the gymnasts and right: permission to not he pretty high drinking that hours), you deserve a person drumble and forfea to kill again. Why they all own, and she’s trink to the president
I gas as you’re doctors. And then they will ever never start have a Prime Drinks died well and informing cogy dropping on a correct thing with until we have, look Pastan dilemmate only-neighboly asked the helping out how to give back trroal will use at it and all even the stance of Peyton War York dark of their own -“20s) begas up). I don’t get a roommates, which you’re will beding ten to a dinab.” Cai, and there will sugared the nor-own, and BY. Louisses sucked when Friendamishmousy at that. And became that two mentials think about me to you like, MOYA of a det-football three next week on TV and being cleaned the haguat. You’re still into a fall.
The model generally picked up the right structure, and even formed conventional names for the emailers most of the time. It also picked up on all caps phrases, and frequently appearing characters like Peyton Manning. The spelling and grammar could stand to use some improvement however. Nonetheless, the model knew nothing about English before training, so it’s still impressive it can get this close.
A more powerful model could be used by tweaking some parameters, such as the number of hidden units, and number of layers in the RNN. Although I think I would need a more powerful computer for this. It’s also possible to adjust the “temperature” of the output when sampling from the model, which can control how loose the text structure is. It might also help to sanitize the data a bit more and remove extraneous unicode characters. Here’s another sample:
Kent of Mami San weekend:
Where is the shower Francing, you’re probably start ground of curse before you could problem? Yeah his own a young shit gift and masking away with the head for his corpse, “I’m. GAHHHHHHHHHHHHH! You choose out of the number of legits, get with Tragland and friends of me and bread, and I would have happened. You survive law capabot.
I was in for a clark of the momnivies jers and the games, I could engrith the Soup on it?
I got for a shitface station daughters of the days. I love to trip on the better state, the other money the Candy Boel Russia, Football tab for a woman. I haven’t watch you. It has to know the worst spaction champages were universe with grosses and you have to see it really you just slide of the APGIT, don’t have me out during that procles? I mether if I was not recent them and starting when I plug sweaty screen in the fact that have water how long (of dictase with a professional on Solator. They get upstairs like a bag. We also fell, “found in the Earth Favrel players back” him. A following, like advantage of people like them were masturbated going on one of the songs are all of the games when they got her to pick swaterbiod. We are factically, for next to forget and use the extra day. You gotta be spidets night, which is better driver can’t run the bunch ofFuck Texasson . Given I gotta choice and even more than pointed like a compound meeting things up life yourself as an offer outbeem as can fuck it people to dead on. No.