Subreddit Clustering

Summary

Reddit is a massive news and entertainment site driven purely by user submitted content. To keep posts thematically organized, the website is partioned into “subreddits”, which are essentially forums with a specific purpose or topic. For example, there’s gaming subreddits, sports subreddits, tv show subreddits, and a subreddit for almost anything else you can think of. The above visualization shows the similarity between over 4000 popular subreddits. Click the icons in the bottom left to enable panning and zooming. Hover over a bubble to see the name of the subreddit. Subreddits that are closer together are “more similar”, and larger bubbles represent more active subreddits. You’ll notice many isolated clusters of topics. The subreddits were clustered by applying Latent Semantic Indexing to a Term-Frequency-Inverse-Document-Frequency matrix composed of over 20 million reddit comments. Using this clustering, I set out to make a subreddit recommendation service. Applying a clustering algorithm such as K-means on the LSI reduced data accurately divides subreddits into a variety of sensical categories, which, given a reddit user’s comment history, can be used to recommend unique subreddits from categories a user is likely to be interested in.

Read On

As of now, there’s no fast and easy way to discover new subreddits pertaining to your own interests. People share subreddits amongst each other, and occasionally you’ll stumble across one you like, but for some reason there’s no form of subbreddit recommendation.

Thus, I was inspired to create a simple, personalized recommendation service that generates relevant subreddits from a user’s reddit data. Specifically, every user has a publically viewable comment history that contains the time, content and parent-subreddit of every comment they’ve ever posted. If a user is fairly active on the site, you can use this data to profile their interests incredibly well – and thus use this “interest profile” to suggest subreddits they might enjoy.

To map a user’s comment data to relevant subreddits, it would useful to create a measure of subreddit-to-subreddit simularity. Given a sub X, find what subs Y most resemble X. If a user posts often in some subreddit pertaining to an RPG game, for example, It’d be a good idea to suggest subreddits that focus on other similar RPG games. Or given some subreddit focused on, say, nature photos, suggest other nature related and photo related subs.

To accomplish this similarity mapping, I started by collecting over 20 million reddit comments from thousands of random subreddits. Much of this data was found in some data dump; the rest was collected using reddit’s api. The comment data was then cleaned and loaded onto an amazon mySQL database for querying. To acheive a rough measure of subreddit to subreddit simularity, you might simply compare word frequencies between subreddits. Strip away common words like “the”, “I”, “a”, etc., and compare the frequency of unique words that are left. Intuitively, Subs with comments that share many of the same words are probably similar to each other.

This quick method works, but not very well. One pitfall is the inability to distinguish “topic words” (that’s my made up name), which are words that are useful in identifying the unique topic at hand. Words like “people” or “thing” don’t help much in explaining the overall topic of a comment, but words like “Role playing game” or “Corvette” probably do. One way to weight these topic words more strongly is called “Term frequency inverse document frequency.” Given the set of words in each document (subreddit, in my case), weight each word by it’s frequency in that document and divide by the frequency it occurs across all documents. The intuition is that words which appear rarely are probably more descirptive of the topic at hand, whereas words that occur accross all subreddits are probably filler words like “people” or “things.”

Using tfidf helps the clustering, but there remains another pitfall: the inability to handle synonyms. One comment might talk about cars, another might talk about a specific car, and another might talk about automobiles in general. They’re all refering to a simlar idea, and should certainly be considered similar by an ideal algorithm, but a simple word to word comparison would return no similarity between them since they don’t share exact words.

Fortunately, data science exists (jazz hands). Enter Latent Semantic Indexing. LSI is a ~~gift from the math gods~~ dimensionality reduction technique that can reduce a set of words into a smaller set of components, which represent core concepts of the text. It handles the synonym problem by observing the context in which a word appears; since “car” and “automobile” will appear in similar contexts, for example, both words will be recognized as belonging to the same underlying concept.

Applying LSA to the reddit comment data greatly improved topic clustering. So much so, I decided to make a visualization. Even after mapping the original data all the way down 2 dimensions, it still retained a surprising amount of information.
Tfidf and LSA code Subreddit recommender code

TO BE CONTINUED

Written on May 30, 2016