NLP - Subreddit Classification
The question I wanted to examine is whether or not the titles can be parsed to determine whether they come from the original subreddit or from the 'best of' subreddit in the same category.
To do this I'm looking at the subreddits of:
In short: can we build a model that will predict whether a post is from the legal advice subreddit or the curated best of legal advice subreddit?
Exploratory Data Analysis:
Once the corpus has been gathered from Reddit using the Pushshift API, Exploratory Data Analysis begins. For this example, we focus on sentiment analysis, title length, common words and common bigrams.
Sentiment Analysis using the NLTK Vader sentiment library breaks down the positivity and negativity of a corpus using a predetermined list of positive and negative words.
Looking at the spread across the two subreddits doesn't show much of a change in overall positivity or negativity of sentiment between the two main categories, so this might not be a useful measurement to include in our final model.
When looking at a scatterplot of compound sentiment score compared to word count by subreddit, we can see that the r/legaladvice subreddit' posts are clustered mostly on the smaller word count end and the r/bestoflegaladvice subreddit is spread out. While we do not see a trend in the sentiment, we do see that the r/bestoflegaladvice subreddit's titles tend to have more words. This spread is fairly normal, as VADER's compound sentiment scores include statement length in their calculations.
If we look at what words specifically are included, we start to see some patterns. Looking at single words, the top words remain mostly static across the two categories.
Once we start looking at bigrams - sets of two words- we see some common patterns begin to emerge.
We're seeing significant divergence here in the most common bigrams across the two subreddits. This is promising for training a good model. Additionally, we're seeing that the most common bigrams from r/bestoflegaladvice are much more common than r/legaladvice. That being said, I'm seeing a very common tag popping up- 'LAOP' In the context of r/bestoflegaladvice, this means 'Legal Advice Original Post'. Additionally, the most common bigram overall is 'actual title'.
To find the best model for the job, we tested four different model types: Logistic Regression, K-Nearest Neighbors, Multinomial Naive Bayes, and Random Forest. For each of these models, we tested both a standard Count Vectorizer and a TF-IDF Vectorizer, giving 8 total model candidates to choose from.
While the K-Nearest Neighbors model dramatically outperformed all the others on the training dataset, it dramatically underperformed on the testing data-- meaning it can not handle new data well. The best options in terms of training and testing accuracy that maintain a low variation between the two are the Naive Bayes and the Random Forest models.
How well can we train a classification model to correctly classify the title of a subreddit post as belonging to the r/legaladvice subreddit or the r/bestoflegaladvice subreddt? Which classification model type will be the strongest?
The best model we encountered was a Count Vectorizer paired with a Multinomial Naive Bayes model looking at bigrams. As we predicted during EDA, we found that there was significant divergence between the most common bigrams of each subset, however as we looked at the misclassified data, we found that using these bigrams may be a crutch- the portion of the misclassified did not include common tags that are commonly used by the redditors who rename the posts for r/bestoflegaladvice when tagging the post in the 'best of' subreddit.
This model is heartening as it was able to correctly classify 87% of the posts it had not seen before, significantly outperforming our baseline of 50%. It also revealed some significant room for improvement.