• karlduane

NLP Classification- Choosing the 'Best of' from 'Legal Advice'

Updated: Jan 19

Phase 1 - Problem Definition


1.1 Broad Goal(s):


My goal is two-fold:

1. Using [Pushshift's](https://github.com/pushshift/api) API, collect posts from two subreddits of my choosing.

2. Use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.


1.2 Subreddit Selection


Reddit.com has a 'best of' feature-- both of Reddit itself and of specific subreddits. How a post or comment is selected to be a 'best of' is a fascinating rabbit hole to dig in to- see a guest post by Randall Munroe of XKCD fame on the subject here, or take a look at the reddit thread in r/nostupidquestions.


TL;DR- it combines a statistical algorithm that tracks the activity, number of comments and number of upvotes to determine which comments and which posts are the most engaged with and flags it for a redditor's review.


The question I wanted to examine is whether or not the titles can be parsed to determine whether they come from the original subreddit or from the 'best of' subreddit in the same category.

To do this I'm looking at the subreddits of:


1. r/legaladvice

2. r/bestoflegaladvice


In short: can we build a NLP model that will predict whether a post is from the legal advice subreddit or the curated best of legal advice subreddit?


Phase 2 - Data Gathering


2.1 Define Function to gather posts from reddit using pushshift API

def get_posts(subreddit, n):
    url = 'https://api.pushshift.io/reddit/search/submission'
    if n < 100:
        params = {
        'subreddit' : subreddit, 
        'size': n 
        }
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
    else:
# note:  Pushshift.io now has a hard limit of 100 posts returned per API hit, so I'm setting this 100 limit here and will loop through this call until I hit n posts
        #get now in epoch date time format
        today = datetime.now()
        now = today.replace(hour=0, minute=0, second=0, microsecond=0)
        epoch = int(now.timestamp()) #get now in epoch date time format

        params = {
            'subreddit' : subreddit,
            'size' : 100, #pull 100 posts at a time
            'before' : epoch #set to now
        }
        posts = []
        # until I have as many posts as called for
        while len(posts) <  n:
            # get the posts
            res = requests.get(url, params)
            # convert to list
            data = res.json()
            # add to list
            posts.extend(data['data'])
            # set params 'before' to oldest post's utc
            params['before'] = data['data'][99]['created_utc']
            # pause for 5 seconds so we're not hitting the API too fast and maxing it out.
            time.sleep(5)

    return pd.DataFrame(posts) # 

2.2 Gather Posts from each of 2 Subreddits


- r/bestoflegaladvice

- r/legaladvice


First look used a total set of 2,000 titles, final model was trained on a total set of 10,000 title.

bola_df = get_posts('bestoflegaladvice', 5_000)
la_df = get_posts('legaladvice', 5_000)

df = pd.concat([bola_df, la_df], ignore_index = True)
df =  df[['title', 'subreddit']]


2.3 Data Cleaning


In this particular case there were no missing cells and significantly more columns than needed. Data cleaning was completed by simply dropping all columns except the text corpus 'title' column and the target columns of 'subreddit'. In future iterations, filtering the corpus by language may prove useful


Phase 3 - Exploratory Data Analysis


3.1 Sentiment Analysis


Sentiment Analysis using the NLTK Vader sentiment library breaks down the positivity and negativity of a corpus using a predetermined list of positive and negative words.


Looking at the spread across the two subreddits doesn't show much of a change in overall positivity or negativity of sentiment between the two main categories, so this might not be a useful measurement to include in our final model.

3.2 Word Count / N-Gram Analysis


We can also take a look at the variation between the length of the titles to see if we have a split between the two categories there.


A histogram of the length of title by word count of the r/bestoflegaladvice titles are showing a flatter curve and greater overall distribution than r/legaladvice, which is showing a positive skew. In other words- the title length in the r/legaladvice subreddit has a tendency to be shorter- with most titles below the 40 word mark.



When looking at a scatterplot of compound sentiment score compared to word count by subreddit, we can see that the r/legaladvice subreddit' posts are clustered mostly on the smaller word count end and the r/bestoflegaladvice subreddit is spread out. While we do not see a trend in the sentiment, we do see that the r/bestoflegaladvice subreddit's titles tend to have more words. This spread is fairly normal, as VADER's compound sentiment scores include statement length in their calculations.


If we look at what words specifically are included, we start to see some patterns. Looking at single words, the top words remain mostly static across the two categories.


Once we start looking at bigrams - sets of two words- we see some common patterns begin to emerge.

All right! We're seeing significant divergence here in the most common bigrams across the two subreddits. This is promising for training a good model. Additionally, we're seeing that the most common bigrams from r/bestoflegaladvice are much more common than r/legaladvice. That being said, I'm seeing a very common tag popping up- 'LAOP' In the context of r/bestoflegaladvice, this means 'Legal Advice Original Post'. Additionally, the most common bigram overall is 'actual title'.


While these two points are highly useful in building a classifier based on title alone, they might get in the way of longer term goals of building a predictor of which posts in r/legaladvice will become r/bestoflegaladvice.


Phase 4 - Modeling


The modeling notebook on my github account contains all of the models tested as well as annotations. For the purposes of this summary we're bringing in only the best option- the Multinomial Naive Bayes.

cvec_mnnb_pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnnb', MultinomialNB())
])

cvec_mnnb_pipe_params = {
    'cvec__stop_words' : [None],
    'cvec__max_features' : [4000],
    'cvec__min_df' : [2],                 
    'cvec__max_df' : [.8],                
    'cvec__ngram_range' : [(1, 1) 
}
cvec_mnnb_gs = GridSearchCV(cvec_mnnb_pipe, cvec_mnnb_pipe_params, cv = 5, verbose = 1, n_jobs = -2)

cvec_mnnb_gs.fit(X_train, y_train)
cvec_mnnb_gs.score(X_train, y_train), cvec_mnnb_gs.score(X_test, y_test)

Phase 5 - Model Analysis


5.1 Model Selection


Prior to settling on the Naive Bayes I also looked at a variety of combinations of word vectorizers and model types:

While the K-Nearest Neighbors model dramatically outperformed all the others on the training dataset, it dramatically underperformed on the testing data-- meaning it can not handle new data well. The best options per the chart above in terms of training and testing accuracy that maintain a low variation between the two are the Naive Bayes and the Random Forest models.



The above confusion matrix is a method for examining which titles the model predicted correctly and which ones they misclassified. Out of the 2500 holdout testing titles that the model examined, it missed only 302. That's not bad.


5.2 Digging into misclassified Titles


So let's take a look at what I got wrong.





Looking at my miscalssified titles, I'm noticing we're not seeing any of the most common bigrams we encountered in EDA. One of the limitations of using bigrams in this set of data is the manner in which the r/bestoflegaladvice titles are set. During EDA I noted the top bigrams in this subset included specific tags such as 'Actual post' and 'LAOP ___' (Legal Advice Original Post). While this allows for more effective classification models when using bigrams, I'm seeing that the tags are not included in the misclassified titles. In other words, when the model is trained on bigrams that include these common tags, it is a less effective classifier when those tags are missing.


Phase 6 - Conclusions

6.0 Revisit Problem Statement


How well can we train a classification model to correctly classify the title of a subreddit post as belonging to the r/legaladvice subreddit or the r/bestoflegaladvice subreddt? Which classification model type will be the strongest?


The best model we encountered was a Count Vectorizer paired with a multinomial naive bayes model looking at bigrams. As we predicted during EDA, we found that there was significant divergence between the most common bigrams of each subset, however as we looked at the misclassified data, we found that using these bigrams may be a crutch- the portion of the misclassified did not include common tags that are commonly used by the redditors who rename the posts for r/bestoflegaladvice when tagging the post in the 'best of' subreddit.


This model is heartening as it was able to correctly classify 87% of the posts it had not seen before, significantly outperforming our baseline of 50%. It also revealed some significant room for improvement.


6.2 Recommendations for Further Research


The model as currently trained is a good proof of concept, and should research continue, can serve as a basis for additional insights.


Future Iterations could include:


1. gather data from the reddit text instead of the reddit title. Changing the corpus used will allow the model to potentially predict which legal advice posts will become 'best of' and serve as a model that predicts popularity and engagement from content.

2. strip out the crutch of title tags such as 'LAOP' to better tune the model.

3. incorporate a sentiment analysis transformer in the pipeline

4. train and incorporate a language filter to remove titles in a different language or potentially include this step as a transformer and add it as a feature to be examined.

6.3 Resources


Randall Munroe- Reddit's New Comment Sorting System

https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/


various redditors in r/NoStupidQuestions:

https://www.reddit.com/r/NoStupidQuestions/comments/6cmz29/how_does_reddit_determine_the_best_ranking_in_a/


Sentiment Analysis workflow in section 3.1 created by Caroline Schmitt- Sentiment Analysis

https://git.generalassemb.ly/DSI-12-Echelon/nlp_modeling_and_sentiment_analysis


EDA Susan Li- Exploratory Data Analysis primer for text data:

https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a


Edward Ma - Named Entity Recognition roundup:

https://towardsdatascience.com/named-entity-recognition-3fad3f53c91e


23 views0 comments

Recent Posts

See All

Business Applications of Existing AI technology

Last week I had a conversation with the CTO of a small company (10-50 people) about how the company uses its data and a portion of the conversation has stuck with me ever since. In the conversation o

DATAcated Conference- Favorite Quotes and Takeaways

Recently I had the opportunity to attend the LinkedIn DATAcated conference. It included a number of excerpts from big names in the data science world and I wanted to record some of my favorite quotes,