• karlduane

FODMAPs for IBS: Awareness and Sentiment Analysis in Social Media

A few weeks back I had a fascinating lunch meeting with a founder in the nutrition and biotechnology industry who was interested in creating a new company that could address irritable bowl syndrome (IBS) using a FODMAP based nutritional supplement. In the interest of securing funding she had questions around the knowledge of FODMAPs in communities of people with IBS and this got me thinking. We can use existing machine learning tools to gather and analyze the level of term usage and sentiment intensity of the communities in question. So when I sat down for a minihackathon recently, I chose this as my topic. In the interest of time, I'll narrow my field of study to the r/ibs and r/fodmap subreddits and use NLTK's VADER sentiment intensity analyzer to look at two primary metrics- term usage in posts and the Vader sentiment scores of those posts.

Acquiring the corpus to examine starts with using pushshift.io's Reddit API to pull samples. I wrote a simple function a while back while I was in General Assembly's Data Science Intensive bootcamp and have continued to refine it for efficiency and flexibility. If you're interested in more detail, you can check it out at my github:

def get_nposts(subreddit, n, *args):
    url = 'https://api.pushshift.io/reddit/submission/search/'
    posts = []
    #get now in epoch date time format
    today = datetime.now()
    now = today.replace(hour=0, minute=0, second=0, microsecond=0)
    epoch = int(now.timestamp()) #get now in epoch date time format
    params = {
        'subreddit' : subreddit,
        'size' : 100, #pushshift now has a hard max of 100
        'before' : epoch
    if 'search_term' in args:
        params['selftext'] = search_term
    if n < 100:
        params['size'] = n
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        # until I have as many posts as called for
        while len(posts) <  n:
            # get the posts
            res = requests.get(url, params)
                # convert to list
                data = res.json()
                # add to list
                # set params 'before' to oldest post's utc
                params['before'] = data['data'][99]['created_utc']
            # pause for 5 seconds so we're not hitting the API too fast and maxing it out.

    return pd.DataFrame(posts) 

First, we'll look at awareness of FODMAPS in the r/ibs subreddit specifically in terms of usage. We'll need to pull all the posts in the subreddit, label them for term usage, and examine the number of posts that include variations on the term 'fodmap.' We can either look at total posts per day/week/month or percentage of posts in that same time period that include the desired term.

Pandas makes this wonderfully simple- we can apply either a lambda or defined function to the dataset to check if the text contains the search term. I tend to prefer defined functions for clarity.

def contains_fodmap(cell):
    if 'fodmap' in str(cell).lower():
        return 1
        return 0

df['fodmap'] = df['selftext'].apply(contains_fodmap)

We can then convert the raw timestamp to a datetime object and set it as the data frame's index with a couple lines of code:

df['create_date'] = pd.to_datetime(df['created_utc'], unit = 's')
df.set_index('create_date', inplace = True)

Given that we've now set the index as the original posting date and whether the post references FODMAPs as 1, we can use


to generate a ratio of posts in the day, week or month that include the term Fodmap,

The key takeaway from these visualizations is that aside from a brief daily spike in early November, the average daily, weekly, and monthly use remains fairly constant. This is indicative of the term FODMAP remaining fairly steady in the subreddit's lexicon. Now we'll need to examine the sentiment.

First, we'll need to label each post's sentiment scores of positive, negative, neutral and compound using NLTK-Vader's Sentiment intensity analyzer and append them to the main dataframe so we can examine the sentiment over time.

sia = SentimentIntensityAnalyzer()

dicts = []

for text in df['selftext']:
    scores = sia.polarity_scores(text)

ibs_scores = pd.DataFrame(dicts)
df = pd.concat([df, ibs_scores], axis = 1)

Let's look at what we've got

This is showing a bit of variation over time, and it's starting to look like we have stationary data. This is good, it means that attempts to model and predict the compound sentiment scores are likely to be successful.

Looking at it on a weekly basis shows us a similar pattern and its not until we examine the trends on a monthly basis that we see a smoother line.

Comparing the compound score to the positive and negative sentiment score shows a similar trend on both the daily:

and weekly scales:

The overall awareness and usage of the term fodmap in the subreddit is a baseline of awareness of low-fodmap diets. The compound sentiment scores of the posts surrounding the term are showing stationarity and means these metrics are good for modeling and prediction.

If a company wanted to run an ad campaign on reddit, they could potentially target this particular subreddit with a campaign for as little as $5 a day, continue to apply these analytical techniques, and keep an eye on the changes. Potentially, this can be used for a/b testing in the future, running two ad campaigns including sentiment and mentions in a suite that included clickthroughs to test if the campaign was working.

This project was small scale exploratory data analysis to see if there was potential to continue studying the subject and the business use case. Thoughts, suggestions, critiques? Feel free to reach out to me, I'm always interested in learning more.

22 views0 comments

Recent Posts

See All

Last week I had a conversation with the CTO of a small company (10-50 people) about how the company uses its data and a portion of the conversation has stuck with me ever since. In the conversation o

Recently I had the opportunity to attend the LinkedIn DATAcated conference. It included a number of excerpts from big names in the data science world and I wanted to record some of my favorite quotes,

#celebrated_failures I've been spending the last two weeks reworking a group project into something I can deploy. The group decided to build a Covid-19 Misinformation Classifier that could give a qui