Project Description

Text Mining in PharmacoWatch: Drug mention extraction from Twitter data

Infolytx has built a PaaS solution that alerts and predicts drug-to-drug interaction and other adverse drug reactions (ADR) of approved pharmaceutical drugs in the USA. This solution integrates public data from the FDA (FDA AERS) with Social Media mentions (Twitter) of ADRs.

We combined the data to create various machine learning models to predictive analytics. We built our initial machine learning models, to accurately predict previously unknown drug-to-drug interactions.

We also provided a public API to support an ecosystem of applications targeting different end-users in the pharmacovigilance space.

Several data sources needed to be cleansed, normalized and integrated before accurate predictive machine learning models could be built for the PaaS.

Click here to see our Solution

The Problem

We wanted to augment the FDA ADR data with public mentions of drugs and adverse reactions in social media. As a proof of concept, we chose Twitter as the social media vehicle. We crawled over two hundred thousand tweets over the first half of 2016. Our goal was to accurately identify which tweets refer actually to one of the drugs in our choices (in the presence of abbreviation, misspelling, typos, etc.) as well as differentiate between advertisements and actual public drug mentions. The problem was additionally difficult by the very nature of Twitter’s 140 character limitation. This restriction forces people to tweet using various short-cuts to preserve space as well as using non-grammatical constructs making parsing complicated.

Extracted Twitter data was used for our Advanced Visualization application

The Solution

  • Query Twitter API to extract references to top 50 drugs in the FDA AERS database
  • Filter out advertisements and references to drugs without adverse reactions and vice versa
  • Use cTAKES (clinical Text Analysis Knowledge Extraction System) augmented and improved by Infolytx for
    • Identifying and coding drug mentions to UMLS
    • Identifying and coding signs and symptoms to UMLS
    • Predicting which signs and symptoms are adverse reactions to the drug mention

Enhancing cTakes for higher accuracy

  • Create and customize an extensive drug and symptom dictionary to replace cTAKES dictionary lookup with our own component to improve drug mention identification and normalization
  • Use Natural Language Processing (NLP) to improve noun-phrase parsing to better lookup and match drug mentions and symptoms in the dictionary
  • Compensate for data errors such as typos by using Lucene-based text indexing for spell checking and error correction in text
  • Use NLP and heuristics to determine whether identified symptoms are ADRs