What if you need more labeled data? Label spreading and propagation

Although it is commonplace that we are living in the era of big data and experiencing a data deluge, tons of projects fail because of the lack of (labeled) data. If you don’t have enough data, you can’t make a great machine learning application, even if your team has got the brightest minds of the universe. In this post, we present an example of label spreading to get more labeled data using Python. If you are not a Pythonista, scroll down for other resources on the topic.

The problem

Let’s assume that you want to assign a sentiment score to each review in the Amazon review dataset’s video game subset. Also, let’s assume that either the ratings are unreliable, or they are unavailable. You have only limited capacity of data annotation. After a little brainstorming, you come up with the solution of using dictionary based sentiment analysis. NLTK has got a sentiment module with full of handy utilities for sentiment analysis. After a few trial and error, you build a scoring function using nltk.corpus.opinion_lexicon and nltk.sentiment.util.mark_negation. The results look good, but by taking a closer look at the data, you realize that the words in nltk.corpus.opinion_lexicon are too general. It’d be great if you can fine tune the list of positive and negative words with domain-specific expressions.

The solution

We are going to use scikit-learn’s sklearn.semi_supervised.LabelSpreading to label words as positive or negative in our corpus (video game reviews). The code snippets below shows the logic behind the solution. Admittedly it does not represent the most optimal way to tackle the problem as our main aim was to create a comprehensible solution. We will assign a probability to each category (positive and negative) for each word, so we can either automatically extend the opinion_lexicon using a threshold value, or we can employ human check on words above certain threshold.

LabelSpreading constructs a similarity matrix from the data, so first we have to vectorize our data somehow. The simplest way for that is to train a word2vec model using gensim.

Now, we can label each word in our vocabulary as positive, negative, or unknown using the opinion lexicon’s built-in lists. Now, we are ready to fit LabelSpreading on our data.

With using the standard predic_proba method of scikit, we can check the probability of each class (positive and negative) for each word in our vocabulary. It means that we are ready to extend our lexicon as described above.

Code and data

Further resources

  • If you have to work with large datasets, go for Neo4j, which has built-in label propagation. Read more about it here.
  • Google is using graph-powered machine learning to get labeled data on a very large scale. Read this post, or this paper to learn more about label propagation on huge datasets using Pregel-style APIs (like Spark’s Pregel API).
  • To learn more about Label Propagation/Spreading and other Semi-Supervised algorithms, read the Semi-Supervised Learning collection.
  • Do you like this method? Read Complex Network Analysis in Python by Zinoiev to get ideas on how to create networks based on co-occurrence, similarity, etc.

Sources

The header image was downloaded from this link, and it appeared in this post.

The image was downloaded from Wikimedia, its original source can be found here.

Do you like our visualizations? Buy them for yourself!

Visit our shop on Society6 to get a printout of our vizs.

Subscribe to our newsletter

Get highlights on NLP, AI, and applied cognitive science straight into your inbox.

powered by TinyLetter