Keywords extraction is a subtask of the Information Extraction field which is responsible with gathering important words and phrases from text documents. This is helpful for assigning documents to certain categories, tagging or organizing documents.

Because these types of subtasks are gaining more and more attention everyday, new methods for extracting keywords appear or old ones are improved so that they offer better results.

Urban Art
Photo by Jelleke Vanooteghem / Unsplash

In this article we are going to compare two of the most research methods for extracting keywords from text: TextRank and Rake. We will take a short look at the algorithms behind them and then we are going to play with one Python implementation for each and compare the results.

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.

There is another efficient method for extracting keywords which was previously covered on this blog and that is the TF-IDF algorithm. But this method is different because it requires a large dataset in order to be able to show off its effectiveness. For the scope of this article I have decided to use only a singular document, so comparing TF-IDF to TextRank and Rake would be unfair in this way. But feel free to check on that article if you want to learn more about the tf-idf algorithm.

Table of Contents

  • TextRank algorithm explained
  • Rake algorithm explained
  • Keywords extraction project setup
  • TextRank Python keywords extraction
  • Rake Python keywords extraction
  • Comparing TextRank vs Rake

TextRank algorithm explained

The TextRank name may sound familiar to you because you've probably heard about PageRank, which is an algorithm used at Google to define the importance of a webpage in relation to other webpages. In short, the more pages refer to a singular page, the more important that page is.

The TextRank algorithm works in a similar fashion. It is a graph-based algorithm, meaning the primary data model used for it is a graph, structured like this:

  • Words in our input text represent nodes in the graph
  • Similarity scores between the words represent edges inside the graph

Two nodes are connected to each other by an edge. The two nodes represent two words from the text, while the edge between them represents how similar the two words are.

Then an plain language explanation of the algorithm would be the following:

  1. Split our original document into words/phrases
  2. Calculate word embeddings using a vector representation algorithm
  3. Compute similarity scores between every two nodes
  4. Build the graph using the rules described above
  5. Get top-n results from the graph

You can find here the original research paper on the TextRank algorithm.

Rake algorithm explained

Rake stands for Rapid Automatic Keywords Extraction and is a very powerful and fast algorithm for keywords and keyphrases extraction. The algorithm seems a little bit too simple to be true, but I think it's genius exactly through that simplicity. 😀

Let's go step by step through an explanation of the RAKE algorithm:

  1. Go through the text and locate the stop words and punctuation
  2. Remove stop words and punctuation and obtain a list of the phrases which were separated by them
  3. For each word, calculate the number of times it appears in all the phrases. We call that the frequency of that word.
  4. For each pair of two distinct word in the text, find out how many times they appear together in the same phrase. We call that a measure of co-occurence.
  5. For every word, obtain a score by dividing the frequency by the co-occurence measure.
  6. Finally, calculate the score of an entire phrase by adding up the scores from the previous step for all the words that form that phrase.

All you need to do from here is order the phrases by their scores and keep your top-n phrases.

You can find here the original research paper on the RAKE algorithm.

Keywords extraction project setup

For the purpose of this project we are going to extract the summary of a Wikipedia article and we are going to apply the two algorithms to extract the keywords and then are going to compare the results.

First off we should install a python package that allows us to extract the text from a wikipedia page.

pip3 install wikipedia

Next let's install 2 packages that contain our implementations. We'll use gensim for TextRank and rake-nltk for RAKE.

pip3 install gensim
pip3 install rake-nltk

We'll also need nltk, but that will automatically be installed when we install rake-nltk.

For extracting the text I've written a small class that, given a title of a wikipedia article, uses the package we've installed to fetch the article, extract the summary and then provide the text for when we'll need it.

import wikipedia


class TextFetcher:

    def __init__(self, title):
        self.title = title
        page = wikipedia.page(title)
        self.text = page.summary

    def getText(self):
        return self.text

TextRank Python keywords extraction

As mentioned earlier, we'll use the Gensim package to apply the TextRank algorithm on a given text.

You'll be amazed by how small this class actually is. 😀

from gensim.summarization import keywords

class TextRankImpl:

    def __init__(self, text):
        self.text = text

    def getKeywords(self):
        return (keywords(self.text).split('\n'))

And that's it. We are using the keywords method from gensim.summarization. The only extra thing we are doing here is take the output of the method and split it by line breaks, so we can obtain our keywords.

That's pretty much it. You can try it yourself or wait to see the results in a minute.

Rake Python keywords extraction

The same approach here. We are using the rake-nltk package and encapsulating it in a small class. Here's the implementation.

from rake_nltk import Rake

class RakeImpl:

    def __init__(self, text):
        self.text = text
        self.rake = Rake()

    def getKeywords(self):
        self.rake.extract_keywords_from_text(self.text)
        return self.rake.get_ranked_phrases()

This package can extract keywords from a given text or from a list of sentences. We are going with text-based implementation for the moment.

We can extract the ranked phrased, but we can also get the scores associated with every phrase by using the rake.get_ranked_phrases_with_scores() but we don't need that for today.

Comparing TextRank vs Rake

Now comes the main part of our project. The steps we're going to follow are:

  • Use the TextFetcher class to get the summary of a Wikipedia article
  • Use the TextRankImpl class to extract keywords using TextRank
  • Use the RakeImpl class to extract keywords using RAKE
  • Print out the results
from textrank import TextRankImpl
from rake import RakeImpl
from textfetcher import TextFetcher

textFetcher = TextFetcher("London")

textRankImpl = TextRankImpl(textFetcher.getText())
print (textRankImpl.getKeywords()[:10])

rakeImpl = RakeImpl(textFetcher.getText())
print (rakeImpl.getKeywords()[:10])

We are extracting only the first 10 keywords from each implementation. And the results look like this.

['london', 'cities', 'largest city', 'museums', 'museum', 'education', 'greenwich', 'investment', 'area', 'universities']

['hosted three modern summer olympic games', 'london contains four world heritage sites', 'inner london borough holding city status', 'sixth largest metropolitan area gdp', 'square mile − retains boundaries', 'oldest underground railway network', 'busiest city airport system', 'comprehensive university college london', 'landmarks include buckingham palace', 'ancient core −']

Let's suppose we didn't know what the Wikipedia article is about. From both lists we could have figured out that the article is about London. So that's a good sign, each implementation did its job here.

The most important thing to notice here is that TextRank gives us keywords(only one entry has two words, the rest have only one word), while RAKE gives us phrases.

What I would conclude from here is that personally I'd use TextRank for let's say a tagging system, or for search engines, while RAKE would prove very useful for a text summarization task. Anyways, both of them worked very well and I'm excited to have tried them for this article.

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.