Natural Language Processing, or NLP is a subfield of Artificial Intelligence research that is focused on developing models and points of interaction between humans and computers based on natural language. This includes text and speech-based systems.
As human language is very complex by nature, building algorithms that process human language might seem a daunting task, especially for a beginner. And it's true that building advanced NLP algorithms and features required a lot of inter-disciplinary knowledged that make NLP look like one of the most complicated subfields of Artificial Intelligence.
But there are small steps we can take so that our introduction to NLP becomes less intimidating, and one of these steps is to have a basic overview of some of the most popular Natural Language Processing algorithms there are at the moment.
In this article I've compiled a small list of some of the most popular NLP algorithms you will encounter when you first begin studying about Natural Language Processing. For each item on the list we will have some basic overview and some further reading directions.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
- Lemmatization and stemming
- Word clouds
- Keywords extraction
- Named Entity Recognition
- Topic Modelling
- Knowledge graphs
Lemmatization and stemming
Lemmatization and stemming are two techniques that allow us to build Natural Language Processing tasks that worked well with multiple morphological variations of the same word.
These are two techniques that allow us to reduce variations of a single word to a single root. For example, we would reduce "singer", "singing", "sing", "sang", "sung" to a single form, "sing". If we do this for all words in a document or a text corpus, we are able to reduce our data space and build more stable NLP algorithms.
Lemmatization and stemming are preprocessing techniques, meaning we can apply one of these two algorithms before we actually begin a NLP project so that we can clean up our data and prepare our dataset.
Lemmatization and stemming are two different techniques and each of them can be done in many ways, but the basic end effect is the same: a reduced search space for our problem.
A word cloud or tag cloud represents a data visualization technique. Words from a text are displayed in a chart, with the more important words being written with bigger fonts while less important words are displayed with smaller fonts or not displayed at all.
We can use word clouds to understand our data before applying other NLP techniques or algorithms to our dataset. I've used word clouds in this article where we analyze the most popular HackerNews post.
If you read that article, you can see word clouds offered us information about the most popular topics at that time on the site.
Keywords extraction is one of the most important tasks of the Natural Language Processing field and is responsible with finding ways for extracting the most important words and phrases from a given text or a collection of texts. This is done in order to help us summarize, organise, store, search and retrieve content in a meaningful and efficient way.
We already have a large number of keywords extraction algorithms available and each applies a different set of principal and theoretical approaches to this problem. We have algorithms that extract only words and algorithms that extract words and phrases. We have algorithms that focus only on one text and algorithms that extract keywords based on a whole collection of texts.
The most popular keywords extraction algorithms out there are:
- TextRank: works on the same principle behind the PageRank algorithms by which Google assignes importance to different web pages on the Internet
- TF-IDF: Term Frequency - Inverse Document Frequency aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus.
- RAKE: Rapid Automatic Keywords Extraction falls into the category of algorithms that can extract keywords and keyphrases based only on the text of one document, without the need to consider other documents in the same collection.
- Automated Python Keywords Extraction: TextRank vs Rake
- TF-IDF Explained And Python Sklearn Implementation
- Python Keywords Extraction using TextRank
Named Entity Recognition
Named Entity Recognition is another very important technique in the Natural Language Processing space. It is responsible with identifying entities in an unstructured text and assigning them to a list of predefined categories: persons, organisations, dates, money and so on.
Named Entity Recognition actually consists of two substeps: Named Entity Identification (identifying potential candidates for the NER algorithm) and Named Entity Classification (actually assigning the candidates to one of the predefined categories).
Topic Modelling is NLP task where we try to discover "abstract topics" that can describe a collection of documents. This means we have a collection of texts and we try to find patterns of words and phrases that can help us cluster the documents and group them by "topics".
One of the most popular algorithms for Topic Modelling is the Latent Dirichlet Allocation. For this algorithm to work you need to establish a predefined number of topics to which your collection of documents can be assigned to.
At first you assign every text in your dataset to a random topic and then you go over the collection multiple times, refine your model and reassign documents to different topics.
This is done by measuring two statistics:
- the probability that a certain document belongs to a certain topic; this is based on how many words(except the currrent word) from this document belong to the topic of the current word
- the proportion of documents that are assigned to the topic of the current word because of the current word.
- Latent Dirichlet Allocation For Topic Modelling Explained: Algorithm And Python Scikit-Learn Implementation
Knowledge graphs represent a method of storing information by means of triples - a set of three items: a subject, a predicate and an object.
Knowledge graphs belong to the category of information extraction techniques - obtaining structured information from unstructed texts.
Knowledge graphs have been immensely popular lately, especially because many companies(think for example the Google Knowledge Graph) use them for various products and services.
Building a knowledge graph requires a large variety of NLP techniques(possible every technique mentioned in this article) and using more of these techniques will likely help you build a more complete and powerful knowledge graph.
In this article we took a look at some quick introductions to some of the most beginner-friendly Natural Language Processing algorithms and techniques. I hope this article helped you in some way to figure out where to start from if you want to study Natural Language Processing.
Thank you so much for reading this! Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.