What Is Natural Language Processing? A Gentle Introduction to NLP

Natural Language Processing, or NLP is a subfield of Artificial Intelligence research that is focused on developing models and points of interaction between humans and computers based on natural language. This includes text, but also speech-based systems.

Computer scientists and researchers have been studying this topic for entire decades, but only recently has it become a hot topic again, a situation made possible by recent breakthroughs in the research community.

Having this said, here's what you'll learn by the end of this article.

A basic overview of how Natural Language Processing works
Who uses Natural Language Processing and for what kind of problems
What are the challenges of using Natural Language Processing for your app or for your business
What are some basic tools that can get you started working with Natural Language Processing

How Natural Language Processing works

A generally accepted truth in computer science is that every complex problem becomes easier to solve if we break it into smaller pieces. That is especially true in the Artificial Intelligence field. For a given problem, we build several small, highly specialized components that are good at solving one and only one problem. We then align all this components, we pass our input through each component and we get our output at the end of the line. This is what we call a pipeline.

In the NLP context, a basic problem would be that for a given paragraph, the computer understands exactly the meaning of it and then possibly it acts accordingly. For this to work, we need to go through a few steps.

Sentence boundary segmentation

For a given text, we need to correctly identify every sentence, so that each sentence resulted from this will have its meaning extracted in the next steps. It seems that extracting the meaning from every sentence in a text and then putting it all together is more accurate than trying to identify the meaning of the whole text. After all, when we speak(or write) we don't necessarily mean only one thing. We often tend to convey more ideas into one and the beauty of natural language(and the curse for NLP) is that we actually can.

A naive approach for this would be to only search for periods in a chunk of text and define that as the end of a sentence. The problem is that periods can also be used for other purposes(for example, abbreviations) so in practice machine learning models have been defined to correcly identify the punctuation marks that are used for ending sentences.

Word tokenization

This part involves taking a sentence from the previous step and breaking it into a list of all the words(and punctuation marks) it contains. This will be used in the next steps to perform an analysis on each word.

Part of Speech Tagging

This step involved taking each word from the previous step and classify it as to what part of speech it represents. This is an essential step for identifying the meaning behind a text. Identifying the nouns allows us to figure out who or what the given text is about. Then the verbs and adjectives let us understand what entities do, or how they are described, or any other meaning we can get from a text. PoS Tagging is a difficult problem but it has mostly been solved and implementations for this can be found in most of the modern Machine Learning libraries and tools.

Named Entity Recognition - NER

Named Entity Recognition refers to identifying the names in a sentence and correctly classify it against a list of predefined categories. Such categories may involve: Persons, Organisations, Locations, Time, Quantities and so on. Lists of categories may be tailored for your own particular use case, but in general, almost everybody needs at least these categories to be correctly identified.

There are many implementations for this kind of problem and models build recently have achieved near-human performance. Basically, this step is also divided in two substasks: correctly identify the names in a sentence and then classify each name according to your list of categories.

There are of course many other tasks which NLP solves in order to be used in real world applications, but the next steps are tailored to every use case and every business needs. Having all these said, the steps previously presented are the basic steps in almost every use case we can think of.

How Natural Language Processing is used

welcome home — Photo by BENCE BOROS / Unsplash

NLP is used in a variety of software and various use cases have been identified as being solvable by deploying NLP models. Some of these example are:

Virtual assistants: Most modern, powerful personal assistants employ a large suite of NLP techniques in order to help users accomplish their tasks. Siri, the Google Assistant and many others have become very efficient and highly skilled in helping their users by using the latest breakthroughs in the NLP field. This means that the companies behind them invest large amounts of resources in further research in the race to developing the perfect virtual assistant. While until a few years ago, these assistants were more fancy than useful, nowadays milions of users are using them and are able to take advantage of their solutions.
Machine translation: Have you noticed that in the recent years, Google Translate has become more accurate in translating texts for you? This is thanks to latest advances achieved by the research in this field, with Google Translate powering machine translation in hundreds of languages for hundreds of milions of users.
Speech To Text: In the fast paced society that we live in today, we often don't have time to record in writing everything that we discuss, be it business notes, phone calls, speeches and so on. There are quite a handful of startups nowadays which help us with these kind of problems, namely taking sound as an input and providing us with the text, from which we can carry on and take actions based on that.
Information extraction and knowledge graphs: Unstructured information is difficult to handle and use, so we can use intelligently built information extraction techniques, particularly knowledge graphs, to extract information and have ready to be used.
Word embeddings: Converting natural language text into vector representations that also carry context and can help us identify words that are similar to each other.

Challenges of using Natural Language Processing

There are, of course, quite a few challenges when using NLP techniques. The most important are related to extracting context from the text. Humans are very good at understanding the context of a sentence, but computers only use statistical models to represent these sentences, so it is very hard for them to understand exactly what we mean by our words.

For example, when we say "bush" we may refer to the plant or to the former US President. Identifying the difference may be very easy for you, but a computer will have to go through several steps from the pipeline until deciding which meaning you are using.

A further challenge of NLP is related to the first one and regards finding enough data to train our model. As with any other Machine Learning models, training a NLP models takes a lot of data and a lot of time, meaning there are only a handful of large enough companies that have the resources to build truly powerful applications involving NLP. Other, smaller companies are employ machine learning models that are highly specialized, meaning they solve only a subset of all the NLP problems, for which they need considerably less data.

Having considered all of these, it is important at the end of the day that we correctly identify the value that a NLP model brings to our business or our app. We need to see if the model we've managed to build can truly help us and our customers, or if it's just a fancy feature so that we can say we are a machine learning company.

Natural Language Processing tools

There are quite a few libraries available for developers in order to start learning about and developing NLP models. And the good thing is most of them are open-source.

NLTK - Natural Language Toolkit is a Python library which is mainly used in research and education. It includes many steps in the pipeline with models which are ready to test and use. You can get started with it on their website.
Stanford CoreNLP is another Python library which provides a wide range of tools for understanding natural language. You can see a few examples here.
spaCy is yet another Python library which employs machine learning and deep learning to help you with a lot of powerful feature. Their website is here.

Choosing the right library for your project can be a challenging task. It all depends on your needs and you need to take your time to explore their strengths and weaknesses and see which one gives better results. Of course, there are problems for which libraries are capable of giving pretty much the same results. For example, here's a comparison between various python libraries for stemming and lemmatization.

Summary

In this article, we've made a gentle introduction to the field of Natural Language Processing. We've discussed a few of the basic steps to building NLP pipelines and then we saw who and how is using Natural Language Processing to enhance their business. We then took a look at the main challenges of using NLP and some of the best tools we can use to learn about NLP and possibly use it in our apps.

Interested in more? Follow me on Twitter at @b_dmarius and I'll post there every new article.

What Is Natural Language Processing? A Gentle Introduction to NLP

Marius Borcan