Word embeddings are state-of-the-art models of representing natural human language in a way that computers can understand and process. They are the starting point of most of the more important and complex tasks of Natural Language Processing.
In this article we are going to take an in-depth look into how word embeddings and especially Word2Vec embeddings are created and used. There will be some code along the way but everything will be explained in detail. As usual on this blog, we will first go through some theoretical overview and then we will jump into the practical part. But feel free to jump to any section you need.
Interested in more? Follow me on Twitter at @b_dmarius and I'll post there every new article.
- What are word embeddings
- Word embeddings applications
- Word2Vec explained
- Word2Vec python implementation using Gensim
- Word embeddings visualization
- Related articles
What are word embeddings
Word embeddings exist to help computers understand human language. Computers are famously good at dealing with numbers but legendarily bad at dealing with words and sentences.
If for you words like "dog" and "city" mean something and carry a context, computers can't tell you pretty much anything more about these words other than the space they take in memory and whether they are alphabetically ordered or not. But of course there are many Natural Language Processing tasks out there and for many of them we need to make computers, well, understand natural language. Let's take a look at an example. Let's say we have three sentences:
- I have a cat and a dog.
- My dog is a lovely animal.
- I am a programmer.
Let's say these are all the words that I want my computer to ever understand. I can build a text corpus out of my three sentences, like this: "i have a cat and dog my is lovely animal am programmer". This would be the dictionary of my application. So how can I represent the words?
For a given word from the dictionary, I can create a vector with 12 elements (because my dictionary has 12 words) and assign 1 to the element with the index that matches the word position in the dictionary. For example "cat" is the 4th word in my dictionary, so my word representation would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]. This technique is called one-hot encoding and it gives us sparese vector representations.
That might work well (spoiler 1, it's not) but look at how much wasted space I have here. To represent one word I had to write one "1" and 11 "0" elements. Try to imagine how a real-life, big dictionary would look like and what a waste of space that would be. Plus, finding a word in this represantion will take ridiculously longer the bigger the dictionary.
What's even worse is that these word representations don't help me much more besides translating text from words to numbers and vice-versa. But I would really love to be able to find out if two words are somehow related to each other (spoiler 2, word embeddings will help me do that). With the one-hot encoding technique, this is not possible, as any 2 words are totally independent.
So naturally we need to come up with better ways of building these vector representations. And this is what word embeddings techniques do: they try to build word representation models that have characteristics like:
- they are dense - meaning there is not much wasted space in a representation
- they can capture context
- they can capture word relations - meaning words that humans find similar should also have similar word embeddings (so similar values in their words vectors).
So now we're ready to provide a short, formal definition about word embeddings.
Word embeddings are dense vector representation for words that are unique for each word and can also capture meaning and how words are related to each other.
There are quite a few techniques for building word embeddings, but for the purpose of this article we are going to focus on Word2Vec.
Word embeddings applications
As previously mentioned, word embeddings are primarily used to learn how words in a text related to each other. You might have already seen the most famous example, but I will add it here just for you to take a second to look at it and fully understand how important it is.
Mikolov et al.  wrote a paper which is the foundation for what we know as Word2Vec today. And they have found that using this vector representation we can model relationships like:
Vec("king") - Vec("man") + Vec("woman") = Vec("queen")
So what does this mean? It means that just by analyzing large amounts of text we can figure out relationships in such way, by substracting the vector representation of the word man from that of the word king and adding the vector representation of woman we will get very close to the vector representation of queen.
This is a very powerful connection because you can see that if we have plain words, there is no actual logical or numerical relationship between them. But of course, given a context, there is some semantics and some underlying relationship, because a man is to a king what a woman is to a queen. And that shows us how powerful word embeddings are, because we can provide such knowledge to our computers and that can be extracted from plain text.
Given this assumption, we can think of some pretty solid usages for word embeddings:
- Spotify famously used a technique based on word2vec to analyize music lyrics in orther to provide better recommendations for users.
- Word embeddings are used as a starting point for modelling text in various deep learning and NLP tasks. They are used to build features for Deep Neural Networks which are involved in NLP tasks like machine translation.
- Word embeddings can also be used in sentiment analysis. A good vector representation for words will result in positive and negative word clusters and we can use this to sort through large amounts of reviews or discussions in order to mine more data from the text.
- Word embeddings can also be used for tasks like word prediction. Think of your autocomplete feature on the phone of the message suggestions you get when you write an email in Gmail.
- Word embeddings are also used for document retrieval tasks. By measuring the vector distance between the user query and the documents in your database, you can provide accurate search results when full text search does not do the job.
Word2Vec is just one implementation of word embeddings algorithms that uses a neural network to calculate the word vectors. The beauty of this model is that the neural network used to calculate vector respresentation is just a 3-layer neural network and eventually we will not even need the entire neural network 😀 - you will see more about that in a second.
Word2Vec comes with two different learning models and depending on your needs, one might work better than the other.
- CBOW - Continuous Bag Of Words is a learning model in which the neural network tries to predict one word given a context (made up of surrounding words).
- Skip-Gram model is a learning model in which the neural network tries to predict a context (surrounding words) based on a given word.
You can see in the image above that both neural architectures mirror each other.
Before diving into our learning models we first need to define an important term, the window. This is just a number that helps us define the boundaries of the context around a word.
For example, let's have this sentence: "I am studying word embeddings". Choosing a context window of 1 for the word studying would mean defining the context for our word as (am, word), but choosing a context window of 2 would mean that our context is (I, am, word, embeddings).
Now let's dive into each learning model to understand the differences.
So we established the fact that Word2Vec uses neural networks to calculate vector representations. We also saw that the neural networks used for both CBOW and Skip-Gram have 1 input layer, 1 hidden layer and 1 output layer.
Particularily, for Skip-Gram:
- The input layer is the one-hot encoding representation of the context window around the target word.
- The output layer is a vector of probabilities. Every item in the output layer is the probability that one word (different from the target word) is located somewhere in a random position around the target word but inside the context window.
- The hidden layer - and now comes the fun part - will contain, at the end of the training, exactly our word embeddings. After the neural network is trained using backpropagation, we can drop the input and the output layer, because the hidden layer is all we need.
The Word2Vec CBOW learning model works basically the same, only that the neural architectural is reversed, as you could see in the image above.
So - CBOW vs Skip-Gram - who's the winner? Eventually, both learning models will give you the same result: word embeddings where words that are more related to each other have more similar vector representation. So you can for either of them. One tip I can give you is actually one that Tomas Mikolov game: Skip-Gram seems to work better with smaller datasets and is usually better at representing rare words.
Word2Vec python implementation using Gensim
Ok, so now that we have a small theoretical context in place, let's use Gensim to write a small Word2Vec implementation on a dummy dataset.
We will download 10 Wikipedia texts (5 related to capital cities and 5 related to famous books) and use that as a dataset in order to see how Word2Vec works.
Let's first install some dependencies.
pip3 install wikipedia pip3 install nltk pip3 install gensim pip3 install scikit-learn pip3 install matplotlib
Here's what these packages are going to be used for:
- wikipedia - download the texts from Wikipedia
- nltk - to split the text we download from Wikipedia into sentences
- gensim - for its great Word2Vec implementation
- scikit-learn - use Principal Component Analysis for visualization purposes
- matplotlib - for visualization
Let's first write the code to download the text. For this I created a text_extractor.py file and put it in my workspace.
The package that we are using today usually requires only the text for English pages. But, sometimes it gets confused, so that's why I've included the pageId field of the article. To get the pageId of a Wikipedia article, you need to go to Wikidata and search for the article there. The page id will be found in brackets after the title of the result.
The logic is simple here, we just check if we have already downloaded the text, and if not, we get the text and write it in a file in a /text directory. To get the text, we just read that file.
Because we are getting our text from multiple sources, I created a pipeline for text extractors, so that we can first download all the text we need and just combine it together into one big corpus.
For this purpose I've created a text_extractor_pipe.py with these contents.
Very simple logic, just iterate through all our extractors and append the content to the corpus.
Now comes the part where we use Gensim for Word2Vec implementation. It's only a few lines a code that use the Gensim API. For this I've created a word2vec.py file.
Gensim requires sentences as an input to the Word2Vec model so that's why in the main file of this project I'm building the text corpus and then I'm splitting it into sentences.
For this, let's take a look at wordembeddings.py
Now it's time to play with our results.
I should note that you might get different results from mine and different results from your own if you run this multiple times.
Another note that I should take is that there are lots of params that you can play with to improve these results. For this, you should take a look at the gensim documentation and try for yourself - I promise you, this is the most fun part. You should also try with different datasets and in general just experiment.
But let's see some examples.
See how word embeddings look under the hood
print (word2vec.getEmbedding("city")) [-0.20562008 0.17258668 0.1832738 -0.11458754 0.08489548 -0.25855815 0.13310282 0.19988851 0.22059454 0.18415321 -0.03709813 0.22998075 0.01270844 -0.17245574 0.06129353 -0.07025457 -0.09331453 0.07606903 0.02120633 0.07070243 -0.20904864 0.10624863 0.13803634 0.09852546 0.27054724 0.09279028 0.18003193 -0.18806095 0.1332284 -0.08513955 0.05531044 0.28267217 -0.29905584 0.23347591 0.14874795 -0.08179035 0.11734431 -0.2493748 -0.09980859 -0.00310389 -0.08026568 -0.00959793 0.10784302 -0.08171367 0.0721353 -0.18769109 -0.13068072 0.04155793 0.13697234 0.00711478 0.06430514 0.05139609 0.22102095 -0.13518322 0.03994606 -0.08874794 0.32076737 0.06737606 -0.16174039 0.21226534 0.05170748 -0.04285322 0.01905769 0.1830514 -0.007 0.10958461 0.01621384 -0.23236032 0.07860104 -0.0975527 0.10622834 0.2301385 -0.19103985 0.03487451 -0.06843872 0.0242951 -0.12540959 -0.06467469 -0.07324062 0.03606528 -0.07536016 0.09618145 0.42312497 -0.01939596 -0.30808946 0.01964926 -0.12507571 -0.08399792 0.0436103 -0.24132213 -0.05179053 0.06024555 0.05517503 -0.06595844 0.22122827 0.04599078 0.04652854 -0.17003839 0.07286777 0.16775317]
Getting the most similar words
print (word2vec.mostSimilar("city")) [('world', 0.9982208013534546), ('most', 0.9976677298545837), ('in', 0.9975283145904541), ('built', 0.9975059032440186), ('its', 0.9974948167800903), ('over', 0.9970438480377197), ('this', 0.9969756007194519), ('It', 0.996961236000061), ('where', 0.9969601035118103), ('second', 0.9967144131660461)]
print (word2vec.mostSimilar("London")) [('Bucharest', 0.9970665574073792), ('Berlin', 0.9969276189804077), ('The', 0.9929461479187012), ('Madrid', 0.9924062490463257), ('City', 0.9923746585845947), ('Paris', 0.9910227656364441), ('is', 0.9897833466529846), ('people', 0.9881220459938049), ('history', 0.9875545501708984), ('government', 0.9870220422744751)]
print (word2vec.mostSimilar("published")) [('This', 0.9991827011108398), ('novels', 0.9991753697395325), ('2015', 0.9987846612930298), ('stories', 0.9986273646354675), ('number', 0.9985889196395874), ('since', 0.9985468983650208), ('made', 0.998454749584198), ('Fire', 0.998334527015686), ('work', 0.9983222484588623), ('place', 0.9983134269714355)]
Get the similarity between two words
print (word2vec.getSimilarity("London", "Berlin")) 0.9963725
Word embeddings visualization
For this we will choose ten random words from our dataset and choose other random words from the dataset which are close to the first ten words we have chosen. We will then plot the data and see how more similar words are displayed closer than the others (we'll use different colors to be able to notice clusters more clearly).
For this let's add some new imports in our word2vec.py file.
And now let's add this method to the same word2vec.py file.
Last thing we need to do is add this line in wordembeddings.py.
As I mentioned earlier you will get different results each time you run this because of the random nature of the method.
In the image above I find the yellow, purple and green clusters very interesting, you can clearly see that the words in each of the clusters are similar. Another interesting cluster is the one in the upper-left corner, where only stopwords have been included. We could have removed them by doing some better hyperparameter tuning.
Throughout the articles I usually make references to other articles on this blog, I'll also add them here for ease of reference, if you want to check them out.
Interested in more? Follow me on Twitter at @b_dmarius and I'll post there every new article.
Word embeddings are very fun to play with and very useful in many Natural Language Processing tasks. In this article we've played with Word2Vec - we first dived into some theoretical aspects and then we used a Gensim implementation to interact with Word2Vec word embeddings and visualize them.
 Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. (2013). Efficient Estimation of Word Representations in Vector Space. 1-12.