A word is defined by the company it keeps. That’s the premise behind Word2Vec, a method of converting words to numbers and representing them in a multi-dimensional space. Words frequently found close together in a collection of documents (corpus) will also appear close together in this space. They are said to be related contextually.
Word2Vec is a method of machine learning that requires a corpus and proper training. The quality of both affects its ability to model a topic accurately. Any shortcomings become readily apparent when examining the output for very specific and complicated topics as these are the most difficult to model precisely. Word2Vec can be used by itself, although it is frequently combined with other modeling techniques to address its limitations.
The rest of this article provides additional background on Word2Vec, how it works, how it’s used in topic modeling, and some of the challenges it presents.
What is Word2Vec?
In September 2013, Google researchers, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, published the paper ‘Efficient Estimation of Word Representations in Vector Space’ (pdf). This is what we now refer to as Word2Vec. The goal of the paper was to “to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary.”
Prior to this point, any natural language processing techniques treated words as singular units. They did not take into account any similarity between words. While there were valid reasons for this approach, it did have its limitations. There were situations in which scaling these basic techniques could not offer significant improvement. Hence, the need to develop advanced technologies.
The paper showed that simple models, with their lower computational requirements, could train high-quality word vectors. As the paper concludes, it’s “possible to compute very accurate high dimensional word vectors from a much larger data set.” They’re talking about document collections (corpora) with one trillion words providing a virtually unlimited size of the vocabulary.
Word2Vec is a way of converting words to numbers, in this case vectors, so that similarities may be discovered mathematically. The idea is that vectors of similar words get grouped within the vector space.
Think of the latitudinal and longitudinal coordinates on a map. Using this two-dimensional vector, you can quickly determine whether two locations are relatively close together. For words to be appropriately represented in a vector space, two dimensions don’t suffice. So, vectors need to incorporate many dimensions.
How Does Word2Vec Work?
Word2Vec takes as its input a large text corpus and vectorizes it using a shallow neural net. The output is a list of words (vocabulary), each with a corresponding vector. Words with similar meaning spacially occur within close proximity. Mathematically this is measured by cosine similarity, where total similarity is expressed as a 0-degree angle while no similarity is expressed as a 90-degree angle.
Words can be encoded as vectors using different types of models. In their paper, Mikolov et al. looked at two existing models, feedforward neural net language model (NNLM) and recurrent neural net language model (RNNLM). In addition, they propose two new log-linear models, continuous bag of words (CBOW), and continuous Skip-gram.
In their comparisons, CBOW and Skip-gram performed better, so let’s examine these two models.
CBOW is similar to NNLM and relies on context to determine a target word. It determines the target word based on the words that come before and after it. Mikolov found the best performance occurred with four future and four historical words. It’s called ‘bag of words‘ because the order of the words in history do not influence the output. ‘Continuous’ in the term CBOW refers to its use of “continuous distributed representation of the context.”
Skip-gram is the reverse of CBOW. Given a word, it predicts surrounding words within a specific range. A greater range provides for better quality word vectors but increases the computational complexity. Less weight is given to distant terms because they are usually less related to the current word.
In comparing CBOW to Skip-gram, the latter has been found to offer better quality results on large data sets. Although CBOW is faster, Skip-gram handles infrequently used words better.
During training, a vector is assigned to each word. The components of that vector are adjusted so that similar words (based on their context) are closer together. Think of this as a tug of war, where words are getting pushed and pulled around in this multi-dimensional vector every time another term is added to the space.
Mathematical operations, in addition to cosine similarity, can be performed on word vectors. For example, the vector(”King”) – vector(”Man”) + vector(”Woman”) results in a vector closest to that representing the word Queen.
Word2Vec for Topic Modeling
The vocabulary created by Word2Vec can be queried directly to detect relationships between words or fed into a deep-learning neural network. One issue with Word2Vec algorithms like CBOW and Skip-gram is that they weight each word equally. The problem that arises when working with documents is that words don’t equally represent the meaning of a sentence.
Some words are more important than others. Thus, different weighting strategies, such as TF-IDF, are often employed to deal with the situation. This also helps address the hubness problem mentioned in the next section. Searchmetrics ContentExperience uses a combination of TF-IDF and Word2Vec, which you can read about here in our comparison with MarketMuse.
While word embeddings like Word2Vec capture morphological, semantic, and syntactic information, topic modeling aims to discover latent semantic structured or topics in a corpus.
According to Budhkar and Rudzicz (PDF), combining latent Dirichlet allocation (LDA) with Word2Vec can produce discriminative features to “address the issue caused by the absence of contextual information embedded in these models.” Easier reading on LDA2vec can be found in this DataCamp tutorial.
Challenges of Word2Vec
There are several issues with word embeddings in general, including Word2Vec. We’ll touch on some of these, for a more detailed analysis, refer to ‘A Survey of Word Embedding Evaluation Methods‘ (pdf) by Amir Bakarov. The corpus and its size, as well as the training itself, will significantly impact the output quality.
How do you evaluate the output?
As Bakarov explains in his paper, an NLP engineer will typically evaluate the performance of embeddings differently than a computational linguist, or a content marketer for that matter. Here are some additional issues cited in the paper.
- Semantics is a vague idea. A “good” word embedding reflects our notion of semantics. However, we may not be aware of whether our understanding is correct. Also, words have different types of relations like semantic relatedness and semantic similarity. Which kind of relationship should the word embedding reflect?
- Lack of proper training data. When training word embeddings, researchers frequently increase their quality by adjusting them to the data. This is what we refer to as curve fitting. Instead of making the result fit the data, researchers should try to capture the relationships between words.
- The absence of correlation between intrinsic and extrinsic methods means it’s unclear which class of method is preferred. Extrinsic evaluation determines the output quality for use further downstream in other natural language processing tasks. Intrinsic evaluation relies on human judgment of word relations.
- The hubness problem. Hubs, word vectors representing common words, are close to an excessive number of other word vectors. This noise may bias the evaluation.
Additionally, there are two significant challenges with Word2Vec in particular.
- It cannot deal with ambiguities very well. As a result, the vector of a word with multiple meanings reflects the average, which is far from ideal.
- Word2Vec can’t handle out-of-vocabulary (OOV) words and morphologically similar words. When the model encounters a new concept, it resorts to using a random vector, which is not an accurate representation.
Using Word2Vec or any other word embedding is no guarantee of success. Quality output is predicated on proper training using an appropriate and sufficiently large corpus.
While evaluating the quality of output can be cumbersome, here’s a simple solution for content marketers. The next time you’re evaluating a content optimizer, try using a very specific topic. Poor quality topic models fail when it comes to testing in this manner. They’re okay for general terms but break down when the request gets too specific.
So, if you use the topic ‘how to grow avocados,’ make sure the suggestions have something to do with growing the plant and not avocados in general.
MarketMuse First Draft natural language generation helped create this article.