April 9, 2019 | Rebecca Bakken

Topic Modeling for SEO Explained

9 min read

Search engines like Google have a vested interest in concealing exactly how they rank content. But there’s only so much you can hide in the information age. It’s known that search algorithms use topic models to sort and prioritize the 130 trillion pages on the web. While we may never eliminate the unknowns of SEO, we can use what we do know to an advantage.

Search algorithms are getting increasingly intelligent. The introduction of Hummingbird made that clear. Writing high-ranking content is no longer a matter of using as many keywords as possible. Instead, the algorithm employs models that measure the topical comprehensiveness of a page. It then matches it to a search query.

As a result, comprehensiveness has become a proxy by which search engines measure content quality. Moreover, Hummingbird made it easier to determine how Google ranks content. Fortunately for us, it provided a baseline for experimentation. Comparing rankings before and after the update has proven to be insightful.

How Do We Know Search Engines Use Topic Modeling?

On September 24, 2018, Google finally confirmed what many SEOs had suspected.

So we’ve taken our existing Knowledge Graph—which understands connections between people, places, things and facts about them—and added a new layer, called the Topic Layer, engineered to deeply understand a topic space and how interests can develop over time as familiarity and expertise grow. — Source: Google Blog

The article goes on to explain that the topic layer is built by analyzing all the content existing on the web for a given topic. From that, they develop hundreds, in some cases thousands, of subtopics. The most relevant content is identified for each of those subtopics, after which they look for patterns to understand how these subtopics relate to one another.

If you’ve got some time on your hands, you can read this extensive research paper by the University of Maryland. It details the many applications of topic models. These include query expansion, information retrieval, and search personalization.

It’s difficult to envision a way to efficiently produce SERPs without topic modeling. There’s too many pages on the web. The way in which queries are entered is vast and complex. There are various on-page SEO factors taken into account for each search.

So, we know that topic modeling is a requirement for providing fast, relevant results. Which means content marketers should care. Here’s why.

Developing a content strategy that produces results begins with understanding search engines. But you don’t need to be a data scientist to crack the code.

Although later on we’ll discuss the history of topic modeling. Then we’ll explore the different types of algorithms for data-curious content marketers.

What SEOs Need to Know About Topic Models

Google’s algorithm aims to surface content that provides deep coverage of a given subject. So the best way to rank is to:

make your content easily readable by the algorithm
create in-depth, broad coverage of your focus topics.

Enter, topic clusters. These are groups of content that contain pillar pages that cover your focus topics. They are, in turn, supported and linked to by pages that cover topics related to your pillars. Topic clusters give you breadth and depth in a way that’s easily navigated by both humans and search algos.

HubSpot did an experiment showing how interlinked topic clusters resulted in better SERP rankings. It’s likely that the clusters made HubSpot’s content easier to crawl. That allowed the algorithm to quickly find the pages relevant to a query.

The interlinked clusters signal breadth and depth of a topic. It can lead users through a seamless journey that answers their questions. After all, that’s the whole point of search. Getting those questions answered is called searcher task accomplishment. It contributes to higher ranking by increasing the authority of your pages. Every time a user visits and doesn’t bounce, that sends a positive signal to Google.

Topic Clusters and User Intent

Searcher task accomplishment is a relatively new industry term. But the concept itself is not new. It’s what happens when you focus on satisfying user intent. You aim to provide as many answers as possible with your content in an easily navigable way. In other words, creating topic clusters.

Optimizing content around user intent involves some critical thinking. You need to determine the potential questions a person may ask. However, throwing stuff at the wall to see what sticks isn’t a great way to strategize. It’s a lesson many content marketers have learned the hard way.

Creating topic clusters is best done with a solution that thinks like a search algorithm. MarketMuse takes a keyword, what we prefer to call a focus topic, for one page. Then it takes it and analyzes tens of thousands of other related pages. In doing so, it identifies subtopics, questions to answer, and user personas to address with your content It does all this by using artificial intelligence to generate detailed content suggestions.

The software helps produce an outline of what your content should look like. It removes much of the guesswork for your writers. We’re not the only company that provides this value, but we do it better than the competition. For that, we have an ensemble of natural language processing algorithms, information theory, neural networks, and semantic analysis to thank.

Like Google, we’re not about to give away our trade secrets. But we can break down for you how more rudimentary topic modeling algorithms work. This should illuminate the differences between simpler tools and sophisticated software platforms.

Term Frequency-Inverse Document Frequency

Introduced in 1972, TF-IDF analyzes keyword frequency in a document compared to a set of documents. It measures the number of times a word or combination of words appears in a body of text. Then it determines the degree of relevance the text has to that term by comparing it to a collection of other documents. But its greatest downfall is that it can’t account for relationships, semantics, or syntactics. That’s why it’s not very useful in today’s complex world of SEO.

Latent Semantic Analysis

Developed in 1988, latent semantic analysis (LSA) looks at the relationship between a set of documents and the terms they contain. Specifically, it produces a set of concepts related to the document and terms. LSA gets us closer to discovering synonyms and semantically related words. But it still can’t identify relationships between topics.

Latent Dirichlet Allocation

This topic model, created in 2003, is commonly used to identify topical probability and relationships between topic and subtopics. Latent Dirichlet Allocation (LDA) analyzes the connections between words in a corpus of documents. It’s able to cluster words with similar meaning. As a result, you have a more in-depth semantic analysis than earlier topic models. LDA also utilizes a Bayesian inference model to identify terms related to a topic within a document. It improves those assumptions each time a new document is analyzed. Using LDA, you can get a reasonably< precise assessment of the topics discussed in a document.

How Does MarketMuse’ Topic Modeling Work?

There are two parts to creating an effective topic modeling system.

First, you need some serious technology to acquire the substantial amounts of required data
Second, you need a robust algorithmic platform to efficiently analyze the collected data.

MarketMuse has engineered a system for analyzing millions of articles on a given topic by understanding related topics and following links until we build sets of millions of content items.

Our proprietary systems analyze all of the content on the Web. Technically, it’s a corpus of web crawl data composed of over 25 billion web pages. Then we sample it and build pre-processed models that are called “high-dimensional vector spaces” which then can give us results in real-time or near-real time.

Providing results in such a timely manner is crucial. Where’s the value in optimizing content based on old information? There isn’t any.

The web is constantly evolving. Content that performs well today may not do so a month, or even a week, from now. So, it’s imperative that decisions are made using the latest data.

Our algorithmic platform is a combination of:

Bayesian statistical methods (a collection of algorithms that measure, for example, co-occurrence). Our methods are generally patterned from Latent Dirichlet Allocation which was invented in 2003 and is substantially different from Latent Semantic Indexing from the ’80s and TF-IDF from the ’70s
Natural language processing that measures, for example, the relationships between concepts in the English language and their specificity. For example – “dog” can be a pet, is a type of animal, has legs, etc.
Graph analysis that looks at content as a collection of edges and vertices, in one document and across a collection of documents
Deep learning – neural networks that look to learn and to understand documents similarly to how the human brain processes them

The MarketMuse Difference

Some inexpensive or free tools frequently use these topic models. However, they can only provide a coarse-grain analysis that gives you vast amounts of data that you’ll need to sift through manually. There isn’t a magic bullet algorithm that gives you relevance, relationships, semantics, syntactics, and keyword variants.

We know because we’ve tried to create it. We’ve ended up with a robust solution using data science to provide the most advanced content solution for SEO on the market today.

With each experiment and update we conduct, we better understand how Google operates. Consequently, our software helps content marketers improve search performance, user experience and fulfill searcher intent.

We know our clients’ websites are much more than a bunch of strategically developed keywords and phrases.

They’re platforms for companies to display transparency. They’re places for organizations to establish expertise and help people find solutions.

Using MarketMuse, you can:

plan your content strategy
optimize your site structure and linking
confidently answer your viewers’ most pressing questions.

Rebecca Bakken

Rebecca is an experienced writer with a demonstrated history of working in the online media industry. Skilled in search engine optimization (SEO), journalism, magazine writing, AP Style, and content marketing. You can follow her on Twitter or LinkedIn.