Product
January 9th 2020

How MarketMuse Identifies Topics That Make a Page More Comprehensive

So you want to craft better content but are confused by terms like entities, semantic analysis, TF-IDF, and latent semantic indexing. The prevalence of cheap SEO tools with their magic-bullet claims doesn’t help either.

Don’t worry. This post will clear up the confusion and help you better understand how MarketMuse can help make your content more topically-rich.

Have you ever read an article and thought to yourself, “Wow! That person wrote so much but said so little.”? That’s the antithesis of a comprehensive post, the type of content you’re trying so hard to avoid.

The challenge is how to quantify something like this, so you can be sure your content is of sufficient quality, as judged by search engines. The premise here is that well-written, in-depth, and relevant content will rank higher, all things being equal.

In the past, people involved in search engine optimization have resorted to several methods to achieve topical richness. The two most popular are term frequency inverse document frequency, and latent semantic indexing.

They both suffer from severe limitations. By themselves, they’re not very useful for establishing semantic relevance and determining what topics should be covered when creating content about a particular subject.

Term Frequency Inverse Document Frequency (TF-IDF)

TF-IDF traces its roots back to the late 1950s, and early 70’s as a way of determining the relevancy of a term within a document. This simple algorithm is not very useful for this complex problem. TF-IDF applications typically suffer from the following:

  1. They rely heavily on Google results.
  2. They consider pages that achieve different goals and merge that together.
  3. Their list of topically relevant keywords isn’t necessarily appropriate for your business.
  4. TF-IDF tools tend to be heavily keyword driven.
  5. Using TF-IDF alone to determine importance Is a flawed metric.

Latent Semantic Indexing (LSI)

Latent semantic indexing was developed in the same decade that brought us the Commodore 64. LSI keywords don’t really exist. It’s a term made up by the SEO community to denote terms identified through the application of LSI.

Some drawbacks to using latent semantic analysis:

  1. The model has difficulty dealing with polysemy (multiple meanings of a word). For example, a crane could be a piece of construction or a long-necked bird.
  2. It ignores word order, thus missing out on syntactic relations, logic, and morphology.
  3. It assumes a particular distribution (Gaussian) of terms in documents that may not be true in all instances.

How MarketMuse Identifies Topics

MarketMuse uses its own patented systems and methods for semantic keyword analysis, the details of which you can read here; United States Patent 10,409,875. Be forewarned. The full-text patent is hefty reading. So I’ll try to simplify things while still maintaining the essence.

Our systems and methods consist of an ensemble of algorithms that together help create a list of semantically related topics scored and ordered by relevance given an input of one more keywords (what we call a focus topic or a subject). These algorithms include ones for:

  • Phrase extraction
  • Graph analyses
  • Natural language processing

Here’s precisely how the method works.

  1. You provide our topic tool with a focus topic of one or more keywords for which you wish to generate a list of related topics for organic search. 
  2. MarketMuse then crawls the web for all the competitive content concerning that topic. We’re talking lots of content here, not just the top 10 or 20 results in Google. Think of a number and add some zeros to it. Then you’re getting close to the amount of content we analyze.
  3. Then we use “key phrase extraction algorithms to generate a set of keywords based on at least the acquired content.”
  4. Next, we employ “graph analyses algorithms to identify a set of topics semantically relevant to the set of keywords generated.”
  5. Finally, we use “natural language processing algorithms to determine a relevance score for each topic” in the set.
  6. From the set of semantically relevant topics, we generate a knowledge graph of related topics ranked by the relevance score.
  7. Based at least partially on the knowledge graph, we provide an enumerated list of topics ranked by relevance score.

A few things to keep in mind concerning the method

  • Content comes from a variety of sources, including “web sites, news articles, blog posts, and keyword data.”
  • Analyzing the content requires a crazy amount of cleaning and normalizing of the acquired. Our data science team could create a whole blog post on this alone, but between you and me, I think it would be pretty dull!
  • Our “key phrase extraction algorithms comprise a Bayesian statistical ensemble.” If you don’t know what this means, here’s an article on the Bayesian method and another on ensemble learning. Be warned. These articles are even heavier reading than our patent application!
  • We apply an ensemble of algorithms that include many term ranking functions, including one or more of the following: a core phrase term ranking function, a tail phrase term ranking function, a hyperdictionary graph traversal algorithm, and a semantic knowledge base path traversal score.

Summary

Determining the most relevant topics that add to a page’s comprehensiveness is no simple feat. Semantic keyword analysis is a complex endeavor, and there is no unique algorithm that can solve the problem. Thinking otherwise shows a lack of appreciation for the intricacies that are involved.

As our patented methods and systems reveal, we rely on many different algorithms working together as an ensemble to produce the desired output.