In this post we look at the challenges of using TF-IDF to create and optimize web content. While using TF-IDF may make you feel good, it’s not really solving the problem. As we investigate the issues surrounding its usage, you’ll discover that employing TF-IDF may in fact lead you astray.
What Is TF-IDF?
Term frequency inverse document frequency (TF-IDF) is a metric used to determine the relevancy of a term within a document. The formula counts the frequency of a term (TF) in a given document and applies an inverse document frequency (IDF) factor to diminish the weight of terms that occur very frequently, while increasing the weight of those that rarely occur.
TF-IDF is based upon the efforts of Hans Peter Luhn (1957) for his work on term frequency, and Karen Spärck Jones (1972) for her work on inverse document frequency. Astute readers will notice that this pre-dates the birth of the world wide web by decades, which begs the question.
Does Google Even Use TF-IDF and Is It Still Relevant?
Google’s John Mueller has implied that the search engine’s use of TF-IDF is very limited. During a hangout, the only context in which he mentioned TF-IDF was for the removal of stop words.
That’s not surprising given the advancement of the Knowledge Graph, Hummingbird, Rankbrain and Topical Layer. Google is a continuously evolving algorithm that’s constantly training and learning about what things mean and how to deal with the ambiguities of human language.
We’re seeing variable SERP features and better handling of results that have personalization. The search engine is improving its ability to deal with intent fracture (search queries that appeal to multiple intents). But the algorithm is far from perfect. As we’ll see, this poses a serious challenge to those using TF-IDF as a means of optimizing content.
In a world where AI, neural networks and machine learning are the norm, TF-IDF is like a kids bike on training wheels compared to a Ferrari.Roger Montti, Search Marketer and Speaker
Why Does TF-IDF Feel so Good to Many SEOs?
Despite Google’s limited use of this half-century-old technology, many SEO experts believe TF-IDF is the path to search engine prominence. Why is that?
TF-IDF is a relatively obscure concept within the SEO community. Because it’s unfamiliar to them, SEOs assume the technology is cutting-edge. That gives it a certain amount of cachet.
The majority of SEOs are unaware of the history of TF-IDF. They don’t realize its true age nor its true purpose. Hint, it’s not for content optimization.
SEOs believe TF-IDF plays a big part in Google’s search algorithms. Because Google has patents and a couple of posts that reference TF-IDF, there’s a false assumption about the role this technology plays.
TF-IDF appears sophisticated to the majority of SEOs. It’s rare for SEOs to have a data science background. In this context, it’s easy to assume the apparent complexity of TF-IDF equals effectiveness.
Who wouldn’t want to use sophisticated, ground-breaking search engine optimization technology? Especially when it sounds so promising!
Except it isn’t.
The Problems With TF-IDF
There are a number of SEO tools, free or inexpensive, that purport to use TF-IDF as a method for optimizing content for SEO. All of them suffer from the following issues.
TF-IDF is a Primitive Approach
I asked JR Oakes, Senior Director, Technical SEO Research at Adapt Partners for his opinion about TF-IDF. He offers a succinct analysis on its limits.
TF-IDF is a good measure of how important a document is, compared to other documents, to an explicit term. Where it falls flat is that you may have a document that is highly relevant for “baby” according to TF-IDF, yet you were searching for “infant”. Because the document (that was most relevant for “baby”) uses this term sparsely, it is not seen as a relevant match.
Google understands that “baby” and “infant” are strongly related (often synonymous) terms, and a page with relevance for one, is more than likely relevant for the other, unless there are context clues in the rest of the query that say otherwise. This is based on the co-occurrence of usage across the internet as well as the probability that they both are used in similar contexts.
Another good example is a misspelling. If you have documents about “reebok” shoes, and you search “rebok”, with TF-IDF, you will more than likely find the page that someone made a misspelling on. Google will understand these as the same and will return appropriate results.JR Oakes, Senior Director, Technical SEO Research at Adapt Partners
TF-IDF Applications Rely on Google Search Results
It’s term frequency inverse document frequency using those documents as they appear in the SERP. These applications typically rely on the top 10 or 20 pages in the SERP blindly without being more thoughtful and investigating why these pages have these topics in them.
Using so few documents as a corpus significantly affects the quality of the results. They don’t consider the outliers with low quality content or short content items that fail to provide value for that model.
Taking the top results from Google ignores outliers for off-page factors; the pages that are ranking well despite their content. The error involved is so high that even accounting for those things, you lack the necessary information to make decisions and potentially put you on the wrong path.
Using time savers like natural language processing. You’ve got to process everything that’s out there on a topic.
TF-IDF and keyword density solutions throw all that out the window. If you follow their advice, you’re as likely to be successful as if you rolled the dice.
I reached out to Bill Slawski, Director of SEO Research at Go Fish Digital. Bill has been analyzing Google’s search patents and writing about them on his blog, SEO by the Sea, since 2005.
TF-IDF is referred to in a number of Google Patents as something that the search engine may use as part of processes behind such things as generating query refinements. Since Google has access to its corpus of documents on the Web, and the words used upon those documents in its index, that is very reasonable.
The IDF part of TF-IDF can be used to identify how rare or how common words are in Google’s Corpus on the Web. Unfortunately Google doesn’t share that corpus.
When you perform a query, Google does say how many results a query term appears within, but that amount is an estimate of a percentage of documents in Google’s Web corpus (as one of Google’s patents tells us.) But anyone other than Google using TF-IDF on a document without Google’s corpus is not capable of determining how common or how rare words on in a document that doesn’t actually use Google’s Corpus.
There are some toolmakers who provide TF-IDF tools. They do things like look at what terms appear on pages that rank highly for specific query terms that you enter. Keep in mind that these aren’t necessarily semantically related to each other. Although I have seen some claim that TF-IDF used in this manner can identify words that are semantically related to each other.Bill Slawski, Director of SEO Research at Go Fish Digital.
TF-IDF Looks at Pages That Achieve Different Goals and Merges That Together
Relying on the top “N” pages in the SERP creates other issues. You may be using pages that are too general or too specific or targeted to a different industry. The content may be poorly written and has significant off-page value that’s driving its ranking. Take for example landing pages landing pages that have been propped up in the SERPs by link building strategies.
The List of Topically Relevant Keywords Isn’t Necessarily Appropriate for Your Business
TF-IDF provides a list of topically relevant keywords associated with those content items. But you still have to determine the relevance of those phrases to your business. If you write a blog post that models after a low quality landing page or content page, or one that doesn’t connect with your intent, it’s not going to be a fit.
TF-IDF Is Heavily Keyword Driven
Pages aren’t about keywords. A page that performs well for a lot of things, is about a lot of things. Using TF-IDF from one keyword to create or optimize a page leaves out a lot. Specifically, all the other search results of all of those other keywords are different. That’s a huge miss.
Term keywords can appear cumulatively, stemming and synonyms and other concepts of relatedness. That kind of debunks using keywords. That’s the bias that’s created by using only the top “N” pages or keywords.
Ultimately, you can never truly know if any of those pages are actually expertly written in a comprehensive fashion. Each one of those pages ranks for “N” other topics too, that results in a pool of pages you have to assess. Based on those pages and what they’re about, it can continue to branch.
A keyword focus can lead to things like really unnatural language; the kind of garbage low-quality content where they’re forcing keywords into the content at all cost. Alternatively, the content may be good, but it has no connection to anything on your site.
Andy Crestodina, Co-Founder / Chief Marketing Officer of Orbit Media Studios puts it this way.
“Nice article, but the TF-IDF could have been a bit better…” When I get that comment from a reader, I’ll start worrying about things like inverse document frequency.Andy Crestodina, Co-Founder / Chief Marketing Officer of Orbit Media Studios
Yes, pick a primary keyphrase within reach. Yes, use that phrase in the title, header and body text. Yes, work in those semantically related phrases and subtopics. Yes, answer the relevant “people also ask” questions. But no, don’t calculate TF-IDF. Because that’s just silly.
Instead, write something original, something unexpectedly useful. Worry more about delighting your reader. Do this and you’ll send all the right search signals. You’ll win links, dwell time, word of mouth and brand searches. Forget the math and do something awesome. You’re readers are hoping you’ll take this advice.
Using TF-IDF to Determine Importance Is a Flawed Metric
Calculating importance by frequency of usage in the SERP vs. relevance is an absolutely flawed metric. If some entries in the SERP focus on one intent and the other ones focus on another, the term weighting (importance) may be scored at 50%. However, if everyone uses some sort of common word, that will be judged as more important.
So, you’re trying to appeal to that one intent. But the model will discourage you from pursuing that path because only five of the results use the term. The model going to say that it’s only five out of 10.
In other words, if you’ve got high-quality content, focused on a different intent, you’ll be lead astray. If you’ve got low-quality content that has high off-page factors, that’s going to lead you down the wrong path. If you’ve got mixed intent, that’s going to lead you off course. So using that as a metric is just garbage.
TF-IDF Applications Only Focus at the Page Level
By restricting themselves to the page level TF-IDF applications can’t connect the dots between the rest of the content on your site. One page on a topic typically won’t cut it. To do well, you need other content that fuels your authority and works together through appropriate interlinking and use of relevant anchor text.
A Grade Does Not Provide Insight
Grading a page based on its compliance with TF-IDF seems like a good idea. But if you can’t dive in and learn more about that site or page, that information is meaningless and not actionable.
The page with the highest grade may:
- Have a different goal than yours.
- Be much stronger or weaker than yours.
- Have two goals.
- May well cover this topic, but also cover something else.
So your goal of simplifying this research project process with TF-IDF is unattainable. It gave you this grade, but then you still have to go back and manually research it to see if the TF IDF data is valid for each page.
What’s the use in that?
Why use TF-IDF if you get a grade and now you’ve got to still manually work through the page? The technology should enable you to conduct a sophisticated analysis including:
- Explicit topic overlap analysis of that topic and all the other words they rank for versus your page and what it ranks for.
- Competitive site structure
- The intent that the competition is looking to service.
This is where TF-IDF falls flat. It provides no shortcut value that you can rely on.
Not being able to dig in using the technology is a flawed methodology. Because you still have to do that additional layer of research to get that head-to-head analysis of what it means to approach one intent versus approaching another.
How TF-IDF Fits Into a Workflow
Tools employing TF-IDF drive bad habits for writers and SEOs. They try to weave in words that don’t naturally fit or may add sections that don’t associate well with the narrative.
These applications ignore the relationship between researcher and writer. Handing a list of words that may not connect with the vision of the writer is going to create conflict. They may be inspired by some of those words, but it isn’t the workflow enablement solution that it pretends to be.
What happens if you deliver a list of keywords using this methodology? Some of them are on one topic and some of them are on another intent. The person on the receiving is not going to know what to do with this. It just doesn’t just doesn’t look right.
True content strategists know they need to assess. They need to do the work to understand what it means to be a subject matter expert, to understand user intent.
Should I try to be just like the page that gets a great grade? Because if I do that, the likelihood of success is as random as any other research methodology. Frankly, if I’ve got to do all that manual research on this metric that I’ve got, what value does it truly provide? I can’t rely on it.
Combining TF-IDF With Other Data Points
Using TF-IDF data with other flawed data points leads to false conclusions. Here are some that we see used in connection with TF-IDF.
Maybe you rely on search volume to determine what to write about. Instead of assessing the true potential that a page that achieves top rankings for this topic will likely yield, you mix it with this type of competitive analysis.
Let’s say a keyword you’re targeting has 8,100 monthly searches. But the competitor, who you’re modeling against has content that ranks for dozens, hundreds or thousands of words with those pages and their web network of pages that they exist within.
Each one of them might receive 10,000 monthly visits while yours might get only 1,000. So you’re using search volume to calculate potential in a flawed way. You’re doing competitive analysis by grading content without diving in and doing the research. Combine those two things in a flawed manner, and the guidance that using those two metrics provides, is as likely to provide success as it is to result in failure.
Using the SERP features and page type analysis as part of your guidance to determine the type of page you need doesn’t speak to the true intent of the query.
What SERP features are there? Do I have the opportunity to succeed?
But if you:
- Have never written anything on this.
- Don’t have any off-page authority.
- Have no collection of content or foundation or cluster of content.
Then using SERP features with search volume and competitive content just adds chaos and disorder to the chance I have of performing. It’s completely useless data.
AdWords Competition and AdWords CPC
AdWords Competition and AdWords CPC are metrics that are strictly for use with search engine marketing (paid ads). Neither metric correlates to difficulty. Nor do they represent any relationship to how easy or hard it will be for you to rank in organic search results.
The Value of TF-IDF
Is there any redeeming feature of TF-IDF?
- It could serve to inspire you or reveal a topic you may not have considered.
- It may aid you in determining if your on-page optimization is way out of line with what is natural.
- It could even help find competitors for which you need to conduct additional detailed research.
Kevin Indig, VP SEO and Content, G2 routinely blogs about fresh digital marketing ideas concepts on his blog. I asked if he could provide some insight into his experience with TF-IDF.
I’m a bit ambivalent about TF-IDF. Google said it doesn’t use it and even if it did, without the full Google corpus (meaning all content on the internet Google has indexed), we cannot get the accurate TF/IDF value. I have to say, though, that whenever I’ve used TF-IDF tools in the past, my content ranked better than without. So, no matter how inaccurate or inapplicable the concept seems to be, there seems to be value in using some of these tools.Kevin Indig, VP SEO and Content, G2
This appears to be similar to the experience Joe Hall wrote about in his post TF-IDF Will Not Help Your SEO.
These types of tools can help optimize content for SEO, but not because of TF-IDF. Simply because they provide guidance and encouragement to rewrite content with more natural language that is commonly used. These same tools can be made using other metrics like “keyword density” or just “total term counts”, that can be compared against each other.Joe Hall, SEO Consultant & Principal Analyst at Hall Analysis
But, is TF-IDF something that provides enough information to support your entire workflow? Not at all.
While it may feel good to many SEOs, the reality is that this 50-year-old metric plays a very limited part in Google’s search algorithms. Not exactly cutting-edge is it?
Now, should your pages be comprehensive and of high quality? Yes.
By modeling it using TF-IDF? No.
You’re ideally trying to build a relevant topic model and you do need relevance as part of this calculation. Search engines may use TF-IDF, but it’s just one factor.
It’s one component of the whole picture of what’s needed for proper research and optimizing your content. So, if somebody’s selling a TF-IDF tool as an end-to-end solution, they are selling you a story that lacks the necessary information to make great decisions for your business.
You might as well trust your editor to make those business decisions. Or just roll the dice. Either way, it’s the same.
Written by Stephen Jeske