TF-IDF is frequently used in machine learning algorithms in various capacities, including stop-word removal. These are common words like “a, the, an, it” that occur frequently but hold little informational value. TF-IDF consists of two components, term frequency, and inverse document frequency.
Term frequency can be determined by counting the number of occurrences of a term in a document.
IDF is calculated by dividing the total number of documents by the number of documents in the collection containing the term. It’s useful for reducing the weight of terms that are common within a collection of documents. The log of this figure is used to dampen the effect of IDF.
Multiply TF times IDF to arrive at the final product.