May 31, 2024 | Jeff Coyle

Leveraging Google’s Content Warehouse API for Structured Annotation, Semantic Analysis and Feature Tagging

17 min read

Google’s Content Warehouse API leak is one of the biggest events in SEO history. Its significance will be felt for years to come — not just for content strategists and marketers, but for those who design software. Rand Fishkin at SparkToro, Mike King at iPullRank and Danny Goodwin at Search Engine Land have already produced some great analysis on the subject. Here’s my take on how you could use it to create and improve a product.

We can learn a lot from the recent “leak” of Google’s Content Warehouse API. Developers use APIs to build, propose and experiment. This information offers us the ability to theorize how things work and build more effectively. Obviously, an API share of this magnitude poses an enormous risk to Google for competitive reasons.

However, the massive value of introducing a shared vocabulary is not to be dismissed. It helps everyone speak the same language.

One thing that most people analyzing this leak will ask is, “Is this ‘ranking factor’ currently in use?”

If that’s all that matters to someone, great! I respect that perspective of not wanting to be distracted.

But no one can answer that without insider information. I respect Google enough not to expect them to share everything with everyone.

Competitors will eat it up and developers move too quickly these days. As we have seen, big AI money wants market share more than they do short-term profits.

But that’s not my perspective. I’ve spent years researching patents, building software, search engine platforms, ad servers, delivering content strategies, and improving websites and website networks — all with a focus on quality.

I’m jazzed that Google’s content analysis infrastructure is this comprehensive and this great.

This helps me, and it is already helping those around me who matter. That’s what matters.

A shared vocabulary is superior to fear, uncertainty, and doubt.

My two key findings, so far, relate to page and site embeddings and their potential use cases. Most interesting is the concept of processing a URL to then make it the processed text, complete with a data structure that rides along with the page (this includes semantic analysis output, features ‘of note’ and embeddings).

Disclaimer — Everything I write, from here, on is speculation of the ways this API could be used. I’m not saying this is how it’s used, how it’s implemented, or the way Google Search operates. As someone who has built search engine platforms, ad servers, focused crawlers, and many other products related to content analysis, this is how I intuitively would use the API to build a product.

Page Embeddings and Site Embeddings

Page embeddings and site embeddings are like digital fingerprints or summaries of web pages and entire websites, respectively. They help convey what each page and site is about.

Page embeddings capture the essence of a single web page’s content, while site embeddings represent the overall theme and focus of an entire website. These embeddings are used for various purposes, such as:

Finding similar content: Page embeddings can help identify web pages with similar topics or themes, which can be useful for content recommendations or competitor analysis.
Evaluating website quality: Site embeddings can help assess a website’s overall quality based on factors like content relevance and user engagement.
Detecting spam: Unusual patterns in embeddings can help flag spammy or low-quality websites.
Tracking changes over time: Embeddings can be used to monitor how a website’s content and quality evolve.

This summary highlights the key aspects of page embeddings, site embeddings, and their roles, drawing on specific examples from the dataset, contained in the Google Content Warehouse API leak.

Page Embeddings

Page embeddings represent individual web pages as dense vectors in a high-dimensional space. These embeddings capture the semantic meaning and relevance of the content on a page. They are used for various purposes, including:

Similarity Measurement: Comparing page embeddings helps in identifying similar content across different pages. For example, in the QualityAuthorityTopicEmbeddingsVersionedItem module, pageEmbedding stores embeddings that capture the content of individual pages, enabling similarity comparisons.
Content Clustering: Pages with similar embeddings can be grouped into clusters, facilitating the identification of content themes and topics.

Site Embeddings

Site embeddings extend the concept of page embeddings to entire websites. They are generated by aggregating the embeddings of all pages within a site, providing a comprehensive representation of the site’s overall content. Site embeddings are utilized for:

Website Similarity and Relationships: Comparing site embeddings allows for the identification of relationships between different websites. This can be used to find clusters of high-quality sites, low-quality sites, or sites with similar thematic content. For instance, the QualityNsrNsrData module includes site2vecEmbedding and site2vecEmbeddingEncoded fields, which store compressed representations of site embeddings to manage data size while maintaining detailed site-level information.
Quality Assessment: Site embeddings help in evaluating the overall quality of a website. Metrics like siteScore in the QualityNsrNsrData module provide an aggregated quality score based on various factors, including content quality and user engagement.
Thematic Focus: The siteFocusScore in the QualityAuthorityTopicEmbeddingsVersionedItem module quantifies how focused a site is on a particular topic, while siteRadius measures how much the content of individual pages deviates from the site’s central theme.

“Something you might want to note is that site vectors seem like they are only 64 dimensions…This kinda shocked me considering so many of the other embeddings models are much higher dimensionality”
Mike King

Use Cases

Content Recommendations: By analyzing site embeddings, recommendation systems can suggest relevant sites or pages to users based on their browsing history and interests. The ImageRepositoryFrameLevelStarburstEmbeddings module, for example, supports frame-level embeddings that help in recommending video content based on specific themes.
Spam Detection: Identifying and clustering spammy sites is possible by analyzing deviations in embeddings. Sites with embeddings significantly different from trusted sites can be flagged for further review.
Version Control and Temporal Analysis: Embeddings can be versioned to track changes over time. This helps in monitoring how a site’s content and quality evolve. The versionId in the QualityAuthorityTopicEmbeddingsVersionedItem module is an example of how versions are tracked for embeddings.

Page embeddings and site embeddings are foundational elements in the modern web ecosystem. They enable detailed content analysis, quality assessment, and personalized recommendations. By leveraging these embeddings, search engines and content platforms can enhance their services, ensuring users receive the most relevant and high-quality content. The examples from the dataset, such as QualityAuthorityTopicEmbeddingsVersionedItem, QualityNsrNsrData, and ImageRepositoryFrameLevelStarburstEmbeddings, illustrate the practical implementation and benefits of these embeddings in various applications.

Disclaimer — A reminder that this is totally speculative. I used the clues provided in the API and the notes to construct this theoretical product. The odds of the software structure Google uses, working close to how I describe it, are a long shot.

SAFT’s Role in Entity Analysis

SAFT (Structured Annotation Framework and Toolkit) is an acronym Google uses internally. It plays a critical role in entity analysis within Google’s search architecture. SAFT is designed to perform advanced semantic parsing, annotation, and extraction of entities and their relationships from textual content.

Here is a detailed explanation of SAFT’s role in entity analysis:

Entity Identification:
NlpSaftEntity:
Identifies named entities in the document, such as persons (PER), organizations (ORG), and locations (LOC).
Stores attributes like entityType, entityTypeProbability, gender, and name.

NlpSemanticParsingSaftMentionAnnotation:
Annotates sub-spans of input text that are relevant to specific entities, such as persons or locations.

Entity Annotation:
NlpSaftAnnotatedPhrase:
Annotates arbitrary spans in the document, including those not considered as entity mentions.
Provides detailed annotations through info and phrase fields.

NlpSaftLabeledSpan:
Defines labeled spans in the text, associating them with specific labels and scores.

Coreference Resolution:
NlpSemanticParsingSaftCoreference:
Resolves pronouns and nominal mentions to their corresponding entities.
Stores coreference annotations, which help in understanding the context and references within the document.

Entity Relations:
NlpSaftRelation:
Defines relations between entities in the document, such as relationships between people or connections between organizations.
Includes fields like source, target, type, and score to describe these relations.

NlpSaftRelationMention:
Captures mentions of relations within the document, linking them to specific entities.

Semantic Nodes and Graphs:
NlpSaftSemanticNode:
Represents semantic constructions in the document, forming a directed acyclic graph (DAG) that captures complex relationships and higher-level abstractions.
Connects nodes to entities, measures, and token spans, providing a rich semantic structure.

NlpSaftSemanticNodeArc:
Represents arcs in the semantic graph, indicating relationships and dependencies between semantic nodes.

Entity Profiling:
NlpSaftEntityProfile:
Contains detailed information about a single unique entity, such as canonical names, attributes, and embeddings.
Includes embedding vectors, attributes, and disambiguation information for precise entity representation.

How SAFT Enhances Entity Analysis

SAFT goes beyond traditional entity analysis to provide a more nuanced and comprehensive understanding of text. Let’s explore the specific ways in which SAFT enhances entity analysis, leading to improved accuracy and relevance.

Contextual Understanding: By resolving coreferences and annotating entities within the text, SAFT provides a deeper understanding of the context and meaning of entities in the document.

Relationship Mapping: Through detailed relations and semantic nodes, SAFT maps out the relationships between entities, enabling the detection of complex interactions and connections.

Entity Disambiguation: SAFT’s profiling and annotation capabilities help in disambiguating entities, ensuring that different references to the same entity are correctly identified and linked.

Semantic Enrichment: The rich semantic annotations and structured representations provided by SAFT enhance the overall semantic understanding of the document, making it easier to extract meaningful insights and improve search relevance.

Knowledge Integration: SAFT integrates with Google’s Knowledge Vault, contributing to the fusion and linkage of entities across different data sources, thereby enriching the knowledge graph and improving information retrieval.

Example Workflow

To illustrate the practical application of SAFT, let’s look at a step-by-step example of how it processes and analyzes text to enhance search results.

Document Parsing: A document is parsed into tokens, and initial part-of-speech tagging and dependency relations are established.
Entity Extraction: SAFT identifies and annotates entities within the text, resolving coreferences and marking relevant spans.
Relation Extraction: Relationships between identified entities are mapped, forming a semantic graph that captures the interactions and dependencies.
Semantic Annotation: Additional semantic nodes and arcs are added to represent higher-level abstractions and complex constructions within the document.
Entity Profiling: Profiles are generated for each unique entity, including canonical names, attributes, and embeddings.
Integration with Knowledge Vault: The extracted entities and relationships can be integrated into the Knowledge Vault, contributing to the broader knowledge graph used for search and information retrieval.

In summary, SAFT is a sophisticated system that significantly enhances entity analysis by providing comprehensive semantic parsing, detailed entity and relation annotations, and integration with broader knowledge systems. This results in more accurate, contextually aware, and meaningful search results.

References

These references collectively illustrate the robust role SAFT plays in entity analysis, providing essential functions for identifying, annotating, resolving, and integrating entities within Google’s search infrastructure.

Note that references to “the spreadsheet” refer to a spreadsheet of the leaked Google Content Warehouse API documentation. You can find the source data on hexdocs.

The following are some references for the previous section on SAFT’s role in entity analysis.

Entity Identification
Reference: NlpSaftEntity
Source Material: The rows describing NlpSaftEntity in the spreadsheet.
Claims Supported:
Identifies named entities such as persons (PER), organizations (ORG), and locations (LOC).
Stores attributes like entityType, entityTypeProbability, gender, and name.

Reference: NlpSemanticParsingSaftMentionAnnotation
Source Material: The rows describing NlpSemanticParsingSaftMentionAnnotation in the spreadsheet.
Claims Supported:
Annotates sub-spans of input text relevant to specific entities.

Entity Annotation
Reference: NlpSaftAnnotatedPhrase
Source Material: The rows describing NlpSaftAnnotatedPhrase in the spreadsheet.
Claims Supported:
Annotates arbitrary spans in the document that are not considered mentions of entities.
Provides detailed annotations through info and phrase fields.

Reference: NlpSaftLabeledSpan
Source Material: The rows describing NlpSaftLabeledSpan in the spreadsheet.
Claims Supported:
Defines labeled spans in the text and associates them with specific labels and scores.

Coreference Resolution
Reference: NlpSemanticParsingSaftCoreference
Source Material: The rows describing NlpSemanticParsingSaftCoreference in the spreadsheet.
Claims Supported:
Resolves pronouns and nominal mentions to their corresponding entities.
Stores coreference annotations for understanding context and references.

Entity Relations
Reference: NlpSaftRelation
Source Material: The rows describing NlpSaftRelation in the spreadsheet.
Claims Supported:
Defines relations between entities in the document.
Includes fields like source, target, type, and score to describe these relations.

Reference: NlpSaftRelationMention
Source Material: The rows describing NlpSaftRelationMention in the spreadsheet.
Claims Supported:
Captures mentions of relations within the document, linking them to specific entities.

Semantic Nodes and Graphs
Reference: NlpSaftSemanticNode
Source Material: The rows describing NlpSaftSemanticNode in the spreadsheet.
Claims Supported:
Represents semantic constructions in the document, forming a directed acyclic graph (DAG).
Connects nodes to entities, measures, and token spans.

Reference: NlpSaftSemanticNodeArc
Source Material: The rows describing NlpSaftSemanticNodeArc in the spreadsheet.
Claims Supported:
Represents arcs in the semantic graph, indicating relationships and dependencies between semantic nodes.

Entity Profiling
Reference: NlpSaftEntityProfile
Source Material: The rows describing NlpSaftEntityProfile in the spreadsheet.
Claims Supported:
Contains detailed information about a single unique entity, such as canonical names, attributes, and embeddings.

Here are some references for the previous section on how SAFT supports entity analysis.

Contextual Understanding
Source Material: Descriptions of NlpSaftEntity, NlpSemanticParsingSaftMentionAnnotation, and NlpSaftAnnotatedPhrase.
Supporting Claim: By resolving coreferences and annotating entities, SAFT provides a deeper understanding of the context and meaning of entities in the document.

Relationship Mapping
Source Material: Descriptions of NlpSaftRelation, NlpSaftSemanticNode, and NlpSaftSemanticNodeArc.
Supporting Claim: Maps out relationships between entities, enabling detection of complex interactions and connections.

Entity Disambiguation
Source Material: Descriptions of NlpSaftEntityProfile and NlpSemanticParsingSaftMentionAnnotation.
Supporting Claim: Helps in disambiguating entities, ensuring correct identification and linking of references.

Semantic Enrichment
Source Material: Comprehensive annotations and structured representations in NlpSaftDocument and NlpSaftSemanticNode.
Supporting Claim: Enhances the overall semantic understanding of the document, aiding in meaningful insight extraction.

Knowledge Integration
Source Material: Integration features in VideoContentSearchSaftEntityInfo and NlpSaftEntity.
Supporting Claim: Contributes to the fusion and linkage of entities in the Knowledge Vault, enriching the knowledge graph for better information retrieval.

Inferred Architecture of SAFT

Here’s my inferred architecture for SAFT within Google’s search infrastructure. It provides a comprehensive view of how SAFT processes, annotates, and enriches textual content, contributing to enhanced entity analysis and semantic understanding within Google’s search infrastructure.

1. Input Layer: Document Parsing
Document Ingestion:
Raw text documents are ingested into the system.
Input data includes web pages, documents, queries, and other text-based content.

Tokenization:
The raw text is split into tokens (words or phrases).
Tokens are assigned identifiers and positions within the document.
Example: NlpSaftToken, which marks spans of bytes in the document text as tokens.

Basic Annotation:
Initial annotations such as part-of-speech tags and basic entity recognition.
Example: NlpSaftToken includes attributes like tag (part-of-speech) and lemma.

2. Semantic Annotation Layer
Phrase and Span Annotation:
Annotates arbitrary spans in the document that may not be direct entity mentions.
Example: NlpSaftAnnotatedPhrase and NlpSaftLabeledSpan.

Entity Recognition and Annotation:
Identifies and annotates named entities within the text.
Example: NlpSaftEntity, which includes entity types, names, and salience scores.

Coreference Resolution:
Resolves pronouns and nominal mentions to their corresponding entities.
Example: NlpSemanticParsingSaftCoreference resolves mentions to entities.

Measure and Quantities Identification:
Detects and annotates measures and quantities within the text.
Example: NlpSemanticParsingSaftMeasure identifies measures like "53 pounds".

3. Relationship and Contextual Analysis Layer
Relation Extraction:
Extracts relationships between entities and annotates these relations.
Example: NlpSaftRelation and NlpSaftRelationMention.

Semantic Node Creation:
Creates semantic nodes representing higher-level abstractions and relationships.
Forms a directed acyclic graph (DAG) to capture complex semantic structures.
Example: NlpSaftSemanticNode and NlpSaftSemanticNodeArc.

4. Profiling and Knowledge Integration Layer
Entity Profiling:
Generates detailed profiles for each unique entity, including canonical names, attributes, and embeddings.
Example: NlpSaftEntityProfile.

Knowledge Integration:
Integrates entities and relationships into the broader Knowledge Vault and knowledge graph.
Links entities to external knowledge bases like Freebase.
Example: VideoContentSearchSaftEntityInfo which links to Freebase MIDs.

5. Output Layer: Enriched Document Representation
Enriched Document:
Outputs a semantically enriched representation of the document with annotated entities, relations, and semantic structures.
Example: NlpSaftDocument contains raw text, tokens, entities, relations, and semantic annotations.

Feedback Loop:
Continuous feedback loop to refine and improve annotations based on user interactions and quality rater feedback.

Detailed Component Interactions

To truly grasp the inner workings of SAFT, let’s examine the intricate interactions between its various components. By tracing the flow of information and analysis through each stage, we can gain a deeper appreciation for how SAFT transforms raw text into a rich semantic representation, ultimately empowering more accurate and informative search results.

A. Tokenization and Basic Annotation
Components: NlpSaftToken, NlpSaftMorphology
Process: Document text is tokenized. Tokens are annotated with part-of-speech tags and basic morphological information.

B. Semantic Annotation
Components: NlpSaftAnnotatedPhrase, NlpSaftLabeledSpan, NlpSaftEntity
Process: Arbitrary spans and specific entities are annotated within the document.
Coreference resolution links pronouns and mentions to their respective entities.

C. Relationship Extraction
Components: NlpSaftRelation, NlpSaftRelationMention
Process: Relationships between entities are extracted and annotated.
Relation mentions link specific instances within the text.

D. Semantic Graph Construction
Components: NlpSaftSemanticNode, NlpSaftSemanticNodeArc
Process: Semantic nodes and arcs form a directed acyclic graph representing complex relationships.
Nodes and arcs provide detailed semantic context and higher-level abstractions.

E. Entity Profiling and Integration
Components: NlpSaftEntityProfile, VideoContentSearchSaftEntityInfo
Process: Entities are profiled with detailed attributes and embeddings.
Entities are integrated into the Knowledge Vault, linked to external knowledge bases.

Data Flow Example

To solidify our understanding of SAFT’s intricate processes, let’s follow a hypothetical piece of text as it journeys through the SAFT system. By tracing the data flow from document ingestion to knowledge integration, we can visualize how SAFT transforms raw text into a rich tapestry of interconnected information.

Document Ingestion: A document is ingested into the SAFT system.
Tokenization: The document text is split into tokens (NlpSaftToken).
Basic Annotation: Tokens are annotated with part-of-speech tags and lemmas.
Entity Recognition: Named entities are identified and annotated (NlpSaftEntity).
Phrase Annotation: Arbitrary spans are annotated for additional context (NlpSaftAnnotatedPhrase).
Coreference Resolution: Pronouns and mentions are resolved to entities (NlpSemanticParsingSaftCoreference).
Relation Extraction: Relationships between entities are extracted and annotated (NlpSaftRelation).
Semantic Graph Construction: Semantic nodes and arcs are created to form a DAG representing complex relationships (NlpSaftSemanticNode).
Entity Profiling: Detailed profiles are created for each unique entity (NlpSaftEntityProfile).
Knowledge Integration: Entities and relationships are integrated into the Knowledge Vault (VideoContentSearchSaftEntityInfo).

Having explored the data flow within SAFT, let’s now examine the individual layers and components responsible for transforming raw text into a semantically rich representation. Understanding the functions of each layer will provide a clearer picture of how SAFT accomplishes its sophisticated analysis.

1. Input Layer: Document Parsing
Document Ingestion:
Input: Raw text documents.
Output: Text ready for processing.

Tokenization:
Input: Ingested document text.
Output: Tokens (words/phrases).
Component: NlpSaftToken

Basic Annotation:
Input: Tokens.
Output: Tokens with part-of-speech tags and lemmas.
Components: NlpSaftToken (with tags and lemmas)

2. Semantic Annotation Layer
Phrase and Span Annotation:
Input: Document text.
Output: Annotated phrases and spans.
Components: NlpSaftAnnotatedPhrase, NlpSaftLabeledSpan

Entity Recognition and Annotation:
Input: Document text.
Output: Annotated entities (e.g., persons, organizations).
Component: NlpSaftEntity

Coreference Resolution:
Input: Document text.
Output: Resolved pronouns and nominal mentions.
Component: NlpSemanticParsingSaftCoreference

Measure and Quantities Identification:
Input: Document text.
Output: Identified measures (e.g., quantities).
Component: NlpSemanticParsingSaftMeasure

3. Relationship and Contextual Analysis Layer
Relation Extraction:
Input: Annotated entities.
Output: Extracted relationships between entities.
Components: NlpSaftRelation, NlpSaftRelationMention
Semantic Node Creation:
Input: Annotated document.
Output: Semantic nodes forming a directed acyclic graph (DAG).
Components: NlpSaftSemanticNode, NlpSaftSemanticNodeArc

4. Profiling and Knowledge Integration Layer
Entity Profiling:
Input: Annotated entities.
Output: Detailed entity profiles.
Component: NlpSaftEntityProfile
Knowledge Integration:
Input: Entity profiles.
Output: Integrated entities into Knowledge Vault.
Component: VideoContentSearchSaftEntityInfo

5. Output Layer: Enriched Document Representation
Enriched Document:
Input: Processed document.
Output: Semantically enriched document.
Component: NlpSaftDocument
Feedback Loop:
Input: User interactions and quality rater feedback.
Output: Continuous improvement of annotations and processing.

Feasible – Visualization in a Flowchart

This flowchart outlines a document’s journey through the system. It should help solidify your understanding of the interconnected steps involved in transforming raw text into a semantically rich and informative output.

Document Ingestion ➔ Document Text
➔ Tokenization ➔ Tokens
➔ Basic Annotation ➔ Annotated Tokens
➔ Semantic Annotation Layer:
- Phrase and Span Annotation ➔ Annotated Phrases/Spans
- Entity Recognition and Annotation ➔ Annotated Entities
- Coreference Resolution ➔ Resolved Coreferences
- Measure and Quantities Identification ➔ Identified Measures
➔ Relationship and Contextual Analysis Layer:
- Relation Extraction ➔ Extracted Relations
- Semantic Node Creation ➔ Semantic Graph
➔ Profiling and Knowledge Integration Layer:
- Entity Profiling ➔ Entity Profiles
- Knowledge Integration ➔ Integrated Knowledge
➔ Output Layer:
- Enriched Document ➔ Semantically Enriched Document
- Feedback Loop ➔ Continuous Improvement

Conclusion

SAFT is a sophisticated system that significantly enhances entity analysis by providing comprehensive semantic parsing, detailed entity and relation annotations, and integration with broader knowledge systems. This results in more accurate, contextually aware, and meaningful search results.

By understanding the intricacies of SAFT, we can better appreciate how search engines are able to process and interpret the vast amounts of information available on the web, ultimately leading to a more informative and satisfying user experience.

The value of an API is that creative, remarkable people can think of unique ways to solve problems. I hope Google uses this opportunity to embrace the industry and collaborate with a shared vocabulary. The API shared is impressive, and if played well, the exposure of this information can be a big win for them.

What you should do now

When you’re ready… here are 3 ways we can help you publish better content, faster:

Book time with MarketMuse Schedule a live demo with one of our strategists to see how MarketMuse can help your team reach their content goals.
If you’d like to learn how to create better content faster, visit our blog. It’s full of resources to help scale content.
If you know another marketer who’d enjoy reading this page, share it with them via email, LinkedIn, Twitter, or Facebook.

Jeff Coyle

Co-founder & Chief Strategy Officer at MarketMuse

Jeff is Co-Founder and Chief Product Officer at MarketMuse. He is a cross-disciplined, data-driven inbound marketing executive with 18+ years of experience managing products and website networks; focused on helping companies grow. You can follow him on Twitter or LinkedIn.