The Rise of Embeddings in SEO

Embeddings have (rightly) dropped onto the radar of more SEOs—and that's broadly a good thing. However, beyond some of the more obvious applications, there remains a gap in understanding exactly what embeddings are and how to effectively utilise them.

What Are Embeddings and Why Should You Care?

Embeddings are transforming how we analyse and match content. Traditionally, many SEOs have used them to:

Match keywords or URL - Establish relationships between content.
Facilitate redirect mapping - Ensure proper content redirection.
Identify internal link opportunities - Enhance site architecture and user navigation.

While these applications work well in some cases, it's important to remember that embeddings are not a one-size-fits-all solution.

The Experimentation Process

Not a Panacea

The more you experiment with embeddings, the clearer it becomes that they aren't the unquestionable panacea they’ve been dubbed.

Their effectiveness heavily depends on:

The training data - Embeddings are only as robust as the data they’ve been trained on.
Understanding nuance - They often struggle with subtle differences or similar words that can have distinct meanings.

Real-World Example: The Artisanal Ham Sandwich

Imagine running a recipe blog focused on your artisanal Ham Sandwich. When trying to match words:

Bacon, Gammon, and Ham could all potentially be seen as similar - or not, how important is the cut of the pig?
But what about the “Pig” itself? That match might be less useful in this context.

This highlights a key challenge: determining whether similar words should be considered interchangeable.

AND that sometimes this is desirable, other times not.

Diving Deeper: Similarity Scores and Thresholds

After generating embeddings from your keywords, you need to compare them using similarity scores (ranging between 0 and 1).

Here’s where things get tricky:

Defining a good match - What score qualifies as a match can vary by subject. For instance, matching thresholds for holidays, ham sandwiches, and cosmetics will differ.
Impact of the model - Different embedding models can produce varying similarity scores even for the same set of keywords.

Time to test it - new Streamlit App

To help navigate these complexities, I developed a small Streamlit app that:

Compares two sets of keywords alongside a "ground truth" that specifies whether a keyword should match.
Tests multiple embeddings. Initially using OpenAI's text-embedding-ada-002 and SBERT from all-MiniLM-L6-v2.
Calculates similarity scores and determines the optimal threshold based on the 90th percentile of your ground truth data.

How It Helps

For now, I found this helped with these x2 things primarily:

Threshold Determination - Understands what similarity score works best for a specific topic.
Model Comparison - Highlights the effectiveness differences between the two models.

Let's look at some examples first, but after we'll reflect on what other applications there may be in the future.

Real-World Insights

Here are x3 small examples to help you see how this works. Each of these examples is basing "classification error" on whether the similarity classification is within the recommended threshold - remember it's trying to correctly match as much as it can.

Artisan Sandwiches

Both SBERT and OpenAI match "PIG" to Bacon, Gammon, and Ham. But without sufficient context (i.e., ensuring these matches are food-related), the usefulness of the match diminishes.

Cosmetics

This shows just how an inconsistent ground truth can lead to skewed scoring. If the goal is to match "foundation" to "makeup" but avoid matching "lip stick" to "lip liner" or "lip gloss," any inconsistency in input data throws off the results.

Garbage in, garbage out.

Travel Queries

When booking a trip to Maui, SBERT struggles to differentiate between day trips, package holidays, and lodges. OpenAI tends to handle this nuance better.

Where next?

This could be used in many different directions, but for now this tool is mainly just to play with thresholds and illustrate the main point I'm making here. But where could we go?

Expand the Model Comparison

Evaluate additional embedding models to see if they offer improved nuance and relevance. The two I've used are mostly an example, but different models have profound impacts.

Test these models across diverse content types to understand their domain-specific performance & don't assume the success will be universal.

Fine-Tune Embeddings for niche topics or other languages

Experiment with fine-tuning pre-trained models on industry-specific data. Okay, so this is easily said, than done, but if you need embeddings and the details matter - you will have to venture down this path. Domain-specific datasets can help the models learn subtle differences in language which is key to scaling this effectively.

Improve Input Data

Any kind of training/supervision means that your training data or ground truths HAVE to be solid. You'll also need to use bigger subsets than I have done in these illustrations. Big a big enough set in your niche and you can get a good threshold to run larger data sets through.

Beyond this you could experiment with providing greater context, feedback loops and dynamic thresholds - but all of this is getting into trickier data science territory.

While vectorizing content using embeddings for comparison and matching can be incredibly powerful, relying on them to solve every SEO issue is like having a new hammer and only seeing nails. Understanding both the strengths and limitations of embeddings is key to leveraging them effectively in your SEO strategy.