I wanted to do more internal linking, but I wanted to focus on the pages that needed it most and link the pages that seemed to make sense. The problem was, it was a big site and I didn’t know every page – so I turned to Screaming Frog’s easy to set up Embeddings to make this easier.
I ran a long, slow crawl and turned on 3 different Embeddings.
1. OpenAI – Extract embeddings from page content (using the OpenAI API)
2. Gemini – Extract embeddings from page content (using the Google Gemini API)
3. Gemini – Extract semantic embeddings from page (using the Google Gemini API)
I ended up using the third one, “semantic embeddings”. There is a slightly different prompt/setup: “semantic embeddings” means the model is asked to optimize for meaningful relationships rather than just surface content.
The result: a nice blend of product pages and blog pages, with better ranking balance (first results are both cables and adapters plus explainer blogs).
Scores are even higher (0.68–0.71), which shows the embeddings are denser in their space, so cosine similarities cluster tighter.
So now you have that data, but how do we use it to figure out what to do next?
Now you’re in a great position to do data-driven internal linking because you now have:
- A full crawl of every page (with content saved)
- Semantic embeddings (so you can find contextual matches)
- GA + GSC data (so you know which pages matter most)
That means you can go way beyond “link this blog to that product” and actually prioritize links by impact.
How to identify the most impactful internal links
Think of it as matching supply (link opportunities) to demand (pages that need authority/traffic).
- Authority flow (PageRank-like)
- Source: Crawl data (inlinks, depth, status codes)
- Find “high authority” pages (e.g., homepage, strong blogs with backlinks).
- Suggest links from them to high-value but underlinked targets (e.g., money pages or new products).
- Traffic opportunity (GA/GSC)
- Source: GA sessions, GSC impressions, CTR.
- Pages with lots of impressions but low CTR → benefit from better contextual linking and anchor text.
- Pages with conversions but low internal visibility → should get more links pointed at them.
- Semantic relevance (embeddings)
- Use your embeddings to cluster related pages.
- For each “topic cluster,” make sure products ↔ blogs ↔ support docs are interlinked.
- Prioritize links where the source has strong traffic and the target has weak authority or low inlinks.
Additional data you could add
- Backlink data (Ahrefs, Majestic, Semrush)
- Which pages have the strongest external authority? → Great candidates to be link sources.
- Conversion or revenue attribution (GA4/CRM)
- Pages that drive sales/lead conversions → prioritize inbound links to them.
- SERP data (from GSC)
- Queries that products rank on page 2–3 → a few strong internal links might push them onto page 1.
- Engagement signals (GA)
- Pages with low bounce rate / high dwell → great hubs to link out from.
The workflow I built
- Crawl + embeddings → Identify semantically relevant linking opportunities.
- Join with GA/GSC → Overlay traffic, impressions, CTR, conversions.
- Score opportunities by:
- Source authority × source traffic × semantic similarity × target business value.
- Output a prioritized list of:
- Source page
- Suggested anchor text (pulled from semantically matching text spans)
- Target page
- Priority score
The Giant Google Sheet
I built a Google Sheet where I downloaded a lot of data. I started on the first tab – I put the URL (which are all the URLs from the Screaming Frog crawl) in column A and column B was “Target Need Score”. Now I needed to build the formula to fill in that column.
In another tab (which I titled “ahrefs external”) I have URLs in column A, referring domains in C and “links to target” in E. I find “links” a bit flakey and wanted to use domain links instead.
In another tab (which I titled “internal links”) I have Source URL in column A and Destination URL in column B.
In another tab (which I titled “GA”) I have the URL in column A, Sessions in B, Views in C, Engaged Sessions in D, Engagement Rate in E, Key Events in F and Event Count in G.
Here’s the formula I used to fill in that column B – Target Need Score.
=LET(
url, A2,
intLinks, COUNTIF('internal links'!B:B, url),
intMax, MAX(FILTER(COUNTIF('internal links'!B:B, 'internal links'!B:B), LEN('internal links'!B:B))),
intNeed, 1 - IF(intMax=0, 0, intLinks / intMax),
refDom, IFNA(VLOOKUP(url, 'ahrefs external'!A:C, 3, FALSE), 0),
refMax, MAX('ahrefs external'!C:C),
extNeed, 1 - IF(refMax=0, 0, refDom / refMax),
engSess, IFNA(VLOOKUP(url, GA!A:G, 4, FALSE), 0),
engMax, MAX(GA!D:D),
engRate, IFNA(VLOOKUP(url, GA!A:G, 5, FALSE), 0),
demand, IF(engMax=0, 0, (engSess / engMax) * engRate),
wInt, 0.50, wExt, 0.20, wDem, 0.30,
wInt*intNeed + wExt*extNeed + wDem*demand
)
What your “Target Need Score” represents
Each score in column B is a normalized composite metric you built from:
- Internal link deficit → pages with fewer internal inlinks (relative to the page with the most).
- External authority deficit → pages with fewer referring domains (Ahrefs).
- Demand signal → pages that attract traffic/engagement in GA, meaning they’d benefit from more authority.
Each factor was weighted (0.5 internal / 0.2 external / 0.3 demand).
So:
- Higher score (closer to 1.0) → Page is underlinked, underpowered externally, but has traffic/engagement potential → prime candidate to receive more internal links.
- Lower score (closer to 0.0) → Page already has strong internal/external links, or little demand, so there’s less upside in sending more links its way.
- Prioritized targets by score: The 0.8+ pages are your “link me now” targets.
Match relevance
The goal here is to pair sources (pages with authority) to targets (pages with high Target Need Score), but only when they are semantically relevant.
You already have:
- Target Need Score (column B in your main sheet).
- Screaming Frog’s embeddings export (cosine similarity scores between query and page).
Now you want to use those embeddings to build source → target matches.
Export embeddings data

You won’t need most of the crawl metadata anymore.
- Column A (Address) = the page URL.
- Column I (Extract semantic embeddings from page) = the raw embedding vector.
That’s all you need to calculate semantic similarity.
Why you can trim it down
The embedding vector in Column I is a list of ~1,500 numbers (depending on model). Each page has one vector. To measure relevance between two pages, you just calculate the cosine similarity between their vectors.
Everything else (status codes, timestamps, etc.) is useful for QA but not for link-matching.
Google Sheets can’t natively calculate cosine similarity on arrays that long, it’ll choke. You’d need to preprocess elsewhere and bring in pairwise similarities.
Google Colab (easiest, free, runs in browser)
Google Colab is perfect if you don’t want to install anything locally.
Steps:
- Go to Google Colab.
- Click New Notebook.
- In the first cell, paste the Python script below.
- Save your simple embeddings Sheet as: embeddings.csv.
- Upload your embeddings CSV:
- Go to the Files panel on the left → click Upload → select your CSV export from Sheets (the tab with columns Address and Embedding).
- Run the cell (Shift + Enter).

Once complete, Colab will generate: similarity_matrix.csv.
One thing to watch out for, is make sure the files, and column names are exactly the same as in the script.
Also, when I first ran this, it was too much. So I tweaked the script to only show me results that were close, by setting a minimum cutoff.
Here’s how you can limit the results so it only gives you a handful of high-quality matches per URL:
Use a Minimum Cosine Similarity Cutoff
Cosine similarity runs from -1 to 1:
- > 0.8 = extremely strong topical match (almost duplicate or very tightly related)
- 0.7 – 0.8 = good contextual match
- 0.65 – 0.7 = loosely related, may still be relevant
- < 0.65 = weak relationship, usually noise
Recommendation
- Start with 0.70 as your cutoff.
- That means only URLs that are very topically related will show up.
Limit to Top N Matches per URL
Even with a cutoff, a busy site might still return too many matches for a single page.
You can cap it so each page only shows its top 3–5 closest matches.
Too Close
I also found that the embeddings also detect extremely similar pages, like two products that differ only by a minor spec. Those pairs can clutter your list because they have very high cosine similarity but aren’t useful internal linking opportunities.
Why This Happens
- High similarity (0.90+) often indicates:
- Variants of the same product (e.g., USB-C cable 1m vs. 2m).
- Staging/duplicate content.
- Boilerplate-heavy pages (like PLPs with near-identical templates).
- For internal linking, these usually don’t need to link to each other, because:
- It can confuse users.
- They already naturally link through navigation or faceted search.
Add a Maximum Similarity Threshold
You already have a minimum cutoff (e.g., 0.70).
Add a maximum cutoff, e.g., 0.95, to filter out near-duplicates.
For me, the URL structure of the site I was doing this for is a giveaway to the intent… /store/ includes all the listings and product pages where visitors can add to cart. In the /blog/ subdirectory is obviously the blog content. The /solutions/ and /services/ are mid-intent between commercial and just kicking the tires. The /about/ pages are not pages I worry about for linking. Last is the /customer-stories/ directory, it’s a good linking opportunity if the similarity is close enough.
I decided to add an additional column that shows intent and add in some type of weight across the viability of linking. The script can also filter out anything above 0.95 and below 0.7.
Obviously, you’re URL structure will be different, but you get the idea.
Finally, if there are any URLs with no pages similar enough, note that so I can target it for new content.
Wait, what if they’re already linked?
Colab/Python: add “already linked” + “link count” and (optionally) filter them out
This version extends the script. It:
- loads embedding.csv (Address, Embedding)
- classifies intent + computes similarities
- applies min/max similarity and top-K
- loads internal_links.csv (Source URL, Destination URL)
- normalizes URLs consistently
- adds Already Linked (True/False) and Link Count
- optionally drops pairs that are already linked (toggle KEEP_ALREADY_LINKED)
Now back to the giant Google Sheet:
- Download it using the Files panel → right-click → Download.
- Import it back into Google Sheets.
import pandas as pd
import numpy as np
import re
# ====== knobs you can tweak ======
INPUT_EMB = "/content/embedding.csv" # columns: Address, Embedding
INPUT_IL = "/content/internal_links.csv" # columns: Source URL, Destination URL
OUTPUT_MATCHES = "/content/similarity_topk_with_links.csv"
OUTPUT_LONELY = "/content/no_matches.csv"
MIN_SIM = 0.70
MAX_SIM = 0.95
TOP_K = 5
KEEP_ALREADY_LINKED = False
# =================================
def normalize(url: str) -> str:
if not isinstance(url, str):
return ""
u = url.strip().lower()
u = re.sub(r"[?#].*$", "", u)
if len(u) > 1 and u.endswith("/"):
u = u[:-1]
return u
# 1) Load embeddings
df = pd.read_csv(INPUT_EMB)
assert {"Address","Embedding"}.issubset(df.columns), "embedding.csv must have Address and Embedding columns"
def classify_intent(url: str) -> str:
u = (url or "").lower()
if "/store/" in u: return "product"
if "/blog/" in u: return "blog"
if "/solutions/" in u: return "solutions"
if "/services/" in u: return "services"
if "/customer-stories/" in u: return "customer_stories"
if "/about/" in u: return "about"
return "other"
df["Address_norm"] = df["Address"].apply(normalize)
df["Intent"] = df["Address"].apply(classify_intent)
def to_vec(s: str) -> np.ndarray:
s = str(s).strip()
if s.startswith("[") and s.endswith("]"):
s = s[1:-1]
return np.fromstring(s, sep=",")
vecs = np.stack(df["Embedding"].apply(to_vec).values)
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
norms[norms == 0] = 1.0
unit = vecs / norms
sim = unit @ unit.T
VIABILITY = {
("blog","product"): 1.00, ("blog","solutions"): 0.70, ("blog","services"): 0.70, ("blog","customer_stories"): 0.75,
("customer_stories","product"): 0.90, ("customer_stories","solutions"): 0.75, ("customer_stories","services"): 0.75,
("solutions","product"): 0.85, ("services","product"): 0.85,
("solutions","services"): 0.60, ("services","solutions"): 0.60,
("product","blog"): 0.60, ("product","solutions"): 0.50, ("product","services"): 0.50,
("about","product"): 0.00, ("about","blog"): 0.00, ("about","solutions"): 0.00, ("about","services"): 0.00, ("about","customer_stories"): 0.00,
}
DEFAULT_VIABILITY = 0.40
urls = df["Address"].tolist()
urls_norm = df["Address_norm"].tolist()
intents = df["Intent"].tolist()
n = len(urls)
# 2) Load internal links (optional)
try:
il = pd.read_csv(INPUT_IL)
src_col = [c for c in il.columns if c.strip().lower().startswith("source")][0]
dst_col = [c for c in il.columns if c.strip().lower().startswith("destination")][0]
il["src_norm"] = il[src_col].apply(normalize)
il["dst_norm"] = il[dst_col].apply(normalize)
link_counts = il.groupby(["src_norm","dst_norm"]).size().to_dict()
has_link = set(link_counts.keys())
print(f"Loaded {len(il):,} internal link rows from {INPUT_IL}")
except Exception as e:
print(f"Warning: couldn’t read {INPUT_IL} ({e}). Proceeding without link checks.")
link_counts = {}
has_link = set()
rows = []
lonely = []
for i in range(n):
kept = 0
source_url = urls[i]
source_norm = urls_norm[i]
source_int = intents[i]
idx_sorted = np.argsort(-sim[i])
for j in idx_sorted:
if j == i:
continue
score = float(sim[i, j])
if score > MAX_SIM:
continue
if score < MIN_SIM:
break
target_url = urls[j]
target_norm = urls_norm[j]
target_int = intents[j]
weight = VIABILITY.get((source_int, target_int), DEFAULT_VIABILITY)
if weight <= 0:
continue
already = (source_norm, target_norm) in has_link
count = link_counts.get((source_norm, target_norm), 0)
if not KEEP_ALREADY_LINKED and already:
continue
viability_score = weight * score
rows.append([
source_url, source_int, target_url, target_int,
score, weight, viability_score,
already, count
])
kept += 1
if kept >= TOP_K:
break
if kept == 0:
lonely.append([source_url, source_int])
cols = [
"Source URL","Source Intent","Target URL","Target Intent",
"Cosine Similarity","Viability Weight","Viability Score",
"Already Linked","Existing Link Count"
]
matches = pd.DataFrame(rows, columns=cols).sort_values(
["Source URL","Viability Score"], ascending=[True, False]
)
matches.to_csv(OUTPUT_MATCHES, index=False)
lonely_df = pd.DataFrame(lonely, columns=["URL","Intent"])
lonely_df.to_csv(OUTPUT_LONELY, index=False)
print(f"Saved {len(matches):,} link suggestions to {OUTPUT_MATCHES} (KEEP_ALREADY_LINKED={KEEP_ALREADY_LINKED})")
print(f"Saved {len(lonely_df):,} URLs with no viable matches to {OUTPUT_LONELY}")
Next Step
Now you’ll want to layer in your Target Need Score , so the final output ranks not just by similarity but also by which pages need links the most.
What you’ll export from Sheets
- From your “Main” tab, export a CSV with two columns:
- URL
- Target Need Score
Save it in Colab as /content/target_need.csv.
Updated Colab script (adds Target Need + Opportunity Score)
- Reads: embedding.csv, internal_links.csv, target_need.csv
- Computes similarities + intent weights
- Filters <0.70 and >0.95, normalizes URLs, skips already-linked pairs (toggleable)
- Joins Target Need by the Target URL
- Re-ranks by Opportunity Score = Viability Score × Target Need
- Keeps top-K per source after the join
- Writes final CSV
import pandas as pd
import numpy as np
import re
# ====== files & knobs ======
INPUT_EMB = "/content/embedding.csv" # Address, Embedding
INPUT_IL = "/content/internal_links.csv" # Source URL, Destination URL
INPUT_TNS = "/content/target_need.csv" # URL, Target Need Score
OUTPUT_CSV = "/content/internal_link_opportunities.csv"
OUTPUT_LONELY = "/content/no_matches.csv"
MIN_SIM = 0.70
MAX_SIM = 0.95
TOP_K = 5
KEEP_ALREADY_LINKED = False
# ===========================
def normalize(url: str) -> str:
if not isinstance(url, str): return ""
u = url.strip().lower()
u = re.sub(r"[?#].*$", "", u)
if len(u) > 1 and u.endswith("/"): u = u[:-1]
return u
def classify_intent(url: str) -> str:
u = (url or "").lower()
if "/store/" in u: return "product"
if "/blog/" in u: return "blog"
if "/solutions/" in u: return "solutions"
if "/services/" in u: return "services"
if "/customer-stories/" in u: return "customer_stories"
if "/about/" in u: return "about"
return "other"
def to_vec(s: str) -> np.ndarray:
s = str(s).strip()
if s.startswith("[") and s.endswith("]"): s = s[1:-1]
return np.fromstring(s, sep=",")
# ---- load embeddings ----
emb = pd.read_csv(INPUT_EMB)
assert {"Address","Embedding"}.issubset(emb.columns)
emb["Address_norm"] = emb["Address"].apply(normalize)
emb["Intent"] = emb["Address"].apply(classify_intent)
vecs = np.stack(emb["Embedding"].apply(to_vec).values)
norms = np.linalg.norm(vecs, axis=1, keepdims=True); norms[norms==0]=1.0
unit = vecs / norms
sim = unit @ unit.T
VIABILITY = {
("blog","product"):1.00, ("blog","solutions"):0.70, ("blog","services"):0.70, ("blog","customer_stories"):0.75,
("customer_stories","product"):0.90, ("customer_stories","solutions"):0.75, ("customer_stories","services"):0.75,
("solutions","product"):0.85, ("services","product"):0.85,
("solutions","services"):0.60, ("services","solutions"):0.60,
("product","blog"):0.60, ("product","solutions"):0.50, ("product","services"):0.50,
("about","product"):0.00, ("about","blog"):0.00, ("about","solutions"):0.00, ("about","services"):0.00, ("about","customer_stories"):0.00,
}
DEFAULT_VIABILITY = 0.40
urls = emb["Address"].tolist()
urls_norm = emb["Address_norm"].tolist()
intents = emb["Intent"].tolist()
n = len(urls)
# ---- load internal links (optional) ----
try:
il = pd.read_csv(INPUT_IL)
src_col = [c for c in il.columns if c.strip().lower().startswith("source")][0]
dst_col = [c for c in il.columns if c.strip().lower().startswith("destination")][0]
il["src_norm"] = il[src_col].apply(normalize)
il["dst_norm"] = il[dst_col].apply(normalize)
link_counts = il.groupby(["src_norm","dst_norm"]).size().to_dict()
has_link = set(link_counts.keys())
print(f"Loaded {len(il):,} internal links")
except Exception as e:
print(f"Warning: couldn’t read internal links ({e}); proceeding without link checks.")
link_counts = {}
has_link = set()
# ---- generate candidate pairs (no top-k yet) ----
rows = []
lonely = []
for i in range(n):
kept_any = False
idx_sorted = np.argsort(-sim[i])
for j in idx_sorted:
if j == i: continue
score = float(sim[i, j])
if score > MAX_SIM: # near-duplicate
continue
if score < MIN_SIM: # below cutoff (rest are lower)
break
s_url, s_norm, s_int = urls[i], urls_norm[i], intents[i]
t_url, t_norm, t_int = urls[j], urls_norm[j], intents[j]
weight = VIABILITY.get((s_int, t_int), DEFAULT_VIABILITY)
if weight <= 0: # non-viable patterns (e.g., about → anything)
continue
already = (s_norm, t_norm) in has_link
if not KEEP_ALREADY_LINKED and already:
continue
kept_any = True
rows.append([
s_url, s_int, t_url, t_int,
score, weight, weight*score,
already,
link_counts.get((s_norm, t_norm), 0)
])
if not kept_any:
lonely.append([urls[i], intents[i]])
candidates = pd.DataFrame(rows, columns=[
"Source URL","Source Intent","Target URL","Target Intent",
"Cosine Similarity","Viability Weight","Viability Score",
"Already Linked","Existing Link Count"
])
# ---- join Target Need Score by Target URL ----
tn = pd.read_csv(INPUT_TNS)
# be forgiving with column names
tn_url_col = [c for c in tn.columns if c.strip().lower() in ("url","address")][0]
tn_score_col= [c for c in tn.columns if "target need" in c.strip().lower()][0]
tn["URL_norm"] = tn[tn_url_col].apply(normalize)
candidates["Target_norm"] = candidates["Target URL"].apply(normalize)
candidates = candidates.merge(
tn[["URL_norm", tn_score_col]].rename(columns={tn_score_col:"Target Need Score"}),
left_on="Target_norm", right_on="URL_norm", how="left"
).drop(columns=["URL_norm","Target_norm"])
candidates["Target Need Score"] = candidates["Target Need Score"].fillna(0.0)
# ---- final Opportunity Score + top-K per source after join ----
candidates["Opportunity Score"] = candidates["Viability Score"] * candidates["Target Need Score"]
final = (
candidates
.sort_values(["Source URL","Opportunity Score"], ascending=[True, False])
.groupby("Source URL", as_index=False)
.head(TOP_K)
.reset_index(drop=True)
)
final.to_csv(OUTPUT_CSV, index=False)
pd.DataFrame(lonely, columns=["URL","Intent"]).to_csv(OUTPUT_LONELY, index=False)
print(f"Saved {len(final):,} opportunities to {OUTPUT_CSV}")
print(f"Saved {len(lonely):,} URLs with no viable matches to {OUTPUT_LONELY}")
How to use it now
- Export target_need.csv (URL, Target Need Score) from Sheets and upload it to Colab.
- Run the script.
- Open /content/internal_link_opportunities.csv:
- It already excludes near-duplicates and low similarity.
- It skips pairs that already link (unless you flip KEEP_ALREADY_LINKED to True).
- It’s ranked by Opportunity Score and capped to top-K per source after the Target Need join.
If you want to change the ranking to incorporate a Source Strength Score later, we can add a source_strength.csv join and compute:
Opportunity = Source Strength × Target Need × Cosine Similarity (or × Viability Weight).
I’ll leave it here for now so you can spice to taste and make it your own.
Leave a Reply