How to use Python for NLP and Semantic SEO

Here’s a compact, practical way to use Python for NLP-driven Semantic SEO—from auditing and clustering to internal linking, pruning, and schema.

Table of Contents

1) Setup
2) Crawl/ingest pages (or load your exports)
3) Clean + split into SEO-sized chunks
4) Build embeddings (semantic vectors)
5) Cluster to reveal topics & subtopics
6) Name clusters + extract keyword sets
7) Build an internal linking map (semantic)
8) Content gap & coverage (vs. your topical map)
9) Content pruning by semantic drift
10) Generate FAQs & JSON-LD schema
11) Title & outline suggestions programmatically
12) Competitor/topic scouting (quick & lawful)
13) Store vectors for fast lookups (internal search / linking assistant)
How this helps your Semantic SEO (in practice)

1) Setup

pip install spacy sentence-transformers keybert bertopic gensim scikit-learn umap-learn hdbscan \
pandas numpy beautifulsoup4 trafilatura lxml faiss-cpu
python -m spacy download en_core_web_sm

2) Crawl/ingest pages (or load your exports)

import trafilatura, requests, pandas as pd
from bs4 import BeautifulSoup

def fetch_text(url):
    html = requests.get(url, timeout=20).text
    return trafilatura.extract(html, include_comments=False, include_tables=False)

urls = [
  "https://example.com/drug-rehab/detox-guide",
  "https://example.com/opioid/withdrawal-timeline",
  # add site URLs or import from your sitemap/export
]
docs = pd.DataFrame({"url": urls})
docs["text"] = docs["url"].apply(fetch_text)
docs = docs.dropna(subset=["text"])

3) Clean + split into SEO-sized chunks

import re, numpy as np
def clean(t): return re.sub(r"\s+", " ", t).strip()
docs["text"] = docs["text"].map(clean)

# optional: chunk into ~500–800 word sections to get granular clusters
def chunk(text, max_words=600):
    words, out = text.split(), []
    for i in range(0, len(words), max_words):
        out.append(" ".join(words[i:i+max_words]))
    return out
rows = []
for r in docs.itertuples():
    for i, c in enumerate(chunk(r.text)):
        rows.append({"url": r.url, "chunk_id": i, "chunk_text": c})
chunks = pd.DataFrame(rows)

4) Build embeddings (semantic vectors)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # fast, great for SEO tasks
emb = model.encode(chunks["chunk_text"].tolist(), normalize_embeddings=True)

5) Cluster to reveal topics & subtopics

Use UMAP + HDBSCAN (unsupervised, finds natural topical groups) or KMeans (explicit k).

import umap, hdbscan
X = emb
X_umap = umap.UMAP(n_neighbors=25, min_dist=0.0, metric="cosine").fit_transform(X)
labels = hdbscan.HDBSCAN(min_cluster_size=8, metric="euclidean").fit_predict(X_umap)
chunks["cluster"] = labels

Now each cluster ≈ a topical “hub” (e.g., opioid withdrawal, alcohol detox at home, treatment options).

6) Name clusters + extract keyword sets

from keybert import KeyBERT
kw_model = KeyBERT(model=model)

def cluster_keywords(texts, topn=10):
    joined = " ".join(texts)
    return kw_model.extract_keywords(joined, keyphrase_ngram_range=(1,3), stop_words="english", top_n=topn)

cluster_topics = (
    chunks[chunks.cluster!=-1]
    .groupby("cluster").agg(texts=("chunk_text", list), urls=("url", lambda x: list(set(x))))
    .reset_index()
)
cluster_topics["keywords"] = cluster_topics["texts"].apply(lambda tx: cluster_keywords(tx, topn=15))

Use these keyphrases for section H2s/H3s, FAQs, and internal link anchors.

7) Build an internal linking map (semantic)

Link pages whose vectors are similar (>0.35–0.45 cosine), prioritizing different URLs inside the same cluster.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# page-level vectors = mean of their chunk vectors
page_vecs = chunks.groupby("url").apply(lambda df: np.mean(emb[df.index], axis=0))
page_vecs = np.vstack(page_vecs.values)
page_urls = chunks.groupby("url").size().index.tolist()

S = cosine_similarity(page_vecs)
links = []
for i,u in enumerate(page_urls):
    # pick top semantically-related pages (excluding self)
    idx = np.argsort(-S[i])[:8]
    for j in idx:
        if i==j or S[i,j] < 0.40: continue
        links.append({"from": u, "to": page_urls[j], "sim": float(S[i,j])})
internal_links = pd.DataFrame(links).sort_values(["from","sim"], ascending=[True,False])

How to use:

Add 2–4 contextual links per page pointing to higher-sim (0.5+) pages as pillar pages; 0.4–0.5 as siblings.
Use anchor text from the cluster’s keywords (natural, varied).

8) Content gap & coverage (vs. your topical map)

For each cluster, define must-have subheadings from keywords.
Score each page for presence/absence of those terms to identify missing sections.

needed = {cid: [k for k,_ in kws] for cid, kws in zip(cluster_topics["cluster"], cluster_topics["keywords"])}

def coverage_score(text, terms):
    t = text.lower()
    return sum(1 for term in terms if term.lower() in t)/max(1,len(terms))

page_text = docs.set_index("url")["text"].to_dict()
coverage = []
for cid, terms in needed.items():
    urls = set(sum(cluster_topics.loc[cluster_topics.cluster==cid, "urls"].tolist(), []))
    for u in urls:
        coverage.append({"cluster": cid, "url": u, "coverage": coverage_score(page_text.get(u,""), terms)})
coverage = pd.DataFrame(coverage).sort_values("coverage")

Pages with low coverage are your first targets for expansion.

9) Content pruning by semantic drift

Find pages that don’t belong to any strong topic (noise cluster) or are low-similarity outliers to your site’s centroid.

site_centroid = np.mean(page_vecs, axis=0).reshape(1,-1)
site_sim = cosine_similarity(page_vecs, site_centroid).ravel()
prune = pd.DataFrame({"url": page_urls, "site_similarity": site_sim}).sort_values("site_similarity")
# Candidates to prune/merge/redirect: bottom quantile (e.g., < 0.25–0.30 depending on site size)

10) Generate FAQs & JSON-LD schema

Use cluster keyphrases to produce FAQs, then emit schema.

import json, spacy
nlp = spacy.load("en_core_web_sm")

def faq_from_keywords(keywords, topic_name):
    qs = []
    for kw,_ in keywords[:6]:
        q = f"What should I know about {kw} in {topic_name}?"
        a = f"{topic_name.capitalize()} considerations for {kw} include symptoms, risks, and evidence-based treatment options. Consult licensed professionals."
        qs.append({"@type":"Question","name":q,"acceptedAnswer":{"@type":"Answer","text":a}})
    return qs

cluster_topics["faq_schema"] = cluster_topics.apply(
    lambda r: {
      "@context":"https://schema.org",
      "@type":"FAQPage",
      "mainEntity": faq_from_keywords(r["keywords"], f" {r['cluster']} topic")
    }, axis=1
)

# attach the JSON-LD of the relevant cluster to its pillar page template:
json.dumps(cluster_topics.iloc[0]["faq_schema"], indent=2)

11) Title & outline suggestions programmatically

def make_title(core_kw, brand=None, max_len=60):
    base = f"{core_kw}: Complete Guide"
    if brand and len(base) + len(brand) + 3 <= max_len:
        base += f" | {brand}"
    return base[:max_len]

def suggest_outline(keywords):
    h2s = [k for k,_ in keywords[:6]]
    return ["Overview"] + [f"{h2.title()}" for h2 in h2s] + ["Risks & Contraindications","Treatment Options","FAQs","Sources"]

12) Competitor/topic scouting (quick & lawful)

Export competitor URLs (from Ahrefs/SEMrush/Sheets), fetch text, embed, and project into your clusters. Gaps show up where they have strong content but you don’t (or vice versa). The same pipeline works—just label sources and compare coverage.

13) Store vectors for fast lookups (internal search / linking assistant)

import faiss
index = faiss.IndexFlatIP(emb.shape[1])  # cosine because we normalized
index.add(emb)                           # store chunk vectors
# Given a new draft paragraph, retrieve best anchor targets:
q = model.encode(["opioid withdrawal day 3 chills and cravings"], normalize_embeddings=True)
D,I = index.search(q, 10)
chunks.iloc[I[0]][["url","chunk_id"]]

How this helps your Semantic SEO (in practice)

Topical map from clusters → defines your hub/pillar architecture.
Internal links from similarity scores → raises topical authority and crawl efficiency.
Coverage scoring from keyphrases → turns into a prioritized content roadmap.
Pruning by drift → removes/merges thin or off-topic pages hurting quality signals.
Schema & FAQs → richer SERP features and better disambiguation.
Vector store → on-demand related links and anchors while you write.