How to use Python for NLP and Semantic SEO

Why Python NLP Matters for Modern SEO

Search engines no longer match keywords to pages. They parse meaning, map entities, and evaluate topical depth the same way a subject-matter expert would. Google’s own NLP models (BERT, MUM) understand context, synonyms, and relationships between concepts, meaning your content strategy needs to operate at the same level.

Python gives you the tools to reverse-engineer that process. With a handful of open-source libraries, you can embed your content as semantic vectors, cluster pages into topical hubs, score coverage gaps against competitors, build internal linking maps based on real similarity scores, and automate schema markup at scale. Instead of guessing which pages support which topics, you measure it.

This guide is a working pipeline. Every step includes runnable code that takes you from raw URLs to a prioritized content roadmap, internal linking strategy, pruning candidates, and programmatic schema. It is built for SEO practitioners who are comfortable with Python and want to move beyond manual keyword spreadsheets into data-driven topical architecture.

Table of Contents

Quick NLP Glossary for SEO
Python Libraries: What Does What
1) Setup
2) Crawl/Ingest Pages (or Load Your Exports)
3) Clean + Split into SEO-Sized Chunks
4) Build Embeddings (Semantic Vectors)
5) Cluster to Reveal Topics and Subtopics
6) Name Clusters + Extract Keyword Sets
- Sample Cluster Output
7) Build an Internal Linking Map (Semantic)
8) Content Gap and Coverage (vs. Your Topical Map)
9) Content Pruning by Semantic Drift
10) Generate FAQs and JSON-LD Schema
11) Title and Outline Suggestions Programmatically
12) Competitor/Topic Scouting (Quick and Lawful)
13) Store Vectors for Fast Lookups (Internal Search / Linking Assistant)
14) Applying This Pipeline to Local and Multi-Location SEO
How This Pipeline Helps Your Semantic SEO (in Practice)
From Pipeline to Practice: Putting This to Work

Quick NLP Glossary for SEO

Before diving in, here are the core NLP concepts that power this pipeline and how they connect to search:

Tokenization – Splitting text into individual words or subwords. This is how search engines break down your content for analysis.
Stop words – Common words (the, is, and) that carry little meaning. Removing them helps isolate the terms that actually define your topic.
Lemmatization – Reducing words to their base form (running > run, treatments > treatment). Helps engines understand that variations of a word share the same intent.
Named Entity Recognition (NER) – Identifying people, places, organizations, and concepts in text. Directly tied to how Google builds Knowledge Graph connections.
Embeddings – Dense numerical representations of text that capture meaning. Two sentences about “opioid withdrawal symptoms” will have similar embeddings even if they use completely different words.
Cosine similarity – A measure of how close two embeddings are (0 = unrelated, 1 = identical meaning). This is the backbone of the clustering and linking steps below.
TF-IDF – Term Frequency-Inverse Document Frequency. Surfaces terms that are important to a specific document relative to a larger corpus. Useful for identifying what makes a page unique.

Python Libraries: What Does What

Library	Used For	Pipeline Step
trafilatura	Extracting clean text from web pages	Crawl/Ingest (Step 2)
BeautifulSoup	HTML parsing and extraction	Crawl/Ingest (Step 2)
spaCy	NER, dependency parsing, tokenization	Entity extraction, Local SEO (Steps 10, 14)
sentence-transformers	Generating semantic embeddings	Embeddings (Step 4)
KeyBERT	Extracting representative keyphrases from clusters	Cluster naming (Step 6)
BERTopic	Topic modeling with transformer embeddings	Alternative to manual clustering
UMAP	Dimensionality reduction for clustering	Clustering (Step 5)
HDBSCAN	Density-based unsupervised clustering	Clustering (Step 5)
scikit-learn	Cosine similarity, KMeans (alternative clustering)	Linking, coverage scoring (Steps 7-8)
Gensim	LDA topic modeling, Word2Vec	Alternative topic analysis
FAISS	Fast vector similarity search	Vector storage/lookup (Step 13)
pandas / NumPy	Data manipulation throughout	Every step

1) Setup

pip install spacy sentence-transformers keybert bertopic gensim scikit-learn umap-learn hdbscan \
pandas numpy beautifulsoup4 trafilatura lxml faiss-cpu
python -m spacy download en_core_web_sm

2) Crawl/Ingest Pages (or Load Your Exports)

import trafilatura, requests, pandas as pd
from bs4 import BeautifulSoup

def fetch_text(url):
    html = requests.get(url, timeout=20).text
    return trafilatura.extract(html, include_comments=False, include_tables=False)

urls = [
  "https://example.com/drug-rehab/detox-guide",
  "https://example.com/opioid/withdrawal-timeline",
  # add site URLs or import from your sitemap/export
]
docs = pd.DataFrame({"url": urls})
docs["text"] = docs["url"].apply(fetch_text)
docs = docs.dropna(subset=["text"])

3) Clean + Split into SEO-Sized Chunks

import re, numpy as np
def clean(t): return re.sub(r"\s+", " ", t).strip()
docs["text"] = docs["text"].map(clean)

# optional: chunk into ~500-800 word sections to get granular clusters
def chunk(text, max_words=600):
    words, out = text.split(), []
    for i in range(0, len(words), max_words):
        out.append(" ".join(words[i:i+max_words]))
    return out
rows = []
for r in docs.itertuples():
    for i, c in enumerate(chunk(r.text)):
        rows.append({"url": r.url, "chunk_id": i, "chunk_text": c})
chunks = pd.DataFrame(rows)

4) Build Embeddings (Semantic Vectors)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # fast, great for SEO tasks
emb = model.encode(chunks["chunk_text"].tolist(), normalize_embeddings=True)

5) Cluster to Reveal Topics and Subtopics

Use UMAP + HDBSCAN (unsupervised, finds natural topical groups) or KMeans (explicit k).

import umap, hdbscan
X = emb
X_umap = umap.UMAP(n_neighbors=25, min_dist=0.0, metric="cosine").fit_transform(X)
labels = hdbscan.HDBSCAN(min_cluster_size=8, metric="euclidean").fit_predict(X_umap)
chunks["cluster"] = labels

Now each cluster approximates a topical “hub” (e.g., opioid withdrawal, alcohol detox at home, treatment options).

6) Name Clusters + Extract Keyword Sets

from keybert import KeyBERT
kw_model = KeyBERT(model=model)

def cluster_keywords(texts, topn=10):
    joined = " ".join(texts)
    return kw_model.extract_keywords(joined, keyphrase_ngram_range=(1,3), stop_words="english", top_n=topn)

cluster_topics = (
    chunks[chunks.cluster!=-1]
    .groupby("cluster").agg(texts=("chunk_text", list), urls=("url", lambda x: list(set(x))))
    .reset_index()
)
cluster_topics["keywords"] = cluster_topics["texts"].apply(lambda tx: cluster_keywords(tx, topn=15))

Use these keyphrases for section H2s/H3s, FAQs, and internal link anchors.

Sample Cluster Output

Here is what a real cluster output looks like for an addiction treatment site:

Cluster	Top Keywords	Pages	Role
0	opioid withdrawal timeline, withdrawal symptoms day by day, medication-assisted detox	/opioid/withdrawal-timeline, /opioid/detox-medications, /opioid/what-to-expect	Hub: Opioid Withdrawal
1	alcohol detox at home, alcohol withdrawal seizures, medical alcohol detox	/alcohol/home-detox-risks, /alcohol/detox-timeline, /alcohol/medical-detox	Hub: Alcohol Detox
2	inpatient vs outpatient rehab, treatment program length, residential treatment	/rehab/inpatient-vs-outpatient, /rehab/30-60-90-day, /rehab/what-to-expect	Hub: Treatment Options
3	insurance coverage rehab, verify insurance, cost of rehab	/insurance/verify, /insurance/coverage-guide, /cost/rehab-costs	Hub: Insurance/Cost
-1 (noise)	mixed/unrelated terms	/about-us, /blog/staff-spotlight, /careers	Prune/redirect candidates

The noise cluster (-1) immediately surfaces pages with weak topical alignment. These are your first candidates for pruning, consolidation, or redirecting.

7) Build an Internal Linking Map (Semantic)

Link pages whose vectors are similar (>0.35-0.45 cosine), prioritizing different URLs inside the same cluster.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# page-level vectors = mean of their chunk vectors
page_vecs = chunks.groupby("url").apply(lambda df: np.mean(emb[df.index], axis=0))
page_vecs = np.vstack(page_vecs.values)
page_urls = chunks.groupby("url").size().index.tolist()

S = cosine_similarity(page_vecs)
links = []
for i,u in enumerate(page_urls):
    # pick top semantically-related pages (excluding self)
    idx = np.argsort(-S[i])[:8]
    for j in idx:
        if i==j or S[i,j] < 0.40: continue
        links.append({"from": u, "to": page_urls[j], "sim": float(S[i,j])})
internal_links = pd.DataFrame(links).sort_values(["from","sim"], ascending=[True,False])

How to use:

Add 2-4 contextual links per page pointing to higher-sim (0.5+) pages as pillar pages; 0.4-0.5 as siblings.
Use anchor text from the cluster’s keywords (natural, varied).

8) Content Gap and Coverage (vs. Your Topical Map)

For each cluster, define must-have subheadings from keywords. Score each page for presence/absence of those terms to identify missing sections.

needed = {cid: [k for k,_ in kws] for cid, kws in zip(cluster_topics["cluster"], cluster_topics["keywords"])}

def coverage_score(text, terms):
    t = text.lower()
    return sum(1 for term in terms if term.lower() in t)/max(1,len(terms))

page_text = docs.set_index("url")["text"].to_dict()
coverage = []
for cid, terms in needed.items():
    urls = set(sum(cluster_topics.loc[cluster_topics.cluster==cid, "urls"].tolist(), []))
    for u in urls:
        coverage.append({"cluster": cid, "url": u, "coverage": coverage_score(page_text.get(u,""), terms)})
coverage = pd.DataFrame(coverage).sort_values("coverage")

Pages with low coverage are your first targets for expansion.

9) Content Pruning by Semantic Drift

Find pages that don’t belong to any strong topic (noise cluster) or are low-similarity outliers to your site’s centroid.

site_centroid = np.mean(page_vecs, axis=0).reshape(1,-1)
site_sim = cosine_similarity(page_vecs, site_centroid).ravel()
prune = pd.DataFrame({"url": page_urls, "site_similarity": site_sim}).sort_values("site_similarity")
# Candidates to prune/merge/redirect: bottom quantile (e.g., < 0.25-0.30 depending on site size)

10) Generate FAQs and JSON-LD Schema

Use cluster keyphrases to produce FAQs, then emit schema.

import json, spacy
nlp = spacy.load("en_core_web_sm")

def faq_from_keywords(keywords, topic_name):
    qs = []
    for kw,_ in keywords[:6]:
        q = f"What should I know about {kw} in {topic_name}?"
        a = f"{topic_name.capitalize()} considerations for {kw} include symptoms, risks, and evidence-based treatment options. Consult licensed professionals."
        qs.append({"@type":"Question","name":q,"acceptedAnswer":{"@type":"Answer","text":a}})
    return qs

cluster_topics["faq_schema"] = cluster_topics.apply(
    lambda r: {
      "@context":"https://schema.org",
      "@type":"FAQPage",
      "mainEntity": faq_from_keywords(r["keywords"], f" {r['cluster']} topic")
    }, axis=1
)

# attach the JSON-LD of the relevant cluster to its pillar page template:
json.dumps(cluster_topics.iloc[0]["faq_schema"], indent=2)

11) Title and Outline Suggestions Programmatically

def make_title(core_kw, brand=None, max_len=60):
    base = f"{core_kw}: Complete Guide"
    if brand and len(base) + len(brand) + 3 <= max_len:
        base += f" | {brand}"
    return base[:max_len]

def suggest_outline(keywords):
    h2s = [k for k,_ in keywords[:6]]
    return ["Overview"] + [f"{h2.title()}" for h2 in h2s] + ["Risks & Contraindications","Treatment Options","FAQs","Sources"]

12) Competitor/Topic Scouting (Quick and Lawful)

Export competitor URLs (from Ahrefs/SEMrush/Sheets), fetch text, embed, and project into your clusters. Gaps show up where they have strong content but you don’t (or vice versa). The same pipeline works. Just label sources and compare coverage.

13) Store Vectors for Fast Lookups (Internal Search / Linking Assistant)

import faiss
index = faiss.IndexFlatIP(emb.shape[1])  # cosine because we normalized
index.add(emb)                           # store chunk vectors
# Given a new draft paragraph, retrieve best anchor targets:
q = model.encode(["opioid withdrawal day 3 chills and cravings"], normalize_embeddings=True)
D,I = index.search(q, 10)
chunks.iloc[I[0]][["url","chunk_id"]]

14) Applying This Pipeline to Local and Multi-Location SEO

The same embedding and clustering pipeline adapts to local search with a few targeted additions.

Extract Location Entities

Use spaCy’s NER to pull location mentions from your pages and competitor content. This reveals which geographic terms you are covering (and missing).

import spacy
nlp = spacy.load("en_core_web_sm")

def extract_locations(text):
    doc = nlp(text)
    return list(set(ent.text for ent in doc.ents if ent.label_ in ("GPE", "LOC", "FAC")))

docs["locations"] = docs["text"].apply(extract_locations)
# Compare location entity coverage across your pages vs. competitors

For multi-location businesses (e.g., treatment centers in multiple cities), check that every target market has dedicated pages with strong location entity presence.

Automate LocalBusiness Schema

Generate JSON-LD for each location programmatically from a structured data source (spreadsheet, CRM export, Google Business Profile data).

def local_schema(name, address, city, state, zipcode, phone, lat, lng):
    return {
        "@context": "https://schema.org",
        "@type": "MedicalBusiness",
        "name": name,
        "address": {
            "@type": "PostalAddress",
            "streetAddress": address,
            "addressLocality": city,
            "addressRegion": state,
            "postalCode": zipcode
        },
        "telephone": phone,
        "geo": {
            "@type": "GeoCoordinates",
            "latitude": lat,
            "longitude": lng
        }
    }

# Loop over your locations spreadsheet and generate schema for each

NAP Consistency Checking

Cross-reference your business listings data (exported from BrightLocal, Yext, or manual scraping) against your canonical NAP to flag inconsistencies:

from difflib import SequenceMatcher

def nap_match_score(canonical, listing):
    scores = []
    for field in ["name", "address", "phone"]:
        scores.append(SequenceMatcher(None, canonical[field].lower(), listing[field].lower()).ratio())
    return sum(scores) / len(scores)

# Flag listings where match score < 0.90 for manual review

Location-Aware Content Gaps

Embed your city/region landing pages alongside their geo-specific competitor pages. Run the same coverage scoring (Step 8) but filtered by location cluster. This surfaces which cities have thin content relative to competitors.

How This Pipeline Helps Your Semantic SEO (in Practice)

Output	SEO Impact
Topical map from clusters	Defines your hub/pillar architecture
Internal links from similarity scores	Raises topical authority and crawl efficiency
Coverage scoring from keyphrases	Turns into a prioritized content roadmap
Pruning by drift	Removes/merges thin or off-topic pages hurting quality signals
Schema and FAQs	Richer SERP features and better disambiguation
Vector store	On-demand related links and anchors while you write
Location entity extraction	Ensures geographic coverage across target markets
NAP consistency checks	Protects local pack rankings from citation drift

From Pipeline to Practice: Putting This to Work

The technical steps above are a means to an end. The end is better rankings, more efficient content production, and a site architecture that search engines can parse as authoritative on your topics.

Start with Steps 1-6 to build your topical map. This alone will reshape how you plan content. Clusters that are thin or missing entirely become your editorial calendar. Clusters where you already have strong coverage become candidates for internal linking optimization (Step 7) rather than new content.

Layer in coverage scoring (Step 8) and pruning (Step 9) once your map is built. These two steps are where the pipeline pays for itself: instead of producing more content, you are making existing content work harder and cutting pages that dilute your topical authority.

Schema, FAQ generation, and vector-assisted writing (Steps 10-13) are ongoing. Integrate them into your publishing workflow so every new page launches with structured data and contextual internal links already in place.

For local and multi-location sites (Step 14), run location entity extraction first to establish a baseline, then use the gap analysis to prioritize which markets need dedicated content.

The goal is not to automate your way out of editorial judgment. It is to replace guesswork with measurement so that every content decision, whether to write, rewrite, merge, prune, or link, is backed by data.