Keyword-level attribution for organic search: how probabilistic models work
Here's a fun fact that most SEO teams have quietly given up on: you cannot see, in your CRM, which keyword brought you a customer.
You can see that they came from "google.com / organic." You can see the landing page. You can see they eventually closed. But the actual keyword they typed? That data is gone. Google stripped it from referrer headers in 2011 under the banner of privacy. It's been thirteen years and marketing teams still build reports without it.
If organic search is your top channel — or one of your top three — this is an expensive gap. You spend six figures a year on content and SEO, and the ROI attribution stops at "organic did well this quarter." The CFO wants to know which topics drove revenue, and the honest answer is: we have no idea.
This post is about the workaround: probabilistic keyword attribution. What it is, how the math works, where it fails, and why — unless you're one of the five brands big enough to get into Google's raw query stream — it's the only path to recovering that data.
Why deterministic tracking doesn't exist anymore
Before 2011, analytics tools received the referring search query in the URL. Your stats dashboard knew a visitor had searched for "multi-touch attribution software" and clicked through to your page. You could tie that keyword to a pageview, then a lead, then eventually to revenue.
When Google encrypted search, that referrer dropped the query. Now you get "not provided" for 100% of organic search traffic in Google Analytics. Bing and other engines do the same for most of their traffic.
What you still have:
- Google Search Console: aggregate data. "This keyword got 847 clicks last month." No user-level data.
- Google Analytics / GA4: pageview data by landing page and source, but not by keyword.
- CRM: lead source, first-touch page, UTM data. Not keyword.
The gap is the join between Search Console's aggregate clicks and Analytics' pageviews. That's what probabilistic models fill.
How probabilistic keyword attribution works
The basic idea: for each lead that came from organic search, we estimate the probability that they clicked from each of the keywords Search Console tells us are driving traffic to the landing page they hit.
Step 1: collect the signals
For each lead with an organic search first touch, gather:
- Landing page they arrived on
- Timestamp of the visit (hour-level granularity)
- Search Console keyword clicks for that landing page in a window around that timestamp (typically 48 hours, sometimes a week for low-volume pages)
- GA4 hourly keyword click data if you have a connected Search Console property (this is the underrated signal)
- Geographic signals if available
- Device category and other minor signals
Step 2: score candidate keywords
For each candidate keyword that sent clicks to the landing page in the relevant window, compute a match probability. The inputs:
- Click volume for that keyword → page in the window (higher = more likely)
- Time alignment between the keyword's click volume spike and the lead's arrival time (closer = more likely)
- Relative ranking: if keyword A sent 80% of clicks to that page and keyword B sent 5%, A is 16× more likely absent other signals
- Geographic match where both are known
- Device match
- Intent plausibility: does the keyword match the page's content? (Usually yes if it's sending clicks, but worth checking.)
Step 3: apply a position penalty
Keywords with higher average SERP positions during the window are more likely to have produced the click, because higher positions get a disproportionate share of clicks. A keyword ranking #1 during the window with 20 clicks is a much stronger candidate than a keyword ranking #27 with 3 clicks.
Step 4: emit a probability, not an assignment
The output is not "this lead came from keyword X." The output is:
"Lead #4729 most likely came from 'revops software' (confidence 87%), possibly from 'revenue attribution tool' (confidence 7%), possibly from 4 other keywords (combined confidence 6%)."
That confidence score is the key artifact. It tells you how much to trust the attribution.
What the model gets wrong
Probabilistic attribution fails gracefully, which is a polite way of saying it fails often but in ways you can catch.
Ties: Two keywords send nearly identical click volumes to the same page at the same time. The model can't distinguish. You get two 45% confidence matches. In practice, these ties are most common on competitive commercial keywords where they're most useful — the exact situation where the user is most trying to understand ROI.
Low-volume pages: A landing page that gets 3 organic clicks a week has almost no signal to work with. The model will default to the highest-ranking keyword and shrug. Treat low-confidence matches (under 60%) as unknown, not as bad matches.
Branded vs non-branded: Branded search traffic ("elir", "elir attribution") is trivially attributable because nothing else competes. Non-branded is where the value lives and where the model works hardest.
Geo anomalies: A lead from Paris who should have hit French keywords but landed from an English-language query. These are usually real but look like anomalies, so don't over-filter.
Why you should bother
If organic search is 20%+ of your pipeline, keyword attribution changes the conversations you can have:
- Content ROI: which blog posts drove revenue, not just traffic. "This post on attribution models drove $420K of attributed pipeline" is a sentence that gets you a content budget.
- SEO prioritization: which keywords are worth fighting for in the SERPs. You can bid more aggressively on the ones that convert, not just the ones that get clicks.
- Paid search comparison: you can compare CAC on the same keyword across paid and organic. Sometimes the organic CAC is 1/10th the paid CAC, and you should be investing more in content than ads. Other times, paid is actually cheaper because competition for the keyword has pushed organic position too low.
This ties directly into our thinking on CAC by channel. Keyword-level attribution is what makes the organic-vs-paid comparison honest.
How this plays with multi-touch models
Keyword attribution operates at the touchpoint level — it enriches what a first touch actually was. It doesn't replace a multi-touch model. You still need a multi-touch attribution model to distribute credit across all touches in the journey.
What keyword attribution does is make the first touch a real signal instead of a generic "organic" bucket. In a U-shaped model where the first touch gets 40% credit, going from "organic (unknown keyword)" to "organic ('revops software', 87% confidence)" is a meaningful upgrade. It lets you then roll those 40%-weighted credits up to specific keywords, and finally to specific content.
Where to get the data
At minimum you need:
- Google Search Console connected to your domain, ideally 12+ months of history
- GA4 with GSC property linked (enables hourly keyword click breakdowns)
- CRM with timestamp-level first-touch data (hour-precision, not day)
- A way to join all three at the lead level
The join is where commercial attribution tools differ from DIY. Building this in a warehouse with dbt is a four-week project if you have strong data engineering. Buying it is a few hours to connect integrations.
Elir is built around this exact join — GSC + GA4 hourly + CRM touchpoints with probabilistic matching and a confidence score per lead. If you're tired of seeing "organic — $420K pipeline, keywords unknown" in your reports, book a 20-minute walkthrough and we'll run it on your data.
TL;DR
Google killed keyword-level referrer data in 2011. Probabilistic attribution recovers it by joining Search Console aggregate data with CRM touchpoint data, scoring candidate keywords by click volume, time alignment, position, and geography. The output is a confidence-weighted match, not a deterministic assignment — because honest attribution admits what it doesn't know. If organic is a top channel for you, this is not optional.