Views: 7238
Google’s Search Algorithm: Hand-Tuned Signals with ML on Top
Newly released documents from the DOJ’s antitrust case reveal that Google’s core ranking system still relies on numerous hand-crafted signals, not just black-box AI. Google’s own court-filed slide deck just spilled more tea on Navboost, and the details punch a few holes in years of SEO guesswork.
The newly released trove goes further in talking about more than 100 raw signals into a ranking recipe that still leans on moldy old PageRank while layering in LLM toys like RankEmbed, re-ranking Twiddlers, and the rumored mysterious quality metric called Q*Star.
Slides tease a possible Chrome spin-off, show Google force-feeding Gemini inside the browser, and confirm daily query volume north of 8 billion. All this while Sergey Brin pops back into internal meetings.
Importantly, the new docs – many of which are less than a year old – show that Google hasn’t handed the keys entirely over to deep learning yet. Traditional ranking factors (kws, links, etc.) are still combined in a linear weighted formula. One document notes that top-level ranking signals are essentially a linear combination of individual factor scores (often using log-scale values to prevent any one factor from dominating too much). Each signal’s impact remains monotonic (meaning more of a positive signal should never hurt your ranking). In recent years Google has added machine-learned signals (e.g. the BERT-based “DeepRank” model), but even those were found to decompose into familiar relevance signals, and combining ML outputs with established signals yielded the best results.
In short, Google still trusts its classic ranking ingredients, even as it integrates AI.
Although actionable info is in short supply, tried-n-true SEO’s will find these docs real page turners. So lets dive in one at a time:
Googles Search Stack
…or Anatomy of an Indexed Page
- PXRD010 This one goes deep into divestiture of Chrome and it also includes this gem of a search flowchart slide for SEO’s:
This slide outlines Google’s Combined Search Infrastructure. It is a very high-level view of how search data floats from the web to the SERPs – highly redacted of the good parts – but even so, it is one of the rare flow charts we ever have seen that takes the path of a page from a GoogleBot spidering to querying.
The process starts at the bottom with Acquiring Data, pulling in both structured feeds (edit: Google on source of structured feeds) from third parties and unstructured web data (GoogleBot) into a system called Multiverse. This component is responsible for interpreting and organizing the wide array of content types. The data is then scrubbed and transformed in the next Preparing Data for Indexing layer, where it’s cleaned, normalized (junk taken out), and prepared for the main index. (more on structured feeds in this ChatGPT chat from @WasBruba on twitter)
In the Indexing Data stage, structured information is added to systems like the Knowledge Graph that powers rich results and semantic features (accordion stuff). From there, components further up the stack like the Query Understanding Service (QUS) (that we know nothing about) and Superroot interpret user queries (this is the algo) and match them with relevant indexed content. The GWS (Google Web Server) handles the actual serving of results and incorporating personalization and user context.
Finally, the Logging Stack records interactions for evaluation and tuning.
The ABC’s of Ranking
- Anchor links (what the web says about the page)
- Body of document (what the document says about itself)
- Clicks (dwell time – what users say about the document)
The rest of the slide is an in your face admission about the algo using “Hand Crafted Signals”. Google goes on in other docs to talk about Microsoft using ML and having no idea when something goes wrong because it is deep in the machine learning; but Google with it’s hand crafted code, knows right were an issue lies because they wrote it and have eyes on it.
There is also the big tidbit about talking “…label data from human raters“. This is the the first public proof that Google uses human raters to grade webpages).
- PXR0356 Interview about NavBoost and Hand Crafting of Signals with Hyung-Jin (father of NavBoost) Kim. (Patents: 1,2,3)
- PXR0357 Interview with Pandu Nayak
It includes some interesting stuff about scoring docs that we’ve not seen before (including Q Star) and the ongoing admission that they use PagRank (whole doc is worth a read for SEO’s):
Signal: Google uses many signals to come up with the SERP. Top-level signals are aggregates of other signals, called raw signals. Google uses over 100 raw signals. Some signals are developed using machine-learning models while others are considered traditional signals. Examples of signals discussed on the call include:
- Q* (pronounced “Q star”), Google’s measure of quality of a document
- Navboost, a traditional signal measuring how frequently users (subset by location and device type) click on a particular document for a particular query; uses the most recent 13 months of data
- RankEmbed, one of Google’s primary LLM-trained signals
- Twiddlers, which re-rank a set of already selected results
- PageRank, one of Google’s original signals that still feeds into page-quality signals
- [redacted] combines signals into a single score, which then determines document rank in the list of blue links on the SERP. [redacted] was an early Google use of a machine-learning model in its search algorithm. [redacted] was taught using a process of [redacted]
Review of Debugging Interface
Pandu typed “james allan umass” into a Google search window and then opened the internal debugger window showing:
- [redacted] shows query expansion and decomposition process
- Example: “umass” rewritten as “University of Massachusetts,” “James” recognized as a first name, “Allan” might also be spelled “Allen,” etc.
- redacted] containing a table with a list of 10 blue links and the score for each top-level signal, plus the total “Final IR” score across all
It’s just a Big Table!
Better description of Googles Navboost system from DOJ Trial. This slid from Dr. Eric Lehman:
Q. Navboost trains on about 13 months of user data; right?
A. That is my understanding. Now, the word “trains” here might be a little misleading. Navboost is not a machine-learning system. It’s just a big table. It says for this document – sorry, for this search query, this document got two clicks. For this query, this document got three clicks, this – and so on. And it’s aggregated, and there’s a little bit of extra data. But you can think of it as just a giant table.

-
- PXRD003 Excellent look at how SGE works. Including this fuzzy, but very uber cool doc: (note the email title “PageRank for training data”)
- PXRD003 Excellent look at how SGE works. Including this fuzzy, but very uber cool doc: (note the email title “PageRank for training data”)
- PXRD004 Goes into detail about the possible “User Choice” screen for browser selection.
-
- PXRD014 The possible force feeding of Google Gemini in Chrome with Google defaulting to Gemini instead of search:
- PXRD014 The possible force feeding of Google Gemini in Chrome with Google defaulting to Gemini instead of search:
-
- PXRD030 Google has said for years that “ad quality is important”. Meanwhile we have this quote slide that runs counter to that :
- PXRD030 Google has said for years that “ad quality is important”. Meanwhile we have this quote slide that runs counter to that :
- PXR0038 Internal Google doc on AI Overviews. Good background doc.
- PXR0181 OpenAI approached Google about using Google Search in SearchGPT and was swatted away – just 8 months ago:
-
- PXR0182 Google does/did 8.3 Billion queries per day and AIO is doing 600m a day (Dec 2024)
- It also mentions that Perplexity was doing 20m Daily “messages” in Dec 2024 (5 months ago)
- PXR0366 Random meeting notes:
What is interesting about it, is that Sergey was present. So rumors of him being back in the full mix, are not rumors any more…
Try the notebook.lm version of this post:
More Google on SEW
- How Google decided website owner control of AIO (and couldn’t be bothered to ask a Website Owner? hmmm)
- Half the Web Opted Out of Google AI Training! Google Trained Anyway.
- Google Faces Heat on Three Fronts: Discrimination, Diplomacy, and Dangerous Devices
- Classaction.org Begins Campaign to Organize Advertisers into Mass Arbitration Action Against Google
- Google Held a Brunch for 7 Bloggers – Umm – Called it A “Summit”?
- Pushing Navboost – A Fresh Look at One of Googles Top Ranking Signals
- CloudFlare CEO Pulls Fire Alarm

As the CEO and founder of Pubcon Inc., Brett Tabke has been instrumental in shaping the landscape of online marketing and search engine optimization. His journey in the computer industry has spanned over three decades and has made him a pioneering force behind digital evolution. Full Bio
Visit Pubcon.com