Press "Enter" to skip to content

Half the Web Opted Out of Google AI Training! Google Trained Anyway.

Views: 55

A court filing just confirmed what mist web publishers feared: while Google offered site owners a way to opt out of having their content used to train AI, it continued training its Search-based AI products on that same content! The only way to opt out is to to opt out of being indexed in Google Search entirely.

The numbers reveal the scale of Webmaster Resistance:

50% of Content Opted Out!


During ongoing federal antitrust proceedings, a Department of Justice lawyer presented an internal Google document titled “Search GenAI <> Gemini v3“. According to the document, Google removed 80 billion out of 160 billion tokens – snippets of text – from its AI training data after filtering out content from publishers who opted out.

That means half the dataset was removed due to publisher objections! The volume alone shows that a massive segment of the web did not consent to having their content used in this way.

Judge Mehta, overseeing the trial, asked for clarification.

“The 80 billion out of 160 billion tokens, 50 percent is removed by publishers opting out?” – Judge Mehta.
“That is correct,” said Eli Collins, VP at DeepMind.

Here’s the Catch: Google Still Used the Rest!

The opt-out mechanism was touted as a tool for publisher control. Now we find out that it only applies to Gemini/DeepMind’s AI models. It does not apply to the Search’s use of content for Search AI and other search-specific applications. Confused yet? So are we. However, this apparently was clear as muddy glasses and was confirmed in testimony from Collins himself.

So while half the web tried to opt out, Google retained the ability to train its Search-based AI models using that very content. If you didn’t also block Googlebot from indexing your site, your content was fair game.

Google Is Using More Than Just Web Pages

  • The internal document also listed other sources of training data, including:
  • Search session data – behavioral logs of user interactions with Search
  • YouTube videos
  • Additional content signals tied to Google’s ecosystem

Obviously, this kind of proprietary behavioral data gives Google a major edge over competitors. It is clearly part of the monopoly that Google has been convicted of being. It also raises questions about what data is truly “opt-outable” and whether publishers or users have any say at all in how it’s used.

What This Means for SEOs

If you’re in SEO, this moment deserves your attention:

  • A massive percentage of publishers said no to AI training.

  • Google moved forward anyway.

  • The only “real” way to stop Google from training AI on your content is to block indexing completely – sigh – cutting yourself off from search traffic in the process.

This isn’t just about privacy or copyright. It’s about control. Google is replacing traditional search results with AI summaries trained on the very web it destroyed – and it’s doing so with only wink-n-a-nod selective respect for consent.

What Can You Do?

  1. Audit your robots.txt and AI opt-out headers
    Make sure you’ve applied any controls you intend to use – realizing their limitations and that is really questionable if Google will follow it.

  2. Monitor your traffic for zero-click erosion
    Look for signs of lost traffic to AI-generated summaries.

  3. Build outside the Google funnel
    Focus on email, brand, and content ecosystems that don’t depend on Google’s benevolence.

Links:

Copy and © Copyright 1997-2024 SearchEngineWorld . all rights reserved
All trademarks and copyrights held by respective owners.
WebmasterWorld and SearchEngineWorld are owned by Pubcon