updated 1/1/2026
What Is Common Crawl?
It is one of the most influential data sources on the web and the mass majority of site owners don't even realize their content is in it. So what is it? Common Crawl is a nonprofit organization that maintains an open repository of web crawl data. It is a snapshot of the internet, available to anyone. They crawl billions of pages each month and make the data available in massive downloadable archives. Their stated goal is to democratize access to web-scale data for research, AI, language modeling, SEO, and beyond.
You know all those Amazon S3 IP's browsers and bots you see in your server logs? Those hits can be both CommonCrawls' Bot (CCBot) crawling your site or it is often someone querying the Common Crawl DB stored on Amazon AWS S3 to find your site. This potentially will lead to visibility for your site in LLM's and engines like ChatGPT, Llama, and others that use the database in various ways.
Who Owns Common Crawl?
Common Crawl was founded in 2007 by Gil Elbaz. Its first full-scale crawl was released in 2011, and it has published monthly crawls since 2013. Elbaz is best known as the co-creator of Applied Semantics, which Google acquired and turned into AdSense.It is operated by the Common Crawl Foundation, a nonprofit organization founded by Elbaz in 2007. The foundation is based in California.
What Do They Do?
Each month, Common Crawl releases a new dataset containing petabytes of crawled web pages. The dataset includes raw HTML, extracted metadata, link graphs, and text based content. They don't just collect web pages; they structure them in a format optimized for research and machine learning and AI applications.
The data includes:
- Raw WARC files (Web ARChive format)
- Text-only extractions
- Metadata (like canonical links, HTTP status, etc.)
- Outlink graphs (link relationships between sites)
Crawler Bot
Crawling about once a month, Common Crawl does not aim to provide full indexing or search services. It's a foundational dataset that researchers, developers, and companies build on top of.
Common Crawl identifies itself with the following user-agent:
CCBot/2.0 (+http://commoncrawl.org/faq/)
In some logs, you may also see it as:
CCBot/2.0
Nutch Crawler
CommonCrawl uses a Nutch based crawler. Apache Nutch is an open-source web crawler built on top of Apache Hadoop (GIT).
Known IP Ranges
Common Crawl uses infrastructure from Amazon Web Services (AWS) and does not publish any sort of a static list of IPs. However, all their crawlers resolve to identifiable AWS IPs, often in the ccbot.commoncrawl.org domain.
Site owners can verify requests by doing a reverse DNS lookup on the IP to confimr it ends with commoncrawl.org.
Why Should Site Owners Care?
- You're Being Crawled: If your site isn't blocking them, you're probably already in Common Crawl's archives.
- LLMs Train on It: OpenAI, Google T5, Meta, Hugging Face, and others use Common Crawl to train large language models. Your content may be part of that training data and is updated far more rapidly than most realize. As of this writing, a post from two hours ago on this site is already in CommonCrawl.
- SEO & Scraping Implications: Common Crawl data is regularly used for link graph analysis, competitive intelligence, and scraping tools.
- Data Permanence: Once crawled, your content can remain in the dataset for years - even if removed from your site.
If you care about who's copying or analyzing your site at scale, you should care about Common Crawl.
Backlink and Link Graphs
Common Crawl publishes web graphs that represent hyperlink relationships between websites. These graphs are available at both the host and domain levels, showing how different sites link to one another. This data is then used by sites such as SemRush, Majestic, and Inlinks for further analysis and usage in their services.
The most recent release, covering May-July 2025, includes a host-level graph with staggering 481.6 million nodes and 3.4 billion edges. It also includes a domain-level graph with 209.5 million nodes and 2.6 billion edges (-CommonCrawl)
These graphs are constructed from the WAT files (Web Archive Transformation), which contain metadata about the crawled pages, including outlinks. The data is processed using tools like cc-pyspark and the WebGraph framework to generate the final graph structures
What Does It Cost?
Common Crawl's data is free to use. It's hosted on Amazon S3 under the PDS: Public Data Sets program. That said, accessing the data can get very expensive depending on how you use it. So the "free" aspect can be misleading for those without significant technical resources. The files are massive, and processing them requires serious computing power and a deep understanding of big data tools like Apache Spark. For individuals or smaller teams, the costs associated with cloud computing, storage, and bandwidth to process petabytes of data can be again - ouch - substantial. That results in the data is effectively "paywalled by infrastructure", making it most realistically usable by large companies and academic institutions with a budget for cloud resources.
But the data itself? Free.
Who Uses Common Crawl?
- OpenAI (for pre-training LLMs like GPT-3 and GPT-4)
- Meta AI (for LLaMA model training)
- Google reportedly is using it for it T5 llm and Colossal Clean Crawled Corpus (C4)
- Allen Institute for AI
- Other LLM's: Kimi.com, Grok, and DeepSeek are known to use CC data.
- Hugging Face
- Common search engine projects like YaCy and Marginalia
- Services such as SemRush, Majestic, and Inlinks
- Academic researchers : Cited in over 10,000 academic papers, Common Crawl fuels studies in web science, NLP, internet censorship, language mapping, privacy analysis, and more
Benefits of Allowing CCbot Access?
It's one of the most widely used datasets in AI, machine learning, and web analytics research. Author of The Accidental SEO Manager Ash Nallawalla writes:

"A manager asked me why our leading brand was not mentioned by an AI platform, which mentioned obscure competitors instead. I found that we had been blocking ccBot for some years, because some sites were scraping our content indirectly. After some discussion, we felt that allowing LLM crawlers was more beneficial than the risk of being scraped, so we revised our exclusion list." - Ash Nallawalla
Google - Google - Google?
Why are Google Serps and telltale Google ID's being found in CommonCrawl Database - and subsequently in OpenAI? In a ground breaking post (in French) on LinkedIn Olivier de Segonzac (translation) dropped a nugget that some of Googles Shopping results are finding their way into OpenAI data because it was retaining merchant specific parameters (ID's and "srsltid"). It may suggest a direct feed from Google or OpenAI scraping. His results imply that OpenAI may be getting the data from third parties such as SemRush or CommonCrawl.
How That Data Gets to OpenAI?
- Common Crawl is part of OpenAI's training data
- OpenAI confirmed it used Common Crawl data up to at least 2023 to train GPT-3 and GPT-4. (1)From Section 2.2 - Training Dataset:
"The dataset is a mixture of Common Crawl and other datasets ... After deduplication and some quality filtering, the final training dataset contains 300 billion tokens from the following sources:
- 60% filtered Common Crawl
- 22% WebText2
- 8% Books
- 8% Wikipedia
- 2% other sources"
- OpenAI confirmed it used Common Crawl data up to at least 2023 to train GPT-3 and GPT-4. (1)From Section 2.2 - Training Dataset:
- That data includes text extracted from millions of public pages, including those containing SERP fragments.
- OpenAI filters and deduplicates, but not perfectly.
- Pages containing quoted search snippets, top result text, or user questions phrased like searches may still end up in the training corpus.
- The net result: ChatGPT (and similar LLMs) can sometimes echo back Google-like results or language because that structure was in the training data. This includes SERP snippets and SERP results.
- Additionally, OpenAI's own crawlers have been found to stumble into shopping and image Serps from time to time. It is well known that OpenAI also trained some of it's image processors on Google's Youtube (without permission) and possibly Google images for Dalle3. The fact you can find Google Serps is OpenAI is not a surprise. This is not a hiybbprqag situation, it is just a 'oops' - silly bots will be bots - situation. Lastly, OpenAI does run DeepResearch bots that acquire content on-the-fly, we do not know if some of those downloads make their way into training data and they potentially could easily include SERPs.
Where Is the Data Stored?
Common Crawl stores its data in Amazon S3 buckets under the commoncrawl public dataset. The data is hosted in the AWS us-east-1 region (Northern Virginia), accessible through:
s3://commoncrawl/
They also offer indexed search and filtering tools like:
- Common Crawl Index Server (CDX)
- Columnar index files for selective downloading
You can query data using Amazon Athena, or download via wget, curl, or AWS CLI.
Common Crawl Summary
| Aspect | Details |
|---|---|
| Org | Common Crawl Foundation (nonprofit) |
| Founded | 2007 (first crawl in 2011) |
| Founder | Gil Elbaz From Applied Semantics (Original AdSense) |
| Crawler Name | CCBot/2.0 |
| Crawler IPs | AWS-hosted, reverse DNS to commoncrawl.org |
| Cost | Free access; AWS usage may incur charges |
| Storage | Amazon S3 (s3://commoncrawl) |
| Used By | OpenAI, Meta, Hugging Face, academic researchers, SEO tools |
Tools and Services to Access CC Data?
There are tools developed by Common Crawl for accessing the data and they are open source. The organization maintains a GitHub repository at github.com/commoncrawl, where there are a variety of projects to access to and process their web crawl data. These tools are primarily licensed under the Apache 2.0 License.
Some notable repositories include:
- cc-pyspark: A Python and Spark-based toolkit for processing CC data.
- cc-index-table: Java-based utilities for indexing Common Crawl archives in tabular formats.
- cc-crawl-statistics: Scripts for extracting and analyzing statistics from monthly crawl archives.
- cc-warc-examples: Examples and processing code for WARC/WET/WAT files using Java and Hadoop.
- cc-notebooks: A collection of Jupyter notebooks demonstrating various use cases of Common Crawl data.
These tools support a range of applications, from large-scale data analysis to NLP processing tasks. Additionally, the community has developed supplementary tools, such as comcrawl, a Python utility for downloading and processing Common Crawl data, and cc_net by Facebook AI Research, which provides tools for downloading and cleaning Common Crawl data for language modeling.
For those interested in leveraging these resources, Common Crawl provides a Getting Started guide, offering insights into data formats like WARC, WAT, and WET, and examples of how to work with them.
Exit Thoughts:
If your site is publicly accessible and not specifically blocking Common Crawl, you're contributing to the world's open web memory bank - whether you meant to or not. For developers and researchers, it's a goldmine. For content creators, it's worth watching.




