Our dataset draws from three primary sources. The bulk of the data comes
from Common Crawl,
specifically the CC-MAIN-2026-04 archive from January 2026,
which contains 2.30 billion web pages across roughly 70 TiB of data. Due to
time and bandwidth constraints, we processed 4 TiB of that archive.
To classify pages by topic, we used
NewsAPI
as a source of known news domains. To classify whether text
was Al-generated, we used the
fakespot-ai/roberta-base-ai-text-detection-v1
model from Hugging Face.
CC-MAIN-2026-04 archive (January 2026) Link commoncrawl.org /v2/top-headlines/sources endpoint) Output URL Purpose Tag pages as news articles for topic-per-page classification Link newsapi.org fakespot-ai/roberta-base-ai-text-detection-v1, RoBERTa base) Output AI, HUMAN labels Purpose Score each page's likelihood of being Al-generated Link Hugging Face The main text of each page was extracted from raw HTML using Trafilatura, and non-English pages were filtered out using fast-langdetect.
Each page was scored for Al-topic relevance using a weighted keyword approach. Al-related stems (e.g. "ai", "llm", "gpt", "chatgpt", "openai") were assigned weights, and the sum of matching stem weights determined whether a page was classified as Al-topic or not.
The extracted text of each page was run through the
fakespot-ai/roberta-base-ai-text-detection-v1 model
across multiple GPUs. The model outputs a probability score per page,
which was thresholded into Al, Human, or Unknown categories.
Pages were classified into topic categories (Blog, Wiki, News, Shop)
based on URL pattern matching. News domains from the NewsAPI
/v2/top-headlines/sources endpoint were used as an
additional signal to identify news pages.
Page text was lemmatized using simplemma, producing per-page stem frequency counts used for word distribution analysis and Al-topic scoring.