Data Acquisition

Data Acquisition

Our dataset draws from three primary sources. The bulk of the data comes from Common Crawl, specifically the CC-MAIN-2026-04 archive from January 2026, which contains 2.30 billion web pages across roughly 70 TiB of data. Due to time and bandwidth constraints, we processed 4 TiB of that archive. To classify pages by topic, we used NewsAPI as a source of known news domains. To classify whether text was Al-generated, we used the fakespot-ai/roberta-base-ai-text-detection-v1 model from Hugging Face.

Common Crawl

Format Raw WARC files (HTTP responses with headers and full page content) Output URL, HTTP headers, raw HTML body Purpose Bulk web page data from the CC-MAIN-2026-04 archive (January 2026) Link commoncrawl.org

NewsAPI

Format JSON (REST API, /v2/top-headlines/sources endpoint) Output URL Purpose Tag pages as news articles for topic-per-page classification Link newsapi.org

Fakespot Al Detection Model

Format Hugging Face model (fakespot-ai/roberta-base-ai-text-detection-v1, RoBERTa base) Output AI, HUMAN labels Purpose Score each page's likelihood of being Al-generated Link Hugging Face

Filtering

Filtering

The main text of each page was extracted from raw HTML using Trafilatura, and non-English pages were filtered out using fast-langdetect.

Enrichment

Al Topic Detection

Each page was scored for Al-topic relevance using a weighted keyword approach. Al-related stems (e.g. "ai", "llm", "gpt", "chatgpt", "openai") were assigned weights, and the sum of matching stem weights determined whether a page was classified as Al-topic or not.

Al Influence Detection

The extracted text of each page was run through the fakespot-ai/roberta-base-ai-text-detection-v1 model across multiple GPUs. The model outputs a probability score per page, which was thresholded into Al, Human, or Unknown categories.

Topic Detection

Pages were classified into topic categories (Blog, Wiki, News, Shop) based on URL pattern matching. News domains from the NewsAPI /v2/top-headlines/sources endpoint were used as an additional signal to identify news pages.

Lemmatizing

Page text was lemmatized using simplemma, producing per-page stem frequency counts used for word distribution analysis and Al-topic scoring.