In these word clouds, we visualized the 500 overall most frequently used words and identified the words that are used more by
or human content respectively. It is important to note, that these aren't the most frequently used words by either type of content, but instead they are where the biggest difference lies.
In the following word cloud, we can identify that Al content uses more
stop words like the, of, and that.
Furthermore, words like experience, enhance, ensure, and provide are more common in Al-written text. There is also
a notable increase in punctuation, likely due to the fact that most people
are pretty bad at using enough of it. This includes us. Except Emil, he's
perfect.
In the human word cloud, we find more conversational language. In
addition to this, there are more political words, like Trump, Europe, War, Government, and even slurs such as British. We also find socially taboo topics, with words such as kill, shoot, death, and suicide.
This heatmap visualizes citation patterns between domains based on a chosen variable. Both axes can be configured to show different variables such as Al content, Al topic, or site type. Each axis represents the percentage of that variable on a domain, ranging from 0% (none of the domain's content matches the variable) to 100% (all of it does). The x-axis represents the source domains and the y-axis represents the cited domains. Each cell then shows how strongly domains at one level cite domains at another, revealing whether, for example, high-Al domains preferentially cite other high-Al domains.
As we can see, domains that are barely or not at all Al-written preferentially cite other non-Al-written domains. This seems to be primarily due to volume though, as there are around 450,000 references to high-Al domains from low-Al domains, compared to the mere 35,000 high-Al domains we have. Notably however, a whopping 40% of references in Al-written domains are to other Al-written domains.
Most other categories fair similarly, with approximately comparable numbers, except for blogs where nearly 5% of low-Al blogs link to high-Al domains.
To understand how Al influences linking behavior, we took all domains classified as news sites and split them into Al-generated and human-written articles. For each group, we then measured how often they reference different categories of domains such as blogs, shops, wikis, and other news sites. This reveals whether Al-generated articles have different referencing patterns than their human counterparts.
Human-written news content tends to cite more news instead of wikis. They also have a higher tendency to cite shops.
Human-written news articles that talk about AI tend to have a higher citation to shops and news. non-AI topic articles tend to use more blogs in comparison.
These are the lemmas (words reduced to their base form, e.g. "running"
becomes "run") used differently by Al-topic and non-Al-topic sites. Some
words used by Al-topic sites are expected, like AI, business, technology, model, tool, and user.
Blogs have the most buzzwords, followed by news and wikis. Shops have the lowest usage.
We can clearly see that wikis are the most cited, followed by news, shops, and – lastly – blogs. When looking at self-referencing, wiki sites by far mostly cite themselves — this is different to the other types that have a bigger portion going to wiki sites instead of citing themselves.
Here we can see what this snapshot of the web is composed of. We tried to have an even distribution when crawling, so this should roughly represent a small model of the larger internet. The majority of sites are human-written, which is expected given the large backlog of older sites that were never updated or simply stuck with being human-edited. However, the 10% Al-generated block is a concerning sign. Since Al-generated content can only have appeared in the last few years, this may indicate that a large portion of new websites are Al-written. Note that "Other" refers to sites we couldn't confidently assign to a single category like blog, news, or shop.