Al Influence

What word usage differs the most between Al and non-Al content?

In these word clouds, we visualized the 500 overall most frequently used words and identified the words that are used more by

Al

or human content respectively. It is important to note, that these aren't the most frequently used words by either type of content, but instead they are where the biggest difference lies.

In the following word cloud, we can identify that Al content uses more stop words like the, of, and that. Furthermore, words like experience, enhance, ensure, and provide are more common in Al-written text. There is also a notable increase in punctuation, likely due to the fact that most people are pretty bad at using enough of it. This includes us. Except Emil, he's perfect.

Word cloud

How much do Al domains cite other Al domains and how do non-Al domains compare?

This heatmap visualizes citation patterns between domains based on a chosen variable. Both axes can be configured to show different variables such as Al content, Al topic, or site type. Each axis represents the percentage of that variable on a domain, ranging from 0% (none of the domain's content matches the variable) to 100% (all of it does). The x-axis represents the source domains and the y-axis represents the cited domains. Each cell then shows how strongly domains at one level cite domains at another, revealing whether, for example, high-Al domains preferentially cite other high-Al domains.

As we can see, domains that are barely or not at all Al-written preferentially cite other non-Al-written domains. This seems to be primarily due to volume though, as there are around 450,000 references to high-Al domains from low-Al domains, compared to the mere 35,000 high-Al domains we have. Notably however, a whopping 40% of references in Al-written domains are to other Al-written domains.

Most other categories fair similarly, with approximately comparable numbers, except for blogs where nearly 5% of low-Al blogs link to high-Al domains.

From:
To:

How much do Al generated news articles refer to specific categories of sites compared to non-Al articles?

To understand how Al influences linking behavior, we took all domains classified as news sites and split them into Al-generated and human-written articles. For each group, we then measured how often they reference different categories of domains such as blogs, shops, wikis, and other news sites. This reveals whether Al-generated articles have different referencing patterns than their human counterparts.

Human-written news content tends to cite more news instead of wikis. They also have a higher tendency to cite shops.

Al Topic Terminology

What words most strongly distinguish Al topic pages from non-Al topic pages?

These are the lemmas (words reduced to their base form, e.g. "running" becomes "run") used differently by Al-topic and non-Al-topic sites. Some words used by Al-topic sites are expected, like AI, business, technology, model, tool, and user.

What type of sites use what amount of Al buzzwords?

Blogs have the most buzzwords, followed by news and wikis. Shops have the lowest usage.

Metadata

How much do wikis get cited over other sources?

We can clearly see that wikis are the most cited, followed by news, shops, and – lastly – blogs. When looking at self-referencing, wiki sites by far mostly cite themselves — this is different to the other types that have a bigger portion going to wiki sites instead of citing themselves.

How is the composition in terms of Al generation, Al topic, and site type?

Here we can see what this snapshot of the web is composed of. We tried to have an even distribution when crawling, so this should roughly represent a small model of the larger internet. The majority of sites are human-written, which is expected given the large backlog of older sites that were never updated or simply stuck with being human-edited. However, the 10% Al-generated block is a concerning sign. Since Al-generated content can only have appeared in the last few years, this may indicate that a large portion of new websites are Al-written. Note that "Other" refers to sites we couldn't confidently assign to a single category like blog, news, or shop.