- | 8:00 am
AI scraping is inevitable. Can publishers turn it into revenue?
AI bots are bypassing rules and indexing publisher content at scale. A new report reveals just how much—and who might benefit.

We all know AI is eating the internet, with bots scraping sites for content and not giving anything in return. This, of course, is the impetus behind the many lawsuits that are playing out between media companies and the big AI labs, but in the here and now, the question remains what to do about those bots.
Blocking them is an option, but how effective is it? And what types of content are most at risk of being scraped and substituted by AI answers? And can you actually get AI bots to pay up?
A good place to start finding answers is the most recent State of the Bots report from AI startup TollBit. For publishers that are feeling the heat of AI, it attaches real numbers to the presence of AI in the media ecosystem and how quickly it’s growing. And while the rise of AI bots is a worrisome trend to those in the content business, it may also be an opportunity.
Bots in disguise
In the interest of maximizing that opportunity, TollBit is doing more with this report than simply offering up charts and graphs. It’s also taking a stand, arguing that AI bots that crawl the internet should at the very least identify themselves to the sites they visit and scrape. The company is openly calling for regulation to force the issue, something CEO Toshit Panagrahi told me back in June after its previous State of the Bots report showed that certain bots from the likes of Perplexity, Meta, and Google were openly ignoring the Robots Exclusion Protocol, which websites use to manage bot traffic.
There is some nuance to that. I won’t rehash the entire thing here, but briefly: certain AI bots perform tasks on behalf of users (as opposed to training or search bots), and those are designated user agents. That affords them a certain status, at least according to AI companies: Because they are essentially human proxies, they believe sites should treat them as humans, not bots. So they don’t identify themselves as bots.
What this does, at the very least, is make it very hard to tell what’s real human traffic—that is, a person navigating to a website and looking at a screen—versus a robot doing the same thing. That’s going to make it very difficult to get accurate data about bot traffic, and TollBit predicts that the amount of “human” traffic will probably rebound once user agents become more common, but that’s only because trackers won’t be able to tell the difference between them and actual people.
You can see the impetus to get bots to self-identify, but let’s assume that doesn’t happen, and a significant amount of traffic falls into this gray area: seemingly human but not behaving as such. Those ersatz humans won’t ever interact with advertising, and once that becomes evident, it will cheapen the value of advertising on the web overall. We may never technically reach Google Zero, but margins will be stripped so low that Google 30 might look more like Google 10.
The content AI craves
Something else the TollBit report reveals, though, is what kind of content appears to be of greatest interest to the AI crawlers, or rather, the people using AI engines for discovery. While the data isn’t definitive, it’s fair to conclude that if a particular category of content is being scraped more often, there are more people sending AI crawlers and user agents to get it. That, in turn, might help guide content strategy.
By far the No. 1 category being scraped is B2B content, followed by parenting, sports, and consumer tech. Parenting, in fact, saw a big increase this past quarter, meaning more people are turning to AI portals for answers about parenting issues. If you produce content for parents (and this applies to any category that’s highly crawled by AI), you should consider a few things:
- Your content is at high risk of substitution by AI answers.
- That means it’s valuable to AI companies.
- You can point to the data as leverage in licensing negotiations (or a lawsuit).
It sounds simple, but getting a major AI provider to license your content isn’t something that any site can do. OpenAI, by far the most prolific deal-maker, has signed only a few dozen agreements. And lawsuits are costly.
If you’re a parenting site, you’re not just going to stop doing parenting content, so you have a choice: block the bots, or let them crawl to ensure your presence in AI answers. While the referral traffic remains negligible (we’re effectively already at “ChatGPT Zero”), there are intangibles, mostly brand presence, that being in an AI answer provide.
You can’t build a business on intangibles, though, and that leaves the other option: blocking—or rather, redirecting bots to a paywall. TollBit’s data does show that more bots than before are being successfully redirected to “forbidden” pages or hitting the company’s own paywalls.
The illusion of control
The key question, though, which the report doesn’t answer, is how many of those bots are actually paying up? The lack of answer suggests the number is quite low, and that’s because it’s simply too easy to access the content in another way. As the report describes, there are sophisticated ways for AI companies to use relays, third-party systems, and different species of bots to scrape content. And the “gray” status of consumer browser agents makes things even murkier. The number of ways to access blocked content are myriad.
That’s ultimately why TollBit has taken its stance that bots should be required to self-identify, backed by legal teeth. It’s hard to imagine AI companies self-regulating in the interest of another industry—in this case, the media—without some kind of regulatory pressure. Otherwise, we can look forward to something else: a lot more paywalls on parenting sites.