Technology

Major sites say no to Apple’s AI data scraping

In a separate analysis this week, data journalist Ben Welsh found that just over a quarter of the news sites he surveyed (294 out of 1,167 primarily English-language publications based in the United States) are blocking Applebot-Extended. By comparison, Welsh found that 53% of the news sites in his sample are blocking OpenAI’s bot. Google introduced its own AI-specific bot, Google-Extended, last September; it’s blocked by nearly 43% of those sites, a sign that Applebot-Extended may still be flying under the radar. As Welsh told WIRED, however, the number has “gradually increased” since he started looking.

Welsh has launched a project to monitor how media outlets are approaching major AI agents. “There’s been a bit of a divide among news publishers about whether they want to block these bots,” he says. “I don’t have the answer to why each outlet has made that decision. Of course, we can read that many of them are entering into licensing agreements, where they’re getting paid in exchange for permission to use the bots – that may be a factor.”

Last year, The New York Times reported that Apple was trying to strike deals with publishers in the artificial intelligence space. Since then, competitors like OpenAI and Perplexity have announced partnerships with various media, social platforms, and other popular websites. “A lot of the biggest publishers in the world are clearly taking a strategic approach,” says Jon Gillham, founder of Originality AI. “I think in some cases it’s a business strategy, like withholding data until a partnership deal is done.”

There is some evidence to support Gillham’s theory. For example, Condé Nast websites used to block OpenAI’s web crawlers. After the company announced a partnership with OpenAI last week, it unblocked the company’s bots. (Condé Nast declined to comment for this article.) Meanwhile, Buzzfeed spokesperson Juliana Clifton told WIRED that the company, which is currently blocking Applebot-Extended, puts every AI web crawler it can identify on its blocklist unless its owner has a partnership — usually a paid one — with the company, which also owns The Huffington Post.

Because the robots.txt file must be edited manually and many new AI agents are emerging, it can be difficult to keep a blocklist up to date. “People just don’t know what to block,” says Gavin King, founder of Dark Visitors. Dark Visitors offers a freemium service that automatically updates a client’s site’s robots.txt file, and King says publishers make up a large portion of his customers because of copyright concerns.

The robots.txt file may seem like the secret domain of webmasters, but given its outsized importance to digital publishers in the age of artificial intelligence, it’s now the domain of media executives. WIRED has learned that two CEOs of major media companies directly decide which bots to block.

Some media outlets have explicitly stated that they block AI scraping tools because they don’t currently have a partnership with their owners. “We are blocking Applebot-Extended on all Vox Media properties, as we have done with many other AI scraping tools when we don’t have a commercial agreement with the other party,” said Lauren Starke, Vox Media’s senior vice president of communications. “We believe in protecting the value of our published work.”

Back to top button