OpenAI has quietly released a new web crawling bot this week, and it’s appropriately titled GPTBot. The purpose behind this stealthy deployment was to scan various websites and accumulate data to enhance its expansive suite of large language models (LLMs), headlined by the widely recognized ChatGPT. However, what followed was a swift backlash from website owners and creators, who rallied to protect their digital sanctuaries from GPTBot’s inquisitive grasp.
OpenAI’s foray into web crawling was revealed alongside a corresponding support page, equipped with guidance on how websites could effectively shield themselves from GPTBot’s data-scavenging escapades. A mere adjustment to the “robots.txt” file on a website would ostensibly render it inaccessible to the prying eyes of GPTBot. This measure seemed pivotal in empowering content creators to control access to their online content. Yet, a profound uncertainty lingered – could obstructing GPTBot truly guarantee that website content wouldn’t become fodder for the training data of LLMs?
An OpenAI spokesperson emphasized that the organization intermittently extracts public data from the internet, underlining its intention to refine future models for heightened capabilities, precision, and safety. GPTBot’s crawling is not indiscriminate; the harvested content undergoes a filtration process that weeds outsources hiding behind paywalls, as well as those that collect personally identifiable information (PII) or host text that runs afoul of OpenAI’s policies.
The reaction among web proprietors was rapid and impactful. Online publications like The Verge were swift to adopt protective measures, inserting the all-important “robots.txt” directive to halt GPTBot’s encroachments. Notable voices in the digital landscape, such as Casey Newton of the Platformer newsletter and Neil Clarke, editor of Clarkesworld sci-fi magazine, made their sentiments clear by choosing to block GPTBot from their domains.
Do you know that YouTube is making big changes to its homepage in an effort to improve the user experience? This change is most noticeable for users who have elected to disable their watch history:
Amidst the furor over GPTBot’s inception, OpenAI undertook a collaborative endeavor with New York University’s Arthur L. Carter Journalism Institute. This collaborative partnership, fortified by a $395,000 grant, is championed by former Reuters editor-in-chief Stephen Adler. The initiative christened the Ethics and Journalism Initiative, serves as a vanguard against the potential ethical dilemmas arising from AI integration within the realm of journalism.
Tom Rubin, OpenAI’s Chief of Intellectual Property and Content, expressed enthusiasm for the venture, while conspicuously sidestepping the contentious topic of web scraping. While the public’s influence over content distribution on the internet is arguably enhanced, the crux of the matter remains unclear. It remains to be seen whether blocking GPTBot could entirely insulate content from being absorbed into LLM training data.
The landscape of data utilization is intricate, particularly concerning the sprawling collections of public data amassed by LLMs. Google’s Colossal Clean Crawled Corpus (C4) dataset and the nonprofit initiative Common Crawl are emblematic examples of such data repositories. In instances where the content was integrated into these vast datasets, experts contend that its imprint on the training data is effectively indelible.
As demonstrated by the legal tussle involving web scraping and the Computer Fraud and Abuse Act (CFAA), scraping publicly accessible data remains a lawful activity, a notion reinforced by the U.S. Ninth Circuit of Appeals. However, this realm has encountered mounting resistance in recent times, particularly in the context of AI training data. OpenAI faced legal action last year, stemming from allegations of unauthorized copying and privacy violations.
The tides of change are evident across prominent platforms like X and Reddit, both responding to data scraping concerns with alterations to accessibility and pricing. The adversarial relationship between AI data gathering and content creators continues to evolve, fueling debates around privacy, ethics, and the boundaries of digital engagement.
In this age of dynamic digital transformation, OpenAI’s GPTBot and its implications stand at the crossroads of innovation, ethics, and legal boundaries. The path forward necessitates a harmonious equilibrium between technological progress and safeguarding the rights and intellectual property of digital creators.