Earlier this month, there was a big fuss about OpenAI and how they gather data to train their chatbot, ChatGPT. People were criticizing OpenAI for taking content without permission, so OpenAI decided to give websites the option to block their data scraping. All the websites had to do was add a simple line of code to their “robots.txt” file, and OpenAI would kindly back off.
Well, let me tell you, a lot of sites have taken up this option. Companies like Amazon, Airbnb, Glassdoor, and Quora have all added the code to their files. They don’t want OpenAI snooping around their content anymore. But here’s the thing, adding a line of code won’t magically fix the problem. OpenAI’s chatbot has already been out there collecting data for a while. It’s like trying to close the barn door after the horse has already escaped.
And it gets even trickier. Other online resources that have valuable data, like Ikea, Indeed.com, Kelley Blue Book, and BAILII, have also started blocking OpenAI. Even StackOverflow, the coding resource, has said no to the bot. But interestingly enough, GitHub, a rival to StackOverflow, hasn’t blocked OpenAI. Maybe that has something to do with Microsoft, who owns GitHub and is a major investor in OpenAI.
This whole situation is a mess, my friends. Media companies are getting involved, negotiating or even suing OpenAI for their data practices. Reports suggest that around 70 of the top 1,000 websites have added the code to block OpenAI. And while it’s nice of OpenAI to say, “Hey, we won’t scrape your data if you don’t want us to,” it’s just not enough. They’ve already collected so much data, and they’re not giving it back.
But here’s where it gets interesting. By allowing websites to block their bot, OpenAI is setting a precedent. They can argue that their competitors should do the same, locking down their data and enjoying the benefits of being an early mover in the AI game. They’re like the pioneers staking their claim. Ben Thompson, the tech commentator, even pointed out that OpenAI isn’t returning the data they’ve already collected.
And here’s another thing: web crawlers aren’t the only way AI companies gather data. They often use large datasets from third parties, and that has caused some legal battles with content owners. OpenAI, Meta, Google, they’re all using these datasets, and some authors are suing them for it. OpenAI hasn’t commented on whether they’ll use data from blocked websites through other means, leaving a lot of uncertainty.
So, my friends, it’s a messy situation. OpenAI is giving websites the option to say no, but the data is already out there. It’s like trying to unscramble an egg. Google, with its rival chat tool called Bard, wants to start a discussion on consent in AI, but it might be too late. The data has already been thrown into the digital blender, my friends, and it’s hard to see how we can undo it now.