avatarAndrew Zuo

Summary

The article discusses the unintended consequences of AI data scraping, leading to restrictive policies on platforms like Twitter and Reddit, and potentially transforming the internet into a more paywall-dominated landscape.

Abstract

The author reflects on the recent changes to Twitter and Reddit's policies, which limit data access in response to excessive AI data scraping. These platforms are grappling with the challenges posed by AI tools that consume vast amounts of data, necessitating measures like login requirements and API usage restrictions. The article suggests that the scale of data scraping is more extensive than previously thought, with potential repercussions for the accessibility of information online. As sites increasingly adopt paywalls to protect their content from scraping, the internet may evolve into a more controlled environment where access to information is predominantly paid and centralized among a few major companies.

Opinions

  • The author initially underestimated the extent of AI data scraping but now recognizes its significant impact on social media platforms and the broader internet.
  • Twitter's new policy of limiting tweet visibility is seen as a drastic measure to combat data scraping, with Elon Musk's response being somewhat dismissive.
  • Reddit's API pricing changes are also viewed as a response to data scraping, although the author believes the user backlash may be exaggerated.
  • The author criticizes Reddit CEO Steve Huffman's handling of the situation, particularly his limited engagement during an AMA.
  • The article posits that the amount of data required to train AI models, such as ChatGPT, is enormous, contributing to the widespread scraping of online content.
  • The author predicts that the proliferation of paywalls in response to data scraping will make free access to information more difficult, potentially leading to increased reliance on news aggregators.
  • The author suggests that the internet may be heading towards a 'post-AI' era characterized by paid content and control by a few large entities.
  • The author encourages readers to follow their publications and consider using a cost-effective AI service as an alternative to more expensive options like ChatGPT Plus.
Photo by Nicolas Picard on Unsplash

Twitter And Reddit Are Just Symptoms Of A Much Larger Problem

All the way back in April I saw this post:

And I wanted to write a post about it. About how AI is causing unintended problems. But I didn’t really see a good angle.

But now I’m kicking myself for not posting anything about it. Because this post is probably the first indicator of what was coming.

The APIocapypse

So Twitter has just announced that you have to be logged in to see Tweets and even if you are logged in there’s a limit to how many Tweets you can see.

Verified users can see 6000 posts a day and unverified 300 or 600 based on account status. This is the justification: ‘To address extreme levels of data scraping & system manipulation, we’ve applied the following temporary limits’. I would link the Tweet that announces this but some people may not be able to see it. And this policy has resulted in many tweets not being indexed in Google. (Update: Twitter has increased this slightly but it’s not that significant)

When questioned on the policy Elon Musk, the Prigozhin of the west, responded, “Touch grass again.” Which I guess isn’t that bad of an answer. But it sort of goes against the whole ‘addictive social media’ angle. That’s why companies insist on infinite scrolling lists, because they don’t want you to leave. Welcome to Social Media, where you can check in but never check out.

And at the same time something very similar is going on at Reddit. This one is much larger because, unlike Twitter, people actually care about Reddit.

Although Reddit does not appear to be going as far just yet. They are only limiting the use of the API, you can still go to the website and browse normally. Honestly, I think the revolt to these policies is a bit overblown.

I do respect subreddits’ right to protest, but some of these comments… they sound like they came out of the rioters in France.

Not to say I approve of Steve Huffman. Honestly I think a lot of his decisions made this whole thing much worse than it had to be. Like running an AMA where you only answer like — what — 6 questions? What did you think would happen? Thankfully they stopped giving interviews to The Verge. Too little too late though.

Whatever the case, it looks like Reddit and Twitter have gone down similar roads. And for similar reasons: the scraping of their content for training AI.

AI

The reason why AI needs so much information is that it’s not like a normal program where you code what the computer does. No, in AI you code how the AI learns and then the AI actually does things. And the way it learns is that you give it a lot of training data.

So suppose we want to develop an AI model that can identify handwritten digits, like those found in zip codes on envelopes. Instead of explicitly programming the AI to recognize each digit, we can utilize a machine learning approach.

To train the AI model, we gather a large dataset of handwritten digit images, where each image is labelled with the corresponding digit it represents. This dataset serves as the training data for our AI model.

Initially the AI does not know which image corresponds to which number. But then we feed it all of our data. Every time the AI incorrectly responds we give it some negative reinforcement (as depicted in this diagram). Eventually after looking at enough training examples it will figure out which handwritten digits correspond to which numbers.

However this process requires a lot of data, way more information than what a human brain would need because artificial neural networks are not like the brain at all. They should be called ‘artificial pachinko boards’.

And this is for a simple problem: identifying numbers. Imagine how much training data ChatGPT needs to be able to answer any question given to it. It’s a lot and it’s why AI companies are scraping all the data they can.

Now originally I thought that there’s not that much scraping going on. It’s why I didn’t want to post an article on the scraping problem. Like how many AI startups can there possibly be? And search engines like Google were the OG scrapers. But with Reddit and now Twitter I may have to reconsider. Maybe the amount of data scraping going on is much larger than I originally suspected.

And if that’s the case this problem may be getting worse, not better. Because Reddit and Twitter aren’t the only sources of data. There are thousands of news sites all over the internet. And now they’re all going to be scraped. Of course this is a solved problem, just put up a paywall. But many sites choose not to use a paywall and I don’t know about you but if I see a paywall I’m like

But with AI this is going to become the lesser of two evils.

We have become accustomed to getting information for free. That’s the entire premise of search engines like Google. They scrape every single website and convert them into a giant graph of data. Then using intelligent algorithms data is ranked and sorted to give you results. Well, what happens when every site is now using a paywall?

It won’t kill search engines but this wave of paywalls will make information a lot harder to find. And when you click through to the site you’ll be greeted by a paywall. Then news aggregators like Apple News and Google News will become a lot more popular because no one has time to pay for a dozen paywalls.

And as a result we may just find ourselves in a new internet. A new post-AI internet. Where everything is paid and controlled by a few companies.

If you liked this article consider following me on one of my publications: Lost But Coding (for programming content) or The Rest Of The Story (for everything else). You can do so with my RSS reader available on iOS (and Apple Silicon Macs) and Android.

AI
Reddit
Twitter
Scraping
Internet
Recommended from ReadMedium