Enrique Dans

Summary

The web content discusses the complexities and legalities surrounding the use of web scraping to obtain data for training machine learning algorithms, highlighting the implications for copyright, the potential for industry monopolization, and the ethical considerations of using creative works without permission.

Abstract

The article delves into the intricate process of sourcing data for machine learning, particularly focusing on the practice of web scraping. Before the introduction of Dall·E, a pioneering image-generative algorithm, companies engaged in web scraping with little regulation, collecting vast datasets of images and text. Legal precedents on the matter are mixed, with cases like LinkedIn's failed attempts to prevent data scraping and Facebook's success against Power Ventures, illustrating the lack of clarity. The legality of web scraping is generally acknowledged, yet its uses vary from reasonable to controversial, as seen with Clearview's condemned activities. Companies such as OpenAI have used copyrighted images, like those from Getty Images, for training algorithms, raising concerns about intellectual property rights. The issue becomes more pronounced as algorithms can now generate content that closely mimics the style of specific artists or authors, leading to potential copyright infringement and ethical dilemmas regarding the need for permission and compensation from original creators. The article suggests that the creations of algorithms, while not human, are still tools managed by humans, implying that the algorithm itself does not hold copyright. The debate extends to the future of the algorithm industry, questioning whether it will be dominated by a few powerful entities or become more democratized, with the understanding that reliance on Big Tech for data could lead to an imbalance of power and innovation.

Opinions

The author implies that the current state of web scraping for algorithm training is akin to a "Far West" scenario, lacking clear legislation or boundaries.
There is an underlying opinion that web scraping, while legal, can be misused, and its ethical implications are a subject of ongoing debate.
The article suggests that the ability to train algorithms should not be limited to companies with access to large databases or those capable of striking deals with image and text repositories, to prevent industry monopolization.
The author seems to advocate for a more equitable ecosystem where companies can train algorithms using their own generated data, which could lead to better, more secure, and less centralized data practices.
The piece conveys a concern that the current trajectory could lead to the dominance of Big Tech in the algorithm training space, potentially stifling competition and innovation.

What are we going to use to train algorithms with?

IMAGE: Alexandra Koch — Pixabay

Explaining the various factors involved in the complex process of obtaining data for training machine learning algorithms can be intricate yet highly fascinating. Before the launch of Dall·E, the first image-generative algorithm, in January 2021, the companies involved in its development basically did what they wanted, in a sort of Far West environment with no apparent legislation or frontiers.

Given that web scraping is, in principle, a legal practice, anyone can copy the content on publicly accessible pages, they harvested huge collections of tagged images and texts that they considered reasonably correct, and fed them into the databases they needed to train their products. The precedents for the issue were confusing: LinkedIn had lost several cases trying to stop other companies from web scraping its network data, but Facebook had won against Power Ventures, while Clearview’s activities prompted condemnation. Nevertheless, the idea, although subject to judges’ interpretation, was that web scraping was a tool, not a crime, and as with any tool, there were reasonable and unreasonable uses.

Then, companies like OpenAI and others broke into databases like Getty Images, getting their hands on millions of tagged images. All of them had a “Getty Images” watermark that could only be removed if you paid for the use of the photo, but it didn’t matter: the image was sufficiently visible, and its tags allowed the algorithm to interpret it.

The issue began to attract attention when users of Dall-E and other algorithms, such as Stable Diffusion or Midjourney, began to play tricks asking for images “in the style of”. The thing seemed like magic: if your prompt asked for an image in the style of a certain artist, the algorithm would go to the images it had of that person, and often the result was so good that it looked like the real thing. In addition, some algorithms in many cases went so far as to reproduce Getty Images watermarks: the algorithm had been trained with so many images bearing them that it interpreted the watermark as an element that should appear in its creations. It got worse with texts: the latest algorithms, such as Claude, can ingest entire books in seconds, allowing them to immediately switch to writing as the author would do and potentially irritating a large number of authors who want to be asked for permission and adequately compensated.

To complicate matters further, an issue that always complicates everything arose: copyright. In principle, only human creations are protected by copyright. The case of the monkey selfie, in which the judge ruled that there was no copyright protection since the photo was taken by the animal itself, was extended to algorithms: an algorithm is not human, and therefore, its creations should be exempt from copyright.

That said… who says the algorithm created the image; surely it is just a tool used by us? People who don’t know how to use them start out creating poor quality images: handling the algorithm, writing a suitable prompt and managing all the interpretations that the algorithm makes is no easy task. Seen like this, just as the author of this article is not the computer I used to write it, so the algorithm used to write an article or create an image isn’t the owner of the work it produces. It’s just a machine managed by us.

The question is complex, but goes beyond the hypothetical: it is the basis of what we will be able to do or not do with algorithms, and above all, of the industry that will be generated around them. If the only companies capable of training algorithms are those that can close deals with image or news banks to supplement the already famous LAION, we’re headed for a repeat of what has happened with social networks, with a few powerful players abusing their position. If, on the other hand, we make it easy for creations to be used to train algorithms, we will be opening the door for anyone to train them and, potentially, to a less concentrated, more diverse environment… but there will be problems with the owners of those images and texts, or with the agencies that represent them.

A better solution is for everyone to train their algorithms with what they can: each company, with the data generated by their activity and transactions. Limited, vertical, but potentially much better, and without compromising the security of that data. Companies that know how to turn their activity into a pipeline that constantly generates data will be able to train their own algorithms and depend less on Big Tech. But that will mean understanding the alternatives and the consequences of not doing so. We’ll see if we understand this in time, or if we’re doomed to the relentless dominance of Big Tech.

(En español, aquí)

Algorithms

Machine Learning

AI

Artificial Intelligence

Data

Recommended from ReadMedium

Christopher Tao

Do Not Use LLM or Generative AI For These Use Cases

Choose correct AI techniques for the right use case families

7 min read

Datadrifters

Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

Voice Mode is hands down one of the coolest features in ChatGPT, right?

5 min read

Han HELOIR, Ph.D. ☕️

The Art of Chunking: Boosting AI Performance in RAG Architectures

The Key to Effective AI-Driven Retrieval

13 min read

Ignacio de Gregorio

AgentQ, A Human-Beating AI Agent

The Return of Monte Carlo Tree Search

11 min read

Cobus Greyling

Fine-Tuning OpenAI GPT-4o mini

Fine-tuning for the GPT-4o mini became available yesterday, so I decided to fine-tune OpenAI’s small language model to explore what I could…

8 min read

Dr. Ashish Bamania

‘MedGraphRAG’ Is A Complete Game Changer For AI In Medicine

A deep-dive into how RAG, GraphRAG, and MedGraphRAG work and how they significantly improve the performance of LLM responses in Medicine.

8 min read