What are we going to use to train algorithms with?

Explaining the various factors involved in the complex process of obtaining data for training machine learning algorithms can be intricate yet highly fascinating. Before the launch of Dall·E, the first image-generative algorithm, in January 2021, the companies involved in its development basically did what they wanted, in a sort of Far West environment with no apparent legislation or frontiers.
Given that web scraping is, in principle, a legal practice, anyone can copy the content on publicly accessible pages, they harvested huge collections of tagged images and texts that they considered reasonably correct, and fed them into the databases they needed to train their products. The precedents for the issue were confusing: LinkedIn had lost several cases trying to stop other companies from web scraping its network data, but Facebook had won against Power Ventures, while Clearview’s activities prompted condemnation. Nevertheless, the idea, although subject to judges’ interpretation, was that web scraping was a tool, not a crime, and as with any tool, there were reasonable and unreasonable uses.
Then, companies like OpenAI and others broke into databases like Getty Images, getting their hands on millions of tagged images. All of them had a “Getty Images” watermark that could only be removed if you paid for the use of the photo, but it didn’t matter: the image was sufficiently visible, and its tags allowed the algorithm to interpret it.
The issue began to attract attention when users of Dall-E and other algorithms, such as Stable Diffusion or Midjourney, began to play tricks asking for images “in the style of”. The thing seemed like magic: if your prompt asked for an image in the style of a certain artist, the algorithm would go to the images it had of that person, and often the result was so good that it looked like the real thing. In addition, some algorithms in many cases went so far as to reproduce Getty Images watermarks: the algorithm had been trained with so many images bearing them that it interpreted the watermark as an element that should appear in its creations. It got worse with texts: the latest algorithms, such as Claude, can ingest entire books in seconds, allowing them to immediately switch to writing as the author would do and potentially irritating a large number of authors who want to be asked for permission and adequately compensated.
To complicate matters further, an issue that always complicates everything arose: copyright. In principle, only human creations are protected by copyright. The case of the monkey selfie, in which the judge ruled that there was no copyright protection since the photo was taken by the animal itself, was extended to algorithms: an algorithm is not human, and therefore, its creations should be exempt from copyright.
That said… who says the algorithm created the image; surely it is just a tool used by us? People who don’t know how to use them start out creating poor quality images: handling the algorithm, writing a suitable prompt and managing all the interpretations that the algorithm makes is no easy task. Seen like this, just as the author of this article is not the computer I used to write it, so the algorithm used to write an article or create an image isn’t the owner of the work it produces. It’s just a machine managed by us.
The question is complex, but goes beyond the hypothetical: it is the basis of what we will be able to do or not do with algorithms, and above all, of the industry that will be generated around them. If the only companies capable of training algorithms are those that can close deals with image or news banks to supplement the already famous LAION, we’re headed for a repeat of what has happened with social networks, with a few powerful players abusing their position. If, on the other hand, we make it easy for creations to be used to train algorithms, we will be opening the door for anyone to train them and, potentially, to a less concentrated, more diverse environment… but there will be problems with the owners of those images and texts, or with the agencies that represent them.
A better solution is for everyone to train their algorithms with what they can: each company, with the data generated by their activity and transactions. Limited, vertical, but potentially much better, and without compromising the security of that data. Companies that know how to turn their activity into a pipeline that constantly generates data will be able to train their own algorithms and depend less on Big Tech. But that will mean understanding the alternatives and the consequences of not doing so. We’ll see if we understand this in time, or if we’re doomed to the relentless dominance of Big Tech.
(En español, aquí)






