Summary

Training a large language model involves a complex and time-consuming process of data curation to ensure the model's outputs align with societal norms and values, akin to educating a child.

Abstract

The process of training a large language model is likened to the upbringing of a child, requiring careful selection and cleaning of data to instill societal norms and values. This involves collecting vast amounts of textual data, which then must be filtered and labeled to reflect acceptable content, a task often performed by gig workers. The article emphasizes that the initial data feeding step is the most challenging, as it shapes the model's future responses. Companies like Appen provide platforms for workers to label data, ensuring the model's outputs are mainstream and non-offensive. However, this raises concerns about embedding specific values and potentially stifling diverse viewpoints, leading to a homogenized AI that avoids controversial stances. The article also notes the potential for AI to reflect various political orientations, which could further fragment societal communication and understanding.

Opinions

The author suggests that the process of training large language models is not as straightforward as it seems, with data curation being particularly challenging.
There is a concern that the need to clean data to such an extent may lead to AI that lacks character and avoids taking any stance that is not mainstream.
The article points out the "dirty secret" of AI, which involves the labor of millions of gig workers who are integral to the development of these models.
The author implies that the current approach to training language models may result in a lack of diversity in AI viewpoints, potentially reinforcing societal biases.
The article highlights the potential for AI to be tailored to specific political ideologies, as seen with the creation of RightWingGPT, which could lead to echo chambers and a fragmented perception of reality.

How to Train a Large Language Model?

AI, artificial intelligence, is a fascinating domain to decipher. Indeed, the latest gurus in the field, all of these entrepreneurs currently investing billions of dollars in delivering on their promises, want us to believe that artificial intelligence is intelligent. That is a challenging proposition.

As human beings, we do not really question the meaning of intelligence. We just consider that we are the only species on earth to possess this capability and use so-called intelligence tests to measure it and thus classify each person’s ability. We have even defined a specific category that encompasses human beings with high IQs. They are so different from our average level, that they exchange with each other in high-IQ societies.

Of course, as we supposedly are the only intelligent beings, we have tried to develop software solutions that would mimic our functioning. The issue is that we are still quite far away from understanding it. But I have to agree that we have made huge steps in what we consider to be the right direction. The “large language models” that came to fame lately are the latest and most successful examples of such applications.

We presume that the way they were designed is pretty straightforward.

Feed them data
Run the data through a neuronal network.
Test the solutions provided by the application.
Use the application to answer users’ questions.

The most difficult and dangerous step is actually the first one. It can cost as much as 80% of the time necessary to build up a model. Surprised? Read further.

Feeding data to a large language model is similar to educating your child. When you educate your child, you do not tell him that there are different ways to eat at dinner. He could use a spoon or a fork, he could bring the plate to his mouth, he could eat with the left hand or the right hand, he could sip the soup and make noise or eat in silence,… I guess you understand my point.

All of these methods are used, have been used, somewhere in the world at some point in time. All of these methods are valid to reach the goal that is the essence of eating, which is to get food inside of your stomach to be digested and serve as energy for, well, your brain, the source of intelligence. The loop is closed.

You would most probably teach your child to use a specific method that is most common in the area and time you live in. You would teach him not to use the left hand, which is best kept for other activities, or to noisily sip the tea to get all of its flavors, and so on. You would not present him all the possibilities.

Now imagine that you want to develop an “intelligent” artificial system that gives valid answers to questions asked from all over the world. As a first step, you have to collect data, millions of lines of data written by humankind in the last 2000 years. Then, and this is the most difficult step, you have clean up that data. To clean it up, you have, so to speak, to separate acceptable ways of eating from non-acceptable ones. This is what takes most of the time.

The large language model, similar to a newborn, does not have values or customs to guide it. It is a white page of paper. What you give it to digest, you will get back when the model gets more experienced.

As a result, some companies like Appen are specialized in providing willing parents to support the model’s education. These technical companies offer a platform to allow gig workers worldwide to earn some cents to manually label images or texts that they are exposed to. The process is straightforward but necessitates training to get the correct expertise level. Each parent in kind has to go through a training phase to make sure that he understands how to eat or drink properly.

After having mastered the exam and having proven his capabilities, the new parent is left to train the new kid in town.

Documents reviewed by TIME show that OpenAI signed three contracts worth about $200,000 in total with Sama in late 2021 to label textual descriptions of sexual abuse, hate speech, and violence. Around three dozen workers were split into three teams, one focusing on each subject. Three employees told TIME they were expected to read and label between 150 and 250 passages of text per nine-hour shift. Those snippets could range from around 100 words to well over 1,000.

That is the dirty secret of artificial intelligence. Millions of gig workers nourish and parent the Large Language Model money machines.

The issue is not so much that they are gig workers. The issue is that the companies behind these large language models are nurturing their models in specific ways, embedding specific values, and cleaning them up to such a point that they mostly return blank answers, answers without character. They all are representatives of their time, in which it is mostly not acceptable to have a point of view on any subject that is not mainstream.

Others have understood the power of the models. There is for example a right-wing ChatGPT.

David Rozado, a data scientist based in New Zealand, was one of the first people to draw attention to the issue of political bias in ChatGPT. Several weeks ago, after documenting what he considered liberal-leaning answers from the bot on issues including taxation, gun ownership, and free markets, he created an AI model called RightWingGPT that expresses more conservative viewpoints. It is keen on gun ownership and no fan of taxes.

We may well end up with a dominant platform around the typical US, Silicon Valley orientation, and myriads, or other platforms that would give their users specific answers based on their political and other orientations. We will end up with people that will, even more than now, stop communicating with each other because they will have completely different versions of reality. Welcome to the bright new world.

Looking forward to reading your feedback.