How to Train a Large Language Model?

AI, artificial intelligence, is a fascinating domain to decipher. Indeed, the latest gurus in the field, all of these entrepreneurs currently investing billions of dollars in delivering on their promises, want us to believe that artificial intelligence is intelligent. That is a challenging proposition.
As human beings, we do not really question the meaning of intelligence. We just consider that we are the only species on earth to possess this capability and use so-called intelligence tests to measure it and thus classify each person’s ability. We have even defined a specific category that encompasses human beings with high IQs. They are so different from our average level, that they exchange with each other in high-IQ societies.
Of course, as we supposedly are the only intelligent beings, we have tried to develop software solutions that would mimic our functioning. The issue is that we are still quite far away from understanding it. But I have to agree that we have made huge steps in what we consider to be the right direction. The “large language models” that came to fame lately are the latest and most successful examples of such applications.
We presume that the way they were designed is pretty straightforward.
- Feed them data
- Run the data through a neuronal network.
- Test the solutions provided by the application.
- Use the application to answer users’ questions.
The most difficult and dangerous step is actually the first one. It can cost as much as 80% of the time necessary to build up a model. Surprised? Read further.
Feeding data to a large language model is similar to educating your child. When you educate your child, you do not tell him that there are different ways to eat at dinner. He could use a spoon or a fork, he could bring the plate to his mouth, he could eat with the left hand or the right hand, he could sip the soup and make noise or eat in silence,… I guess you understand my point.
All of these methods are used, have been used, somewhere in the world at some point in time. All of these methods are valid to reach the goal that is the essence of eating, which is to get food inside of your stomach to be digested and serve as energy for, well, your brain, the source of intelligence. The loop is closed.
You would most probably teach your child to use a specific method that is most common in the area and time you live in. You would teach him not to use the left hand, which is best kept for other activities, or to noisily sip the tea to get all of its flavors, and so on. You would not present him all the possibilities.
Now imagine that you want to develop an “intelligent” artificial system that gives valid answers to questions asked from all over the world. As a first step, you have to collect data, millions of lines of data written by humankind in the last 2000 years. Then, and this is the most difficult step, you have clean up that data. To clean it up, you have, so to speak, to separate acceptable ways of eating from non-acceptable ones. This is what takes most of the time.
The large language model, similar to a newborn, does not have values or customs to guide it. It is a white page of paper. What you give it to digest, you will get back when the model gets more experienced.
As a result, some companies like Appen are specialized in providing willing parents to support the model’s education. These technical companies offer a platform to allow gig workers worldwide to earn some cents to manually label images or texts that they are exposed to. The process is straightforward but necessitates training to get the correct expertise level. Each parent in kind has to go through a training phase to make sure that he understands how to eat or drink properly.
After having mastered the exam and having proven his capabilities, the new parent is left to train the new kid in town.
Documents reviewed by TIME show that OpenAI signed three contracts worth about $200,000 in total with Sama in late 2021 to label textual descriptions of sexual abuse, hate speech, and violence. Around three dozen workers were split into three teams, one focusing on each subject. Three employees told TIME they were expected to read and label between 150 and 250 passages of text per nine-hour shift. Those snippets could range from around 100 words to well over 1,000.
That is the dirty secret of artificial intelligence. Millions of gig workers nourish and parent the Large Language Model money machines.
The issue is not so much that they are gig workers. The issue is that the companies behind these large language models are nurturing their models in specific ways, embedding specific values, and cleaning them up to such a point that they mostly return blank answers, answers without character. They all are representatives of their time, in which it is mostly not acceptable to have a point of view on any subject that is not mainstream.
Others have understood the power of the models. There is for example a right-wing ChatGPT.
We may well end up with a dominant platform around the typical US, Silicon Valley orientation, and myriads, or other platforms that would give their users specific answers based on their political and other orientations. We will end up with people that will, even more than now, stop communicating with each other because they will have completely different versions of reality. Welcome to the bright new world.
Looking forward to reading your feedback.






