Getting machines to generate coherent meaningful text, seems like an incredibly hard task, but if you have been following the latest Artificial Intelligence buzz for GPT-3, you would be forgiven for thinking that Skynet is already upon us.
If you just look at the headlines it seems like we can already do a lot. It can summarize user feedback neatly so that you can gain insights and can apparently write working code for you. It is being used to develop interactive text-based games that organically evolve with user choices instead of constraining them to permutations of various states. But the common theme for a lot of its highly impressive accomplishments is that the stakes are not that high. A few mistakes by the model in an interactive game isn’t the end of the world, it can even be hilarious sometimes. What if we want to generate some serious content where quality assurance is a top priority?
At Alexi, our lawyers leverage Natural Language Processing (NLP) models to answer legal questions. But can a generative model just generate an entire legal memo that not only answers the question but also has the most relevant citations with no help from the lawyers? (Un)Fortunately, right now the answer is no.
Language Models: Taking Over the World, One Word at a Time
Most generative models in NLP today are some variants of language models or make use of the same techniques that a language model does. A complete technical breakdown of how they work is out of scope for this article, but I will give a brief overview of how they work in general. There are a lot of technical details and some neat tricks which are not really important for the big picture.
Language models ultimately learn to predict the probability of words given all the words they have already produced. As an imaginary example when trained on a dataset of Yelp reviews:
Starbucks coffee tastes very ___________.
The model might predict that the probability of the next word being “bad” is 70%. I am exaggerating here, considering that the vocabulary size is quite large the real probability will probably be lower.You feed the model texts from your dataset and it learns the general pattern of how different words are related to each other and learns to predict the probability of words conditioned on any input text and all the words generated up to that point, such a model is often called an autoregressive model. It is basically predicting one word at a time.
Make no mistake, there is active research going on to come up with alternative ways of generation rather than autoregressive. Autoregressive models themselves leave something to be desired since intuitively that’s not how humans seem to actually generate text. We don’t generate words one by one and choose the most likely next word based on the previous words. We seem to internalize a concept and already know what we are going to say as a whole sentence.
GPT-3 has 175 billion parameters. GPT-2 had 1.5 billion. The amount of textual data needed to train those models reaches the order of TeraBytes. Studies indicate that with the current approaches to text generation larger models trained on larger datasets simply tend to be better. It is a challenging and expensive problem to work with and scale models of this size into your tech stack in the form of fast usable APIs. This also leads to other issues. Only the top richest companies have the resources to train models of this size. This leads to closed APIs, commercialization, and slowing research progress.
Oh No, My Model is Racist
The fact that these huge models are trained on huge datasets means the data the model is using is not very clean. The data essentially comes from the internet and includes social media and forum comments.
When asked to generate content by giving one word as input (black/women/jewish/holocaust) very frequently the model resorts to generating incredibly hateful content. This is not surprising, and there are tricks and research towards mitigating it, but it is still a huge problem.
Train of Thought … What’s That?
One of the biggest problems with the current generative models is that it is very common for them to go on a completely unrelated tangent. It is possible for the model to produce highly structured and coherent long text, but it is not consistent let alone guaranteed. The model-generated text is very coherent locally (say within a paragraph) but it is quite common for a model to pick up on some tangent and just completely diverge from the original point. From a technical standpoint, it is actually easy to see why this happens. GPT-3 for example has no mechanism for long-term memory. The current state-of-the-art generative models are all based on the Transformer architecture. This is severely limited by its input context size (typically around 500-1000 words). There is active research into mitigating these issues, but no standout solution has been found yet.
Generative Models Are Dumb, What Now?
They will get better, they are getting better all the time. Right now they still have issues, apart from the ones discussed above. They also have other issues such as repetition and lack of originality in generated content. The model can produce highly creative and original texts, but it's common for the model to spit out memorized paragraphs.In the absence of a good general intelligence model, it is important to break your problem into sub-tasks and have specialized individual models solve them. Involve humans and domain experts instead of treating NLP models as a magical black box that spits out answers. Instead of asking the model to generate long coherent answers, ask it to select relevant passages that answer your question. Using these techniques and solving specialized subproblems can lead to impressive results.
I believe that models will become more intelligent. We will definitely have a general-purpose intelligence system that is actually intelligent someday. But we don’t have to worry about appeasing our AI overlords anytime soon.