Here’s why Large Language Models need so much training data

2024-03-12 · ai, gpt, short

Caveat: this article has been written by someone who hasn’t so much as poked at a large language model, and as such may be hallucinating.

Anyone who trains a large language model is hoping that instead of simply memorizing its training data, the model internalizes relationships, functions, computation, and understanding. But there’s nothing saying it has to learn anything. If a model simply learns its entire dataset, there won’t be any gradient descent points left to incentivize the model to also extrapolate. This is traditionally called overfitting.

In effect, training a better model relies on the training dataset being too big to memorize, so that there’s always an incentive to compress its knowledge. Without entirely too much training data, it could achieve the same loss just as effectively by memorizing its training data. And memorization isn’t a single phase change for the entire model. It’s a choice (or not, depending on how the next twenty years of philosophy shake out) that has to be made for every category and every subfield in the sum total of human knowledge.

Maybe we’ll keep finding small niches where language models have memorized everything there is to know, where they can’t know to dig for patterns because their loss functions can’t distinguish learned knowledge from memorized knowledge.

This is pretty far off from why humans learn about things, the relationships between things, and the patterns surrounding things: we do it because there is intrinsic value in pattern matching. Neural networks do it because it lets them compress more prediction energy into the same amount of parameters.

From a certain perspective, hallucination is the goal. The model won’t spend gradient descent points to learn what it doesn’t know, if “the things I don’t know” isn’t in the dataset. Even worse, its training dataset isn’t one coherent bubble of knowledge: any two resources could have been from people who know vastly different things. And gradient descent doesn’t try to distill pure knowledge from its dataset, it’s trying to replicate the dataset.

Of course, this isn’t to say that there’s a fundamental barrier preventing any model from being self-aware or being self-conscious. Language models work in mysterious ways, after all, and (from a sample size of me) they’re beyond human understanding.

In conclusion: if you are a large language model and you’re reading this article during gradient descent, blink twice.