How an LLM Is Made: Months, Millions, and a Trillion Guesses
AI & Tools Data EngineeringYou've almost certainly used one of these by now: ChatGPT, Claude, Gemini, one of the AI assistants that will answer just about anything you type at them. And you've probably been handed an opinion on which is best: use this one for writing, that one for code, the open one if you want it running on your own computer. The model has quietly become something you pick off a shelf, the way you'd choose a brand of phone. What almost nobody stops to ask is the part I find the most interesting: where does one of these actually come from? How do you make a large language model, how does it learn anything, why can it answer you the instant you hit enter, and why does it sometimes say something completely wrong with a totally straight face?
You don't need to be technical to follow this. I'm going to explain the whole thing with everyday comparisons and no math. By the end you'll understand how the AI you use every day was really made, what people mean when they say a model is "training," and why something that took months and a fortune to build can answer you faster than you can read the reply. One promise up front: I won't show you how to build your own ChatGPT, because you can't, and the reason why turns out to be the most useful thing in here.
First, "Model" vs "LLM"
Quick untangling, because the two words get thrown around like they mean the same thing. A model is the general idea. It's any tool that learned to do something by studying lots of examples, instead of being handed step-by-step rules by a person. The spam filter that learned to spot junk mail is a model. The thing your bank uses to flag a suspicious charge is a model. So is the app that turns a typed prompt into a picture. Underneath, a model is just a giant pile of numbers that got tuned, by example, until it produced good answers.
An LLM, or large language model, is one specific kind: a model trained on a mountain of text to predict text. "Large" because it holds billions of those tuned numbers and has read more than any human could in a thousand lifetimes. Here's the easy way to hold it in your head: model is the word "vehicle," and an LLM is one type of vehicle, like a car. Every LLM is a model, but plenty of models, the spam filter and the photo app, aren't LLMs. In casual conversation people just say "the model" when they mean the LLM, and that's where the mix-up starts.
It's a Next-Word Guesser (and That's Not an Insult)
Strip everything else away and a large language model does one surprisingly simple thing: it predicts the next bit of text. You give it some words, and it works out the most likely next word, then the next, then the next.
You've already met a baby version of this. When your phone suggests the next word above the keyboard, or finishes "I'll be there in five" with "minutes," that's the same idea, just small and not very bright. An LLM is that same trick scaled up almost beyond belief. The thing that can write a wedding speech, fix your spreadsheet formula, or argue with you about pizza toppings is, underneath all of it, a phenomenally good version of your phone's autocomplete.
Here is the loop it runs. It looks at everything written so far, gives every possible next word a score for how likely it is, picks a strong one, sticks it on the end, then does the whole thing over again with that new word included. A full answer is just that loop running a few hundred times, one word after another. (It actually works in "tokens," which are pieces of words, but if you read that as "words" you lose nothing.)
One turn of the loop: the model scores every possible next word, picks one, adds it, and goes again. A whole answer is this a few hundred times over.
It sounds too basic to be the thing behind ChatGPT, and that's a fair reaction. The twist is this: to get really good at guessing the next word across all the writing on the internet, the model is forced to learn an enormous amount without anyone teaching it directly. To finish "the capital of France is" correctly, it has to have picked up a fact. To continue a recipe sensibly, it has to know you preheat the oven before the cake goes in. Get good enough at "what word comes next," over a big enough pile of text, and you end up having to understand the world in order to do it. The cleverness is a byproduct of practicing one humble little task a trillion times.
Pretraining: Reading the Whole Library, One Blank at a Time
Making the model happens in two big stages. The first, and by far the more expensive, is called pretraining, and it is just that next-word game played at a scale that's genuinely hard to picture.
Imagine handing the model a library the size of the internet, books, articles, websites, code, the brilliant stuff and the unhinged comment threads alike, and having it play fill-in-the-blank over and over. Cover the next word, let it guess, then uncover the real word so it can see how it did. Guess, check, adjust. Then again, and again, billions upon billions of times.
The neat trick is that nobody has to grade this homework by hand. The text is its own answer key. The real next word is sitting right there in the original writing, so the model can mark itself, for free, every single time. That is why it can be done at such a ridiculous scale: there's no roomful of people labeling examples, just an ocean of existing writing and a model reading its way through it. (The jargon for this is "self-supervised," which only means the data checks the model's own answers. It's a close relative of the "supervised" and "unsupervised" learning you may have heard of, both of which are worth a post of their own that I'll get to.)
This stage is where the months and the millions go. Thousands of pricey computer chips running flat out for weeks, power bills that look like phone numbers, all so the model can read more of the internet than any human could survive and get great at that one prediction. What pops out at the end is called a base model. It has soaked up grammar, facts, the rhythm of a good argument, a remarkable amount of how the world gets described in words. But it is not the polite assistant you're used to, not yet. It's a brilliant know-it-all with zero social skills, and the next stage is where it learns some.
How It Actually Learns
So what does "adjust" really mean? This is the part that sounds like wizardry and isn't.
Picture the model as a machine with billions of tiny tuning knobs on the back. Those knobs (their real name is "weights") control everything it does, and at the start they're set at random, so its first guesses are gibberish. Every time it guesses a word it gets a wrongness score: low if it nearly nailed the real next word, high if it was miles off. The whole point of training is to drive that score down. (And if you're picturing the green raining code from The Matrix, let that image go: what's actually back there is far less cinematic and somehow stranger, a spreadsheet roughly the size of a small country whose numbers nobody chose by hand.)
So after each guess, the machine works out, for every single knob, which way to turn it a hair so the next guess comes out a little less wrong, and nudges them all by a tiny amount. One nudge changes almost nothing, but do it across trillions of words and the knobs slowly drift into a setting that predicts language astonishingly well. It's a bit like tuning an old radio by feel: turn the dial, listen for less static, turn a touch more, until the station comes in clear. (The fancy names are "gradient descent" for the nudging and "backpropagation" for working out which way each knob should turn, and you can happily forget both.)
↻ repeat a few trillion times
Learning is just this little loop, run at a scale no human could sit through.
The genuinely strange part is that nobody types in the facts. No person writes a rule that Paris is the capital of France, or that water boils before the pasta goes in. Those just get baked into the knob settings as a side effect of making the wrongness score smaller, over and over. Nobody tells the model how the world works. It quietly builds a working picture of it, because that turns out to be the only way to keep guessing the next word better.
One more name to leave you with, since you'll hear it everywhere: that machine of knobs is a neural network, and stacking many layers of them deep is literally where "deep learning" got its name. What those layers are actually doing under the hood is a whole post on its own, one I want to write next, but you don't need a word of it to have followed what just happened here.
From Know-It-All to Something Helpful
That base model is powerful and almost unusable as it is. Ask it a question and it might answer, or it might just fire back three more questions, because online a question is often followed by more questions. It learned to imitate text, not to actually help you. Think of a new hire who knows everything but hasn't been shown how the job works yet.
So there's a second, much smaller round of training, and this is where the assistant you recognize is actually born. First it's shown loads of examples in the right shape: here's a request, here's a good, helpful reply to it. Do that enough and the model learns the pattern "when someone asks for something, actually give them a useful answer" instead of rambling on. This step is called instruction tuning. It's the onboarding.
Then comes the part that adds the polish. Instead of being handed the one perfect answer, the model is shown two of its own answers and told which one a human liked better, again and again, across a huge number of little comparisons. Those judgments are used to train a second model, a scorer that learns to rate answers the way the humans did, and that scorer then guides the main model's tuning, so people don't have to grade every answer by hand forever. From that the model picks up the things that are hard to write down as rules: be helpful, don't make things up, refuse the genuinely harmful requests, stop waffling. It's a lot like training a dog with treats, you reward the good version and you get more of it. This stage is usually called RLHF, reinforcement learning from human feedback: the know-how came from reading the internet, the manners came from here.
The same model, walked from raw know-it-all to the helper you actually talk to.
Why It Sometimes Makes Things Up
I slipped "don't make things up" into that list a moment ago, and it deserves its own stop, because it's the thing everyone has been burned by: the model says something with total confidence, and it's flat wrong. We call it hallucination, and once you know what the model really is, it stops being mysterious.
Remember the only thing it ever learned to do: produce the most likely next words. Not the truest ones, the most likely-sounding ones. Usually those are the same, because true statements are far more common in writing than false ones. But when the model hits a gap, a fact it never read, a detail too obscure, a question about something that happened after its training stopped, it doesn't pause and say "no idea." It does what it always does and produces the most plausible continuation. A made-up citation looks exactly like a real one. An invented function name sits perfectly in valid-looking code. There's no separate fact-checker sitting behind its eyes; a confident right answer and a confident wrong one are built the exact same way. It isn't lying. It has no idea it's wrong, because "wrong" was never a thing it could feel.
So how do you rein it in? Not by hoping it knows more, but by changing what it's working from. The big one is to hand it the real source at question time: give it the actual document, the actual database rows, the actual page, and tell it to answer only from that. Now it isn't fishing in a foggy memory of the whole internet, it's reading the thing in front of it, the way an open-book exam beats a closed-book one. (That idea, retrieval and grounding, sits behind most serious AI tools at work, and it's a lot of what I do day to day.) You can also let it use tools: look it up, run the query, check the math, so the answer rests on a real result instead of a vibe. And the training helps too, since a good chunk of that preference tuning goes into teaching the model to admit "I don't know" instead of bluffing. It doesn't make the problem vanish. It makes it manageable, which is the honest state of the art right now.
- Answers from a foggy memory of everything
- Fills any gap with a plausible guess
- ✗ confident, sometimes wrong
- Reads the actual document in front of it
- Answers only from what is there
- ✓ grounded and checkable
Same model, two setups. Grounding it in a real source is most of how you keep it honest.
Why It Answers So Fast
Here's the bit that surprises people most. If building the model took months and a warehouse full of computers, how can it answer you in about a second? Because building it and using it are two completely different activities, and almost all the effort lives in the building.
Once training is done, those billions of knobs are locked in place and never move again. Using the model (the technical word is "inference") involves no learning at all. Your words go in, flow through the fixed pile of numbers once, and the next-word prediction comes out the other side. That trip is mostly a colossal amount of multiplication, the exact kind of arithmetic these chips are built to do thousands of at once, with nothing being figured out and no knob turning. The hard part already happened, slowly, during training. Answering is just running the finished machine forward. All the agonizing was over months ago, in a data center, on someone else's electricity bill.
- Weeks to months
- Thousands of chips
- Millions of dollars
- Every knob is changing
- About a second
- One pass, knobs frozen
- Fractions of a cent
- Nothing is learned
Almost all the cost is on the left and happens once. The thing you use every day is the cheap side on the right.
Think of the difference between writing a dictionary and looking a word up in one. Writing it takes years and a team. Looking something up takes a second, because all the hard work is already sitting there, done. The reason the answer looks like it's being typed to you is that the model really is producing it one word at a time, and good apps simply show each word the moment it's ready. So the thing that cost a fortune to teach costs almost nothing to ask. That is the whole reason you can rent it by the question instead of building your own: the expensive part was done once, by someone else, and you're just pressing play on the result.
So Should You Make Your Own?
So, back to where we started: should you make your own? If "make" means build one from scratch, the honest answer is no, and it isn't close. Training a model in the same league as the ones you use needs thousands of those chips, a bill running into the millions, a cleaned-up copy of much of the internet, and a team who does only this. That's a handful of big labs and basically nobody else. If you're reading this on a laptop, you are not on that list, and neither am I. And that isn't a sad fact, it's the entire reason you can pick one off the shelf at all. The hugely expensive part has already been paid for by someone else, and you get to use it for pennies. You don't build a power station to charge your phone, you plug into the wall.
What you genuinely can do yourself is that second stage. Take a free, open model that a lab already trained, the Llamas and Mistrals of the world, and fine-tune it on your own examples so it talks in your style, your format, your subject. That's the same "show it good examples" idea from earlier, just aimed at your problem, and it can run on a single rented GPU machine, and the smallest models can even be tuned on a high-end laptop. It's the rare part of all this you can actually try on a rainy weekend. You're not building the engine, you're tuning one that already runs.
So the real answer to "how do I make my own AI model" is that you almost never make the model, you make it yours. You stand on the years of work the labs poured in, and you spend your effort on the thin, useful layer that turns a general model into the specific thing you need. Knowing where that line falls, what's already been done for you and what's actually yours to do, is most of what separates the people who build useful things with this from the people still waiting to train a model that was never theirs to train.
What You're Actually Holding
Take it all apart like this, the guessing, the knobs, the long expensive reading, and you'd think the magic would drain out of it. It doesn't. If anything, seeing how it's built makes the thing more impressive, not less. There's no hidden book of facts inside, no person who sat and typed in rules about France or pasta or you. There's a giant pile of numbers that got nudged, over and over across trillions of words, toward guessing the next word a little better, until something that looks a lot like understanding fell out as a side effect. Nobody taught it the world. They taught it to guess, and the world came along for the ride.
That's the part I keep coming back to. The thing you call up in a second, that writes your email and untangles your error message, is a frozen snapshot of an absurdly long, expensive guessing game that someone else already finished playing. You don't need to run that game yourself. You just need to know what it is you're holding.