The L in LLMs


You've probably used ChatGPT, but do you know how it's trained?

Most AI chatbots, including ChatGPT, are powered by Large Language Models (LLMs), which are fascinating advancements in natural language processing.

That second "L" in LLM is our main focus today. It represents the language (or training data, rather) used to train the LLM.

LLMs work by predicting the next word in a sequence of words (e.g. How do I tie... "a tie"). The more data in an LLM's training data set, the better the model's predictions will be.

To help you grasp the scale of the training data used to train LLMs, let's assume the red square below represents 100 million words.

100 Million Words

That's a humongous number of words! To understand just how big that is, take a look at the super small white square below. That represents the amount of words in the Harry Potter series (it's about a million words)!

1 Million Words

Now that you know just how big one square is, let's take a tour of the evolution of LLMs.

Let's start in 2018, when Google had just unveiled BERT, a groundbreaking LLM that revolutionized natural language processing by being able to capture contextual nuances in language.

But just how many words was BERT trained on?

BERT (2018)

3.3 Billion Words

Remember that each square represents 100 million words...

This was a huge step forward for AI, and it was in large part due to the massive amount of data that BERT was trained on.

BERT's success paved the way for the next significant advancement in LLMs - GPT-2 (Generative Pre-trained Transformer 2) by OpenAI.

Through unsupervised training, GPT-2 acquired the capability to generate coherent and contextually relevant text, making it a milestone in LLMs.

GPT-2 (2019)

7.5 Billion Words

Following GPT-2, OpenAI introduced GPT-3 in June 2020 and GPT-3.5 in March 2022.

GPT-3.5 (2022)

300 Billion Words

GPT-3.5, which initially powered ChatGPT, also powers a variety of OpenAI's other services, including DALL-E (which generates images from text), CLIP (which connects images and text), and Whisper (which converts multi-lingual voice to text).

While many people use ChatGPT as an alternative or a supplement to a search engine, ChatGPT cannot respond to questions about current events since its training data set only covers information on the internet up until September 2021.

Additionally, bad training data can also lead LLMs to "hallucinate". For example, a New York lawyer created a legal brief using ChatGPT which included many made-up citations due to hallucinations by the LLM.

Enter Meta in early 2023, which introduced LLaMA — an open-source LLM with multiple model sizes.

Developers can even run the small-sized LLaMA model on laptops. On the contrary, LLaMA's largest model had more training data than even GPT-3.5.

LLaMA (2023)

~1 Trillion Words

Unlike OpenAI (with ChatGPT) and Google (with Bard), Meta does not yet have an LLM chatbot.

Instead of asking questions to LLaMA, developers start off with a sequence of words and let LLaMA finish it.

LLaMA's customizable parameter amount also allows for developers to more easily retrain and fine-tune the model for specific use cases.

Parameters are similar to neurons in a human brain, and an LLM could be composed of billions of them!

Stanford's Human-Centered Artificial Intelligence program created a fine-tuned version of LLaMA's 7 billion parameter model, known as Alpaca.

You've probably grown tired of scrolling... but get this - imagine the height of this whole webpage as hundred-million-word squares - that amount represents the absolutely massive training data size of OpenAI's GPT-4 (which now powers ChatGPT) and Google's PaLM 2.

GPT-4 and PaLM 2 have training data sets composed of nearly 3 trillion words. You've already scrolled through twenty thousand hundred-million-word squares, amounting to a total of two trillion words.

Hit the button below to skip to the summary!

GPT-4 and PaLM 2, are trained on a data set of nearly 3 trillion words, and have more training data than the other LLMs before them combined.

Comparison of Training Data Sizes of LLMs by Number of Words

GPT-4/PaLM 2: ~3T

BERT: 3.3B

GPT-2: 7.5B

GPT-3.5: 300B

LLaMA: ~1T

L

M


Sources

  • Inspired by ABC's Story Lab
  • A token is equivalent to about 3/4 of a word (OpenAI)
  • The Harry Potter series has 1,084,170 words in total (Foster Grant)
  • BERT has a corpus size of 3.3 billion words (Chang et al.)
  • GPT-2 has a corpus size of 10 billion tokens (Lambda Labs)
  • GPT-3.5 was trained on about 300 billion words (BBC Science Focus)
  • GPT-3.5 powers other OpenAI services (Meetanshi)
  • Lawyer used ChatGPT to generate a legal brief (New York Times)
  • LLaMA was built on approximately 1.4 trillion tokens (Meta AI)
  • Alpaca was built on LLaMA's 7B model (Stanford University)
  • PaLM 2 is trained on about 3.6 trillion tokens (CNBC)
  • GPT-3 and GPT-3.5 have 175 billion parameters (PC Mag)
  • GPT-4 has 1.76 trillion parameters, making it ~10x more advanced than GPT-3.5 (The Decoder)