<aside> 💡 My notes from Karpathy’s YouTube tutorial on building GPT. Watched it around January 20, 2024 These notes are meant for me to recall the things I want to make sure to recall

</aside>

Intro

The neural network architecture comes from the Attention Is All You Need
Can’t build something as sophisticated as ChatGPT as its a production-grade system trained on an immense amount of cleaned data

<aside> 💡 My own note: one of the biggest differentiators between OpenAI’s system and a GPT that I build on my laptop is probably the data the model is trained on. Elon Musk talks about this in Lex Fridman’s podcast. He talks about how the model’s codebase is actually quite simple, and the heavy lift is filtering through all of the noise on the internet to get good data to train the model on

</aside>

So, Karpathy’s got a file containing all of Shakespeare’s writing in text.
Going to train the transformer on Shakespeare and try to get it to produce character patterns that look like Shakespeare
Output is token by token, where token is sub-word level
nanoGPT is a repo for training transformers on any given text
One file defines the GPT model, and one file trains it
Going to basically recreate this

Exploring the data

first, pull out all unique characters that occur in the text
will then map characters to integers (encode = characters—> integers translation, vice versa for decode)

you do this using lambda functions!

<aside> 💡 side note for me: lambda functions are small, anonymous functions that are usually used for short, simple, one-off operations. their notation is: lambda parameters: expression

</aside>

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

#[46, 47, 47, 1, 58, 46, 43, 56, 43]
#hii there

can do encoding in different ways. Google uses SentencePiece