What is a token in an LLM?
A token is the basic unit a language model reads and generates. It is roughly a word fragment - common English words are usually one token, but longer or rarer words can be split into several. Punctuation, spaces, and code symbols also count. As a rough rule, one token equals about 4 characters of English text, so 1,000 tokens is approximately 750 words.