Test Drive: tokenizers
Published on .
Today I’m test driving tokenizers which is how text is split into tokens to be processed by language models. It’s full of quirks and there are several different approaches and my test drive is inspired by “You Should Probably Pay Attention to Tokenizers“. My task for today is to tokenize the text “the quick brown fox jumped over the fence” with 2 tokenizers. Here’s the code:
import sentence_transformers
import tiktoken
model = sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2")
tokenized = model.tokenize(["the quick from fox jumped over the fence"])
tokens = model.tokenizer.convert_ids_to_tokens(tokenized["input_ids"][0])
print(tokens)
# ['[CLS]', 'the', 'quick', 'from', 'fox', 'jumped', 'over', 'the', 'fence', '[SEP]']
model = tiktoken.encoding_for_model("gpt-4o-mini")
tokenized = model.encode("the quick from fox jumped over the fence")
tokens = [model.decode_single_token_bytes(number) for number in tokenized]
print(tokens)
# [b'the', b' quick', b' from', b' fox', b' jumped', b' over', b' the', b' fence']
Code language: Python (python)
This particular example is boring, but if you add emoji or trailing whitespace it gets more interesting!