Test Drive: tokenizers by Nat Taylor

Test Drive: tokenizers

Published on Oct 24, 2024.

Today I’m test driving tokenizers which is how text is split into tokens to be processed by language models. It’s full of quirks and there are several different approaches and my test drive is inspired by “You Should Probably Pay Attention to Tokenizers“. My task for today is to tokenize the text “the quick brown fox jumped over the fence” with 2 tokenizers. Here’s the code:

import sentence_transformers
import tiktoken

model      = sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2")
tokenized  = model.tokenize(["the quick from fox jumped over the fence"])
tokens     = model.tokenizer.convert_ids_to_tokens(tokenized["input_ids"][0])
print(tokens)
# ['[CLS]', 'the', 'quick', 'from', 'fox', 'jumped', 'over', 'the', 'fence', '[SEP]']

model      = tiktoken.encoding_for_model("gpt-4o-mini")
tokenized  = model.encode("the quick from fox jumped over the fence")
tokens     = [model.decode_single_token_bytes(number) for number in tokenized]
print(tokens)
# [b'the', b' quick', b' from', b' fox', b' jumped', b' over', b' the', b' fence']

This particular example is boring, but if you add emoji or trailing whitespace it gets more interesting!

Nat Taylor — Blog, AI, Product Management & Tinkering

Test Drive: tokenizers

Post Navigation