Google's language-independent tokenizer that treats text as raw Unicode and works without pre-tokenization. It's the engine behind T5, ALBERT, and XLNet, supporting both BPE and Unigram algorithms. The big win is multilingual support, especially for CJK languages where you can't just split on spaces. It's deterministic, lightweight (6MB memory), and processes 50k sentences per second. Training takes a few minutes on 100MB of text. HuggingFace Tokenizers are 4x faster if you need speed, but SentencePiece is the standard when you need reproducible tokenization across languages or want to train on raw text without language-specific rules.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill sentencepiece