Tokenization (Linguistics)

Definition:

Tokenization is the process of segmenting a continuous string of text into discrete, meaningful units called tokens. In linguistics and natural language processing (NLP), tokenization is a fundamental preprocessing step — before any analysis can be done on text (parsing, sentiment analysis, machine translation, etc.), the raw character stream must first be divided into units that the system can work with. What counts as a token depends on the task and the language: in English NLP, tokens are often words, but they can also be subword units, morphemes, or individual characters.


In-Depth Explanation

Word tokenization:

The most intuitive approach treats tokens as words — splitting text at spaces and punctuation boundaries. For English, “the cat sat on the mat” → [“the”, “cat”, “sat”, “on”, “the”, “mat”]. But even this is non-trivial:

  • What do you do with “don’t”? → [“don’t”], [“do”, “n’t”], or [“do”, “not”]?
  • What about hyphenated words like “well-known”?
  • Contractions, abbreviations, URLs, and emoticons all require rules or models

Subword tokenization:

Modern large language models (LLMs) and NLP systems typically use subword tokenization rather than word-level tokenization. The main methods:

  • Byte Pair Encoding (BPE): Used by GPT models. Iteratively merges the most frequent character pairs into single tokens until a target vocabulary size is reached. “unhappiness” might tokenize as [“un”, “happiness”] or [“un”, “happy”, “ness”].
  • WordPiece: Used by BERT. Similar to BPE but uses a likelihood-based criterion for merging.
  • SentencePiece: Language-agnostic, treats the text as a raw sequence including spaces; useful for morphologically complex languages.

Subword tokenization solves the out-of-vocabulary problem: instead of failing on an unknown word, the model can tokenize it into known subword units.

Tokenization in morphologically complex languages:

English has relatively simple morphology, so word tokenization is often adequate. Languages with rich morphology present challenges:

  • Japanese and Chinese have no spaces between words — tokenization requires a word segmentation step (itself a non-trivial NLP problem)
  • Arabic and Hebrew agglutinate clitics onto words so that a single written word may be multiple syntactic units
  • Turkish, Finnish, Hungarian are agglutinative — a single word can carry meaning that English expresses with many words; tokenizing at the word level loses morphological structure

Linguistic vs. computational tokenization:

In linguistic analysis (corpus linguistics, linguistic annotation), tokenizers may be designed to align with linguistic theory — splitting at morpheme boundaries, preserving distinctions that matter linguistically. In engineering contexts (search engines, language models), tokenization is pragmatic — optimized for the vocabulary size, model efficiency, and downstream task performance.

Token count and practical implications for language learners:

Modern AI language tools (ChatGPT, translation apps, grammar checkers) have context windows measured in tokens. For English, a token is roughly 0.75 words (100 tokens ≈ 75 words). This is why LLMs have limits on how much text they can process at once. Language learners using AI tools may encounter this limit with long texts.


Related Terms


Sources