Type-Token Frequency — the distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.
Definition
The distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.
In Depth
The distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.
In-Depth Explanation
Type-token frequency refers to the relationship between types (distinct word forms) and tokens (total word occurrences) in a language sample — a fundamental metric in corpus linguistics and lexical analysis. The measure is most associated with lexical diversity assessment: how wide and varied is a speaker’s or writer’s vocabulary use?
Core definitions:
| Concept | Definition | Example |
|---|---|---|
| Token | Every word occurrence in a sample | “the cat sat on the mat” → 6 tokens |
| Type | Each distinct word form | “the cat sat on the mat” → 5 types (definite article the counts once) |
| Type-Token Ratio (TTR) | Types ÷ Tokens | 5 ÷ 6 = 0.833 |
| Lemma | Inflected forms grouped by baseform | runs, ran, running → 1 lemma |
Type-Token Ratio (TTR):
The simplest lexical diversity measure. Higher TTR = more varied vocabulary. Key limitation: TTR is sensitive to text length — as a text grows longer, repeated function words inevitably accumulate, causing TTR to fall steadily. A 50-token sample and a 1000-token sample from the same speaker are not directly comparable by raw TTR.
Advanced lexical diversity measures:
| Measure | Method | Advantage |
|---|---|---|
| VOCD (D) | D statistic from TTR curves | Less sensitive to text length |
| MTLD | Mean length of run maintaining .720 threshold | Relatively length-independent |
| Maas index | Logarithmic transformation of TTR | Partial length correction |
| HD-D | Hypergeometric distribution approach | Theoretically robust, computationally heavy |
MTLD (Measure of Textual Lexical Diversity):
The most widely used length-robust measure in current SLA writing research. MTLD calculates the average length of sequential word sequences maintained above a TTR threshold of .720 — longer runs without repetition = higher diversity score. McCarthy & Jarvis (2010) validated MTLD against multiple criteria.
Frequency effects and acquisition:
Word frequency is a critical dimension of lexical acquisition independent of TTR:
- High-frequency words (the, a, is, of, etc.) are acquired early and appear in all samples
- Acquisition is sensitive to token frequency: a word encountered many times acquires phonological, semantic, and collocational strength faster than rarely encountered words (Ellis 2002)
- Breadth of type exposure matters for lexical richness — extensive reading covers more types than narrow input
L2 writing and TTR research:
Japanese and other L2 English writing researchers use lexical diversity measures to:
- Compare novice and expert L2 writing
- Track longitudinal development in writing programmes
- Evaluate vocabulary richness in automated writing evaluation (AWE) systems
- Control for lexical sophistication in corpus studies of learner language
History
The type-token distinction dates to C.S. Peirce’s semiotic terminology (late 19th century). In corpus linguistics, TTR was used as a readymade diversity measure from early quantitative text studies. Yule (1944) proposed an early statistical correction (Yule’s K). Computational corpus analysis (1980s–2000s) exposed TTR’s length-sensitivity problem systematically. Malvern et al. (2004) developed VOCD/D, and McCarthy & Jarvis (2010) validated MTLD, establishing it as the current standard. Automated writing evaluation tools (e-rater, Turnitin, etc.) incorporate lexical diversity metrics.
Common Misconceptions
- “Higher TTR always means better writing.” Extremely high TTR in short texts often reflects excessive variability rather than quality. Native speaker academic prose has moderate TTR — appropriate repetition for cohesion is valued. Highly repetitive beginner writing does show low TTR meaningfully, but the relationship is not linear across proficiency levels.
- “Type-token ratio is directly comparable across text lengths.” It is not. Only length-corrected measures (MTLD, D) are valid for cross-text comparisons. Raw TTR comparisons across texts of different lengths are methodologically invalid.
- “Word frequency and lexical diversity measure the same thing.” Frequency is about how often individual words occur in language generally; lexical diversity is about how varied word use is within a specific sample. A learner can use many high-frequency words (low diversity) or many low-frequency words (high diversity) — both dimensions matter independently.
Social Media Sentiment
Type-token frequency is an academic concept rarely appearing in general language learning social media. It appears in SLA/linguistics academic Twitter, applied linguistics discussions of writing assessment, and computational linguistics/NLP communities. For Japanese learners, word frequency is discussed extensively (via resources like Frequency Analysis, Core Decks, Kaishi, etc.) — frequency-based vocabulary study is mainstream advice. Lexical diversity as a quality metric for Japanese production is not commonly discussed in learner communities.
Last updated: 2026-04
Practical Application
- Vocabulary breadth targeting: Using frequency-ranked vocabulary lists (JLPT-organised vocab, frequency dictionaries for Japanese) targets type coverage strategically — ensuring exposure to a wide range of types rather than repeated tokens from a narrow range.
- Writing self-assessment: Run a vocabulary diversity analysis on your Japanese writing samples at different points in study. Analyse whether your lexical profile is expanding (more types across samples) or stagnating.
- Frequency-based Anki decks: Systems like Optimised Kaishi or frequency-ordered decks function as type-breadth tools — prioritising learning the most encountered types first.
Related Terms
See Also
Sources
- McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. Validates MTLD as the most reliable lexical diversity measure across text lengths; now the standard in L2 writing research.
- Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 143–188. Theoretical synthesis of frequency effects in SLA — covers both token frequency (depth of processing) and type frequency (abstraction of patterns).
- Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge University Press. Comprehensive framework for vocabulary breadth, depth, and frequency in L2 acquisition; foundational reference for lexical diversity research in SLA.