Type-Token Frequency

Type-Token Frequency — the distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.

Definition

The distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.

In Depth

In-Depth Explanation

Type-token frequency refers to the relationship between types (distinct word forms) and tokens (total word occurrences) in a language sample — a fundamental metric in corpus linguistics and lexical analysis. The measure is most associated with lexical diversity assessment: how wide and varied is a speaker’s or writer’s vocabulary use?

Core definitions:

Concept	Definition	Example
Token	Every word occurrence in a sample	“the cat sat on the mat” → 6 tokens
Type	Each distinct word form	“the cat sat on the mat” → 5 types (definite article the counts once)
Type-Token Ratio (TTR)	Types ÷ Tokens	5 ÷ 6 = 0.833
Lemma	Inflected forms grouped by baseform	runs, ran, running → 1 lemma

Type-Token Ratio (TTR):

The simplest lexical diversity measure. Higher TTR = more varied vocabulary. Key limitation: TTR is sensitive to text length — as a text grows longer, repeated function words inevitably accumulate, causing TTR to fall steadily. A 50-token sample and a 1000-token sample from the same speaker are not directly comparable by raw TTR.

Advanced lexical diversity measures:

Measure	Method	Advantage
VOCD (D)	D statistic from TTR curves	Less sensitive to text length
MTLD	Mean length of run maintaining .720 threshold	Relatively length-independent
Maas index	Logarithmic transformation of TTR	Partial length correction
HD-D	Hypergeometric distribution approach	Theoretically robust, computationally heavy

MTLD (Measure of Textual Lexical Diversity):

The most widely used length-robust measure in current SLA writing research. MTLD calculates the average length of sequential word sequences maintained above a TTR threshold of .720 — longer runs without repetition = higher diversity score. McCarthy & Jarvis (2010) validated MTLD against multiple criteria.

Frequency effects and acquisition:

Word frequency is a critical dimension of lexical acquisition independent of TTR:

High-frequency words (the, a, is, of, etc.) are acquired early and appear in all samples
Acquisition is sensitive to token frequency: a word encountered many times acquires phonological, semantic, and collocational strength faster than rarely encountered words (Ellis 2002)
Breadth of type exposure matters for lexical richness — extensive reading covers more types than narrow input

L2 writing and TTR research:

Japanese and other L2 English writing researchers use lexical diversity measures to:

Compare novice and expert L2 writing
Track longitudinal development in writing programmes
Evaluate vocabulary richness in automated writing evaluation (AWE) systems
Control for lexical sophistication in corpus studies of learner language

History

The type-token distinction dates to C.S. Peirce’s semiotic terminology (late 19th century). In corpus linguistics, TTR was used as a readymade diversity measure from early quantitative text studies. Yule (1944) proposed an early statistical correction (Yule’s K). Computational corpus analysis (1980s–2000s) exposed TTR’s length-sensitivity problem systematically. Malvern et al. (2004) developed VOCD/D, and McCarthy & Jarvis (2010) validated MTLD, establishing it as the current standard. Automated writing evaluation tools (e-rater, Turnitin, etc.) incorporate lexical diversity metrics.

Common Misconceptions

“Higher TTR always means better writing.” Extremely high TTR in short texts often reflects excessive variability rather than quality. Native speaker academic prose has moderate TTR — appropriate repetition for cohesion is valued. Highly repetitive beginner writing does show low TTR meaningfully, but the relationship is not linear across proficiency levels.
“Type-token ratio is directly comparable across text lengths.” It is not. Only length-corrected measures (MTLD, D) are valid for cross-text comparisons. Raw TTR comparisons across texts of different lengths are methodologically invalid.
“Word frequency and lexical diversity measure the same thing.” Frequency is about how often individual words occur in language generally; lexical diversity is about how varied word use is within a specific sample. A learner can use many high-frequency words (low diversity) or many low-frequency words (high diversity) — both dimensions matter independently.

Social Media Sentiment

Type-token frequency is an academic concept rarely appearing in general language learning social media. It appears in SLA/linguistics academic Twitter, applied linguistics discussions of writing assessment, and computational linguistics/NLP communities. For Japanese learners, word frequency is discussed extensively (via resources like Frequency Analysis, Core Decks, Kaishi, etc.) — frequency-based vocabulary study is mainstream advice. Lexical diversity as a quality metric for Japanese production is not commonly discussed in learner communities.

Last updated: 2026-04

Practical Application

Vocabulary breadth targeting: Using frequency-ranked vocabulary lists (JLPT-organised vocab, frequency dictionaries for Japanese) targets type coverage strategically — ensuring exposure to a wide range of types rather than repeated tokens from a narrow range.
Writing self-assessment: Run a vocabulary diversity analysis on your Japanese writing samples at different points in study. Analyse whether your lexical profile is expanding (more types across samples) or stagnating.
Frequency-based Anki decks: Systems like Optimised Kaishi or frequency-ordered decks function as type-breadth tools — prioritising learning the most encountered types first.

Related Terms

Sources

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. Validates MTLD as the most reliable lexical diversity measure across text lengths; now the standard in L2 writing research.
Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 143–188. Theoretical synthesis of frequency effects in SLA — covers both token frequency (depth of processing) and type frequency (abstraction of patterns).
Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge University Press. Comprehensive framework for vocabulary breadth, depth, and frequency in L2 acquisition; foundational reference for lexical diversity research in SLA.

Mikey Does