Type-Token Frequency

Type-Token Frequency — the distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.

Definition

The distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.

In Depth

The distinction between how many times a specific item appears in input (token frequency) and how many different items instantiate a pattern (type frequency) — both affecting acquisition differently.

In-Depth Explanation

Type-token frequency refers to the relationship between types (distinct word forms) and tokens (total word occurrences) in a language sample — a fundamental metric in corpus linguistics and lexical analysis. The measure is most associated with lexical diversity assessment: how wide and varied is a speaker’s or writer’s vocabulary use?

Core definitions:

ConceptDefinitionExample
TokenEvery word occurrence in a sample“the cat sat on the mat” → 6 tokens
TypeEach distinct word form“the cat sat on the mat” → 5 types (definite article the counts once)
Type-Token Ratio (TTR)Types ÷ Tokens5 ÷ 6 = 0.833
LemmaInflected forms grouped by baseformruns, ran, running → 1 lemma

Type-Token Ratio (TTR):

The simplest lexical diversity measure. Higher TTR = more varied vocabulary. Key limitation: TTR is sensitive to text length — as a text grows longer, repeated function words inevitably accumulate, causing TTR to fall steadily. A 50-token sample and a 1000-token sample from the same speaker are not directly comparable by raw TTR.

Advanced lexical diversity measures:

MeasureMethodAdvantage
VOCD (D)D statistic from TTR curvesLess sensitive to text length
MTLDMean length of run maintaining .720 thresholdRelatively length-independent
Maas indexLogarithmic transformation of TTRPartial length correction
HD-DHypergeometric distribution approachTheoretically robust, computationally heavy

MTLD (Measure of Textual Lexical Diversity):

The most widely used length-robust measure in current SLA writing research. MTLD calculates the average length of sequential word sequences maintained above a TTR threshold of .720 — longer runs without repetition = higher diversity score. McCarthy & Jarvis (2010) validated MTLD against multiple criteria.

Frequency effects and acquisition:

Word frequency is a critical dimension of lexical acquisition independent of TTR:

  • High-frequency words (the, a, is, of, etc.) are acquired early and appear in all samples
  • Acquisition is sensitive to token frequency: a word encountered many times acquires phonological, semantic, and collocational strength faster than rarely encountered words (Ellis 2002)
  • Breadth of type exposure matters for lexical richness — extensive reading covers more types than narrow input

L2 writing and TTR research:

Japanese and other L2 English writing researchers use lexical diversity measures to:

  • Compare novice and expert L2 writing
  • Track longitudinal development in writing programmes
  • Evaluate vocabulary richness in automated writing evaluation (AWE) systems
  • Control for lexical sophistication in corpus studies of learner language

History

The type-token distinction dates to C.S. Peirce’s semiotic terminology (late 19th century). In corpus linguistics, TTR was used as a readymade diversity measure from early quantitative text studies. Yule (1944) proposed an early statistical correction (Yule’s K). Computational corpus analysis (1980s–2000s) exposed TTR’s length-sensitivity problem systematically. Malvern et al. (2004) developed VOCD/D, and McCarthy & Jarvis (2010) validated MTLD, establishing it as the current standard. Automated writing evaluation tools (e-rater, Turnitin, etc.) incorporate lexical diversity metrics.

Common Misconceptions

  • “Higher TTR always means better writing.” Extremely high TTR in short texts often reflects excessive variability rather than quality. Native speaker academic prose has moderate TTR — appropriate repetition for cohesion is valued. Highly repetitive beginner writing does show low TTR meaningfully, but the relationship is not linear across proficiency levels.
  • “Type-token ratio is directly comparable across text lengths.” It is not. Only length-corrected measures (MTLD, D) are valid for cross-text comparisons. Raw TTR comparisons across texts of different lengths are methodologically invalid.
  • “Word frequency and lexical diversity measure the same thing.” Frequency is about how often individual words occur in language generally; lexical diversity is about how varied word use is within a specific sample. A learner can use many high-frequency words (low diversity) or many low-frequency words (high diversity) — both dimensions matter independently.

Social Media Sentiment

Type-token frequency is an academic concept rarely appearing in general language learning social media. It appears in SLA/linguistics academic Twitter, applied linguistics discussions of writing assessment, and computational linguistics/NLP communities. For Japanese learners, word frequency is discussed extensively (via resources like Frequency Analysis, Core Decks, Kaishi, etc.) — frequency-based vocabulary study is mainstream advice. Lexical diversity as a quality metric for Japanese production is not commonly discussed in learner communities.

Last updated: 2026-04

Practical Application

  • Vocabulary breadth targeting: Using frequency-ranked vocabulary lists (JLPT-organised vocab, frequency dictionaries for Japanese) targets type coverage strategically — ensuring exposure to a wide range of types rather than repeated tokens from a narrow range.
  • Writing self-assessment: Run a vocabulary diversity analysis on your Japanese writing samples at different points in study. Analyse whether your lexical profile is expanding (more types across samples) or stagnating.
  • Frequency-based Anki decks: Systems like Optimised Kaishi or frequency-ordered decks function as type-breadth tools — prioritising learning the most encountered types first.

Related Terms

See Also

Sakubo – Learn Japanese

Sources

  • McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. Validates MTLD as the most reliable lexical diversity measure across text lengths; now the standard in L2 writing research.
  • Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 143–188. Theoretical synthesis of frequency effects in SLA — covers both token frequency (depth of processing) and type frequency (abstraction of patterns).
  • Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge University Press. Comprehensive framework for vocabulary breadth, depth, and frequency in L2 acquisition; foundational reference for lexical diversity research in SLA.