Definition:
A quantitative measure of how many different words (types) appear in a language sample relative to the total number of words (tokens). Higher lexical diversity indicates broader vocabulary use.
In-Depth Explanation
The simplest measure of lexical diversity is the type-token ratio (TTR): the number of unique word types divided by the total number of tokens. A text of 100 words that uses 65 distinct words has a TTR of 0.65.
TTR has a well-known problem: it is length-sensitive. Longer texts mechanically produce lower TTRs because the same common words (function words, recurrent nouns) repeat more. Comparing TTR across texts of different lengths is misleading.
Several measures have been developed to correct for length-sensitivity:
VOCD / D-measure (Malvern & Richards, 2002): Estimates diversity by fitting theoretical random sampling curves to observed TTR values across different sample sizes extracted from the text. The D value is a single summary index independent of text length.
MTLD (Measure of Textual Lexical Diversity, McCarthy & Jarvis, 2010): Calculates the mean length of sequential word strings in a text over which the TTR remains above a threshold (0.72). The longer the average run before TTR degradation, the higher the MTLD score.
HD-D (Hypergeometric Distribution D): A hypergeometric-based alternative to VOCD with similar properties but more explicit statistical justification.
In SLA research, lexical diversity is used as a developmental index: L2 writers and speakers typically show lower and less diverse vocabulary use than native speakers at equivalent text lengths, and diversity increases as proficiency develops. Frequent repetition of the same high-frequency words is a marker of early-stage writing.
History
Type-token ratio has been used since at least the 1940s in stylometrics and literary analysis. Brunet (1978) introduced a length-normalized measure (Brunet’s W). Herdan (1964) proposed log(types)/log(tokens) as another normalization. The shift to the VOCD and MTLD family reflects increasing sophistication about statistical modeling of sampling effects, driven largely by computational linguistics research from the late 1990s onward.
Common Misconceptions
“High diversity is always better.” In some genres (technical writing, formal reports), deliberate lexical repetition is appropriate for clarity. Excessive diversity in academic writing can impair readability.
“TTR is still fine for short, equal-length texts.” Even among comparably short texts, TTR is sensitive to content area differences. Two equally-length essays on very different topics can show different TTR purely from topic vocabulary, not from the writer’s range.
Criticisms
- No single diversity measure is universally agreed upon; VOCD, MTLD, and HD-D can produce different rank orderings.
- Lexical diversity measures treat all word types equally, ignoring the difference between using rare, specialized vocabulary and simply switching between common synonyms.
- Measuring lemma types rather than form types requires lemmatization tools, introducing noise from automatic taggers.
Social Media Sentiment
Lexical diversity rarely appears by name in language learner forums but the underlying concept appears in discussions of writing quality and “sounding advanced.” Tools like Grammarly or language learning apps sometimes report vocabulary variety as a quality metric. In academic English contexts (IELTS, TOEFL), “lexical variety” is an explicitly graded criterion, and exam preparation content focuses on synonyms and collocational variety as ways to improve scores.
Related Terms
- Word Frequency Effect — frequency of vocabulary items underlies diversity measures
- Zipf’s Law — the distributional background for why some words are rare
- Academic Word List — a curated vocabulary set relevant to academic diversity goals
- Formulaic Sequences — high-frequency chunks may reduce perceived diversity if over-used
Research
- Malvern, D., Richards, B., Chipere, N., & Purán, P. (2004). Lexical Diversity and Language Development: Quantification and Assessment. Palgrave Macmillan.
- McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.
- Johansson, V. (2009). Lexical diversity and lexical density in speech and writing. Working Papers, 53, 61–79. Lund University.