Definition:
Vocabulary size (also called vocabulary breadth) is a measure of how many distinct words (typically counted in word families) a person knows in a language. It is the most commonly reported metric of overall lexical development in SLA research and pedagogical assessment. Vocabulary size is closely related to reading and listening comprehension — knowing more words means more text can be understood. Landmark research by Paul Nation and colleagues has established vocabulary size benchmarks for different text comprehension thresholds and communicative functions, providing actionable targets for language learners and curriculum designers.
Units of Counting
Vocabulary size is typically measured in word families (base word + inflections + common derivatives), not raw word forms. Because knowing teach gives substantial access to teacher, teaching, taught, counting word families more accurately reflects usable lexical knowledge.
Alternative units used in research:
- Lemmas (base + inflections only): more conservative than word family; each lemma = lower count
- Types (individual word forms): most conservative; highest count but inflated
- Tokens (total occurrences including repetitions): not used for vocabulary size (measures corpus frequency, not individual knowledge)
Nation’s Vocabulary Coverage Research
Paul Nation‘s corpus-based research established the relationship between vocabulary size and text coverage — the percentage of running words in a text that a reader knows:
| Word Families Known | Approximate Text Coverage | Implication |
|---|---|---|
| 1,000 | ~78% of typical text | Very difficult; most text unknown |
| 2,000 | ~87% | Still quite challenging |
| 3,000 | ~90% | Better but still ~1 in 10 words unknown |
| 5,000 | ~95% | Usable reading with some dictionary support |
| 8,000–9,000 | ~98% | Comfortable independent reading |
| 14,000–15,000+ | ~99%+ | Near-native reading comfort |
The critical threshold: 98% coverage (~8,000–9,000 word families) is needed for fluent independent reading without excessive disruption from unknown words. This is the target for advanced L2 reading.
Average Native Speaker Vocabulary Size
Research (Nation; Goulden, Nation & Read) estimates adult native English speakers know:
- Roughly 17,000–20,000 word families (conservative estimates)
- Some estimates using broader word family definitions reach 50,000–100,000 word forms
This contextualizes L2 learning: even advanced L2 learners typically top out well below native speaker levels in vocabulary breadth.
Measuring Vocabulary Size
Vocabulary Size Test (VST) — Paul Nation & David Beglar:
Samples words from frequency bands (1,000–14,000 word families); scores estimate total vocabulary size in thousands of word families. Widely used in research and available for free.
Vocabulary Levels Test (VLT) — Nation; revised Schmitt:
Tests specific frequency levels (1,000, 2,000, 3,000, 5,000, 10,000 word families) to identify where learner vocabulary breaks down. More diagnostic than the VST.
Yes/No tests:
Learner marks whether they know each word; requires statistical correction for over-claiming.
Vocabulary Size Benchmarks for Japanese (JLPT)
Approximate vocabulary size by JLPT level:
| JLPT Level | Approximate Vocabulary Target |
|---|---|
| N5 | ~800 words |
| N4 | ~1,500 words |
| N3 | ~3,750 words |
| N2 | ~6,000 words |
| N1 | ~10,000+ words |
Note: JLPT tests only recognition of vocabulary in multiple-choice format; vocabulary size for productive use is typically lower than recognition size.
Vocabulary Size vs. Depth
Size (breadth) = how many words you know at any level
Depth = how well you know each word (collocations, register, pragmatics)
Both are necessary for proficient language use. Research shows that at lower proficiency, breadth is more strongly predictive of comprehension; at higher levels, depth plays an increasing role in fluency and production quality.
History
Vocabulary size estimation has been a concern in applied linguistics since the early 20th century. Seashore and Eckerson (1940) provided early estimates of native English speaker vocabulary size (averaging around 58,000 word families), though their methodology was later criticized. Nation’s (1990, 2001) framework for analyzing vocabulary size in terms of word families, and his development of frequency-based word lists (the first 1,000, second 1,000, etc.), provided the standard approach for relating vocabulary size to language proficiency benchmarks. The Vocabulary Size Test (Nation & Beglar, 2007) and the Vocabulary Levels Test (Schmitt, Schmitt, & Clapham, 2001) became the main assessment instruments. Key findings established practical benchmarks: approximately 2,000-3,000 word families for basic conversation, 5,000 for general reading comprehension, and 8,000-9,000 for comprehension of authentic text without extensive dictionary use.
Common Misconceptions
“Vocabulary size means the number of individual words you know.”
Vocabulary size research typically counts word families — a base word plus its inflected and derived forms (e.g., “develop,” “develops,” “developing,” “developed,” “development,” “developer” = one word family). A vocabulary size of 8,000 word families represents knowledge of far more than 8,000 individual words.
“You need to know 50,000+ words for fluency.”
Native speakers may know 15,000-20,000 word families. The 8,000-9,000 word family threshold covers approximately 98% of typical text — the level at which most unknown words can be inferred from context. Beyond this, returns diminish sharply.
“Vocabulary size is the same as vocabulary knowledge.”
Size (how many words) is one dimension; depth (how well you know each word) is another. A learner with 5,000 shallow word associations may comprehend less than a learner with 3,000 deeply known words including collocations, connotations, and register information.
“Vocabulary sizes are comparable across languages.”
Word family definitions differ across languages. Japanese vocabulary “size” is complicated by the multiple writing systems — kanji compounds, katakana loanwords, and native vocabulary represent different learning challenges. Direct numerical comparison between languages is misleading.
Criticisms
Vocabulary size research has been criticized for the word family unit itself. Critics argue that derived forms within a family (e.g., “nation,” “national,” “nationalism,” “nationalize”) are not equally known — learners may know “nation” without knowing “nationalism” — making word family counts overestimate productive vocabulary. Alternatives like lemma-based counting (base form + inflections only) and flemma-based counting have been proposed.
The coverage-comprehension relationship (e.g., 98% coverage = adequate comprehension) has been questioned: coverage thresholds were established primarily for English text and may not apply equally to other languages with different morphological structures. For Japanese, frequency-based word lists are complicated by the multiple reading systems and the fact that kanji knowledge provides partial access to unfamiliar compound words — a form of morphological guessing not captured by standard vocabulary size metrics.
Social Media Sentiment
Vocabulary size is a popular metric in language learning communities. Online vocabulary tests (particularly Paul Nation’s Vocabulary Size Test and various “how many words do you know?” tools) are frequently shared and discussed on r/languagelearning. Learners commonly ask “how many words do I need to know to…?” — seeking practical benchmarks.
In Japanese learning communities, discussion focuses on kanji counts and JLPT vocabulary lists as proxies for vocabulary size. The relationship between vocabulary size and real-world comprehension (“I know 5,000 words but can’t understand anime”) generates frequent discussion, usually resolved by distinguishing vocabulary size from vocabulary depth and listening fluency.
Practical Application
- Set vocabulary size benchmarks — Use frequency-based targets: 1,000-2,000 families for basic conversation, 5,000 for comfortable reading of simplified texts, 8,000-9,000 for authentic text comprehension in English. Japanese benchmarks differ but follow similar progressive structure.
- Focus on high-frequency vocabulary first — The first 2,000-3,000 word families provide the most coverage per unit of study effort. Prioritize these before moving to lower-frequency words.
- Measure your vocabulary size periodically — Use validated tests (Nation’s VST for English, or kanji/vocabulary knowledge tests for Japanese) to benchmark progress and identify frequency bands that need attention.
- Build depth alongside size — Don’t just add new words — deepen knowledge of words you already partially know. Learn collocations, register restrictions, and connotations for your existing vocabulary.
Related Terms
- Word Family
- Depth of Vocabulary Knowledge
- Lexical Acquisition
- Incidental Vocabulary Learning
- Receptive Vocabulary
- Productive Vocabulary
See Also
Research
Nation (2001, 2006) established the word frequency/vocabulary size framework that dominates applied linguistics research. The Vocabulary Size Test (Nation & Beglar, 2007) is the most widely used instrument for estimating receptive vocabulary size.
Schmitt (2008) reviewed vocabulary size research, finding that the 8,000-9,000 word family threshold for adequate text coverage holds across genres and registers for English — though specialised texts require additional technical vocabulary. Laufer and Ravenhorst-Kalovski (2010) investigated the optimal vocabulary size for reading comprehension, finding that 8,000 word families corresponded to 98% coverage and a significant comprehension threshold. For Japanese, Matsushita (2012) developed frequency-based vocabulary lists for Japanese that facilitate vocabulary size estimation, finding that the first 6,000 lemmas provide approximately 95% coverage of general Japanese text.