Frequency Lists

Frequency lists are ranked vocabularies derived from corpus analysis — ordered compilations of words (or lemmas, compounds, or phrases) ranked by decreasing frequency of occurrence in a representative language sample. In language learning, frequency lists serve as a practical vocabulary prioritization tool: because languages follow a Zipfian distribution (a small number of words account for a disproportionately large percentage of all words encountered in text), learning the highest-frequency words first maximizes reading and listening coverage gains per unit of study time.


In-Depth Explanation

The Zipf-Pareto principle in vocabulary:

Language follows an approximate power law (Zipf’s Law) in word frequency: in any large corpus, the most frequent word is roughly twice as common as the second most frequent, three times as common as the third, and so on. This means:

  • The top 1,000 words of a language cover approximately 85% of most texts
  • The top 2,000 words cover approximately 90–92%
  • The next 1,000–5,000 words give diminishing returns per word
  • Academic or domain-specific vocabulary requires separate targeted lists

This distribution justifies the pedagogical strategy of front-loading high-frequency vocabulary before branching into lower-frequency specialized vocabulary.

Nation’s coverage estimates (for English):

Vocabulary size% coverage of typical text
1,000 words~85%
2,000 words~90%
3,000 words~92%
10,000 words~99%

The 95–98% comprehension threshold (the level needed for reading to be genuinely comprehensible rather than laborious) requires roughly 5,000–10,000 words in English, and similar figures apply to other languages with comparable vocabulary distributions.

Japanese frequency lists:

Japanese presents additional complexity because:

  1. Writing system: Frequency must be considered at kanji, word, and lemma level separately
  2. Register divergence: Spoken Japanese and written Japanese use different vocabulary distributions; learning from a written corpus optimizes for reading; learning from subtitle/spoken corpus optimizes for listening comprehension
  3. Grammar particles and function words: Japanese has high-frequency grammatical items (は、が、を、に、で、と、も, etc.) that appear constantly but require grammatical rather than lexical learning

Major Japanese frequency list sources:

  • BCCWJ (Balanced Corpus of Contemporary Written Japanese): The standard reference corpus; basis for multiple word lists; includes frequency data for print media, magazines, books, web content
  • Rikai.com / jpdb.io: Web-based tools with frequency ranking for Japanese vocabulary — jpdb.io allows frequency-ranked vocabulary lists per media (anime, novels, games)
  • Refold / Mass Immersion Approach frequency decks: Anime subtitle corpus-derived decks (e.g., Core 2K, Core 6K, Tango N5–N1 decks in Anki) optimized for immersion learners
  • JLPT lists: The JLPT vocabulary lists are roughly frequency-ordered but are designed for the test, not for corpus frequency optimization

Popular Learn Japanese Frequency Decks:

DeckWordsNotes
Core 2K/6K2,000–6,000Classic Anki deck; anime subtitle corpus; some outdated items
Tango N5–N1~2,600 totalJLPT-structured frequency deck; sentence cards
JP1K1,000 high-frequency wordsOptimized for recognition; sentence deck format
Kaishi 1.5K1,500Modern replacement for Core 2K; curated
BCCWJ-basedVariableMore written-language biased; good complement to spoken-focused decks

Frequency lists vs. frequency-list.md:

Note: this entry covers frequency lists as a general tool and learning concept. A separate entry at Frequency List covers the concept of a frequency list as a defined resource type.


History

Corpus linguistics as a field developed with the digitization of text in the 1960s–1970s (Brown Corpus, 1961; LOB Corpus, 1978). The application of corpus frequency data to vocabulary learning was systematized by Alan Nation (1983 onward), particularly through the General Service List (GSL) of English, which identified the most frequency-productive words for English learners. Nation’s work established vocabulary learning principles based on corpus evidence that remain foundational in EFL/ESL teaching.

For Japanese, corpus-based frequency analysis became accessible to learners primarily through the web tools and Anki deck communities of the 2000s–2010s, with the Core 2K deck (derived from frequency analysis of anime subtitles) becoming briefly a standard recommendation before more curated alternatives developed.


Common Misconceptions

  • “Learning the frequency list means learning the language.” High-frequency vocabulary is necessary but not sufficient — grammar, pragmatics, collocations, and domain vocabulary are all required for functional competence.
  • “All frequency lists are equivalent.” A frequency list from a written news corpus will prioritize different vocabulary than one from an anime subtitle corpus. The right list depends on your target domain.
  • “Once you’ve done X frequency words, you know X words.” SRS frequency deck completion means you have recognized the words in controlled review conditions — active retrieval, collocational use, and contextual deployment are not guaranteed.
  • “Lower-frequency items don’t matter.” Once you have the high-frequency core, targeted domain-specific vocabulary learning (for your actual reading/viewing material) becomes more important than continuing through increasingly low-frequency general lists.

Practical Application

  • Use a modern, well-curated Anki frequency deck (Kaishi 1.5K or Tango N5–N2 are current recommendations) as your primary vocabulary base alongside immersion.
  • After completing 1,000–2,000 common words, transition to sentence mining from your own immersion material — the vocabulary that matters for your specific content is more important than the next chunk of a frequency list.
  • Use jpdb.io to identify the frequency-ranked vocabulary for specific media you want to consume — you can pre-learn it before engaging with the content to increase comprehension.
  • Don’t conflate JLPT vocabulary lists with frequency lists — they partially overlap but serve different purposes.

Related Terms


Sources