Definition:
Criterion-referenced testing (CRT) is an assessment approach in which scores are interpreted with reference to a fixed, pre-determined performance standard (criterion) rather than relative to the performance of other test-takers. The fundamental question is: “Did the learner achieve the specified level of performance?” — not “Where does the learner rank among their peers?” Criterion-referenced interpretation tells you what an individual can do (absolute performance); norm-referenced interpretation tells you how an individual compares to a group (relative performance). In language education, CEFR-based proficiency certifications and mastery-learning systems are the most prominent examples.
Criterion-Referenced vs. Norm-Referenced
| Feature | Criterion-Referenced | Norm-Referenced |
|---|---|---|
| Score meaning | Compared to a fixed standard | Compared to group performance |
| Key question | Did the learner meet the criterion? | How does the learner rank? |
| Typical output | Pass/fail, level, mastery/non-mastery | Percentile rank, standard score |
| Score distribution | Any distribution is acceptable | Designed to spread learners (normal curve) |
| Purpose | Certify competence / measure mastery | Select, rank, differentiate |
| Example | JLPT (N5–N1), driving test, CEFR level | SAT, IQ test, class rank |
A critical implication: a criterion-referenced test is not designed to maximize score spread. If all learners meet the criterion, all should pass — this is a success, not a problem. Norm-referenced tests are specifically constructed to discriminate between learners; items that everyone gets right are often removed because they don’t differentiate.
Setting the Criterion: Cut Scores
The central technical challenge in criterion-referenced testing is standard-setting — determining where the threshold between passing and failing lies. Common methods:
- Angoff method: Expert judges estimate for each item the probability that a “minimally competent” candidate would answer correctly; item probabilities are summed to produce a cut score
- Bookmark method: Items are ordered by difficulty; judges place a “bookmark” where minimally competent performance transitions to non-mastery
- Contrasting groups: Define groups of learners known to be above/below threshold; find the score that best separates them
Standard-setting is always a values and policy judgment embedded in a technical procedure — experts operationalize conceptions of “adequate performance.”
Criterion-Referenced Assessment in Language Learning
CEFR as CRT architecture:
The CEFR’s six-level system (A1–C2) is fundamentally criterion-referenced — each level is defined by a set of can-do descriptors specifying what a learner at that level can do. Certifications that report CEFR levels (DELF A2, Goethe B1, IELTS bands, Cambridge B2 First) are criterion-referenced: the score means “achieved this level of communicative ability,” not “ranked in the top X% of test-takers.”
Mastery learning:
In mastery learning frameworks (Bloom, 1968), learners are assessed criterion-referenced — did they achieve 80% mastery of the unit material? — before progressing. This contrasts with conventional time-based grading in which everyone advances after a fixed period regardless of mastery.
Proficiency certification vs. norm-referenced selection:
- JLPT N2 = criterion-referenced pass/fail against N2 descriptor
- University entrance exam that selects the top 200 applicants = norm-referenced
Advantages of Criterion-Referenced Testing in Language Education
- Meaningful score interpretation: A CEFR B2 certificate tells employers and admissions offices what the holder can actually do — a percentile rank does not
- Fairer assessment: Learners are not penalized because their cohort happened to score well; each learner’s score reflects their performance against a standard
- Curriculum alignment: Can-do based criteria align directly with instructional objectives
- Positive washback: Teaching toward meeting explicit language competency criteria encourages genuine skill development
Limitations
- Criterion-setting is subjective: The standard is always set through human judgment; different standard-setting procedures yield different cut scores
- Single number masking: A JLPT N2 pass/fail collapses a complex ability profile into a single binary — a near-fail and a strong pass receive the same designation
- Comparability across administrations: Ensuring that a B2 certificate from one year represents the same standard as one from another year requires careful equating
History
Criterion-referenced testing as a distinct psychometric approach was formalized by Robert Glaser (1963), who contrasted it with norm-referenced testing (NRT) in a foundational paper that distinguished between tests designed to measure performance against a defined standard (criterion-referenced) versus performance relative to a population distribution (norm-referenced). The shift toward criterion-referenced assessment in education was accelerated by the objectives-based curriculum movement of the 1960s and the growth of competency-based education. In language testing, criterion-referenced frameworks gained prominence with the development of the CEFR (2001) and similar frameworks that describe what learners can do at specific proficiency levels — explicitly criterion-referenced benchmarks. Bachman and Palmer’s (1996, 2010) language test design frameworks integrated criterion-referencing into the theoretical basis for performance-based language assessment.
Common Misconceptions
“Criterion-referenced tests are easier than norm-referenced tests.” The difficulty of a criterion-referenced test is determined by the standard set, not by the comparison group. A criterion with a high performance standard can be very demanding; a criterion with a low standard may be easily passed. Difficulty is a property of the criterion and the test design, not of the criterion-referencing methodology itself.
“Pass/fail on a criterion-referenced test means the learner has ‘mastered’ the skill.” A pass on a criterion-referenced test means the learner performed above a defined threshold on the tested sample of behavior — it does not guarantee complete mastery, transfer to all real-world contexts, or durable retention. The standard is set by human judgment about what constitutes “sufficient” performance, introducing both validity and reliability considerations.
Criticisms
Setting the cut score (the performance threshold that determines pass/fail) is among the most technically and ethically challenging aspects of criterion-referenced test design. Standard-setting methods (Angoff, bookmark, contrasting groups) each have known limitations and can produce different cut scores for the same test, raising concerns about artificiality. Critics also note that describing criterion “standards” in language ability (as in CEFR levels) requires significant interpretive judgment — “can communicate in routine situations” is not a precise behavioral specification. High-stakes criterion-referenced decisions (graduation, certification, immigration) based on performance thresholds with known measurement error are a persistent concern in language testing ethics.
Social Media Sentiment
Criterion-referenced testing appears in language learner communities primarily in the context of proficiency certifications — IELTS, TOEFL, JLPT, and DELF/DALF are all criterion-referenced assessments, and their specific passing criteria (band scores, level requirements) are frequently discussed. Test preparation communities share information about what specific criteria mean in practice and how to optimize performance for specific test standards. The psychometric distinction between criterion and norm-referenced testing is less commonly discussed in learner communities than the practical “what score do I need?” question.
Last updated: 2026-04
Practical Application
Understanding criterion-referenced testing is practically important for L2 learners pursuing certification. JLPT, IELTS, DELF, and similar examinations set specific passing criteria that learners can use to calibrate their preparation. Rather than aiming to perform better than other test-takers (norm-referenced thinking), learners should focus on meeting the specific criterion — which means identifying and practicing the specific task types and language functions assessed by the test.
Related Terms
See Also
Research
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18(8), 519-521.
The foundational paper distinguishing criterion-referenced from norm-referenced measurement, framing criterion-referencing as a tool for measuring learner competence against defined objectives rather than relative to other learners — the conceptual starting point for criterion-referenced language testing.
Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
The most widely used framework for language test design, integrating criterion-referencing into the theoretical basis for validity, reliability, and practicality in language assessment — required reading for anyone designing or evaluating criterion-referenced language tests.
Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978). Criterion-referenced testing and measurement: A review of technical issues and developments. Review of Educational Research, 48(1), 1-47.
A comprehensive psychometric review of the technical challenges in criterion-referenced testing — including cut score setting, reliability estimation, and validity evidence — providing the technical foundation for understanding the measurement issues underlying criterion-referenced language assessment.