Definition:
Validity is the degree to which a test or assessment accurately measures the construct it is intended to measure. In language assessment, this means: does the test actually measure language proficiency (or the specific sub-skill it claims to assess), or is it inadvertently measuring something else — test-taking strategy, cultural background knowledge, anxiety, reading speed? Validity is considered the most fundamental quality criterion in assessment: reliability (consistency) is necessary but not sufficient — a test can be consistently measuring the wrong thing. The modern unified framework for validity was established by Samuel Messick (1989), who argued that validity is construct validity and encompasses all evidence bearing on the appropriateness, meaningfulness, and usefulness of score-based interpretations.
Classical Types of Validity
Before Messick’s unified framework, validity was typically discussed as several distinct types:
Does the test adequately sample the content domain it is supposed to assess?
- A grammar test that only tests past tense has poor content validity for “English grammar overall”
- A JLPT N2 vocabulary test should sample from words at the N2 frequency range, not exclusively lower or higher frequency
Does the test measure the theoretical construct (e.g., communicative competence, reading comprehension, global proficiency)?
- A test that measures only speed of isolated word recognition may lack construct validity as a measure of “reading comprehension” if comprehension of meaning is the target construct
- Best investigated through factor analysis, convergent/divergent validity studies, and think-aloud protocols
Concurrent validity:
Do scores on the new test correlate highly with scores on a well-established test measuring the same construct?
- A new quick proficiency screener should correlate substantially with IELTS or TOEFL scores
- Tested statistically; commonly used to validate new instruments by comparison to established benchmarks
Predictive validity:
Does performance on the test predict future performance on a real-world criterion?
- Do TOEFL scores predict academic GPA of international students?
- Does the driving theory test predict actual safe driving behavior?
- Especially relevant for placement tests and aptitude tests
Does the test appear to measure what it claims — from the perspective of test-takers and stakeholders?
- Not technically a form of validity evidence in Messick’s framework, but practically important for test acceptance and motivation
- A proficiency test that consists entirely of trivia questions would lack face validity regardless of psychometric properties
Messick’s Unified Validity Framework (1989)
Messick argued that validity is not a property of a test itself but of the inferences made from test scores. He proposed a unified construct validity framework that encompasses:
- Evidence: Psychometric, substantive, structural, generalizability, external, consequential
- Value implications: Are the values underlying the construct appropriate?
- Social consequences: What are the consequences of using the test — including washback?
This framework made consequential validity (the real-world effects of test use, including washback and equity implications) a component of validity argumentation.
Validity vs. Reliability Trade-offs
In language assessment, there is often a practical trade-off:
- High authenticity tasks (free composition, real conversation) tend to have higher construct validity but lower inter-rater reliability (harder to score consistently)
- Discrete-point tests (multiple choice grammar) tend to have high reliability but may sacrifice construct validity (they test form recognition, not communicative ability)
Optimal test design seeks to maximize both — but perfect balance is rarely achievable. The relative weighting of validity vs. reliability in a given test depends on its purpose.
Validity in SLA Research
Beyond testing, validity is fundamental to SLA research methodology:
- Does a grammaticality judgment task actually measure implicit knowledge, or is it measuring explicit metalinguistic knowledge?
- Does an elicited imitation task measure syntactic processing or rote memory span?
- Does an eye-tracking paradigm measure what SLA researchers claim it measures about processing?
Validity arguments are as critical in research design as in assessment design.
History
Validity in language assessment evolved from a narrow concept (does the test measure what it claims to measure?) to a comprehensive framework encompassing multiple evidence types. Early validity theory distinguished three types: content validity (does test content represent the domain?), criterion-related validity (does the test correlate with external measures?), and construct validity (does the test measure the intended psychological construct?). Messick (1989) revolutionized validity theory by arguing that validity is a unitary concept — construct validity — and that all evidence (content, criterion, consequential) contributes to a single validity argument about the interpretation and use of test scores. This reconceptualization, adopted by the AERA/APA/NCME Standards for Educational and Psychological Testing (1999, 2014), redefined validity not as a property of the test but of the inferences made from test scores.
Common Misconceptions
“Validity means the test is accurate.”
Validity is not an inherent property of a test — it is a property of the interpretations and uses of test scores. A test may be valid for one purpose (placement) but invalid for another (proficiency certification) even though the test itself hasn’t changed.
“A validated test is valid for everything.”
Validity must be established for each specific use and population. A test validated for adult ESL placement is not automatically valid for child heritage speaker assessment, university admission, or immigration purposes. Each use requires its own validity argument.
“High reliability means high validity.”
Reliability (consistency) is necessary but not sufficient for validity. A test can produce highly consistent results that consistently measure the wrong thing. A vocabulary test may reliably measure vocabulary but would not be valid as a measure of overall speaking proficiency.
“Face validity is sufficient.”
Face validity (the test looks like it measures what it claims to) is the weakest form of validity evidence. A test that looks like a proficiency test may actually measure test-taking strategy, reading speed, or background knowledge rather than language proficiency.
Criticisms
Messick’s unitary validity framework, while influential, has been criticized for being so comprehensive that it becomes impractical — gathering evidence for all aspects of construct validity (content, structural, generalizational, external, substantive, consequential) is prohibitively expensive and time-consuming for most testing contexts.
The inclusion of consequential validity — the social consequences of test use — in the validity framework remains controversial. Critics argue that test consequences are policy questions, not measurement questions, and that conflating them with validity overextends the concept. For language testing specifically, the emphasis on construct validity has been critiqued for privileging academic constructs (communicative competence, proficiency) over the practical question that stakeholders care about: does this person have the language ability to do what they need to do? Additionally, validity arguments for major language tests (TOEFL, IELTS) are developed by the testing organizations themselves, creating potential conflicts of interest.
Social Media Sentiment
Validity concerns are pervasive in language learning communities, though rarely expressed using the term. Complaints that test scores “don’t reflect real ability,” debates about whether JLPT measures actual Japanese proficiency, and skepticism about placement test accuracy are all validity discussions in practical terms.
On r/LearnJapanese and r/IELTS, the most common validity-related frustration is mismatch between test performance and real-world communication ability — a concern about the ecological validity of standardized tests. Teachers on academic forums discuss validity more technically, debating the construct definitions underlying major test batteries.
Practical Application
- Interpret test scores carefully — Understand what a specific test score means and what it doesn’t mean. A JLPT N2 pass certifies reading and listening comprehension at a defined level — it does not certify speaking or writing ability.
- Use tests for their validated purposes — Choose assessment tools that are validated for your specific purpose. If you need to assess speaking ability, use a speaking test, not a multiple-choice grammar test.
- Consider multiple evidence sources — No single test provides a complete picture of language proficiency. Combine different assessment types for more valid overall evaluation.
- For teachers: align tests with instruction — The most valid classroom assessments test what was taught, in the way it was taught. If the course emphasizes conversation, assess conversation — not grammar translation.
Related Terms
See Also
Research
Messick (1989) established the unitary validity framework that remains the standard in educational measurement. Bachman and Palmer (1996) applied validity theory to language testing within their broader test usefulness framework, providing practical guidance for language test validation.
Kane (2006) proposed the interpretive argument approach, requiring test developers to explicitly state the inferential chain from test performance to score interpretation to use decision — and to provide evidence supporting each link. For language testing specifically, Chapelle, Enright, and Jamieson (2008) provided a comprehensive validity argument for the TOEFL iBT, illustrating how multiple evidence types contribute to a unified validity argument for a major high-stakes language test. Xi (2010) addressed the consequential validity of language tests, examining how test use affects teaching, learning, and social outcomes — an area of increasing importance for language assessment.