Definition:
Generalizability theory (G-theory), developed by Lee Cronbach and colleagues in the 1970s, is a statistical framework for analyzing the dependability of test scores by decomposing score variance into multiple sources (called facets) — such as raters, tasks, items, and occasions. Unlike Classical Test Theory, which treats all measurement error as a single undifferentiated term, G-theory identifies how much of the score variability comes from each source.
In-Depth Explanation
The core question: “How well does a test score generalize beyond the specific conditions under which it was obtained?”
A writing test score, for example, is influenced by:
- The person’s actual writing ability (the thing we want to measure)
- Which topic they were assigned
- Which rater scored their essay
- The day they took the test
- Interactions between these factors (e.g., some raters are harsher on certain topics)
G-theory lets test developers estimate how much each source contributes to score variation:
| Source of Variance | Example | What it tells us |
|---|---|---|
| Person (p) | Test-taker ability | The signal — what we’re trying to measure |
| Rater (r) | Different raters | Are raters consistent? |
| Task (t) | Different essay topics | Do topics differ in difficulty? |
| Person × Rater (p×r) | Rater-person interaction | Are some raters harsher on certain people? |
| Person × Task (p×t) | Task-person interaction | Do some people do better on certain topics? |
| Residual | Unexplained variation | Random noise |
Two types of studies:
- G-study (Generalizability study): Collects data and estimates variance components for each facet and their interactions.
- D-study (Decision study): Uses G-study estimates to design optimal measurement procedures — how many raters, tasks, or items are needed to achieve a desired level of reliability?
For example, a G-study of a Japanese speaking test might reveal that rater variation contributes more error than task variation. The D-study would then recommend using more raters (or rater training) rather than more tasks.
Relevance to language testing:
G-theory is particularly important for performance-based assessments — speaking tests, writing tests, and oral interviews — where human raters introduce a major source of variability. It’s less critical for multiple-choice tests where rater effects don’t exist.
Related Terms
See Also
Research
- Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. Wiley. — The foundational text on G-theory.
- Brennan, R. L. (2001). Generalizability Theory. Springer. — The standard modern reference for G-theory methodology and applications.