Definition:
Speech production is the cognitive and motor process by which a speaker transforms a communicative intention into a sequence of sounds, moving from concept to utterance through a series of interacting processing stages. It encompasses everything from deciding what to say to the precise muscular coordination of the tongue, lips, and vocal cords required to say it. Speech production is a central concern of psycholinguistics and has been studied through reaction-time experiments, speech errors, neuroimaging, and computational modeling.
Levelt’s Model
The most influential account of speech production is Willem Levelt’s model (1989), which identifies four main processing stages:
| Stage | Process | Output |
|---|---|---|
| Conceptualizer | Selecting and ordering the communicative intention; preverbal message | Preverbal message |
| Formulator | Grammatical encoding (lemma selection, syntax); Phonological encoding (lexeme access) | Phonetic plan |
| Articulator | Motor execution of the phonetic plan | Overt speech |
| Speech Comprehension System | Self-monitoring via internal speech and hearing one’s own output | Error detection and repair |
A critical feature of the model is its distinction between lemma (the abstract lexical entry containing syntactic and semantic information) and lexeme (the phonological word form). This distinction explains the tip-of-the-tongue state and the patterns observed in speech errors.
Grammatical Encoding
Before phonological form is accessed, the speaker selects lexical items (words) with appropriate syntactic properties — noun, verb, grammatical gender — and constructs a syntactic frame. Structural priming — the tendency to repeat syntactic structures just heard or produced — operates at this stage, indicating that syntactic frames are abstract and reusable.
Phonological Encoding
Once a lemma is selected, its phonological form (lexeme) must be retrieved. This involves:
- Selecting the word’s segments and syllable structure
- Assigning stress
- Integrating the word into the prosody of the larger utterance
The syllable is a key planning unit. Evidence from speech errors and reaction-time studies shows that phonological encoding proceeds left-to-right but with considerable look-ahead — speakers have partial preparation of upcoming syllables while articulating earlier ones.
Articulation and Motor Control
The phonetic plan specifies abstract motor targets (articulatory gestures). The articulator translates these into motor commands that coordinate breathing, laryngeal vibration, and supralaryngeal tract movements. Coarticulation — the overlap of articulatory movements across adjacent sounds — is a central feature of fluent speech: sounds are shaped simultaneously, not executed one after another. Coarticulation allows rapid speech rates while preserving intelligibility.
Self-Monitoring
Speakers monitor their own production at two points: internally (via an inner speech loop, before articulation) and externally (by hearing what they have said). When an error is detected, speakers typically interrupt the utterance and repair it. The patterns of when errors are detected vs. missed have been used to test the architecture of the monitoring system.
Speech Production in L2
L2 speech production differs from L1 production in critical ways:
- Higher cognitive load as all stages compete for limited processing resources
- Slower lexeme access, especially for less frequent or recently learned words
- Phonological encoding interference from L1 phonological representations via cross-linguistic influence
- Greater reliance on working memory due to less automatized processing
- More frequent speech errors and self-repairs
As L2 proficiency increases, individual stages become more automatized, freeing working memory for higher-level formulation.
History
Early accounts of speech production were largely behavioristic (speech as conditioned stimulus-response chains), a framework that began collapsing with Chomsky’s review of Skinner’s Verbal Behavior (1959). The speech error corpora assembled by Victoria Fromkin (1971) and the models of Merrill Garrett (1975) established that speech production involves hierarchical, symbolic planning. Levelt’s Speaking (1989) synthesized this evidence with reaction-time and priming data into the standard framework. Subsequent computational work, including Dell’s (1986) spreading activation model, provided a connectionist alternative, predicting specific error patterns through activation dynamics rather than strictly serial processing. Neuroimaging (PET, fMRI) from the 1990s onward has mapped production stages to cortical networks, including Broca’s area (phonological/syntactic encoding) and the supplementary motor area (initiation).
Common Misconceptions
- “Speech production is simply the reverse of comprehension.” While the same representations may be partly shared, production requires explicit planning and sequencing that comprehension does not. The systems are closely related but distinct.
- “L2 speech is just slower because of translation.” Even advanced L2 speakers do not translat word by word; slower L2 production largely reflects less automatized phonological encoding and greater phonological competition from L1.
- “Thinking about how you speak makes you speak better.” Conscious attention to articulatory planning often disrupts automatized speech processes — expert speakers are fluent precisely because they do NOT consciously manage every stage.
Criticisms
Levelt’s model has been criticized for being too modular — assuming that stages operate strictly in sequence with limited feedback. Dell’s (1986) model allows phonological activation to feed back to lexical selection, better accounting for phonological facilitation (producing a phonologically similar word before the target speeds target naming). The two levels have been challenged by some researchers who argue for a single-stage model in which all word properties are jointly activated. More recent work in embodied and dynamical systems approaches questions whether discrete symbolic stages adequately capture the gradient and continuous nature of motor speech.
Social Media Sentiment
Speech production concepts circulate primarily through discussions of fluency, accent, and L2 learning anxiety. The point that fluency reflects automatized sub-processes — not just “knowing vocabulary” — resonates widely with late-stage language learners who understand vocabulary but struggle with real-time production. Levelt’s model is regularly introduced in linguistics courses, and simplified explanations draw significant engagement on YouTube and language-learning forums.
Last updated: 2025-07
Practical Application
Understanding speech production demystifies why vocabulary knowledge does not automatically produce fluent speech. Knowing a word means having lemma access (semantic, syntactic information) — but fluent production additionally requires rapid, automatized lexeme access (the phonological form). Repeated retrieval practice is what drives phonological form consolidation. This is why spaced repetition tools like Sakubo produce the kind of fluency gains that passive review does not: they train lexeme-level access repeatedly across spaced intervals, progressively automating exactly the stage of the production pipeline where L2 learners most commonly show bottlenecks.
Related Terms
- Psycholinguistics
- Speech Error
- Tip-of-the-Tongue
- Mental Lexicon
- Lexical Access
- Spreading Activation
- Cross-Linguistic Influence
- Coarticulation
- Working Memory
- Cognitive Load
- Spaced Repetition
See Also
Research
Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. MIT Press.
The definitive model of speech production, synthesizing two decades of speech error, reaction-time, and priming research into a staged processing account. Introduced the conceptualizer–formulator–articulator architecture and the lemma/lexeme distinction.
Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3), 283–321.
Proposed a connectionist model in which phonological and lexical activation spreads interactively rather than in a strictly serial fashion. Predicts specific error type patterns through activation dynamics and remains an influential alternative to strictly modular accounts.
Indefrey, P., & Levelt, W. J. M. (2004). The spatial and temporal signatures of word production components. Cognition, 92(1–2), 101–144.
A comprehensive meta-analysis of neuroimaging studies of word production, mapping Levelt’s processing stages to cortical locations and time windows, providing the primary neuroanatomical grounding for the standard model.