Camellia sinensis Genetics

When the first high-quality whole-genome sequence of Camellia sinensis var. sinensis was published in 2017, and then the assamica variety was sequenced in 2018, the resulting genomes were among the largest and most complex crop plant genomes ever assembled — approximately 3.02 gigabases for the sinensis variety (roughly as large as a mammalian genome) and an estimated 3.13–3.16 gigabases for assamica — and exploration of these genomes revealed a tea plant shaped by an ancient whole-genome duplication approximately 30–40 million years ago that had deposited redundant copies of biosynthetic pathway genes into the genome at key junctions, allowing the plant to up-regulate catechin, caffeine, and L-theanine production to levels unparalleled in other plant systems, with gene family expansions in flavonoid biosynthesis (a 25-fold expansion in certain CYP450 gene subfamilies compared to Arabidopsis), terpene synthase genes (accounting for the hundreds of aroma volatiles tea produces), and caffeine synthase paralogs serving as direct genomic evidence for the metabolic complexity that makes tea such an unusual crop. Understanding the genetic basis of these traits not only explains why tea is what it is chemically, but also enables the breeding of new cultivars with targeted flavor, pest resistance, and climate adaptation traits.


In-Depth Explanation

Genome Structure and Size

Genome assembly data:

  • Camellia sinensis var. sinensis (the “small-leaf” variety common to Chinese teas): 3.02 Gb genome; 36,951 predicted protein-coding genes; ~80% of the genome consists of repetitive elements (transposable elements and retrotransposons), which partially explains the large genome size
  • Camellia sinensis var. assamica (the “large-leaf” variety from Assam and Yunnan regions): 3.13 Gb genome; 36,137 predicted genes; similar repetitive element composition
  • For context: rice genome is ~430 Mb; Arabidopsis thaliana ~135 Mb; human genome ~3.0 Gb
  • The large genome relative to other flowering plants reflects an ancient whole-genome polyploidy event (estimated 30–40 MYA, confirmed by block-level genomic synteny analysis) combined with subsequent retroelement proliferation without proportional genome compaction

Chromosome number: C. sinensis has 2n = 30 chromosomes (15 pairs); the diploid state and moderately high chromosome number are consistent with the post-polyploidy equilibration that characterizes ancient polyploids that have returned to near-diploid chromosome organization while retaining expanded gene numbers from the polyploidy

Comparison with other Camellia species: C. sinensis is phylogenetically nested within the broader Camellia genus (~280 species); genome sequencing of C. sinensis alongside C. lanceoleosa (non-tea-drink camellia) revealed that many of the tea-specific trait genes (particularly caffeine synthase and theanine synthetase-related genes) are either absent or present in lower copy number in non-tea Camellia species, confirming that these represent gains within the C. sinensis lineage rather than deeply ancestral characters shared across all camellia


Key Biosynthetic Gene Families

Catechin Biosynthesis Pathway Genes

Catechins (particularly EGCG, which comprises 50–70% of total catechins in tea) are derived from the general flavonoid pathway via the following gene cascade:

“`

Phenylalanine

↓ PAL (phenylalanine ammonia-lyase)

4-Coumaroyl-CoA

↓ CHS (chalcone synthase)

Naringenin chalcone

↓ CHI (chalcone isomerase)

Naringenin (flavanone)

↓ F3H (flavanone 3-hydroxylase)

Dihydrokaempferol

↓ F3’5’H (flavonoid 3′,5′-hydroxylase) ← KEY EXPANSION GENE

Dihydromyricetin

↓ DFR (dihydroflavonol 4-reductase) + LAR (leucoanthocyanidin reductase)

Epicatechin (EC)

↓ F3’5’H again + ANR (anthocyanidin reductase)

Epigallocatechin gallate (EGCG) ← via galloyltransferase (ECGT) from EGC + galloyl-glucose

“`

Gene family sizes in Camellia sinensis vs. Arabidopsis:

  • CYP450 genes (includes F3’H, F3’5’H, and many other hydroxylases): 711 members in C. sinensis vs. 272 in Arabidopsis — a near-3-fold expansion
  • CsF3’5’H (flavonoid 3′,5′-hydroxylase): Crucial for diverting the pathway toward the galloylated epigallocatechin series (EGC, EGCG); tea has 45 copies of CYP75B-type genes (which include F3’5’H paralogs) vs. 3 in Arabidopsis — this massive expansion enables the exceptional galloylated catechin production capacity that defines tea
  • CsLAR (leucoanthocyanidin reductase): 5 copies in C. sinensis vs. 1 in most plants; key for EC synthesis branch
  • CsANR (anthocyanidin reductase): 4 copies; key for epicatechin and EGC gallate synthesis

The importance of CsF3’5’H and CsANR expansion: Expression QTL studies have shown that natural variation in the expression levels of these genes across cultivars (not necessarily sequence differences) accounts for a large fraction of the observed variation in EGCG percentage of total catechins among cultivars — meaning that high-EGCG cultivars (prized for both health and flavor reasons) are primarily cultivars where regulatory architecture has driven high-level CsF3’5’H and CsANR expression, not necessarily cultivars with different structural gene versions.

Caffeine Biosynthesis Genes

Caffeine biosynthesis in tea (and in coffee, cacao — a convergent evolution case) is:

“`

Xanthosine (purine salvage pathway)

↓ 7-methylxanthosine synthase (TCS1)

7-Methylxanthosine

↓ Nucleosidase

7-Methylxanthine

↓ Theobromine synthase (CaXMT1)

Theobromine

↓ Caffeine synthase (CaMXMT1/TCS1)

Caffeine

“`

Key findings from genome analysis:

  • The TCS1 gene (tea caffeine synthase) is present in 2 copies in C. sinensis sinensis and assamica vs. 1 functional copy in coffee (Coffea arabica and C. canephora); the gene family (tea caffeine synthase/S-adenosyl-L-methionine-dependent methyltransferases) has undergone specific expansion in the Camellia lineage
  • Comparison with coffee: The caffeine biosynthesis genes in tea and coffee are not orthologous (not descended from the same ancestral gene) — they’re paralogs that independently evolved caffeine synthesis capacity from different methyltransferase gene ancestors, one of the clearest examples of convergent evolution at the biochemical level confirmed by genome sequencing; tea’s caffeine synthase is more closely related to its own theobromine synthase than to coffee’s caffeine synthase
  • CsCSN1 and CsCSN2 (caffeine synthase isoforms detected in genome): Expressed differentially by tissue type and growth stage; leaf buds exhibit highest TCS1 expression (consistent with the highest caffeine concentration found in young buds and first leaves)

Theanine Biosynthesis Genes

L-Theanine (γ-glutamylethylamide) biosynthesis:

  • CsTDH (theanine synthetase / γ-glutamylmethylamide synthetase): Catalyzes L-glutamate + ethylamine → L-theanine + ADP + Pi; requires ATP; expressed primarily in roots
  • Root-to-shoot transport: L-theanine synthesized in roots is transported via xylem to young shoots; shoot accumulation is highest in young buds (explaining the umami character of first-flush material)
  • The genome-scale gene family search found CsTDH is related to glutamine synthetase gene family; it represents a tea-specific (and some other Camellia species) neofunctionalization of a nitrogen metabolism gene for theanine production
  • Shade-induced theanine accumulation: When light is reduced (shade growing), nitrogen assimilation shifts and glutamate flux into theanine increases; the molecular mechanism involves CsGS1 (glutamine synthetase) and GDH (glutamate dehydrogenase) expression changes, whose regulation is still being characterized at gene network level

Terpene Synthase Genes

Tea produces hundreds of terpene and terpenoid volatile compounds contributing its aroma:

  • C. sinensis genome contains 40+ terpene synthase (TPS) genes — an expansion compared to most plants (~32 in grapevine, ~40 in Arabidopsis)
  • Many TPS genes are expressed specifically in young leaves and buds, and their expression is upregulated by specific stresses including insect herbivory (explaining the Jassid-mediated terpene production in Oriental Beauty and similar bug-bitten teas)
  • Key characterized TPS genes: CsTPS1 (linalool synthase; linalool is one of the most important floral aroma compounds in jasmine-type oolong), CsTPS2/3 (geraniol synthase family; geraniol is precursor to nerol and the rose-scent phenotype of some high-mountain oolong), CsTPS7 (nerolidol synthase; specifically expressed in insect-damaged tissue)

Sinensis vs. Assamica Divergence and Domestication Genetics

The long-standing question of whether C. sinensis var. sinensis and var. assamica represent:

  • (A) Two separate domestication events (independent domestication from different wild C. sinensis populations in different regions)
  • (B) A single domestication with subsequent morphological divergence (one origin, then human selection driving the varieties apart)

Genetic evidence now available:

  • Whole-genome SNP analysis of sinensis and assamica accessions shows genetic distance consistent with divergence approximately 22,000–38,000 years BP — far predating domestication (which is thought to have begun only 3,000 years BP for Chinese tea and 2,000 years BP in Assam context)
  • However, reticulate (network-like) gene flow patterns detected in nuclear gene phylogenies suggest the varieties have not been completely isolated since their divergence, with some introgression episodes
  • Current consensus supports polyphyletic or paraphyletic domestication — the ancestral populations that were domesticated were already genetically distinct (i.e., sinensis-like populations in Yunnan/Chinese interior and assamica-like populations in Assam/upper Brahmaputra drainage), making this effectively two domestication events in genetically pre-differentiated populations
  • Yunnan’s wild tea populations (particularly around Xishuangbanna) remain the closest wild relatives of cultivated assamica based on population genomics studies; this is consistent with Yunnan as the center of origin and diversity for the species historically

Cultivar Breeding Applications

Modern cultivar development uses genomic tools:

  • Marker-assisted selection (MAS): SNP markers linked to high-EGCG content (QTL on chromosome 8 associated with CsF3’5’H expression cluster), flavor quality traits (QTL on chromosomes 2 and 11 linked to linalool and geraniol production capacity), and cold tolerance (QTL on chromosome 4 associated with cold-shock response genes)
  • Genome-wide association studies (GWAS): Scanning natural cultivar diversity for SNP association with key traits including resistance to the tea green leafhopper (Empoasca onukii), the most damaging insect pest in Chinese tea production; CsJAZ repressor genes and CsMYB transcription factors associated with jasmonate-signaled defense response have been identified
  • Important developed cultivars: Longjing 43 (fast-flush cultivar with 10–14 days earlier harvest than standard, developed by classical selection); Jinkuan (high amino acid content); Zhongcha 108 (pest resistance emphasis); all pre-genomic era — the genomic era (post-2017) has produced pilot programs not yet widely deployed cultivars

Common Misconceptions

“Sinensis and assamica are so different they should be separate species.” Modern genomic evidence confirms they are varieties (subspecies at most) of a single species with a shared ancestor, not separate species; gene flow and reticulate evolution between them is well-documented and the genetic distance is far smaller than between recognized plant species.

“Tea genetics is simpler to study than coffee or cacao genetics.” Tea has one of the largest crop plant genomes sequenced; its ~80% repeat content makes genome assembly challenging; the large TPS and CYP450 gene families with many paralogs make functional annotation difficult; tea genetics research is actually more computationally demanding than many other food crop genome projects.


Related Terms


See Also

  • Cultivar — the fundamentals entry covering what defines a tea cultivar versus a variety or landrace, how cultivars are selected and propagated (vegetative cloning for commercial cultivars vs. seed propagation for wild-type), the major Chinese, Japanese, Taiwanese, and Indian cultivar groups (Qingxin, Yabukita, Assam clones, etc.) and their flavor profiles, and how terroir interacts with cultivar genetics to produce the final cup character; this genetics entry provides the genomic basis for why cultivar differences exist at the molecular level (gene expression patterns, gene copy number differences) while the cultivar entry provides the practical consumer and agricultural context for how these genetic differences translate into tea market segments and brewing preferences
  • EGCG — the entry covering epigallocatechin gallate as both a flavor compound (contributing bitterness and astringency to tea, with concentration varying by cultivar, growing conditions, and processing) and a bioactive compound (the most studied catechin for health benefits, with antioxidant, anti-inflammatory, and metabolic activity at various levels of evidence); the CsF3’5’H gene expansion described in this genetics entry is the direct genomic explanation for why Camellia sinensis produces vastly more EGCG than most other plants, and cultivar-level variation in EGCG content maps to differential expression of the same gene family clusters characterized here

Research

  • Xia, E. H., Zhang, H. B., Sheng, J., Li, K., Zhang, Q. J., Kim, C., et al. (2017). The tea tree genome provides insights into tea flavor and independent evolution of caffeine biosynthesis. Molecular Plant, 10(6), 866–877. DOI: 10.1016/j.molp.2017.04.002. First high-quality chromosome-level genome assembly of Camellia sinensis var. sinensis (C. sinensis cv. Shuchazao); assembled 3.02 Gb; 36,951 predicted genes; annotation confirmed massive expansions in CYP450 family (711 members), terpene synthase family (40+ members), and SCPL (serine carboxypeptidase-like acyltransferases, including galloylation pathway); comparative analysis of caffeine synthase (TCS1) gene family with coffee (Coffea canephora) caffeine synthase gene confirmed non-orthologous origin (convergent evolution of caffeine biosynthesis in tea vs. coffee); metabolomics integration detected 198 secondary metabolites significantly enriched in young buds, with EGCG and L-theanine showing highest tissue-specificity; established the genomic foundation for all subsequent population genomic and functional genomic work in tea and remains the primary genome reference for sinensis variety research
  • Xia, E. H., Tong, W., Hou, Y., An, Y., Chen, L., Wu, Q., et al. (2020). The reference genome of tea plant and resequencing of 81 diverse accessions provide insights into its genome evolution and adaptation to stress. Molecular Plant, 13(7), 1013–1026. DOI: 10.1016/j.molp.2020.04.010. Improved reference genome assembly combined with population resequencing of 81 C. sinensis accessions spanning sinensis, assamica, pubilimba, and taliensis varieties; phylogenomic dating estimated sinensis-assamica common ancestor approximately 22,000–38,000 years BP; admixture analysis detected four ancestral genetic clusters with evidence for complex reticulate gene flow between sinensis and assamica populations (inconsistent with clean bidirectional divergence model); specific stress-response gene expansions characterized: drought-tolerance-associated CsDREB transcription factors (drought-responsive element binding; 23 members in C. sinensis vs. 9 in Arabidopsis); CsWRKY defense transcription factor expansion (88 members); cold acclimation gene CBF cluster expression; established the population genomic context for MAS cultivar development programs and identified high-priority QTL regions for breeding targets in tea quality traits