Skip to main content

A systematic search for positive selection in higher plants (Embryophytes)



Previously, a database characterizing examples of Embryophyte gene family lineages showing evidence of positive selection was reported. Of the gene family trees, 138 Embryophyte branches showed Ka/Ks>>1 and are candidates for functional adaptation. The database and these examples have now been studied in further detail to better understand the molecular basis for plant genome evolution.


Neutral modeling showed an excess of positive and/or negative selection in the database over a neutral expectation centered on the mean Ka/Ks ratio. Out of 673 families with assigned structures, 490 have at least one branch with Ka/Ks >>1 in a region of the protein, enabling a picture of selective pressures delineated by protein structure. Most gene families allowed reconstruction back to the last common ancestor of flowering plants (Magnoliophytes) without saturation of 4- fold degenerate codon position. Positive selection occurred in a wide variety of gene families with different functions, including in the self incompatibility locus, in defense against pathogens, in embryogenesis, in cold acclimation, and in electrontransport. Structurally, selective pressures were similar between alpha-helices and beta- sheets, but were less negative and more variant on the surface and away from the hydrophobic core.


Positive selection was detected statistically significantly in a small and nonrandom minority of gene families in a systematic analysis of embryophyte gene families. More sensitive methods increased the level of positive selection that was detected and presented a structural basis for the role of positive selection in plant genomes.


The search for lineage- specific genes and phenotypically important lineage- specific evolution has intensified with the increasing number of genome sequences, driven through population genetic (intra- specific) analysis of SNPs and through inter- specific molecular evolutionary analysis [1]. Gene duplication has classically been viewed as a source of evolutionary innovation [2]. Differing views have emerged as to the role of gene duplication in driving evolutionary novelty, including concerted evolution in tandemly repetitive families, mediating genetic robustness, subfunctionalization, and neofunctionalization de novo or involving pre-adaptation [37]. In both paralogs (emerging from lineage- specific gene duplication events) and in orthologs, phenotypically important changescan emerge along a specific lineage via changes in gene expression, in alternative splicing, in coding sequence evolution, or through a host of other mechanisms (see [8] for a discussion). This study will focus mostly on coding sequence evolutionary dynamics.

Amino acid substitutions in a protein sequence can be advantageous, neutral or deleterious to the organism. There are several links in this process. An individual substitution can effect individual protein folding or function. The individual protein then interacts with other molecules in the execution of cellular and organismal functions. These functions are selectable as they differentially effect the organism's ability to survive, mate, and reproduce in competition with other individuals in the species, as modulated by the effective population size of the species. On top of this, environments can change, changing the structure and optima in the fitness landscapes with time (giving temporal variation to the flatness/ruggedness of the fitness landscape).

Even without environmental change, the nature of this fitness landscape is open for debate. Driven by the clock- like evolution of many gene families, the view that a large number of substitutions are neutral or nearly neutral on a flat fitness landscape emerged ([9], see [10] for a review). An alternative view to account for this is presented through a selectionist interpretation of a rapidly changing fitness landscape driven by strong selective pressures on protein stability and dynamics [11, 12]. Evidence for this view comes from the metastability of proteins. However, it has been shown that the metastability of proteins can emerge as a neutral product of the selection process [13]. This has led to yet another hybrid view, where protein sequences take neutral walks through sequence space, punctuated by rarer adaptive changes. The positive selective effects can be dictated by simple compensatory co- variational changes, but can also be explained by true environmental adaptation.

In examining the evolution of individual sites, the molecular clock frequently breaks down. A gamma distribution has been used to model the distribution of rates across sites to characterize the differences in site-specific selective pressure [14]. More recently, it has been proposed that a single gamma distribution may not be adequate and sites shift co-evolutionarily through evolution in a process called heterotachy [15]. The complexity of this process has even led some to believe that biology has no underlying rules or mechanisms and no models are possible (see for example [16]). Ignoring this last view, we seek to obtain an overview of the substitutional process and characterize the positive selection that has been detected in embryophytes to shed light on the evolutionary process.

Several methods have been developed to detect positive selection in protein-encoding genes (for a review, see [17]). One way to evaluate the selective pressures on protein evolution is to compare the rate of synonymous and non-synonymous nucleotide substitutions. Ks is the estimated number of synonymous changes per synonymous site and corresponds to the rate of amino acid- neutral evolution. Ka, on the other hand, is the number of non- synonymous substitutions per non-synonymous site. Under neutral protein- level evolution Ka should be equal to Ks and hence the ratio Ka/Ks = 1. Deviations from this mark selective pressures at the protein level. It should be noted that Ks itself can be under selective pressures that do not act on Ka, like selection for optimal expression level based upon codon use and tRNA concentration [18]. However, this effect is not expected to generate a large false positive rate in the use of Ka/Ks to detect positive selection.

A Ka/Ks ratio < 1 indicates negative (purifying) selection. The criterion for positive (adaptive) selection is Ka/Ks>1. This is a quite stringent criterion of positive selection, when averaged over an entire gene and is likely to miss certain cases where positive selection is operating (even with more powerful approaches [19]). The method used to calculate the Ka/Ks ratio is averaged over time and so negative selection over the majority of time covering a long branch can mask a short period of positive diversifying selection. Similarly, the method averages over all sites, including those in the hydrophobic core that are frequently under very negative selective pressure to retain a folded scaffold. To avoid that problem, in a straight -forward approach, one can use a sliding window to detect selection (one- dimensional window analysis [2022]). This approach is designed to detect selective sweeps, where additional mutations hitchhike to fixation together with a positively selected residue. On the other hand, this method will miss strong selective pressures acting on individual regions of a protein driven structurally or functionally. Therefore one can take advantage of a known three- dimensional structure of the protein of interest and group sites according to their proximity in the structure, estimating Ka/Ks in different regions [23, 24]. This 3D windowing method works by sliding a sphere of defined radius through a three dimensional structure and calculating Ka and Ks within each sphere for each branch of a three dimensional structure. If the sequence divergence within a family is small, it can generally be assumed that the structure will have changed little over the course of evolution within the family and homology model generation is not necessary. In that case, a solved 3D structure is only needed for one member of a close gene family.

The analysis of the evolution of plant gene families has been limited to a few selected cases where adaptive evolution has been suspected: the self- incompatibility locus proteins in Solanaceae [20], S- phase kinase-associated protein 1 (SKP1) [25], trypsin inhibitors [26, 27], the anthocyanin pathway [28, 29], and chitinase genes [30]. Other studies focused on specialized genera and discovered adaptive evolution in, for example, cytochrome c oxidase of bladderworts [31]. The release of the fullgenome sequences of Arabidopsis thaliana (thale cress; Arabidopsis) [32] and Oryza sativa (rice) [33, 34] brought a whole new dimension into the field, including the study of gene and genome dynamics [35, 36]. Comparing a selection of Arabidopsis lyrata (lyrate rocketcress) ESTs with the whole genome sequences from Arabidopsis, Barrier and co- workers [37] discovered that 14 out of 304 orthologous genes revealed signs of adaptive evolution.

Analyzing the data in The Adaptive Evolution Database (TAED) [38], we studied 87 protein families with at least one branch showing a Ka/Ks>>1. Further analysis provided both a characterization of selective pressures and substitution rates, including as dictated by structure using the 3D windowing approach described above.


From 138,662 Embryophyte genes from GenBank release 136, 4,216 gene families were built in TAED. The ratios of non- synonymous to synonymous nucleotide substitution rates (Ka/Ks) cover a range from 0.00 to 7.21 per branch with an average value of 0.21 ± 0.04 (Figure 1). Modeling shows that the database contains more negative and/or positive selection than the neutral expectation about the mean Ka/Ks ratio (itself a signal of negative selection). After manual curation of the automated results, 87 families showed Ka/Ks>>1. Of these, 53 families were pairs of putative proteins or proteins of unknown function. Most of these pairs originated from the genome annotations of Arabidopsis and rice. Although most of these pairs originated from the above genome annotations, they were all confirmed by an EST in at least one species (and conservation across species without introduction of spurious stop codons) and therefore were presumed to be real rather than annotation artifacts. 41 of the 53 pairs were even confirmed by EST detection in at least two species. Future embryophyte gene and genome sequencing in these families will create a fuller picture of these evolutionary events.

Figure 1

The distribution of Ka/Ks values calculated from TAED and from a null model derived from gene family analysis parameters, as measured with PAML [62], and re- simulated in the absence of positive selection, is shown. An equal number of branches is represented for each dataset. An excess of positive and/or negative selection is observed from the comparative genomic data.

8 sequence pairs and 26 families with 3 or more sequences of 'known' function showed Ka/Ks rate ratios significantly greater than 1 (Table 1). The highest values were recorded for a ferredoxin in the sequence pair Zea mays (maize) and rice (7.21 ± 0.47) and a pair of mannose binding lectins in rice and Hordeum vulgare (barley; 5.93 ± 0.28).

Table 1 A list of Embryophyte gene families with high Ka/Ks values detected using whole gene averaging is shown. The p- value and expected rate of false positive detection given multiple tests per detected branch hypothesis (Bon*) are shown as a measure of statistical significance. Statistical significance is measured against simulated proteins evolved according to measured parameters rather than against evolved pseudogenes evolving neutrally at the amino acid level (as is sometimes used as a null model in evolutionary studies). The null distribution is therefore centered at 0.21 rather than 1.0.

We also calculated Ka/Ks rate ratios for families where a structure existed in PDB at a maximal distance of 70 PAM units. The values in the 10Å windows calculated cover a range from 0.00 to 27.46 with an average value of 0.35 ± 0.20 (Figure 2). Out of 673 families, 490 had at least one branch with Ka/Ks>>1.

Figure 2

The distribution of Ka/Ks values calculated from TAED using a 3D windowing (structurally- delineated) approach is shown overlaid on the whole gene averaged dataset from Figure 1. An equal number of branches is represented for each dataset. An excess of positive (and negative) selection is detected in the 3D windowing approach compared with the whole gene averaged approach (see the main text for an expanded discussion of this).

The gene families with high Ka/Ks values covered a wide range of cellular functions, including at the self incompatibility locus and at shoot elongation (Figure 3). In the global analysis more than 65% of the families under positive selection were of unknown function. 11% had a catalytic activity and 6% were involved in some kind of binding (DNA or protein). When the large number of proteins of unknown function were excluded from GO annotation categorization, enzyme inhibitors, and genes with nutrient reservoir activity, cell wall organization and biosynthesis, and cellular response to water deprivation functions were over- represented in the high Ka/Ks dataset. When 3D windowing was used (Figure 4), then proteins with catalytic activity, and those involved in protein biosynthesis, development, intracellular transport, RNA/DNA metabolism, and organelle organization and biosynthesis joined the list of over- represented GO terms. Some examples of defense – related proteins were detected, although surprisingly, defense- related proteins in general were not significantly over- represented in the high Ka/Ks dataset.

Figure 3

GO annotations for the gene families where positive selection was detected using whole gene averaging are shown. The text describes these GO annotations normalized by the total number of gene family branches.

Figure 4

GO annotations for the gene families where positive selection was detected using 3D windowing are shown. Compared with Figure 3, the two methods detect different distributions of functions, with an expanded number of positively selected proteins identified with 3D windowing. The text describes these GO annotations normalized by the total number of gene family branches.

Within the self incompatibility locus, several RNases seem to have been under positive selection. In Solanaceae, this family has 25 branches with values significantly above 1. In Rosaceae four high values occurred on branches leading to Pyrus pyrifolia (sha li) (1.18 ± 0.02), to Crataegus monogyna (hawthorn) (1.20 ± 0.02 and 1.10 ± 0.01), and to a clade containing Malus x domestica (apple), Crataegus monogyna and Sorbus aucuparia (rowan) (1.04 ± 0.01). A family of cysteine rich S- locus proteins [39] from Brassica species had 11 significant branches. An S- locus F- box protein in Prunus mume (Japanese apricot) showed allelic diversity [40] and we found a Ka/Ks value of 1.31 ± 0.01 on the branch to protein b.

Other families under positive selective pressure are found in plant defense mechanisms. In the chitinase III family of Poaceae a branch of 1.08 ± 0.01 leading to rice was observed. The neighboring branch led to a clade containing other rice sequences and Triticum turgidum (poulard wheat). The Triticum gene is annotated as xylanase inhibitor. The chitinase A and B genes were found in family 9484 and showed similarly elevated values on branches to maize as previously reported [30].

Two thionin families, proteins involved in resistance against bacterial pathogens, showed evidence of positive selection. In Poaceae, two values occurred between Avena sativa (oat) and barley, 1.21 ± 0.02 on the branch to oat, and between the class V thionins and the purothionins in Triticeae (wheat, oat, barley; 1.30 ± 0.03). A flower- specific gamma thionin in Solanaceae showed a value of 1.29 ± 0.03 on a branch leading to Lycopersicon esculentum (tomato) and Capsicum chinense (bonnet pepper), compared to a clade with Petunia x hybrida (garden petunia) and Nicotiana excelsior (tobacco).

CCD1 is an EF- hand Ca2+ binding protein which is involved in fast response to pathogen attacks in wheat [41]. The Ka/Ks value for the sequence pair Triticum aestivum (bread wheat) and Arabidopsis on the Magnoliophyta node was 1.08 ± 0.08.

Several Late Embryogenesis Abundant proteins with high Ka/Ks rate ratios were also observed. LEA proteins are involved in desiccation tolerance in the drying seed. The highest value was found for proteins from Glycine max (soybean) and Arabidopsis in a subfamily of the D113 proteins (1.91 ± 0.08). Another family included proteins probably involved in cold- response from bread wheat and Cicer arietinum (chick- pea) at a value of 1.78 ± 0.07. The third candidate for adaptation on LEA proteins was the PACCAD clade of grasses containing e.g. maize [42]. This clade showed a Ka/Ks rate ratio of 1.12 ± 0.01. BudCAR6 (or CAS15) was a protein associated with cold induced freezing tolerance in Medicago sativa (alfalfa) [43]. A homologous protein in Vitis vinifera (grape) is induced in ripening fruits [44]. The gene pair had a Ka/Ks value of 1.16 ± 0.06.

A trypsin inhibitor family from poplar had two high branches at 1.76 ± 0.02 and 1.06 ± 0.01 leading to Populus tremuloides (trembling aspen). Cathepsin D inhibitors (aspartic proteinase inhibitors) are proteins induced by jasmonic acid, i.e. upon wounding of the plant by herbivory. Genes from Solanum tuberosum (potato) and Solanum nigrum (black nightshade) showed a value of 1.03 ± 0.01 on the branch to potato.

Proteinase IV is a cysteine proteinase with a preference for glycyl bonds from the latex of papya [45] and had a value of 1.05 ± 0.01 in Caricaceae on the branch to Carica candamarcensis (mountain papaya).

Other families showing signs of adaptive evolution were found in plant development and intracellular transport. Cytochrome P450 proteins are involved in steroid synthesis and oxygenation reactions. In rice, a cytochrome P450- like protein was separated by a branch with a value of 1.06 ± 0.01 from a clade with Arabidopsis, rice, and Musa acuminata (banana). Dof zinc finger transcription factors are plant specific proteins and showed a Ka/Ks value of 1.14 ± 0.06 in the Magnoliophyta pair Arabidopsis and barley. The phytochrome family is involved in light perception and plant development [46]. Within Coniferales, a branch leading to Cephalotaxus fortunei (Fortune's plum yew) with a Ka/Ks value of 1.12 ± 0.01 was observed.

Two families of lipid transfer proteins in monocotyledons also showed signs of positive selection. The first family is root specific and had a Ka/Ks ratio of 1.26 ± 0.03 on a branch towards rice. The second family had two branches under positive selection. The first branch within Triticeae had a value of 1.04 ± 0.02 and led to bread wheat and poulard wheat. The second branch is in Poales and led to Poaceae (rice, barley) with a value of 1.43 ± 0.03.

TIM23 is a family of mitochondrial inner membrane proteins involved in the translocation of proteins in to the mitochondrial matrix. There are generally three isoforms in each species. The branch leading to rice isoform 3 showed a Ka/Ks value of 1.12 ± 0.06.

Ferredoxin is a soluble low molecular weight protein that mediates transfer of one electron from a donor to an acceptor. It is involved in a broad spectrum of redox metabolism in plastids and mitochondria. The extreme value of 7.21 ± 0.47 separated the sequence pair of maize and rice.

Beta expansins and xyloglucane endotransglycosylases are proteins involved in shoot elongation. Both genes seem to have been under positive diversifying selection in the monocot Schedonorus pratensis (meadow ryegrass). The beta expansins have a high Ka/Ks value of 1.21 ± 0.01 in comparison to bread wheat. And the xyloglucane endotransglycosylase Xet2 has a value of 1.02 ± 0.01 compared to barley.

In the 3D- windowing approach, 49% of the families with nucleotide substitution rate ratios above 1 had catalytic activity and only 10% were of unknown function. Binding activities occurred in 17 % of the families (Figure 4). The largest number of additional functions observed to have undergone positive selection compared to the global sequence analysis, were hydrolases and MADS- box transcription factors. These proteins have defined specific selectable binding pockets.

In Table 2 we show ratios of identical four- fold degenerate codons for the 10 most populated nodes in the embryophyte tree of life in a pairwise analysis. This shows that even without considering phylogeny, the average family can be analyzed back through the last common ancestor of flowering plants (Magnoliophytes).

Table 2 The 10 most populated nodes in the Embryophyte tree of life are shown together with the fraction of identical four- fold degenerate codons for all pairwise comparisons linked by that speciation node and also with the number of species in TAED under that node.

In examination of general structural constraints on sequence evolution as seen in Table 3, the hydrophobic core has the strongest negative selection, while residues of intermediate solvent accessibility and those on the surface showed less negative selective pressures with larger variances of selective pressure. No significant differences were seen between secondary structural elements (data not shown).

Table 3 After partitioning sites into categories of solvent accessibility for 367 TAED families with solved three dimensional structures, Ka/Ks values were calculated across each family over branches and partitions. A stronger negative selective pressure with less variance is generally observed for the hydrophobic core of proteins.


The database underlying the systematic study performed here contains 4216 gene families. In a global sequence analysis, 87 of these families showed at least one branch with a Ka/Ks ratio above 1. Most of the gene families identified to be under positive selective pressure were just sequence pairs. The majority of these pairs came from the genome annotations of the only two fully sequenced genomes of Arabidopsis and rice. This is also the reason why a lot of these pairs were of unknown function. Furthermore, the two genomes were from quite distant species: Arabidopsis is a dicotyledonous plant, while rice a monocotyledonous plant, and they shared a last common ancestor about 135 – 250 Mya [47]. This point in Table 2 did not, on average, show saturation in a pairwise comparison of sequences.

In the future, more embryophyte genomes will be fully sequenced and this will be reflected in the database that was used for this study. This will eliminate the need to include potentially lower quality EST sequences in the analysis to enrich phylogenetic coverage. The number of species with ESTs is currently much higher than the number of species with full cDNA sequences. In addition, the EST database covers a broader spectrum of cellular functions than the fully sequenced known genes. However, it is anticipated that the general picture of embryophyte coding sequence evolution will remain robust as underlying datasets become fuller.

Most of the gene families we found have molecular functions that are known to be under positive selective pressure. The self- incompatibility locus (S- locus) in flowering plants is a highly polymorphic region that inhibits self- fertilization. The genes in this region are under a high diversifying selective pressure, as Clark and Kao already showed for Solanaceae [20]. Mating and sexual selection are classic examples of an evolutionary arms race.

Defense genes were also expected to show up in our analysis, as they are involved in the cellular arms race (another classic example) against pathogens, and different species show resistance to different pathogens. Individual defense gene examples were detected, but this was not general in a statistically significant way. The functions involved among those detected include chitinases, proteinases, proteinase inhibitors, and lectins. The chitinase III family in our database showed a high Ka/Ks branch to a rice gene, other genes in this clade are annotated as xylanase inhibitors. Further analysis on the sequences and especially their biochemical function will have to show if this clade is working as a chitinase or/and as a xylanase inhibitor in vivo and what causes this functional change.

Among the defense related genes we can also find the lectins, which are a quite heterogeneous family of carbohydrate binding proteins involved in defense against herbivory and probably cellular signaling [48]. Other lectins are used as storage in leguminous seeds, where they can act as major food allergens in humans.

The results from the 3D- windowing show a quite different pattern of cellular functions under selective pressure than the global analysis (Figures 3 and 4). Some of this is due to functional sampling bias in PDB. However, the increased detection of positive selection using this approach may be linked to regions having undergone true functional change. It has been proposed that many proteins can change function under positive selection using only a small number of residues [49]. Alternatively, it might be taken as evidence for structural co- evolution in a heterotachy model (some structural co-evolution occurs allosterically- see [50] for example). Interestingly, applying a method that searches for evidence of multiple gamma distributions to characterize a gene family does not show a significant correlation between shifting gammas and Ka/Ks>>1 [[51], Pierre Pontarotti, personal communication]. This lack of a significant correlation remains unexplained, as a correlation would be expected if both methods are detecting the same signal of positive diversifying selection leading to functional shifts.

The distribution of non- synonymous to synonymous nucleotide substitution rate ratios in the database from 0.00 to 7.21, with an average of 0.21 ± 0.04 shows a strong negative selection on most branches in our gene families. The simulated sequences created with the same average Ka/Ks ratio has a much narrower distribution, with many fewer examples of Ka/Ks>1 (Figure 1). This indicates an excess of negative and positive selection acting on the embryophyte tree of life. Comparing the data from the 3D- windowing to the global sequence analysis, we see a similar distribution of values in both cases, with some additional positive selection occurring in the 10 Å windows, shifting the average value to 0.35 ± 0.20. These windows tended to correspond with regions of proteins that were more solvent accessible, consistent with the results seen in Table 3. It should be emphasized that the same codon positions can occur in multiple windows and that more windows contain surface positions than windows containing only hydrophobic core residues. In addition to the greater resolution to detect positive (and negative) selection with a greater variance of Ka/Ks ratios, the asymmetrical distribution of windows can explain the increase in the average Ka/Ks value (see [24] for a discussion of the statistical properties of this method).

In general, the 3D windowing approach offers greater power to detect regions where positive selection may have occurred, as proteins evolve new functions in specified regions of proteins (for example binding pockets) while retaining the same fold and hydrophobic core. 3D windowing is especially attuned to detecting the signal from the region undergoing neofunctionalization without averaging the signal together with that from the highly conserved core and folding- critical residues.

The analysis of the 10 most populated nodes for the fraction of identical four- fold degenerate codons in a pair wise comparison of sequences, shows, that saturation is not reached on average nodes as far back as the last common ancestor of flowering plants 135 to 250 million years ago [47]. This allows us to compare the sequences without ancestral sequence reconstruction on most Magnoliophyte families, including the split between Arabidopsis and rice. Further, using reconstructed ancestral sequences has been shown to increase the evolutionary time that can be accessed without saturation, dependent upon the tree topology and articulation, as reflected in individual branch lengths [52]. There is, of course, a large degree of variation between families and this must be evaluated on a family by family basis.

This study has characterized a snap shot of the evolutionary process acting upon higher plant gene families. We find a large degree of non-random variation in the selective process across gene families, including an excess of positive selection in some functions characterized by an evolutionary arms race. Examining evolution in the context of three dimensional structure adds power to the analysis in detecting additional positive selection and displaying the complexity of the selection process. Future work will extend this analysis to enhance our view of both the general evolutionary process and of the functions that are shifting through protein coding gene sequence as closely related species diverge.


Positive selection was detected in specific gene families along specific lineages providing a set of candidate genes for functional adaptation that warrant further experimental study. Further, this set of genes represented a small but nonrandom and statistically significant part of plant genome evolution. This global picture, in combination with studies at the population genetic level are providing a general picture of the role of selective forces in plants in shaping biodiversity through lineage- specific processes.


The TAED database was built as described in Roth et al. [38]. 138,662 embryophyte sequences longer than 10 amino acids from 4,568 species were extracted from GenBank release 136 [53]. After an all-against- all BLAST search [54] global PAM distances were calculated for each hit using Darwin [55]. Families were built by single linkage clustering from sequences annotated as complete, with pairwise PAM distances of 100 PAM units above a relative length threshold. For each family, a multiple sequence alignment was calculated using POA [56]. Phylogenetic trees were estimated by Bayesian inference using MrBayes [57]. A novel soft parsimony approach [58] was used to simultaneously root the trees and map them onto the NCBI taxonomy. Distant families were split based on alignment quality as measured in the percentage of gaps and where the oldest node in the tree was identified as a duplication event rather than a speciation event. This served to increase alignment quality, a possible source of false positive positive selection. For each branch of every phylogenetic tree, Ka/Ks ratios were calculated using a previously established ancestral sequence reconstruction and counting- based approach (see [5961]) with a Ks lower cut- off of 0.05. Branches were preliminarily considered significant if Ka- V(Ka)>Ks+V(Ks) and at least two non- synonymous substitutions occurred along the branch. This statistical assessment was refined through evaluation with a neutral null hypothesis, as described below. The procedure described above automatically generated the TAED database and hits were manually curated for this study. Manual evaluation of positively selected genes involved examination of the alignments, where in families with branches with Ka/Ks>1 and alignments where some sequences had a high percentage of gaps, the effect of these sequences on the alignment and subsequent calculation of Ka/Ks ratios was evaluated.

PDB- structures were assigned to families at a distance of 70 PAM units. For each family with a structure, a new multiple sequence alignment was calculated included the structure. From this, sequences were threaded through the structure and contact maps with a radius of 10Å were calculated. Then Ka/Ks ratios were determined within these 10Å- windows for each sphere along each branch of every phylogenetic tree [24]. Significant branches were selected as previously described [24].

To compare the results with a null (actually a combination of neutral and negative selection in the absence of positive selection) model, we used Evolver from the PAML package [62]. Families were generated with non-Ka/Ks parameters sampled proportionately from the distribution obtained by codeml (PAML) analysis of all TAED families and with the amino acid distribution derived from TAED as a whole. Trees and branch lengths were randomly selected from the distribution in the database, and Ka/Ks was set to the mean value of the database (0.21). Ka/Ks values were then calculated for these simulated families as described above. The simulated dataset was used to calculate both p- values and an expected false positive rate, given multiple tests (equivalent to the Bonferroni correction multiplied by the number of tests) to assess statistical significance. Many of these whole gene calculations will likely be unambiguous true positives when more sensitive methods that look at sequence subsets (like [24]) are employed. Employing a null model that is based upon positive selection free parameters estimated from comparative genomic data rather than a null model based upon pseudogenes (which these are not) is expected to give a more realistic (less overly conservative) picture of statistical significance, even though this has not traditionally been employed in evolutionary studies.

Saturation of synonymous sites was calculated as the fraction of identical four- fold degenerate codons in a sequence pair if at least five shared codons occur. These pairwise values were projected onto the node of the species tree that joined them from the mapping of the gene family tree to the species tree. From these pairwise values, average gene saturation at every node in a gene family tree was calculated. Then these nodes were projected onto the corresponding node in the embryophyte tree of life. This is different from the along- branch calculation used in TAED to discard saturated datapoints, but is given as an estimate of how far back (on average), pairwise analysis enables estimation of Ks.

367 families in TAED with close structural homologs in PDB were used to categorize general structural properties of sequence evolution. Using DSSP, both the secondary structural elements and solvent accessibility of each residue site were identified [63]. After partitioning residues into secondary structural categories and solvent accessibility categories, Ka/Ks calculation was performed in the 367 families over each branch for each partition and averaged values compared.

To determine which GO terms [64] were over- represented among the gene lineages having undergone positive selection, the functional annotations of the gene family dataset as a whole and the global and tertiary windowing high Ka/Ks datasets were compared to GO, and gene families assigned to a small number of selected GO terms. Over-representation was determined by examining the ratio of GO term frequency per positively selected branch to GO term frequency in the total set of test branches in the TAED database.


  1. 1.

    Haag ES, True JR: Perspective: From mutants to mechanisms? Assessing the candidate gene paradigm in evolutionary biology. Evolution. 2001, 55: 1077-1084. 10.1554/0014-3820(2001)055[1077:PFMTMA]2.0.CO;2.

    PubMed  CAS  Google Scholar 

  2. 2.

    Ohno S: Evolution by gene duplication. New York: Springer-Verlag; 1970.

    Google Scholar 

  3. 3.

    Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999, 151: 1531-1545.

    PubMed  CAS  PubMed Central  Google Scholar 

  4. 4.

    Ohta T: Evolution of gene families. Gene. 2000, 259: 45-52. 10.1016/S0378-1119(00)00428-5.

    PubMed  CAS  Article  Google Scholar 

  5. 5.

    Moore RC, Grant SR, Purugganan MD: Molecular population genetics of redundant floral- regulatory genes in Arabidopsis thaliana. Molecular Biology and Evolution. 2005, 22: 91-103. 10.1093/molbev/msh261.

    PubMed  CAS  Article  Google Scholar 

  6. 6.

    Francino MP: An adaptive radiation model for the origin of new gene functions. Nature Genetics. 2005, 37: 573-577. 10.1038/ng1579.

    PubMed  CAS  Article  Google Scholar 

  7. 7.

    Rastogi S, Liberles DA: Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evolutionary Biology. 2005, 5: 28-10.1186/1471-2148-5-28.

    PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Liberles DA: Datasets for evolutionary comparative genomics. Genome Biology. 2005, 6: 117-10.1186/gb-2005-6-8-117.

    PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Kimura M: Evolutionary rate at the molecular level. Nature. 1968, 217: 624-626. 10.1038/217624a0.

    PubMed  CAS  Article  Google Scholar 

  10. 10.

    Nei M: Selectionism and Neutralism in Molecular Evolution. Molecular Biology and Evolution. 2005, 22: 2318-2342. 10.1093/molbev/msi242.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  11. 11.

    Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA: Structure, function and evolution of multidomain proteins. Current Opinion in Structural Biology. 2004, 14: 208-216. 10.1016/

    PubMed  CAS  Article  Google Scholar 

  12. 12.

    DePristo MA, Weinreich DM, Hartl DL: Missense meanderings in sequence space: A biophysical view of protein evolution. Nature Reviews Genetics. 2005, 6: 678-687. 10.1038/nrg1672.

    PubMed  CAS  Article  Google Scholar 

  13. 13.

    Taverna DM, Goldstein RA: Why are proteins marginally stable?. Proteins. 2002, 46: 105-109. 10.1002/prot.10016.

    PubMed  CAS  Article  Google Scholar 

  14. 14.

    Yang Z: Maximum – likelihood phylogenetic estimation from DNA- sequences with variable rates over sites – Approximate methods. Journal of Molecular Evolution. 1994, 309: 306-314. 10.1007/BF00160154.

    Article  Google Scholar 

  15. 15.

    Lopez P, Casane D, Philippe H: Heterotachy, an important process of protein evolution. Molecular Biology and Evolution. 2002, 19: 1-7.

    PubMed  CAS  Article  Google Scholar 

  16. 16.

    Siddall ME, Kluge AG: Probabilism and phylogenetic inference. Cladistics. 1997, 13: 313-336. 10.1111/j.1096-0031.1997.tb00322.x.

    Article  Google Scholar 

  17. 17.

    Kreitman M: Methods to detect selection in populations with applications to the human. Annual Reviews of Genomics and Human Genetics. 2000, 1: 539-559. 10.1146/annurev.genom.1.1.539.

    CAS  Article  Google Scholar 

  18. 18.

    Duret L: tRNA gene number and codon usage in the C. elegans genome are co- adapted for optimal translation of highly expressed genes. Trends in Genetics. 2000, 16: 287-289. 10.1016/S0168-9525(00)02041-2.

    PubMed  CAS  Article  Google Scholar 

  19. 19.

    Anisimova M, Bielawski JP, Yang Z: Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Molecular Biology and Evolution. 2001, 18: 1585-1592.

    PubMed  CAS  Article  Google Scholar 

  20. 20.

    Clark AG, Kao TH: Excess nonsynonymous substitution of shared polymorphic sites among self- incompatibility alleles of Solanaceae. Procedings of the National Academy of Sciences, USA. 1991, 88: 9823-9827. 10.1073/pnas.88.21.9823.

    CAS  Article  Google Scholar 

  21. 21.

    Endo T, Ikeo K, Gojobori T: Large- scale search for genes on which positive selection may operate. Molecular Biology and Evolution. 1996, 13: 685-690.

    PubMed  CAS  Article  Google Scholar 

  22. 22.

    Fares MA, Elena SF, Ortiz J, Moya A, Barrio E: A sliding window -based method to detect selective constraints in protein- coding genes and its application to RNA viruses. Journal of Molecular Evolution. 2002, 55: 509-521. 10.1007/s00239-002-2346-9.

    PubMed  CAS  Article  Google Scholar 

  23. 23.

    Suzuki Y: Three- dimensional window analysis for detecting positive selection at structural regions of proteins. Molecular Biology and Evolution. 2004, 21: 2352-2359. 10.1093/molbev/msh249.

    PubMed  CAS  Article  Google Scholar 

  24. 24.

    Berglund AC, Wallner B, Elofsson A, Liberles DA: Tertiary windowing to detect positive diversifying selection. Journal of Molecular Evolution. 2005, 60: 499-504. 10.1007/s00239-004-0223-4.

    PubMed  CAS  Article  Google Scholar 

  25. 25.

    Kong H, Leebens-Mack J, Ni W, dePamphilis CW, Ma H: Highly heterogeneous rates of evolution in the SKP1 gene family in plants and animals: functional and evolutionary implications. Molecular Biology and Evolution. 2004, 21: 117-128. 10.1093/molbev/msh001.

    PubMed  CAS  Article  Google Scholar 

  26. 26.

    Clauss MJ, Mitchell-Olds T: Population genetics of tandem trypsin inhibitor genes in Arabidopsis species with contrasting ecology and life history. Molecular Ecology. 2003, 12: 1287-1299. 10.1046/j.1365-294X.2003.01832.x.

    PubMed  CAS  Article  Google Scholar 

  27. 27.

    Clauss MJ, Mitchell-Olds T: Functional divergence in tandemly duplicated Arabidopsis thaliana trypsin inhibitor genes. Genetics. 2004, 166: 1419-1436. 10.1534/genetics.166.3.1419.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  28. 28.

    Lu Y, Rausher MD: Evolutionary rate variation in anthocyanin pathway genes. Molecular Biology and Evolution. 2003, 20: 1844-1853. 10.1093/molbev/msg197.

    PubMed  CAS  Article  Google Scholar 

  29. 29.

    Yang J, Gu H, Yang Z: Likelihood analysis of the chalcone synthase genes suggests the role of positive selection in morning glories (Ipomoea). Journal of Molecular Evolution. 2004, 58: 54-63. 10.1007/s00239-003-2525-3.

    PubMed  CAS  Article  Google Scholar 

  30. 30.

    Tiffin P: Comparative evolutionary histories of chitinase genes in the Genus zea and Family poaceae. Genetics. 2004, 167: 1331-1340. 10.1534/genetics.104.026856.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  31. 31.

    Jobson RW, Nielsen R, Laakkonen L, Wikstrom M, Albert VA: Adaptive evolution of cytochrome c oxidase: Infrastructure for a carnivorous plant radiation. Procedings of the National Academy of Sciences, USA. 2004, 101: 18064-18068. 10.1073/pnas.0408092101.

    CAS  Article  Google Scholar 

  32. 32.

    Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.

    Article  Google Scholar 

  33. 33.

    Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, Hadley D, Hutchison D, Martin C, Katagiri F, Lange BM, Moughamer T, Xia Y, Budworth P, Zhong J, Miguel T, Paszkowski U, Zhang S, Colbert M, Sun WL, Chen L, Cooper B, Park S, Wood TC, Mao L, Quail P, Wing R, Dean R, Yu Y, Zharkikh A, Shen R, Sahasrabudhe S, Thomas A, Cannings R, Gutin A, Pruss D, Reid J, Tavtigian S, Mitchell J, Eldredge G, Scholl T, Miller RM, Bhatnagar S, Adey N, Rubano T, Tusneem N, Robinson R, Feldhaus J, Macalma T, Oliphant A, Briggs S: A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science. 2002, 296: 92-100. 10.1126/science.1068275.

    PubMed  CAS  Article  Google Scholar 

  34. 34.

    Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, Cao M, Liu J, Sun J, Tang J, Chen Y, Huang X, Lin W, Ye C, Tong W, Cong L, Geng J, Han Y, Li L, Li W, Hu G, Huang X, Li W, Li J, Liu Z, Li L, Liu J, Qi Q, Liu J, Li L, Li T, Wang X, Lu H, Wu T, Zhu M, Ni P, Han H, Dong W, Ren X, Feng X, Cui P, Li X, Wang H, Xu X, Zhai W, Xu Z, Zhang J, He S, Zhang J, Xu J, Zhang K, Zheng X, Dong J, Zeng W, Tao L, Ye J, Tan J, Ren X, Chen X, He J, Liu D, Tian W, Tian C, Xia H, Bao Q, Li G, Gao H, Cao T, Wang J, Zhao W, Li P, Chen W, Wang X, Zhang Y, Hu J, Wang J, Liu S, Yang J, Zhang G, Xiong Y, Li Z, Mao L, Zhou C, Zhu Z, Chen R, Hao B, Zheng W, Chen S, Guo W, Li G, Liu S, Tao M, Wang J, Zhu L, Yuan L, Yang H: A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science. 2002, 296: 79-92. 10.1126/science.1068037.

    PubMed  CAS  Article  Google Scholar 

  35. 35.

    Mitchell-Olds T, Clauss MJ: Plant evolutionary genomics. Current Opinion in Plant Biology. 2002, 5: 74-79. 10.1016/S1369-5266(01)00231-X.

    PubMed  CAS  Article  Google Scholar 

  36. 36.

    Wright SI, Gaut BS: Molecular population genetics and the search for adaptive evolution in plants. Molecular Biology and Evolution. 2005, 22: 506-519. 10.1093/molbev/msi035.

    PubMed  CAS  Article  Google Scholar 

  37. 37.

    Barrier M, Bustamante CD, Yu J, Purugganan MD: Selection on rapidly evolving proteins in the Arabidopsis genome. Genetics. 2003, 163: 723-733.

    PubMed  CAS  PubMed Central  Google Scholar 

  38. 38.

    Roth C, Betts MJ, Steffansson P, Saelensminde G, Liberles DA: The Adaptive Evolution Database (TAED): a phylogeny based tool for comparative genomics. Nucleic Acids Research. 2005, 33: D495-D497. 10.1093/nar/gki090.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  39. 39.

    Mishima M, Takayama S, Sasaki K, Jee JG, Kojima C, Isogai A, Shirakawa M: Structure of the male determinant factor for Brassica self- incompatibility. Journal of Biological Chemistry. 2003, 278: 36389-36395. 10.1074/jbc.M305305200.

    PubMed  CAS  Article  Google Scholar 

  40. 40.

    Entani T, Iwano M, Shiba H, Che FS, Isogai A, Takayama S: Comparative analysis of the self- incompatibility (S-) locusregion of Prunus mume: identification of a pollen- expressed F-box gene with allelic diversity. Genes to Cells. 2003, 8: 203-213. 10.1046/j.1365-2443.2003.00626.x.

    PubMed  CAS  Article  Google Scholar 

  41. 41.

    Takezawa D: A rapid induction by elicitors of the mRNA encoding CCD- 1, a 14 kDa Ca2+ – binding protein in wheat cultured cells. Plant Molecular Biology. 2000, 42: 807-817. 10.1023/A:1006431724090.

    PubMed  CAS  Article  Google Scholar 

  42. 42.

    White CN, Rivin CJ: Sequence and regulation of a late embryogenesis abundant group 3 protein of maize. Plant Physiology. 1995, 108: 1337-1338. 10.1104/pp.108.3.1337.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  43. 43.

    Monroy AF, Castonguay Y, Laberge S, Sarhan F, Vezina LP, Dhindsa RS: A new cold- induced alfalfa gene is associated with enhanced hardening at subzero temperature. Plant Physiology. 1993, 102: 873-879. 10.1104/pp.102.3.873.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  44. 44.

    Davies C, Robinson SP: Differential screening indicates a dramatic change in mRNA profiles during grape berry ripening. Cloning and characterization of cDNAs encoding putative cell wall and stress response proteins. Plant Physiology. 2000, 122: 803-812. 10.1104/pp.122.3.803.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  45. 45.

    Buttle DJ, Ritonja A, Pearl LH, Turk V, Barrett AJ: Selective cleavage of glycyl bonds by papaya proteinase IV. FEBS Letters. 1990, 260: 195-197. 10.1016/0014-5793(90)80101-N.

    PubMed  CAS  Article  Google Scholar 

  46. 46.

    Quail PH: Phytochrome photosensory signalling networks. Nature Reviews Molecular and Cellular Biology. 2002, 3: 85-93. 10.1038/nrm728.

    CAS  Article  Google Scholar 

  47. 47.

    Schneider H, Schuettpelz E, Pryer KM, Cranfill R, Magallón S, Lupia R: Ferns diversified in the shadow of angiosperms. Nature. 2004, 428: 553-557. 10.1038/nature02361.

    PubMed  CAS  Article  Google Scholar 

  48. 48.

    Van Damme EJ, Barre A, Rouge P, Peumans WJ: Cytoplasmic/nuclear plant lectins: a new story. Trends in Plant Science. 2004, 9: 484-489. 10.1016/j.tplants.2004.08.003.

    PubMed  CAS  Article  Google Scholar 

  49. 49.

    Golding GB, Dean AM: The structural basis of molecular adaptation. Molecular Biology and Evolution. 1998, 15: 355-369.

    PubMed  CAS  Article  Google Scholar 

  50. 50.

    Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: Maximum likelihood identification and relationship to structure. Journal of Molecular Biology. 1999, 287: 187-198. 10.1006/jmbi.1998.2601.

    PubMed  CAS  Article  Google Scholar 

  51. 51.

    Lin LJ: Examination of functional differentiation during protein evolution. In M.Sc. Thesis Skövde University College (Sweden);2002.

    Google Scholar 

  52. 52.

    Koshi JM, Goldstein RA: Probabilistic reconstruction of ancestral protein sequences. Journal of Molecular Evolution. 1996, 42: 313-320. 10.1007/BF02198858.

    PubMed  CAS  Article  Google Scholar 

  53. 53.

    Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Research. 2005, 33: D34-38. 10.1093/nar/gki063.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  54. 54.

    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  55. 55.

    Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics. 2000, 16: 101-103. 10.1093/bioinformatics/16.2.101.

    PubMed  CAS  Article  Google Scholar 

  56. 56.

    Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics. 2002, 18: 452-64. 10.1093/bioinformatics/18.3.452.

    PubMed  CAS  Article  Google Scholar 

  57. 57.

    Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19: 1572-1574. 10.1093/bioinformatics/btg180.

    PubMed  CAS  Article  Google Scholar 

  58. 58.

    Berglund-Sonnhammer AC, Steffansson P, Betts MJ, Liberles DA: Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. Journal of Molecular Evolution. 2006, 63: 240-250. 10.1007/s00239-005-0096-1.

    PubMed  CAS  Article  Google Scholar 

  59. 59.

    Li WH, Wu CI, Luo CC: A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Molecular Biology and Evolution. 1985, 2: 150-174.

    PubMed  Google Scholar 

  60. 60.

    Liberles DA: Evaluation of methods for determination of a reconstructed history of gene sequence evolution. Molecular Biology and Evolution. 2001, 18: 2040-2047.

    PubMed  CAS  Article  Google Scholar 

  61. 61.

    Liberles DA, Schreiber DR, Govindarajan S, Chamberlin SG, Benner SA: The adaptive evolution database (TAED). Genome Biology. 2001, 2: RESEARCH0028.

    PubMed  CAS  PubMed Central  Google Scholar 

  62. 62.

    Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Computational Applications in the Biological Sciences. 1997, 13: 555-556.

    CAS  Google Scholar 

  63. 63.

    Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers. 1983, 22: 2577-2637. 10.1002/bip.360221211.

    PubMed  CAS  Article  Google Scholar 

  64. 64.

    Gene Ontology Consortium: The gene ontology (GO) project in 2006. Nucleic Acids Research. 2006, 34: D322-D326. 10.1093/nar/gkj021.

    PubMed Central  Article  Google Scholar 

Download references


We want to thank Björn Wallner for his help with the contact map code, and Ann- Charlotte Berglund for fruitful discussions about statistics and modeling. Discussions with Maria Anisimova and Rasmus Nielsen were helpful in framing our view of evolution and we thank Maria for careful reading of a paper draft. The work has been funded by FUGE, the functional genomics platform of the Norwegian research council, the Swedish Foundation for Strategic Research and a grant to C.R. from Schweizerischer Nationalfonds, the Swiss national science foundation.

Author information



Corresponding author

Correspondence to David A Liberles.

Additional information

Authors' contributions

Both CR and DAL contributed in significant ways to the design and execution of this study as well as the writing of this paper.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Roth, C., Liberles, D.A. A systematic search for positive selection in higher plants (Embryophytes). BMC Plant Biol 6, 12 (2006).

Download citation


  • Chitinase
  • Fitness Landscape
  • Gene Family Tree
  • Strong Negative Selection
  • Positive Selective Pressure