Conservation and diversity of gene families explored using the CODEHOP strategy in higher plants

Background Availability of genomewide information on an increasing but still limited number of plants offers the possibility of identifying orthologues, or related genes, in species with major economical impact and complex genomes. In this paper we exploit the recently described CODEHOP primer design and PCR strategy for targeted isolation of homologues in large gene families. Results The method was tested with two different objectives. The first was to analyze the evolution of the CYP98 family of cytochrome P450 genes involved in 3-hydroxylation of phenolic compounds and lignification in a broad range of plant species. The second was to isolate an orthologue of the sorghum glucosyl transferase UGT85B1 and to determine the complexity of the UGT85 family in wheat. P450s of the CYP98 family or closely related sequences were found in all vascular plants. No related sequence was found in moss. Neither extensive duplication of the CYP98 genes nor an orthologue of UGT85B1 were found in wheat. The UGT85A subfamily was however found to be highly variable in wheat. Conclusions Our data are in agreement with the implication of CYP98s in lignification and the evolution of 3-hydroxylation of lignin precursors with vascular plants. High conservation of the CYP98 family strongly argues in favour of an essential function in plant development. Conversely, high duplication and diversification of the UGT85A gene family in wheat suggests its involvement in adaptative response and provides a valuable pool of genes for biotechnological applications. This work demonstrates the high potential of the CODEHOP strategy for the exploration of large gene families in plants.


Background
Plants have evolved extremely diversified gene families as tools to cope with a harsh environment. Some of these families such as cytochromes P450 and UDP-glycosyltransferases (UGT) reflect the extraordinary biochemical versatility of plants and across plant species, and represent a very valuable source of genes for biotechnologies. Both gene families offer a huge potential for bioremediation and control of crop and weed pesticide tolerance [1][2][3], but obviously also for industrial applications. P450s, con-sidered as the most versatile catalysts known [4], usually activate dioxygen and transfer one of its atoms into various substrates, but also catalyze a great diversity of reactions ranging from C-C and C=N bond cleavage, phenolic coupling, dehydration, dehydrogenation, isomerizations to reduction [5]. Many of these reactions are important for the biosynthesis of hormones, drugs, pigments, aromas, biopolymer building blocks and defense molecules [6,7]. Glycosyltransferases are also essential for the production of natural compounds since they control their solubility, stability, transport, storage and sometimes also their bioactivity [8,9]. Should some of this potential become directly accessible through genomewide sequencing, extensive information is restricted to model plants, usually with a small genome, or to plants with a major economical interest. Exploitation of this knowledge to target genes of other plants that need to be studied or engineered, or to explore gene families in plants with specific biosynthetic capacities is an objective for the next several years.
With the growing availability of gene sequences plus information regarding their diversity and phylogeny, increasingly sophisticated PCR techniques have been developed to target gene families. Plant P450s are low abundant membrane-bound and unstable proteins, usually difficult to purify. For this reason, early on, several groups attempted isolation of P450 genes on the basis of the most conserved consensus regions, after generating probes by conventional PCR at low stringency [10][11][12][13]. This approach was later refined and used by several other groups for isolation of P450 genes in various plant species [e.g. [14][15][16][17][18]]. It proved successful in many cases, although only leading to a small number of highly expressed and related P450 families. A significant step forward resulted from coupling degenerate PCR with a heme binding primer and differential display of the amplified fragments, an approach that allowed effective identification of nine P450 genes responsive to elicitor treatment of soybean cell cultures [19] and 21 unique P450 genes in Taxus cells induced for taxol production [20]. A carefully controlled and strongly differential system is however needed for such an approach. Another interesting improvement was recently reported that involves use of nested primers to increase PCR selectivity [21]. However, the major limitation of all the strategies reported so far is that they did not take into account the huge diversity and low conservation of P450s recently revealed by genome sequencing in higher plants, and allowed neither focused gene selection nor isolation of the most divergent P450 clades, i.e. no systematic exploration of the P450 superfamily in highly divergent species.
In this paper we report on the high potential of the recently described COnsensus-DEgenerate Hybrid Oligonucleotide Primers (CODEHOP) strategy [22] of primer design, ensuring optimal match and PCR amplification focused on very short conserved sequences, for the isolation of orthologues in evolutionarily distant species and for the focused or systematic exploration of gene families in plants with a very large genome. The method was tested to analyze both the duplication and conservation of the CYP98 family of P450 genes in many plant species. This family was recently suggested to play an essential role in lignification and plant development [23][24][25]. The same approach was also used for the analysis of the UGT85 family in wheat.

Chasing the CYP98 genes in wheat
The CYP98 family of cytochrome P450 genes encodes the 3'-hydroxylases of coumaroyl esters, which catalyze an essential step in the synthesis of lignin monomers and chlorogenic acid [23][24][25]. CYP98 activity is also needed for the biosynthesis of many phenolic flavouring compounds such as eugenol, safrole or vanilin. Engineering the expression of this family of P450s has important agro-industrial applications, including enhancement of plant defense and modification of lignin composition to improve forage digestibility and wood pulping [26,27]. Access to CYP98 genes from major crop and forage plants and most common woody species is a necessary step for modifying their expression. If some partial sequences are made available by EST sequencing of a limited number of species, they do not reflect the whole range of gene isoforms expressed in a plant or plant tissue, especially for large genome plants.
Our first aim was thus to test if it was possible to detect several genes belonging to the CYP98 family expressed in the seedlings from wheat, a major crop plant with a very large genome. When this work was initiated, few CYP98 sequences were available, some of them only partial sequences arising from ESTs. To optimize primer selection, with a bias in favour of monocot genes, available ESTs from rice and maize were aligned with the full-length sequences from sorghum [17], soybean [28] and Arabidopsis thaliana. In the latter case, A. thaliana genome sequencing had revealed three genes. Two of them (CYP98A8 and CYP98A9: function unknown) were closely related and clearly divergent from the third, CYP98A3, recently shown to encode a coumaroyl esters 3'-hydroxylase. CYP98A3 seemed to be the orthologue of the sequences isolated from sorghum, rice, maize and soybean. Primers were designed using the CODEHOP strategy. Three sense (P98a, d and c) and one reverse (P98cr) primer were selected ( Figure 1) so as to avoid the strong consensus regions common to other P450 families such as the highly conserved PERF motif. To further ensure high primer selectivity, touch-down gene selection PCR was conducted starting with a high (70°C) annealing temperature.
BLASTp analysis, performed with the consensus protein sequence corresponding to each primer on a local plant P450 library, indicated that the sense primers were more specific than the reverse, and likely to control amplification selectivity (Table 1I). As predicted by the BLAST analysis, the P98c/P98cr pair was the most specific, and led to the amplification of two different but closely related CYP98 fragments from wheat seedlings cDNA libraries. A single band of the expected size was obtained on an agarose gel that was eluted and subcloned in a 3'-T overhang vector. Out of 19 sequenced clones, 11 corresponded to CYP98A11, 7 to CYP98A10, and one to a non-CYP sequence ( Figure 2). As predicted as well, the P98d/P98cr pair was the second most effective. In addition to CYP98A10 and CYP98A11, it also amplified a clearly divergent CYP98 gene, CYP98A12. Out of 14 amplified fragments sequenced 13 coded for CYP98s. The P98a primer is predicted to match a larger number of P450 families (Table 1). Used together with P98cr, it amplified more non-CYP and non-CYP98 CYP sequences than CYP98s (CYP98A11 and CYP98A10). Two of the CYP sequences however were closely related to CYP98s and CYP76s. Initiation of the touch-down PCR at a lower temperature and analysis of the amplified fragments did not reveal additional CYP98 sequences.
The CODEHOP strategy thus appears well suited for the very focused to broader exploration of gene families in plants with a large genome, depending on primer selectivity. Selectivity of each set of primers can be predicted by a BLAST analysis. In young wheat seedlings, focused CODE-HOP screening allowed the detection of three clearly different CYP98 genes. All three genes are apparently related to A. thaliana CYP98A3.

Isolation of CYP98 ortho/homologues in other plant species
The second step of this investigation was aimed at testing the possibility to isolate CYP98A3 orthologues in a broad range of distantly related species. The most selective prim-

Figure 1
Location of the P98 primers on the CYP98 alignment used for primer design. Only the region overlapping available monocots ESTs was used for primer selection in order to introduce a bias for monocot sequences. This overlapping region is shown on the full-length alignment (left) shaded in blue. All sequence alignments were performed using the BioEdit program [40].
er pair P98c/P98cr, without modification for codon usage, was first assayed with the same amplification programme or after shifting the initial annealing temperature from 70 to 65°C using various cDNA libraries, prepared from Capsicum annuum fruit (Solanaceae), Ceratopteris richardii (fern), Coleus blumei cell culture (Lamiaceae), Eucalyptus globulus xylem (Myrtaceae), Helianthus annuus stem and leaf (Asteraceae), Lycopersicon esculentum shoots (Solanaceae), Picea abies cell culture (Coniferales), Pinus pinaster stem and root (Coniferales), Populus trichocarpa x deltoides stem, root and leaf (Salicaceae), Physcomitrella patens protonemal tissue (moss). CYP98-related sequences were amplified from the libraries of C. blumei, E. globulus, H. annuus, P. abies, and Populus. After optimization of primer codon usage, a CYP98-like fragment was also am-plified from the C. richardii cDNA library. No amplicon was obtained however for P. patens, P. pinaster or the Solanaceae, neither after optimizing codon usage, nor after further decreasing the initial temperature of the touchdown PCR.
Out of these ten representative species in a broad range of vascular plants, including fern, conifers, monocots and dicots, the CODEHOP strategy thus provided CYP98-related DNA sequences in seven cases ( Figure 3). Due to the short size of the amplicons, it is not possible to unambiguously assign them all to the CYP98 family. Full-length sequences would be needed for such an assignment, if not catalytic activity for the proteins. To date, those were only obtained in the case of wheat, confirming gene identity and function (M. Morant, personal communication). Homology analysis of the amplified fragments is however consistent with evolutionary history of vascular plants ( Figure 4). Combined with a BLAST analysis, it suggests that the P. abies and C. richardii amplified fragments could be either representative of ancestral forms of CYP98s or derived from a related A-type P450 family, possibly CYP81. No CYP98-related sequence was detected in the moss P. patens, which would be in agreement with the evolution of CYP98s with vascular plants. CYP98 and related sequences are as yet also absent from P. patens ESTs, among which CYP73 expected to code for cinnamate 4hydroxylase can be found. Evolution of CYP73, involved in an upstream step in the phenylpropanoid pathway and the biosynthesis of flavonoids, is supposed to have preceded that of lignification [29].

Limits of the method
In a few cases, no amplification was obtained after decreasing the initial annealing temperature of the touchdown PCR and adapting codon usage. This occurred with libraries where CYP98 cDNAs were expected to be present,     for example in a library of pine stem and root, and libraries from bellpepper fruit or young tomato plants. In pine stem, CYP98 should be expressed at a high level for the synthesis of lignin monomers, while Solanaceae are described as accumulating large concentrations of hydroxycinnamic esters such as chlorogenic acid. The P98c and P98cr primers, initially designed with codon usage for wheat gene amplification, were compared to those designed with optimal codon usage for the other plant (Figure 5). P98c differed in 2 positions out of 16 in the clamp segment from the primer optimized for pine, and in 4 positions out of 16 from the primer predicted as optimal for Solanaceae. P98cr also differed in 4 positions out of 11 in the clamp for both pine and Solanaceae. Difference in co-don usage thus seemed to provide an explanation for the failure of our first experiments. New primers using adapted codons were thus tested but did not lead to amplified fragment under any tested PCR condition.
To find an explanation, EST and cDNA sequences now available for pine and tomato, another Solanaceae, were examined and aligned with our primers and consensus sequences. This comparison to the authentic sequences revealed significant divergences from most frequent codon usage in addition to the selected consensus sequence. The assumption that the clamp segment of the primers has a very minor impact on the amplification [22] thus probably should be reconsidered. The dissimilarity present in this example is however likely to be local and thus successful amplification should be obtained by using multiple primers distributed along the full sequence.

Exploration of another gene family in wheat: UGT85
Glycosyl transferases (type 1) form another large gene family with important applications in agrochemistry, therefore, we tested the efficiency of the CODEHOP strategy for exploring the diversity of glycosyl transferases in hexaploid wheat. In this case, we decided to focus on UGTs related to the UDP-glucose:p-hydroxymandeloni-trile-O-glucosyl transferase (sbHMNGT or UGT85B1) catalyzing the last step in the synthesis of cyanogenic glucosides, recently isolated from S. bicolor [30]. When this work was initiated BLASTp search revealed only 2 sequences, UGT85A2 and UGT85A3, significantly related to sbHMNGT (43% identity) in the genome of A. thaliana. CODEHOP primer design, based on the alignment of the 3 full-length sequences, provided five sense and five reverse primers likely to be selective of this group of UGTs (Table 2). PCR was conducted with all combinations of primers, starting touch-down PCR at 70°C. When a couple of primers amplified products of expected size, they were subcloned and analyzed. The primer couple was then discarded and the library was screened using the other primer combinations but with the initial temperature of the touch-down PCR decreased by 5°C. Successive temperature decreases led to successful amplification from 8 couples of primers. Glucosyl transferases were recently shown to be induced by the herbicide safener cloquintocet-mexyl in wheat [31]. In agreement with this report, a stronger amplification was usually obtained using a cDNA library prepared from safener-treated seedlings as a template (Figure 6). Analysis of 63 subclones led to the isolation of 18 distinct UGT sequences resulting from specific amplifications. Their sequences can be obtained from GeneBank under the accessions AJ438327, AJ438326, AJ438330, AJ438331, AJ438332, AJ438316, AJ438315, AJ438318, AJ438320, AJ438317, AJ438319, AJ438333, AJ438335 AJ438337, AJ438334, AJ438338, AJ438329, and AJ438328.
Not all amplified fragments are overlapping (Figure 7). It is thus not possible to determine if they correspond to more than 12 different wheat UGT genes or allelic variants. Their alignment and comparison with representative members of the different UGT families indicate that they are all phylogenetically related and derived from the same ancestor as the UGT85A subfamilly from A. thaliana and sbHMNGT. None of the fragments corresponded to an obvious orthologue of sbHMNGT.

Discussion
In this paper we investigated the potential of the CODE-HOP strategy for targeted isolation of genes from organisms not yet submitted to extensive sequencing and for exploration of the complexity of selected gene families in large plant genomes. Special emphasis was given to wheat as representative of plants with a complex genome. The CODEHOP method proved extremely useful for the characterization of gene orthologues or homologues in a broad range of plants species. Focus of the gene search can be controlled by changing the degree of specificity of the primers (easily checked by a BLAST analysis) and by the choice of touch-down PCR temperature. In our hands, the method was successful where cDNA library screening with heterologous probes (e.g. cDNAs from maize for screening wheat libraries) had failed. The main advantage of this method is to rapidly provide a representative sample of either the allelic variants and recently evolved paralogues, or of the homologues of a given gene in complex genomes. Compared to other methods, a very high proportion of useful sequences (more than 90% with some primer couples) is obtained. Besides targeted search for specific expressed sequences, it allows complete exploration of the different P450 clades in a single organism, which was not possible using previously described methods. The main source of failure seems to be local sequence or codon divergence from the consensus or most frequent usage. This problem should be easily circumvented by using several primer couples chosen from different regions of the gene. Using the CODEHOP strategy on subgroups of phylogenetically related genes, families or clades, within large superfamilies such as UGTs or P450s is a powerful approach for exploring their complexity in various genomes. It is a very effective tool for the construction of expression libraries for agrochemical and other industrial applications. It can also be used for identifying genes from a given subgroup expressed at a specific stage of development.
Some plant P450 families result from extensive duplications, some of them forming clusters of up to 13 genes in Arabidopsis [7]. Some P450 families or subfamilies are found in only subsets of plant species. Extensive search for CYP98 genes expressed in wheat seedlings did not reveal more than 3 different sequences, all related to A. thaliana CYP98A3. The bread wheat genome is hexaploid and results of successive hybridizations and rearrangements [32]. This suggests that the three CYP98A10, A11 and A12 genes may correspond each to one of the three wheat genomes, with CYP98A12 which is the most divergent possibly resulting from the most recent genome introduction. This hypothesis is currently investigated. In agreement with the low number of CYP98s found expressed in wheat, no extensive duplication of CYP98 was detected in other plant species. This observation, together with the high conservation of the CYP98As gene across evolution argue for essential functions of the CYP98 genes in higher plant development. Accordingly, strongly impaired growth and fertility are observed in cyp98A3 mutants [ [24], S. Goepfert, personal communication]. Similar conservation is observed for other P450 genes participating in the early phenylpropanoid pathway and hormone homeostasis.
No obvious homologue of the sbHMNGT gene was found in wheat. This may be connected to the fact that large amounts of dhurrin are not reported to accumulate in wheat, since sbHMNGT is described as showing a strong preference for mandelonitrile substrates [30]. A large number of related genes, all belonging to the UGT85A subfamily were however detected. This is not surprising considering that 6 UGT85A genes have been reported in the small Arabidopsis genome [33]. Such a duplication of genes in this family probably reflects an adaptative evolution and their implication in some type of stress/defense response rather than a developmentally essential function. Significantly, a large number of UGT85A-related sequences are found among wheat ESTs isolated from wheat challenged with pathogens. Recombinant expression of Figure 6 Result of a decrease in initial annealing temperature of the touch-down PCR on the amplification of UGT genes with different couples of primers. Location of amplicons of the expected size is indicated by an asterisk. The same couples of primers were tested with two cDNA libraries made out wheat seedlings treated (+) or not (-) with cloquintocetmexyl and phenobarbital for inducing herbicide metabolism. Some couples of primers such as PUGTb/PUGTer did not provide any amplified fragment, even at low annealing temperatures. Some (e.g. PUGTe/PUGTer) were effective using stringent annealing temperature.
such genes should provide a valuable library for pharmacological and toxicological investigations, and for studying evolutionary ecology of plant pathogen interactions.

Conclusions
The CODEHOP strategy appears as a powerful method for exploring the complexity of gene families in plants with a large genome, and conservation of genes across evolution. CYP98s are genes evolved early and are highly conserved during evolution, as expected for genes with an essential role in homeostasis and development of vascular plants. Conversely, the UGT85s are more variable. No orthologue of the sorghum UGT85B1 gene was detected in wheat, while UGT85As, found in many plants species, are also present in wheat. The great variability of this subfamily in wheat strongly suggests a role in environmental adaptation and plant defense.

cDNA libraries
Wheat cDNA libraries were constructed from poly(A) + mRNA from 3-5 mm Triticum aestivum (L. cv. Darius) seedlings, both control or pre-coated with cloquintocetmexyl (0.1% seed dry weight) and phenobarbital, as described previously [34], in λ ZipLox (GIBCOBRL) by T. Bioinformatics CODEHOP primers are designed as to ensure a very high probability of annealing on the gene of interest with 11-12 completely degenerate core nucleotides at their 3'-end and efficient amplification with a consensus 18-25 nucleotide clamp sequence at the 5'-end. Design of the clamp is based on an alignment of available related sequences and codon usage of the target organism. Primers completely degenerate at their 3'-end and ensuring high probability of annealing to CYP98s or UGTs specific sequences were designed using the CODEHOP strategy [22] based on the multiply-aligned sequences: AF029856, AF022458, AA86449, C74921, D47937, AI881302, AI734373, AAG52369, AAG52373 for the P450 family CYP98, and AAF17077, AAF18537, BAA34687 for UGTs. The multiple alignment was first generated using ClustalW [35], then cut into blocks using the BlockMaker server [36]. Primers were designed using the default parameters of the CODE-HOP server [37]. It was assumed that barley codon usage proposed by the server was close enough to that of wheat to obtain effective primer design.
From the primer solutions proposed by the server, five P450 primers were selected, which both provided the largest PCR fragments and avoided the most conserved consensus regions common to a large number of P450s. In the case of UGTs, a broader choice was offered by the server, thus 5 sense and 5 reverse primers could be selected on the same basis as for P450s.

Figure 7
Alignment of protein segments deduced from amplified wheat UGT sequences with sbHMNGT (UGT85B1).

Probe amplification and analysis
In the present work, the CODEHOP approach was not used with genomic DNA but with cDNA libraries, our main objective being to identify genes expressed in specific plant tissues. Screening can be directly performed on cDNA libraries but better results were obtained after preliminary extraction of cDNA from phages. For extraction, an aliquot of the cDNA library was heated 10 min at 70°C, then extracted with one volume of phenol-chloroform. PCR screening was then performed on 50 ng of this template using 2.6 U of HiFi (Expand High Fidelity, Roche), 1.5 mM Mg 2+ , 0.3 mM dNTP and 0.5 µM primers in the polymerase manufacturer's buffer. The PCR program was designed according to the CODEHOP server's tips, including successively a touch down (A) and a classical (B) PCR as follows: first 3 min initial denaturation at 94°C, then (A) 20 cycles of 1 min at 94°C, 2 min at 70°C (-1°C/cycle), and 2 min at 72°C, then (B) 20 cycles of 1 min at 94°C, 2 min at 58°C, and 2 min at 72°C, and finally a 5 min extension. Elongation time was adapted to the largest expected fragment in each screening. When no amplified fragment was obtained with a primer pair, the touchdown starting temperature of annealing was decreased by 5°C until a successful amplification was achieved.

Analysis and cloning of PCR products
PCR products were analyzed on 1% agarose gels, and fragments of expected size were eluted by centrifugation on Ultrafree-DA columns (Millipore), precipitated and cloned into the pGEM-T vector (Promega). E. Coli XL1blue (Stratagene) was electroporated with 1/10 of the ligation volume. Inserts from five white colonies were sequenced.
Sequences were analyzed by BLASTx against a local database generated from the data found [33,38]. Non-P450 sequences were xblasted on the NCBI site. The CYP98A10, CYP98A11 and CYP98A12 names were assigned on the basis of the full-length coding sequences subsequently isolated from the cDNA library.