The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana.

BACKGROUND
Most genes in Arabidopsis thaliana are members of gene families. How do the members of gene families arise, and how are gene family copy numbers maintained? Some gene families may evolve primarily through tandem duplication and high rates of birth and death in clusters, and others through infrequent polyploidy or large-scale segmental duplications and subsequent losses.


RESULTS
Our approach to understanding the mechanisms of gene family evolution was to construct phylogenies for 50 large gene families in Arabidopsis thaliana, identify large internal segmental duplications in Arabidopsis, map gene duplications onto the segmental duplications, and use this information to identify which nodes in each phylogeny arose due to segmental or tandem duplication. Examples of six gene families exemplifying characteristic modes are described. Distributions of gene family sizes and patterns of duplication by genomic distance are also described in order to characterize patterns of local duplication and copy number for large gene families. Both gene family size and duplication by distance closely follow power-law distributions.


CONCLUSIONS
Combining information about genomic segmental duplications, gene family phylogenies, and gene positions provides a method to evaluate contributions of tandem duplication and segmental genome duplication in the generation and maintenance of gene families. These differences appear to correspond meaningfully to differences in functional roles of the members of the gene families.


Background
Most genes in Arabidopsis thaliana are members of gene families. Similarity searches between all predicted proteins show that 65 -85% of all Arabidopsis genes are homologous to at least one other gene in the genome, depending on similarity thresholds ( [1] and analysis in this paper). There is a wide range in gene family sizes, with more than 400 receptor kinase genes [2,3], ~270 -285 cytochrome P450 genes [1,4,5], and many small families or unique genes. The dramatic variation we observe in gene family size and distribution may be affected by many processes, including tandem duplication with high rates of birth and death and gene duplication resulting from larger scale genome events such as polyploidy or duplications of large chromosomal regions (referred to in this paper as "segmental duplications"). We provide a quantitative characterization of the gene duplication patterns evident in the evolution of 50 large gene families in A. thaliana.
The complete sequencing of the A. thaliana genome revealed numerous large-scale segmental duplications [1,[6][7][8][9][10]. Several studies have concluded that at least two rounds of duplications have probably occurred in the A. thaliana genome, with many losses and rearrangements leaving a mosaic of "segmental duplications" or "duplication blocks" [7,[10][11][12][13][14]. Most duplication blocks appear to have come from one round of polyploidy, estimated by various methods to have occurred 20 -40 Mya, before the evolution of the genus Brassica but after the separation of Brassicaceae from other close eudicot families [7,[10][11][12]. The portion of the genome that exists in duplicate regions serves as a baseline for evaluating whether genes in a given gene family have been lost or retained at a rate higher than expected for the genome as a whole. If most duplication blocks did in fact originate during one round of polyploidy, this duplication could also be used to provide an internal reference point to use in comparing the rates of amino acid substitutions in members of different gene families.
While polyploidy is one mechanism by which gene family copy numbers expand, tandem or local duplication is the most commonly evaluated mechanism for gene family expansion. Tandem duplication often results from unequal crossing-over [15] and multiple episodes of unequal crossovers might lead to increasing or decreasing copy numbers in gene families, or to simple cycling of genes without large changes in gene family size. Though not investigated in this paper, transposable elements may also have played an important role in gene duplications and genome rearrangements in Arabidopsis [16].
To determine the relative importance of segmental and local duplications in the evolution of large gene families, we developed software to identify clades in gene family phylogenies that have arisen either through segmental or local duplications. In 50 large gene families in A. thaliana, we find that contributions made by these two processes differ greatly from gene family to gene family. We discuss the possible biological significance of these differences in gene family evolution.

Strategy
Our general approach consisted of the following steps. Details, parameters, and software are described in the Methods section.
1) Choose initial gene families and preliminary sequence membership. We began with 2001 Arabidopsis PIR superfamilies, available at MIPS [17], and refined family membership in the subsequent steps.
2) Narrow the gene family selection on the basis of domain arrangements. We determined the Pfam [18] domains of all sequences in each gene family, assessed the consistency of domain arrangements in each family, and excluded families with particularly complex domain arrangements, such as those in several kinase families.
3) Iteratively construct and refine gene family alignments. We constructed T-Coffee [19] alignments using a maximum of 30 genes from each family, then generated hidden Markov models (HMMs), realigned all proteins in each family to the model, used the model to re-search the full set of predicted Arabidopsis proteins, retrieve sequences with expectation values less than 10 -10 , and realign those to the HMM. 4) Trim alignments for use in phylogenetic analyses. This involved removing indel regions, first by removing residues falling outside of the "match states" in the HMM, and then by visually inspecting and in some cases removing other poorly aligned or indel regions. 5) Calculate phylogenies. We generated parsimony and bootstrapped neighbor joining trees, and also calculated maximum likelihood branch lengths for the parsimony topologies. 6) Predict segmental duplications in the Arabidopsis genome, using DiagHunter [20,21]. In a two-dimensional dot plot of amino acid similarity "hits" between chromosomes, segmental duplications appear as diagonal features. The sets of homologous gene pairs that contribute to such features were used in the next step. Similarity is at a BLASTP bit score threshold of 500, with other parameters described in [21]. 7) Determine gene pairs in a gene family having the same coordinates as found in a pair of duplication blocks. Any such gene pair likely duplicated at the same time as the pair's duplication block. We carried out this and the next three steps using OrthoParaMap software developed for this purpose, and described in detail at [22,23]. 8) Annotate the gene phylogenies with information on duplication block membership. Infer nodes that likely originated through segmental duplications, and annotate the phylogeny with this information.
9) Use gene position information to infer which closely related genes (defined in terms of position in gene phylogenies) are located physically "close" to one another (defined in terms of the physical distance between genes, as described below). Infer nodes that likely originated through tandem duplications, and annotate the phylogeny with this information.
10) Add translated EST consensus sequences from other species to help provide additional context. This involved using each A. thaliana sequence in each gene family to query TIGR unigene sets for soybean, M. truncatula, Lotus japonicus, tomato, potato, and corn, then choosing the longest translations, aligning these to the HMM, and recalculating the phylogenies using the same procedures as for A. thaliana (step 5). Though generally not integral to this project, this information was helpful in determining evolutionary patterns for some families -and particularly for families consisting of small, highly-expressed proteins. Because of space constraints, figures 5, 6, 7, 8 include only A. thaliana, Medicago, and tomato sequences, though phylogenies for all sequences are included at [24].

Study set selected from all large A. thaliana gene families
A high-throughput phylogenetic analysis of many gene families is complicated at the start by questions of what constitutes a gene family [25][26][27][28]. Conceptually, gene families have a common ancestor, arise by gene duplication, and may share similar functions. The diversity of sequence and function in gene families often makes delimiting gene families difficult. Operationally, gene Sizes of gene families in A. thaliana Figure 1 Sizes of gene families in A. thaliana Approximate gene family sizes were calculated using single-linkage clustering of BLASTP similarities below E-value thresholds of 10 -10 (red), 10 -20 (black), and 10 -30 (blue). At the resolution of this graph, these lines follow nearly the same path. The curves follow a power law distribution. The best-fit power law equation for the 10 -10 curve is indicated on the graph. Dot plots of similarities in A. thaliana chromosomes 1 and 2 Figure 2 Dot plots of similarities in A. thaliana chromosomes 1 and 2 Chromosome 1 is shown to the top and left, chromosome 2 on the bottom and right. Dots represent BLASTP similarities at bit score thresholds of 500. Synteny blocks identified by DiagHunter [20,21] are shown in black (larger images are available at [24]). Hits of proteins to themselves have been suppressed. A large excess of local duplications is apparent in higher densities near the main diagonal. The average density at any given distance between genes can be calculated from diagonal strips through the dot plot. One such strip is highlighted in chromosome 2 × 2. families can be defined in terms of levels of sequence similarity and domain composition, but a simple similarity threshold may be misleading if the threshold inappropriately splits a divergent superfamily, or inappropriately groups together separate gene families that share a common domain [25,26].
To limit the scope of this study and to avoid some of the complexities presented by superfamilies with diverse domain arrangements, we arbitrarily chose 50 gene families with at least 20 members, functional domains in common, and consistent family membership. Consistent family membership was judged by distributions of expectation scores in HMM searches (using hmmer [29]) of the A. thaliana proteome. Preference was given to families in which there is a clean drop-off in HMM E-value, with members having scores of at worst 10 -10 and nonmembers generally having much poorer scores. Apart from the minimum family size, we chose better-studied families, though some have no members with known functions or Pfam domains [18]. Lastly, we chose families with a range of family sizes, from families of 20 members up to the approximately 225-member cytochrome P450 superfamily (though the total number of P450 genes in A. thaliana, including members of all diverse subfamilies, is estimated to be 275 -285 genes [1,4,5]). The families used in this study are shown in Table 1.
To get a sense of the distribution of gene family sizes, we also conducted a simple whole-proteome homology search and single-linkage clustering at two BLASTP [30] thresholds. In this context, single linkage clustering transitively merges sets of genes in which any gene is sufficiently similar to some other gene in the set. These results are shown in Figure 1. The distribution closely follows a power-law ( Figure 1; y = 1642.7x -0.8517 , R 2 = 0.96). In such a distribution, there are few families with large numbers of members and many families with few members. A power-law distribution is worth noting in part because it calls for a mechanism for the evolution and maintenance of family sizes. Any proposed mechanism will need to be consistent with the mechanisms of individual gene duplications and losses in various families.
Densities of homologs by genomic distance in A. thaliana chromosome 2 and genome-wide  Using a BLASTP E-value threshold of 10 -10 ( Figure 1

Tandem duplications quantified
Our goal is to distinguish between gene duplication resulting from segmental duplication of chromosomal regions, and tandem duplication generating nearby gene copies. This required operational definitions of both gene similarity and genomic proximity. Similarity should be determined in the context of the gene phylogeny because  different genes in different families evolve at different rates. We limited the search for tandem duplications to sequences with ≤75% of the average evolutionary distance from terminal nodes to an approximate midpoint root in the phylogeny, the maximum search depth. This is somewhat arbitrary cutoff, but avoids very early duplications in the phylogenies, for which mechanisms are difficult to infer. To determine whether two genes are physically close enough to conclude that they probably arose through tandem duplications, we measured the average genomic distance at which there is an excess number of duplications above the genome average.
Following the approach of Vision et al. [9], we use a dot plot to map the occurrence of two similar sequences located in different genome regions. Locations along the linear sequence of genomic regions are graphed as the Xand Y-axis with each dot at an XY coordinate marking a similarity "hit" (Figure 2). Dot plots of chromosomes compared to themselves, e.g. chrom. 1 by chrom. 1, map local tandem duplication as well as segmental events within the chromosome. Segmental events are represented as dense linear arrays of dots in the same or opposite orientation as the main diagonal but located off the main diagonal.
The density of dots in any portion of the dot plot represents the density of matches between the genome regions being compared. If a large amount of tandem duplication has occurred in a chromosomal region, this will be visible as a densely dotted region near the main diagonal of a chromosome plotted against itself. The dot plots we present do not include the main diagonal itself (showing the similarity of each protein to itself). The average density at any given distance between genes can be calculated from diagonal strips through the dot plot. One strip is presented as an example in Figure 2. Breaking down the density by distance observations into closer intervals, we next plotted density in 5 kb windows.
The graph in Figure 3b shows similar density measurements, but within 5 kb windows, from 100 kb downstream from a query to 100 kb upstream in chromosome 2. The plot obtained is very similar for all five chromosomes and Figure 3b also shows a genome average. In each, we find a dramatic excess of local duplications falling within 50 kb physical distance. The units are homologs per area, where a meaningful unit area might be (5 kb) 2 . The rationale for using (5 kb) 2 in the denominator is that in A. thaliana, the average gene density is approximately one gene per 5 kb, so if all genes were homologous, the number of homologs between any two 5 kb regions would be 1. The value of one homolog per (5 kb) 2 in A. thaliana might therefore be described as one density unit (d.u.; a term novel to this paper). As would be expected, the highest densities of local duplications are seen at 5 kb ( Figure 3b). In windows extending from 5 kb to 10 kb from any gene, the density of apparently duplicated genes (BLASTP threshold of 10 -10 ) is 0.098 d.u. genome-wide. This means that on average, there are ~0.1 homologs within any two 5 kb windows that are separated by 5 kb, or that one duplicated gene in ten is likely to have a homolog very close by. In the 100 kb window centered on any gene, the corresponding density of duplicated genes is approximately 0.020 -0.035 d.u., depending on the chromosome. In all chromosomes, a clear local duplication effect is not seen beyond 50 kb. Thus, we define tandem duplications as those closely related genes falling within 50 kb of one another.
The distribution of densities of locally duplicated genes by distance follows an exponential distribution (R 2 value of 0.98, Figure 3). Integrating under this curve, 90% of the area under the curve in the interval between 5 kb and 100 kb is found within the smaller interval of 5 kb to 50 kb and represents densities above an average background density of 0.002 d.u. Again, this supports the use of 50 kb as a reasonable threshold for identifying local duplications in A. thaliana.

Expected values for tandem and segmental duplications
To compare the relative contributions of tandem and segmental duplications when gene families differ substantially in size, we generated expected values for tandem and segmental duplication events for gene families of each size class, calculated a ratio of observed to expected values for these two mechanisms, and compared the ratios for each family.
We simulated distributions of expected numbers of tandem duplications that would occur by chance for a gene family of a given size in a genome of a given size. The simulation procedure is to randomly place N genes in a 100,000 kb genome (the approximate extent of euchromatic DNA in A. thaliana), and to count the number of genes that are within 50 kb of one another. A total of 1000 simulation runs generates a distribution for each gene family size. For small gene families, the probability that two genes will fall near one another follows a Poisson distribution. For example, a mean and variance of 0.12 Proteasome 20S subunit family: low tandem, high segmental Figure 5 Proteasome 20S subunit family: low tandem, high segmental The phylogeny on the left shows segmental duplications in the A. thaliana proteasome 20S subunit family, which lacks tandem duplications. The phylogeny on the right represents the same A. thaliana sequences but with M. truncatula and tomato EST sequences added to evaluate the degree to which these homologs are conserved. Relationships of clades represented in both phylogenies are in general agreement, with some differences due to instabilities of some deep nodes.
NBS-LRR disease resistance family: moderate tandem, low segmental duplications Figure 6 NBS-LRR disease resistance family: moderate tandem, low segmental duplications The NBS-LRR disease resistance family is divided into two subfamilies: the non-TIR subfamily (top third of the phylogeny) and the TIR subfamily (the bottom two-thirds). Tandem duplications are indicated with "t" and segmental with "S." Other duplications are not classified by our methods. For clarity in the large tree, gene names and positions have been removed. The complete phylogeny, including bootstrap values, is available at [24].
Chlorophyll a/b binding protein family: high tandem, low segmental duplications  Major latex protein family: high tandem, low segmental duplications Figure 8 Major latex protein family: high tandem, low segmental duplications The phylogeny on the left shows segmental and tandem duplications in the A. thaliana major latex protein family. The phylogeny on the right shows the same A. thaliana sequences with M. truncatula and tomato EST sequences added to provide an indication of degree of conservation of these sequences and lineages. Clades are generally represented in comparable relationships, with some differences due to instabilities of some deep nodes. Bootstrap values are indicated as follows: *** >90%; ** >=80%; * >=70%. Note the expansion of several clades in each species following separation of these taxa. Our goal for calculating expected numbers of segmental duplications was to establish an easily interpretable normalizing constant (a different objective than establishing values for a standard null hypothesis). Our assumption was that the majority of genes resulting from segmental duplications have been lost, and we wanted a way to compare extent of loss between families beyond the level expected due only to the loss of large duplicate regions. By our method of identifying segmental duplication blocks [20,21], approximately 75% of the euchromatic portion of the Arabidopsis genome exists in at least one duplication block. If all genes within those duplicated regions had been retained, then (all other factors being equal) the fraction of gene copies expected in segmentally duplicated regions would also be 75%. In fact, the proportion of retained gene copies is much lower than this, but 75% provides a baseline and normalizing constant for comparing observed counts of gene copies due to segmental duplication in gene families of different sizes ( Table 2). Table 2 shows counts of tandem and segmental duplications in each family, together with ratios of these counts to the expected genome average for tandem or segmental counts for each family size. Other types of events, including transpositions or remnants of segmental duplications, were not classified. The ratio of observed/ expected tandem duplication counts demonstrate an enormous range from 0 to 63; some families are the apparent result of no tandem duplication while one, the Germin family, demonstrates 63 times as many gene copies in tandem arrays as would be expected by chance. The ratio of observed/expected segmental duplication events range from 0 to 0.52; some families have lost all segmental duplicates predicted by the model, while some have lost only about half the duplicates predicted by the model. Table 2 presents gene families grouped by low, medium, and high ratios of observed/expected tandem duplication events, and then grouped by low, medium, and high ratios of observed/expected segmental duplication events.

Counts of tandem and segmental duplications
To generate these classes, cutoffs are set at one standard deviation above and below the median. Several families in the tandem and segmental categories also fall above two standard deviations, and these are also indicated in Table 2.
Gene families represented in Table 2 tend either to fall into high-tandem/low-segmental duplication classes or vice versa as is evident in the moderate negative correlation found in a plot of expected/observed segmental and tandem duplications (correlation coefficient = -0.47; R 2 = 0.22; p = 0.00057 for ANOVA F-statistic; Figure 4). Among the eight low-tandem duplication families, none are in the low-segmental duplication category, and two of the eight have segmental duplication counts that place them approximately two standard deviations above the segmental-duplication median. Among the eight low-segmental duplication families, none are in the low-tandem duplication category, and five fall more than one standard deviation above the median ratio of observed/expected tandem-duplication events. There are gene families such as PR1 and CatHydExch with high numbers of segmental or tandem duplications compared to that expected, but not high numbers of both. Neither the ratios of observed/ expected tandem events nor the ratios of observed/ expected segmental duplications are correlated with gene family size (the R 2 values are 0.044 and 0.066, with p-values 0.14 and 0.08, respectively).
In our set of 50 gene families, the low-tandem duplication class appears to be represented by highly conserved, housekeeping or key regulatory gene families, while the medium-and high-tandem duplication classes are represented by families involving pathogen defense or diverse enzymatic functions. Families involved in pathogen defense all fall in the medium-or high-tandem duplication classes; the NBS-LRR [31,32], Thaumatin [33], Germin [34,35], PR1 [36], and Major Latex Protein/PR10 families [37]. The low-tandem duplication class includes two of the three transcription factor families (heat shock and WRKY) and some housekeeping gene families (mitochondrial carrier proteins [38,39], proteasome 20S subunit family [40,41]).

Gene phylogenies from multiple species
Some phylogenies that include only A. thaliana sequences appear to have long internal branches -potentially indicating rapid evolution. Addition of homologous sequences from other species provides a means of testing whether genes in these families have evolved rapidly, or whether long internal branches indicate ancient differences between highly conserved protein sequences. This approach is shown in Figure 5, a phylogeny of the 20S proteasome subunit family [41,42]. The right-hand phylogeny includes representatives from three species: A. thaliana, tomato, and M. truncatula. The tight clustering of sequences at the end of long internal branches indicates that this family consists of highly conserved amino acid sequences that have been retained in these genomes for extended times -though it should be said that taxa represented here are fairly closely related dicotyledons, and sequences from basal angiosperms or gymnosperms would likely be placed much more deeply in the phylogeny. Similarly, sequences from multiple species were also used for comparisons shown in Figures 7 and 8.
The multi-species approach taken here generally provides qualitative rather than quantitative measures of evolutionary patterns. For tomato and Medicago, we used translated EST tentative consensus (TC) sequences, which are error-prone. Nevertheless, for gene families with highlyexpressed, relatively short transcripts, the information gives estimates of minimum evolutionary distances between A. thaliana and other dicot gene homologs.

Discussion
This paper describes differences in the relative importance of tandem and segmental duplication to the size and evolution among large gene families in the A. thaliana genome. Tandem duplications are clearly an important engine generating new gene copies in genomic clusters, where unequal crossovers generate new diversity. Segmental duplication events have a different effect as they may widely disperse gene copies throughout the genome where they experience few recombinational exchanges with parental copies [43]. To study the joint effects of these genome processes on multigene family evolution, we placed gene families into low, medium, and high tandem duplication classes and low, medium, and high segmental duplication classes, and investigated attributes of some better-studied families within each duplication class.

Distribution of gene family sizes
The frequency distribution of A. thaliana gene family sizes closely follows a power-law relationship (Figure 1). A plausible explanation for this distribution in other genomes was proposed by Huynen and van Nimwegen [44]. In their model, gene families are founded by a single ancestor, and through duplications and deletions, the family size fluctuates over time, with the possibility of the family going extinct from the genome. The requirements of the model are that all members in a family have the same probability of duplication or loss at any given time, different gene families may have different probabilities at any given time, and the average of all duplication probabilities is less than one (preventing gene families from growing to infinity). Under these general conditions, the model generates a power-law distribution of the sizes of surviving gene families [44][45][46] and thus, selection need not be invoked to explain gene copy number distribution per se.
Still, our observations show that since the time of segmental events (the most recent of which is estimated to have occurred 9,12], varying numbers of gene copies have been maintained in segmentally duplicated regions across different gene families. For example, gene copies generated by segmental duplication are more often retained following polyploidy in the more slowly-evolving MYB gene family ( Table 2, [24]) whereas, in the large, rapidly-evolving NBS-LRR disease resistance family, duplication in local genomic clusters is common with surprisingly low retention of segmental duplications. Below, we consider the possible biological significance of gene duplication patterns for each class of gene families.

Low tandem, low segmental duplication
Eight gene families were classified as low-tandem duplication and of these, most fell in the moderate or high segmental duplication classes. A few families demonstrating low tandem duplication levels also demonstrate relatively low segmental duplication levels compared to a genomewide average. The mitochondrial carrier protein family (MC or MCP [38,39]) and heat shock transcription factor family [47] each have retained few segmental duplications and yet demonstrate no apparent tandem duplications or a clustered organization. The MC proteins serve as antiporters, preferentially exchanging one solute for another [38,39,48]. Structurally characterized members of the MC family are dimers. Conceivably, additional gene copies might disrupt the stoichiometry of protein dimers in these transmembrane complexes, particularly once duplicated genes were lost following polyploidy. It remains to be tested whether gene duplication and loss patterns in members of protein complexes generally differ from patterns for monomeric proteins.
It is likely that a relatively large portion of the variance in the segmental losses across the 50 gene families is a result of the stochastic process of genomic loss following polyploidy and thus, will appear a more course-grained process than tandem duplication and loss. In Arabidopsis, many megabase duplication blocks have been retained, while other very large regions have been lost. In any case, the more extreme cases of very high or very low apparent segmental duplication warrant further description below; the high-segmental duplications for the proteasome 20S subunit family and the low-segmental duplications for the NBS-LRR family. These have observed/expected segmental ratios of 0.41 and 0.05, respectively.

Low tandem, high segmental duplications
A large proportion of the families in or near the low-tandem duplication class also fall in the high-segmental duplication class (Table 2). These include proteins involved in a variety of roles: transcription factors (MYB), signalling (GTP binding proteins, calmodulin, phosphoprotein phosphatase), various enzymatic functions (glycosyl transferase, plastocyanin), membrane transport (major intrinsic protein), and cellular housekeeping roles (Proteasome 20S subunits).
The proteasome 20S subunit family provides an interesting case study with which to consider possible constraints on duplication processes and gene copy numbers [22,23]. In eukaryotes, the proteasome recycles proteins by degradation of ubiquitin-tagged proteins [41,42]. It is a large protein complex, consisting of a 28-subunit catalytic cylindrical structure, called the 20S proteasome, and an ATP-dependent 19S regulatory particle consisting of an additional set of approximately 18 subunits [49]. The 20S proteasome is made up of four stacked rings. The two middle rings are each composed of seven 20S beta polypeptides, and these rings are sandwiched between two alpha rings, each composed of a ring of seven polypeptides, giving an 7α 7β 7β 7α structure [42]. In most eukaryotes described to date, each of the seven alpha and seven beta subunits is somewhat different from one another, requiring 14 types of proteasome subunits to make up the 20S proteasome [50].
In the A. thaliana 20S proteasome there are 23 genes encoding 20S proteasome subunits [40,41,51,52] -rather than 14. The phylogeny in Figure 5 suggests the origin of the additional subunits. There are two large clades of 20S proteasome sequences, each representing the alpha or beta subunits and each alpha or beta clade is composed of seven clades or lineages representing different alpha (or beta) sequences [40,42] ( Figure 5). Interestingly, it appears that there are two, nearly complete sets of alpha and beta subunits because rather than the expected 14 sequences, we find 23 sequences comprised of nine pairs (18 total) plus five as singletons. We identified seven instances of segmental duplication on the tree, and the short branch lengths for two additional pairs suggest that these also may represent remnants of the same, recent, polyploidy event ( Figure 5).
The scenario suggested by the combination of biological, phylogenetic, and genome contextual information is that following a round of genome doubling, roughly 20-40 Mya [7,[10][11][12], most members of the duplicated proteasome subunits have been maintained but that five copies have been lost. There are no tandem duplications in this gene family, thus these two, nearly complete sets of 20S subunits were apparently generated by segmental duplication alone. Maintenance of seven alpha and beta lineages suggests maintenance of the stoichiometry of the 20S components, while tolerance of duplicated subunits might provide greater regulatory or catalytic flexibility [49]. Moore and Purugganan [53] describe precedence for positive selection driving the fixation and preservation of at least some duplicate genes in Arabidopsis. Whatever the cause of higher-than-expected retention of segmental duplicates in the Proteasome subunit family, this pattern contrasts with the following example in NBS-LRR resistance genes, which shows rapid turnover of gene family members and loss of major gene lineages in some plant families.

Moderate tandem, low segmental duplication
Three families in the moderate-tandem duplication class are also in the low-segmental duplication class: a protein phosphatase family, an acyltransferase family, and the NBS-LRR disease resistance family. The NBS-LRR disease resistance gene family contains 152 members in A. thaliana by our HMM-search criteria. Members of the family have been shown to confer resistance to a wide variety of pathogens [31,32,[54][55][56] and are of tremendous economic importance so, we focus on these as an example here.
We found 54 tandem duplication events and six segmental duplication events on the tree. For one of the largest clades, the 60-member RPP5-containing clade in the TIR subfamily, we identified 25 tandem duplications within this single RPP5 clade, 24 of which occur after segmental duplications. The segmental duplications map to duplication blocks dated to the recent round of polyploidy and estimated to have occurred after the separation of Brassicaceae from other dicot families, roughly 20 -40 Mya [10,57]. Given that no close homologs outside of Brassicaceae species [31] have been found to date for the RPP5 family, a likely scenario is that a RPP5 ancestral sequence underwent tandem and perhaps another transposing duplication, was subsequently amplified via polyploidy, and then experienced multiple rounds of tandem duplication.
In the NBS-LRR resistance gene family, Baumgarten et al. [43] find that segmental duplications largely explain the genome-wide distribution of NBS-LRR homologs. Here, we extend these findings to show that tandem duplications and losses play the dominant role in affecting copy number, as in the expansion of the RPP5 homologs. In fact, net gene loss following polyploidy, coupled with dynamic expansion and loss, could lead to quite variable numbers of sequences in any given clade. Across the very large NBS-LRR gene family, we observe several clades for which we could demonstrate few or no tandem duplications [31]. Perhaps the most dramatic example of sequence loss is the complete lack of TIR NBS-LRR sequences in the Poaceae, despite abundant sequence for both the grasses and the NBS-LRR gene family [32,58]. The absence of TIR sequences in the grasses has to be inferred as a loss of this sequence type from the grasses because TIR homologs have been found in pine [31] and moss [59]. Just as certain clades have expanded rapidly, such as the RPP5 clade, other lineages such as the entire TIR subfamily in grasses appear to have been lost.

High tandem, low segmental duplication
Among the 11 high-or very-high tandem duplication families, five are in the low-segmental duplication class, and four other families demonstrate a lower than median level of segmental duplications. The high tandem, low to moderate segmental duplication families fall into several broad functional categories. Two families in which the level of tandem duplication is more than two standard deviations above the genome median are the germin and major latex protein (MLP) families. Both families are involved in pathogen defense as well as other functions. Germins have been found to play a variety of roles, including cell wall formation during germination, stressrelated signaling, production of active oxygen species, and degradation of oxalate presented as a fungal toxin [60,61].
Other families with members shown to have roles in defense against pathogens are the subtilisin-like serine proteases and the pathogen-related PR1 family [36]. In contrast to the apparent advantage of diverse defense sequences, why might the chlorophyll a-b binding (CAB) family have retained high numbers of tandem duplications? A priori, the CAB family might be expected to evolve much more conservatively than the very highly duplicated defense-related families because these proteins form the large multi-protein complexes of photosystems I and II [62]. We will describe the unusual CAB family first and then the MLP family.
High tandem, low segmental duplication, example 1: CAB The CAB proteins are components of the complex multisubunit photosystems I and II (PS I and PS II) [63,64]. Both photosystems consist of a chlorophyll-binding core, and a peripheral antenna or light harvesting complex (LHC). In addition to the light-harvesting function, the antenna is able to dissipate excess energy through a process called feedback de-excitation [65]. There are at least 10 distinct types of proteins in the LHCs for PS I and PS II [62,64], encoded by the nuclear lhc genes. The basic structure of photosystems have been conserved since the evolution of early land plants [62,64]. Four lhc genes associated with PS I are denoted lhca1-4. Genes associated with PS II are denoted lhcb1-6, and lhcb1 and lhcb2 can also associate with PS I [62][63][64]66]. Although the basic structure of photosystems have been conserved since the evolution of early land plants [62,64], our results show surprisingly dynamic copy numbers, especially for lhcb1, lhcb2, and lhcb3. Figure 7 shows a phylogeny for the A. thaliana CAB family (left side) and for comparison, a phylogeny that also includes consensus ESTs from tomato and M. truncatula (right side). In the A. thaliana gene phylogeny, four tandem duplications and the one segmental duplication were detected in the clade that contains lhcb1, lhcb2, and lhcb3.
In the three-species gene phylogeny, there are multiple lhcb1, lhcb2, and lhcb3 homologs in both tomato and Medicago. These paralogs show recent independent expansions in each species lineage to generate sets of paralogs more similar to one another than to homologs from another species. In the A. thaliana lhcb1 clade, for example, there are five paralogs that have arisen through three tandem duplications following divergence of Brassicaceae, and one segmental duplication prior to divergence of Brassicaceae. In the corresponding clade of the three species gene tree, there are at least 11 tomato sequences, with six arising before the tomato/A. thaliana split, and at least six Medicago sequences, with one arising before the Medicago/A. thaliana split. Similar phylogenetic patterns are apparent in EST data for corn, soybean, and potato (not shown). In contrast, none of the lhca genes nor the lhcb4-6 genes show recent amplification of gene copy number.
There appear to be different evolutionary modes at work in different parts of the CAB family. The phylogeny suggests high rates of turnover in the lhcb1-3 genes and low turnover with possible loss of segmental duplicates in the other lhc genes. Structural and functional studies show that in PS II, a dimer of the core complexes is flanked by two proteins each encoded by lhcb4, lhcb5, and lhcb6. These, in turn, are flanked by a total of four trimers of lhcb1, lhcb2, and lhcb3 [64,67,68]. Lhcb4 is essential to formation of a functioning PS II, but functional photosystems are still formed if expression of lhcb1 and lhcb2 is inhibited [67,68] and transcription of lhcb5 is strongly upregulated. However, Lhcb1 and Lhcb2 proteins do play important roles in low light conditions [64] and in establishing the proper formation of grana stacks, and Lhcb5 can not entirely compensate for these functions [64]. Thus, the evolutionary flexibility of lhcb1-lhcb3 genes may provide a mechanism to tune the light harvesting complex for different light conditions [65], while in contrast, the genomically dispersed and evolutionarily more stable, lhcb4, lhcb5, and lhcb6 genes maintain the photosynthetic core of PS II.
The Major Latex Protein (MLP) family encodes proteins that were originally isolated from the latex of opium poppy [69,70] but also found in a wide range of plants and tissues [71]. Functions of MLP are not known, but they do show significant similarity to a pathogenesisrelated proteins (IPR or PR10 proteins [37]) which show increased expression with pathogen or stress challenge [37,[72][73][74]. Members of the two gene families (MLP and IPR-PR10) show only about 25% identity to each other, but sequence and structural analyses indicate that they are similar enough to be considered to be part of a single superfamily [37]. Interestingly, there are no A. thaliana homologs that group with the IPR-PR10 subfamily [37], but we located 11 tandem and three segmental duplications for the MLP family (Figure 8), resulting in a tandem duplication observed/expected ratio of 54.5 and segmental duplication observed/expected ratio of 0.16.
Evolutionary distances among sequences resulting from the predicted segmental duplications are greater in the MLP than among segmentally duplicated sequences in the proteasome family. In the MLP family, pairwise distances among segmentally duplicated sequences range from about 15 to 60 PAM units [75], but in the proteasome 20S family, range from 0 to about 4 PAM. Nevertheless, the MLP duplications do appear to come from the same polyploidy event as the proteasome duplications (Blanc et al. [6,10]). Clearly, the MLP members have been evolving much more rapidly following polyploidy than have the proteasome 20S subunits or most members of the CAB family. As above, we used Medicago and tomato sequences to mark divergence times and the results support duplication due to a recent polyploidy event in Arabidopsis.

Patterns of gene duplication
We observed a moderate negative correlation between levels of predicted tandem and segmental duplications in gene families. If either sequence variation or gene copy number must be maintained within some bounds, one possible source of selection against tandem duplication is that unequal cross-over and gene loss will generate variation and high turn-over of gene copies [15,76]. Conversely, segmental duplicates may be more often retained due to subfunctionalization, without increasing the likelihood of gene rearrangment [77,78]. In families that demonstrate moderately high levels of segmental and tandem duplication, gene family members have been retained within segmental duplication blocks, while gene copy number in some clades has been expanded by tandem duplication. An example is found in flavin-containing monooxygenase family, which has six segmental duplications at various nodes in the tree, and seven tandem duplications accounting for members of one clade. Another example includes the chlorophyll a/b binding protein family. On the other end of the spectrum, we found few families in our study set with both low segmental and low tandem duplication. No families fall below one standard deviation below the median in both the segmental and tandem categories, but several are close: the mitochondrial carrier proteins and the heat shock transcription factor families have 9% and 13% of the expected segmental duplications, respectively. Conceivably, protein copy stoichiometry is critical in some families representing multi-subunit protein complexes [77,78].

Conclusions
The relative contributions of tandem and segmental duplication to the generation and maintenance of 50 large A. thaliana gene families was characterized using ratios of observed/expected tandem duplication and observed/ expected segmental duplication. Counts of tandem and segmental duplications were negatively correlated; no families exhibited both high levels of tandem and segmental duplication. Although the distribution of gene family sizes across the genome can be accounted for by a stochastic model, by comparing the relative levels of tandem and segmental duplication in large gene families, we can speculate that gene function might feedback on copy number and genome organization, and thus result in the widely varying patterns of observed tandem and segmental duplication. Predicted proteins for all gene families were aligned using T-Coffee [19] and a maximum of 30 randomly selected proteins sequences from each family. These initial alignments were used to create HMMs, which were in turn used to re-align the full protein sets. HMM parameters in hmmer [29] were: "hmmbuild --archpri .7 --fast --gapmax .3". The HMMs were calibrated using hmmcalibrate, and then were used to search the full set of predicted A. thaliana protein sequences, using the hmmsearch program in hmmer [29]. Sequences scoring at least 10 -10 were generally retained as gene family members and genes scoring worse than this threshold were excluded, although scores were evaluated in the context of all scores in the putative gene families. Families with gradually declining scores in the range of 0.1 -10 -15 were generally excluded from the study because the difficulty in unambiguously assigning family membership.

Gene family selection, alignment, and phylogeny construction
Alignments were prepared for use in phylogenetic reconstructions as follows. To remove highly variable or indel regions, sequence positions falling outside of HMM match states were removed. Genes matching fewer than 75% of the remaining positions were removed entirely. Alignments were also manually inspected, and other particularly poorly aligning regions were removed. Both fulllength and trimmed sequences for all gene families are available at [24].
Parsimony and bootstrapped neighbor joining trees were calculated for each gene family. Parsimony trees were calculated using protpars in the Phylip suite [81]. The input sequence order was jumbled five times, and a topology calculated based on each data order. One most-parsimonious tree was chosen at random to serve as the basis for branch length calculations. Maximum likelihood branch lengths were calculated on the parsimony topologies using TreePuzzle [82]. The model of substitution was of Adachi and Hasegawa [83], amino acid frequencies were calculated from the input trees, and rate heterogeneity was allowed with four Gamma rate categories. Neighbor joining trees were calculated using Clustalw, without the Kimura distance correction, and with 1000 bootstrap replicates. All trees are available at [24].

Prediction of A. thaliana duplication blocks and gene family segmental and tandem duplications
Internal genomic duplications were predicted using Diag-Hunter [20,21]. All duplication block predictions (genes, genomic coordinates, and dot plot images of genomic similarities and predicted duplications) are available at [24]. Predictions of segmental or tandem duplications in gene families were made using the OrthoParaMap suite [22,23].
The approach for identifying segmental duplicates consists of identifying pairs of sufficiently similar genes in a phylogeny that fall sufficiently close to their respective corresponding regions in a synteny block. Pairs of gene family members falling within a synteny block are annotated as such in the phylogeny, using the extended New Hampshire (NHX) format [84]. For all but the 13 largest gene families, the threshold for "sufficiently similar" was set at 10 -25 , and the threshold for "sufficiently close" was set at 50 kb. Because the number of potential false positive hits to a synteny block rises approximately proportionally to the square of the number of genes in a gene family, more stringent thresholds ("similar" = 10 -30 and "close" = 30 kb) were used for the following families (see Table 1 for full names): CytP450, MATE, MFS, Myb, NBS-LRR, WRKY, GSDLLipase, GTPBP, MajIntrins, Prot, Oxidored, Polygalns, SCDehydRed, UDPGlycTnsf.
Nodes giving rise to tandem gene duplications were inferred using the ParaMap program in the OrthoParaMap suite [22,23]. This recursively walks through the tree, identifying internal nodes that give rise to genes or other nodes that are physically near one another (<50 kb) on the chromosome.

Calculation of gene family size distributions
All predicted proteins were used in a BLASTP [30] search against one another, using three different E-value thresholds (10 -10 , 10 -20 , and 10 -30 ). BLAST results were parsed using a BioPerl [85] -based script. Approximate gene families were constructed using a single linkage clustering approach implemented in Perl. The 2003 TIGR A. thaliana 4.0 assembly and protein predictions were used for these procedures.

Calculation of gene duplication densities by distance
Predicted protein sequences in the 2003 TIGR A. thaliana 4.0 assembly were assigned genomic positions based on the nucleotide position halfway between the predicted 5' and 3' positions. Protein sequences from each chromosome were used in BLASTP [30]searches against all other sequences in that chromosome, to give lists of BLAST hits and query/target midpoint positions for each chromosome. Hits to self were excluded.

Calculation of expected tandem and segmental duplications
Tandem duplications expected to occur by chance in a gene family of a given size were simulated under the assumption of a 100,000 kb genome (approximately the size of the A. thaliana euchromatic genome). Gene families of sizes ranging from 20 to 230 genes were simulated, using size classes in increments of 10. Approximate distributions were calculated using 1000 simulation runs for each gene family size class.
Normalizing constants for segmental duplications in gene families of given sizes were calculated under the assumption that the maximum proportion of segmental duplications retained in an average gene family should be the same as the percentage of the genome that exists "In duplicate" (in synteny blocks). Arithmetic for the proportion of expected segmental duplicates (in the absence of local gene losses or duplications following polyploidy) is shown in the Results section.

Additional data
Alignments, gene phylogenies, annotations, analyses of genomic position, relationships to internal genomic duplications, and comparisons with homologous ESTs from various species are available at http:// www.tc.umn.edu/~cann0010/genefamilyevolution/