Highly syntenic regions in the genomes of soybean, Medicago truncatula, and Arabidopsis thaliana

Background Recent genome sequencing enables mega-base scale comparisons between related genomes. Comparisons between animals, plants, fungi, and bacteria demonstrate extensive synteny tempered by rearrangements. Within the legume plant family, glimpses of synteny have also been observed. Characterizing syntenic relationships in legumes is important in transferring knowledge from model legumes to crops that are important sources of protein, fixed nitrogen, and health-promoting compounds. Results We have uncovered two large soybean regions exhibiting synteny with M. truncatula and with a network of segmentally duplicated regions in Arabidopsis. In all, syntenic regions comprise over 500 predicted genes spanning 3 Mb. Up to 75% of soybean genes are colinear with M. truncatula, including one region in which 33 of 35 soybean predicted genes with database support are colinear to M. truncatula. In some regions, 60% of soybean genes share colinearity with a network of A. thaliana duplications. One region is especially interesting because this 500 kbp segment of soybean is syntenic to two paralogous regions in M. truncatula on different chromosomes. Phylogenetic analysis of individual genes within these regions demonstrates that one is orthologous to the soybean region, with which it also shows substantially denser synteny and significantly lower levels of synonymous nucleotide substitutions. The other M. truncatula region is inferred to be paralogous, presumably resulting from a duplication event preceding speciation. Conclusion The presence of well-defined M. truncatula segments showing orthologous and paralogous relationships with soybean allows us to explore the evolution of contiguous genomic regions in the context of ancient genome duplication and speciation events.


Background
The rapid increase in eukaryotic genome sequence in recent years enables genome-wide alignments, megabase (Mb)-scale comparisons between species, and fine-scaled phylogenetic footprinting. Recent sequenced-based studies in a variety of organisms have described high levels of synteny (conservation of gene content and order between species) within kingdoms and between families, but have also highlighted frequent synteny loss and degradation due to gene duplication, deletion, and rearrangement. In some cases, observed synteny has been extensive. In vertebrates, over 90% of the mouse and human genomes (separated by 91 million years; My) lie in syntenic blocks [1,2], some exceeding 40 Mb [2,3]. At a greater evolutionary distance (310 My), the human and chicken genomes show large synteny blocks, including at least 70 Mb of highly conserved sequence [2,4]. Regions syntenic to 1.8 Mb of human DNA were identified in twelve different species including fish, which separated from humans 450 Mya [2,5].
High levels of synteny have also been found in plant families. Molecular marker analysis has allowed chromosome-by-chromosome alignments of several genera within the Solanaceae, Fabaceae, and Poaceae [6][7][8]. Generally, syntenic relationships are complicated by microand macro-rearrangements as well as duplications [9]. Complete genome sequences of rice and A. thaliana, models representing the two major clades of flowering plants, allows comparisons across a greater evolutionary distance. Separated by 200 My, rice and Arabidopsis thaliana nonetheless retain substantial conserved syntenic blocks, including one region spanning 119 A. thaliana genes [10].
Though genomic relationships within legumes are less well characterized, a growing number of studies have begun to reveal extensive synteny between the members of this important plant family. Based on restriction fragment length polymorphisms (RFLPs), substantial genome conservation was discovered among Phasoloid species, including mungbean (Vigna radiata) and cowpea (V. unguiculata), extending as long as entire chromosomes [11]. Comparable levels of synteny were later demonstrated between Vigna and the common bean, Phaseolus vulgaris [12]. Synteny with the more distant soybean, Glycine max, was more limited, typically on the order of 10 -20 cM. Later, Lee et al. [13] observed higher levels of conservation between bean, mungbean, and soybean, where A. thaliana also showed conservation to some conserved legume regions and even helped to elucidate duplicated regions in soybean. Choi et al. [6] described genome-wide macrosynteny among legumes using a large set of crossspecies genetic markers. Though genomic correspondence was reduced by chromosomal rearrangements increasing with phylogenetic distance, they could align chromosomes from a variety of Papilionoid species, including Medicago truncatula and soybean.
M. truncatula and Lotus japonicus are two model legumes that are now targets of large-scale genome sequencing. With more than 100 Mb of genome sequence publicly available in both, genome-scale comparisons at both the macro-and micro-syntenic level are possible. Young et al [14] compared all finished and anchored sequence between these two genomes (111 Mb) and concluded that more than 75% of both genomes reside in conserved, syntenic segments. At a microsyntenic scale, Choi et al. [6] analyzed ten BAC/TAC clone pairs and found 80% of genes were conserved and colinear. Soybean has also been compared to M. truncatula because of its economic importance. With few sequences 100 kbp or more in length available, however, comparisons of soybean with reference legumes have been limited to low resolution surveys and short contiguous segments. Nevertheless, conserved synteny is widespread between M. truncatula and soybean. Yan et al. [15] analyzed three homologous BAC contig groups in detail by comparative physical mapping and cross-hybridization and found six of eight genome regions exhibited conserved synteny, including three that were extensively conserved. In genome-wide survey of synteny, slightly more than half of 50 RFLP-based soybean BAC-contigs, each approximately 200 kbp in size, exhibited conserved synteny with M. truncatula [16] and nearly 75% of these cases were extensive.
In the course of our genome sequencing work in M. truncatula [14], two regions were observed to be significantly conserved with previously sequenced regions of soybean. These soybean regions contain two important soybean cyst nematode (SCN) (Heterodera glycines) resistance loci, rhg1 and Rhg4, which have been studied extensively reviewed in Concibido, et al. [17]. In previous work, our lab and others localized the genetic positions of these genes and characterized their role in resistance [18][19][20][21][22][23][24]. We saturated the regions with genetic markers, developed high throughput molecular markers, created physical maps, and characterized homoeologous and surrounding genome regions [23,[25][26][27][28][29][30]. As a result of the extensive information available and the importance of SCN resistance, these genome regions were eventually sequenced [31,32], including the tentative cloning of rhg1 and Rhg4 (gene 29 in Figure 1 and gene 21 in Figure 2, respectively).
A preliminary examination of the soybean rhg1 region described in the present study concluded that nearly 70% of genes were conserved and colinear between soybean and M. truncatula [6]. Previously, Foster-Hartnett et al. [29] had used survey sequences along a 1 Mb stretch that included and extended beyond the region described here to examine syntenic relationships with A. thaliana. Based on survey (primarily BAC-end) sequence that included both genic and non-genic regions, 35% of soybean sequences were conserved in one or more syntenic A.
Synteny block 1 Figure 1 Synteny block 1. A syntenic block of soybean, M. truncatula and A. thaliana genes surrounding soybean's rhg1 gene (Gm gene 21). Solid black lines connect homologs. Dotted black lines indicate that the absence of a homolog in the syntenic position. Blue lines connect orthologs. Pink lines connect paralogs. M. truncatula genes are shown in red, soybean in brown, and A. thaliana in blue. Lighter colored genes represent those that had no significant similarity to Genbank's nonredundant protein database. Gray genes are repetitive elements. A thick gray vertical line connecting sequence assemblies indicate regions in which sequence is not yet available but in which linkage and approximate distance were determined. Genbank accessions are shown in gray. i Mt1b sequence assembly bottom  The longest syntenic segment in A. thaliana was more than 2 Mb in length, highlighting the existence of long stretches of conserved sequence between distantly related genomes [29].
In the present study, we describe the gene content in two soybean regions totaling approximately 1 Mb in size with more than 150 genes [32] that exhibit extensive synteny to M. truncatula and A. thaliana. The two soybean regions reside on different chromosomes but are functionally linked -each contains a receptor-like kinase gene (rhg1 or Rhg4) tentatively identified as a resistance gene to SCN (Heterodera glycines). Up to 75% of soybean genes in this region are colinear with M. truncatula, including one 300 kbp segment with 33 of 35 soybean genes colinear to M. truncatula. Nearly 60% of the genes in this same soybean region exhibit colinearity with one or more A. thaliana regions. These highly syntenic blocks are discussed in the

Identifying and mapping homologous contigs
We used the Genbank soybean sequences surrounding SCN resistance loci as a basis for searching all available M. truncatula BAC sequences and the A. thaliana proteome for syntenic regions (Table 1). Homologous regions were then used to create corresponding sequence assemblies for all three genera (Figures 1, 2). Where possible, we also merged sequence M. truncatula assemblies by identifying end-sequenced BACs that spanned gaps between nonoverlapping BACs. Most of the M. truncatula BACs could be anchored to the M. truncatula genetic map (Table 1) [33].
There is a gap in all three species between synteny blocks 1a and 1b which we were unable to span. In soybean, these two sequence assemblies are genetically linked and located 2 cM apart on LG-G. In addition, synteny block 1b contained two gaps in M. truncatula (Figure 1), one a 90 kbp gap toward the bottom M. truncatula sequence assembly. This gap and its surrounding M. truncatula sequence (totaling about 175 kbp) correspond to an insertion/deletion in soybean just 25 kbp in size and containing two gene models without hits to Genbank's nonredundant (nr) database along with one repetitive sequence. There is also a gap of unknown size between the top and bottom sequence assemblies in synteny block 1b that could not be spanned with end sequenced BACs. In M. truncatula, the top sequence assembly maps to M. truncatula chromosome 4, while the bottom maps to chromosome 3 (Table  1). These two M. truncatula assemblies therefore appear to be unlinked, even though they show substantial synteny and are apparently both orthologous (see below) to a contiguous region in soybean.
M. truncatula sequences in synteny block 2 also map to chromosomes 3 and 4. One of the M. truncatula homoeologs in synteny block 2, Mt_2ii, maps 5-8 cM below the M. truncatula bottom sequence assembly in synteny block 1b ( Figure 1, 2, Table 1). The other M. truncatula duplicate in synteny block 2, Mt_2i, maps to chromosome 4, more than 25 cM from the top sequence assembly in synteny block 1b (Figure 1, 2, Table 1).

Soybean/Medicago truncatula synteny within synteny block 1
The soybean and M. truncatula regions in synteny block 1b are highly syntenic, with nearly complete conservation of orientation and order of conserved genes (Table 2, Figure  1). Seventy-five percent of soybean genes in this region ( Synteny block 1a shows lower, yet still impressive synteny. Nearly half of the genes in synteny block 1a are conserved in order and orientation. Indeed, 44% of M. truncatula genes are conserved in order and orientation in soybean in block 1a, increasing to 50% when only genes with database support are considered. Of soybean genes, 37% are conserved in M. truncatula, increasing to 43% of genes with database support.

Soybean/Medicago truncatula synteny in synteny block 2
Like block 1, extensive synteny is also evident throughout synteny block 2 ( Figure 2, Table 2). In synteny block 2, there are two duplicated regions of M. truncatula syntenic to soybean, Mt_2i and Mt_2ii, which flank the soybean segment in Figure 2. The Mt_2i homoeolog and soybean share 60% (28 of 47) of their genes. With two exceptions, orientation is conserved between Mt_2i and soybean, and remarkably, a run of 13 out of 13 confirmed soybean genes are perfectly conserved in Mt_2i in the bottom portion of block 2. The corresponding soybean region extends nearly 110 kbp ( Figure 2).
The other M. truncatula homoeolog, Mt_2ii, shows synteny with soybean extending more than 300 kbp ( Figure  2, Table 2). In this region, soybean shares 32% (12 of 38) of genes and M. truncatula homoeolog Mt_2ii, 24% (12 of 50) in this syntenic region. One gene, with similarity to a rapid alkalinization factor in Solanum chacoense, shows synteny between soybean and Mt_2ii but appears to have been lost from Mt_2i ( Figure 2, Gm gene 34, Mt_2ii gene 18). The middle portion of synteny block 2 exhibits multiple rearrangements and duplications between soybean and Mt_2ii ( Figure 2). While much of the corresponding Mt_2i region has not yet been sequenced, it is less than half the size of the rearranged region in Mt_2ii/soybean on the basis of BAC-end sequenced clones that span the Mt_2i region.
The Mt_2i and Mt_2ii homoeologs themselves share nine genes, only one of which is absent from soybean ( Figure  2). The gene absent from soybean encodes a putative AMP-binding protein ( Figure 2, Mt_2i gene 14, Mt_2ii genes 9-10) present in one copy in Mt_2i and two adjacent copies in Mt_2ii. By contrast, Mt_2i and soybean share three times as many homologous pairs as the two M. truncatula duplicates themselves, including 19 homologous pairs that are absent from Mt_2ii. These observations help to illuminate the orthologous and paralogous relationships of these genome regions, which are described in further detail below.

Comparisons with A. thaliana
High levels of synteny are also maintained between the two legume species and networks of duplicated A. thaliana regions, each with a unique pattern of gene loss (Table 2). For example, nearly 62% (28 of 45) soybean genes and half of M. truncatula genes in synteny block 1b have a homolog within a syntenic network of four A. thaliana duplicated regions (Figure 1). With any one A. thaliana region, much lower levels of conserved synteny are observed (between 23% and 34% of soybean genes; 17% and 27% of M. truncatula genes). These results are consistent with the model of large-scale genome duplication followed by gene loss in Arabidopsis [34] and mirror the results of Foster-Hartnett et al. [29] in their low resolution synteny analysis of the soybean rhg1 region. By contrast, we found only one region in A. thaliana syntenic to block 1a ( Figure 1, Table 2) with 20% (4 of 20) of soybean genes and 29% (5 of 17) of M. truncatula genes. In synteny block 2, levels of synteny between the legume species and individual A. thaliana regions were comparable to those in synteny blocks 1a and 1b, but composite syntenies were much lower (Table 2). For instance, just 29% of soybean genes were conserved in At4_2i and 17% in At3_2ii, with only 31% conserved in the network of both A. thaliana regions. Tandemly duplicated genes with the highest copy numbers occur in a highly rearranged region in the middle of synteny block 2 ( Figure 2). The rearranged region in soybean contains 11 copies of chalcone synthase genes in three separate groups of four, four, and three genes (genes 39-42, 50-51, 53-54, 57-59 in Figure 2). The latter group appears to have originated from a 25 kbp segmental duplication of the top CHS group and surrounding genes. While soybean has 11 copies of the CHS gene in this region, including CHS1, CHS2, CHS3, CHS4, and CHS5, the Mt_2ii region has only one CHS cluster with two genes, CHS1A and CHS1B (genes 52-53 in Figure 2). In addition, Mt_2ii contains a group of 10 genes with similarity to A. thaliana auxin-induced proteins 6B and X10A that are absent in soybean (genes 24-33 in Figure 2). It was not possible to analyze the corresponding region in Mt_2i, as this genome segment has not yet been sequenced.
Tandem duplications occur in other regions as well (Figures 1, 2). There are examples of tandemly duplicated genes whose homolog(s) are not duplicated, as well as cases in which two or more homologs have duplicated. For example, soybean and both M. truncatula duplicates in synteny block 2 have three copies of a glucosyltransferase (Mt_2i genes 2-4, Gm genes 18-20, and Mt_2ii genes 4-6 in Figure 2). Cases of differential tandem duplication may have resulted from duplication in only one species or loss of duplicates from one species.

Phylogenetic analysis
Phylogenetic trees were successfully generated for 21 gene families with members in synteny block 1 and for 23 in synteny block 2, many of which included homologs from both M. truncatula duplicates (Figures 1, 2) [see Additional Files 1, 2]. These phylogenies were examined to determine whether soybean and M. truncatula homologs within synteny blocks were more closely related to each other than to homologs elsewhere in the genomes, as represented by expressed sequences. For all 16 phylogenies in synteny block 1b, M. truncatula/soybean homologs were more closely related to each other than to homologs in other genomic regions, strongly suggesting orthology ( Nucleotide substitution levels were determined to measure the evolutionary distance between soybean and M. truncatula (synonymous substitution levels) and to identify differences, if any, in selection pressure (nonsynonymous substitution levels). In synteny block 1, estimates of synonymous and nonsynonymous substitution levels were obtained for 34 sets of M. truncatula and soybean homologs (Figure 3a), six in synteny block 1a and 28 in synteny block 1b. In comparing these two blocks, we observed no difference in the number of synonymous substitutions per site (Table 3; 1a: 0.71, 1b: 0.71, p = 0.96), suggesting similar times of divergence between soybean and M. truncatula in both regions. This result is somewhat surprising given the fact that block 1a is composed of a mixture of apparent orthologous and paralogous relationships, while block 1b exhibits exclusively orthologous relationships.
In synteny block 2, the two M. truncatula homoeologs shared the same eight genes with soybean. We therefore focused on paired comparisons using these eight genes.
The extent of synonymous substitutions between soybean and Mt_2i (0.87), an orthologous relationship based on phylogenetic tree analysis, was significantly lower than the extent between soybean and Mt_2ii (1.21), a paralogous relationship (p = 0.008) ( Table 3; Figure 3b). All eight paired comparisons show higher levels of synonymous substitutions in the paralogous comparison. Not surprisingly, therefore, the paralogous region has evolved farther from soybean in evolutionary time than the orthologous region, implying that a duplication spanning the entire synteny block 2 preceded speciation between M. truncatula and soybean. The number of nonsynonymous substitution levels per site were comparable between the orthologous and paralogous M. truncatula/soybean relationships, with no significant difference (p = 0.69) ( Table  3).
Comparisons of the distance between the two M. truncatula homoeologs in synteny block 2 revealed levels of synonymous substitutions (0.82) comparable to those of the orthologous Mt_2i/soybean comparison in the same block (0.87) ( Table 3; Figure 3b; p = 0.85), suggesting that the duplication may have occurred close in time with speciation. It is surprising that the synonymous distance between M. truncatula homoeologs (0.82) is not closer to that of the paralogous M. truncatula/soybean comparison (1.21), which should be comparable given a duplication event followed by speciation. However, the difference between them was not significant (p = 0.22). There were no significant differences when comparing homoeologs between synteny blocks (data not shown) for either synonymous or nonsynonymous substitution levels. No differences were observed between tandemly duplicated and single copy genes for nonsynonymous or synonymous substitution levels (data not shown).
Estimates of synonymous substitution distance between orthologous soybean and M. truncatula regions allowed us to estimate the time of the Medicago/Glycine speciation event, as did the synonymous distance between the two M. truncatula duplicates in synteny block 2 in timing the underlying genome duplication. Orthologous regions between soybean and M. truncatula in both synteny blocks 1b (Mt/Gm) and 2 (Mt_2i/Gm) (Figures 1, 2) give similar estimates of divergence since speciation. Synteny block 1b has a median synonymous substitution level of 0.61 per site (Table 3), suggesting 50 My since the divergence through speciation, using an estimate of 6.1 × 10 -9 substitutions per synonymous site per year [35]. Synteny block 2 has a median synonymous substitution level of 0.59 per site (Table 3) when comparing all orthologs, suggesting 48 Mya since speciation. By contrast, the duplication event in M. truncatula evident in synteny block 2 ( Figure  2) appears to have predated speciation, with a median of 0.79 synonymous substitutions per site and an inferred divergence between duplicates of 64 Mya.  region of more than 500 kbp. Of 30 soybean genes and 45 M. truncatula genes confirmed by hits to the Uniref database [37] at BLASTP, e ≤ -4, just nine (20%) were syntenic.

Discussion
Why the regions that we examined in the current study should have remained so highly conserved is unknown. Several explanations for differential conservation of synteny have been proposed. Regions with disease resistance genes often evolve rapidly and show frequent rearrangements [38][39][40]. Nevertheless, the soybean regions in this study were characterized because they contain important disease resistance genes -though not members of the more widespread NBS-LRR gene family most frequently associated with rearrangements [38]. Often, regions near centromeres tend to be more conserved than telomeric regions [41]. But the soybean region in synteny block 1 is known to be located at the very end of the chromosome, while still retaining high levels of synteny with M. truncatula and A. thaliana. Of the mapped M. truncatula sequence assemblies, only the top sequence assembly in synteny block 1b is close to the centromere (R. Geurts et al., personal communication). In some organisms, regions of housekeeping genes are clustered and thought to be more conserved [42][43][44]. Though housekeeping genes are certainly present in the regions of this study, the regions do not represent clusters of housekeeping genes [see Additional File 3]. Finally, the presence of transposons and other repetitive sequence may decrease stability in a region [45][46][47].  [48] found that more than 10% of a soybean BAC containing multiple NBS-LRR sequences was composed of retrotransposons. Although we cannot identify hallmarks of the sequences examined that would cause them to be more conserved than usual, they do appear to be highly conserved and care should be taken in drawing general conclusions from this comparison.

Legume and A. thaliana synteny
The relationship between legumes and A. thaliana in synteny block 1b, described in detail here, seem to follow a pattern of post-speciation duplication followed by gene loss [34]. Several syntenic regions exist that vary slightly in overall degree of synteny, while gene loss/insertion has occurred in every case. A composite of all the partially syntenic regions come together to form a network that together recapitulates substantial genome conservation. The retention of soybean and M. truncatula genes in A. thaliana is impressive, given the roughly 90 My thought to separate A. thaliana from legumes [49]. For example, synteny block 1b has 60% of its genes occurring in at least one of four syntenic A. thaliana regions, though individually, the most conserved of these regions contains just half that level.
Though synteny between legumes and A. thaliana in this region is impressive, previous results suggest it extends beyond the region analyzed here. Foster-Hartnett et al [29] described conserved synteny involving the genome region around rhg1 twice the size examined in the present study, though at low sequence resolution primarily using BAC-end sequences [29]. Simillion, et al. [50] also found conservation among the A. thaliana regions syntenic to synteny block 1b extending up to 182 kbp in length.

Exceptions to synteny
There were also some consistent exceptions to synteny. A total of 13 transposons, including retrotransposons, were found among the three species. Just one was found in syntenic positions, where an M. truncatula retroelement in synteny block 1b was located in a comparable position to Additionally, genes predicted by FGENESH but without database support were much less likely to have homologs than those with database confirmation (9% of soybean predicted genes without database support versus 62% of those with database support have homologs). Still, a small number -roughly 10% -of nearly 60 unconfirmed genes (no hits in nr) in either soybean or M. truncatula did show synteny. Conserved legume genes without homologs in the database may be the most interesting genes of all, since they are likely to be novel or highly diverged from known proteins and may play a role in important plant or legume-specific processes.

Gene density
Although soybean's genome size is more than double that of M. truncatula [51], gene density is comparable in these regions  [51]. Higher than expected gene densities in soybean and M. truncatula suggest the possibility of gene clustering. Indeed, gene clustering has identified in M. truncatula [53] and forms the basis of targeting M. truncatula sequencing to the generich regions of the genome [14].

Genome duplication and speciation
The networks of synteny we have identified reflect duplication and gene loss between species, including M. truncatula regions with both orthologous and paralogous relationships to soybean. In synteny block 2, for example, there are clear-cut examples of regions with orthologous and paralogous relationships. Mt_2i shows only orthologous relationships with soybean, while Mt_2ii shows only paralogous relationships (Figures 1, 2). M. truncatula regions in synteny block 1b also unambiguously display orthology to soybean.
M. truncatula duplications in synteny block 2 allowed us to systematically examine corresponding orthologous and paralogous relationships to soybean. The percentage of conserved genes between soybean and Mt_2i (orthologous region) was twice as high as with Mt_2ii (paralogous region) ( Figure 2, Table 2). Given that orthology indicates the most closely related regions evolutionarily (reflected in the phylogenetic trees) [see Additional File 2], it is not surprising that fewer genes have been deleted/inserted or experienced substitutions in orthologous comparisons. The fact that all the M. truncatula genes in the orthologous region (Mt_2i) are more closely related to soybean as evidenced by phylogenetic trees, synonymous substitution levels, percent identity, and extent of synteny, than either is to the M. truncatula genes in the paralogous region (Mt_2ii) suggests that the duplication seen in M. truncatula occurred before the speciation event splitting Medicago and Glycine lineages. Presumably, soybean also has (or had) a duplicate region as well, a possibility with some phylogenetic support [see Additional File 2].
We date the duplication event, possibly as a part of a genome duplication, in M. truncatula at 64 Mya, preceding a speciation event approximately 48-50 Mya. Indeed, the possibility of a genome duplication event predating the split between M. truncatula and soybean has been suggested previously [16,54,55]. Median synonymous substitution levels between the two M. truncatula duplicates in synteny block 2 (0.79 synonymous substitutions per site) fall within [55] or near [54] synonymous distance peaks, which were interpreted by the authors as a genome duplication event in M. truncatula. Schleuter et al. [55] estimates that this event occurred 58 million years ago, while Blanc and Wolfe [54] inferred a more recent event based on a substantially different molecular clock [56]. Likewise, we estimate the speciation event between Medicago and Glycine at 48 -50 Mya, while Blanc and Wolfe [54] inferred a much more recent date of 13.3-15 million years ago, though again, the differences are primarily due to the use of differing molecular clocks [56].
Comparatively long (~500 kbp) and contiguous sets of homologous segments from different species with known phylogenetic relationships and nucleotide substitution levels bring power to the study of molecular evolution. Though median synonymous substitution levels of duplication and speciation events correspond well to published values [54,55] (see above), the extent of synonymous substitutions varies significantly between neighboring genes despite a common genomic context (Figure 3). Estimates comparing the two M. truncatula segments created by a duplication event range from 0.62-1.12 while those comparing soybean and M. truncatula orthologs (speciation event) range from 0.42-2.68. Since the duplicates in each one of these cases presumably diverged at the same moment, one must postulate different evolutionary trajectories for the different gene lineages. Knowing that all the genes on a contiguous genomic block duplicated (and later speciated) together removes an important unknown from evolution analyses in contrast to comparable ESTbased studies [54,55].

Conclusion
We analyzed genome regions of soybean, M. truncatula, and A. thaliana with remarkable levels of conservation of gene content and order. Such high levels of colinearity within the legumes and with the model plant A. thaliana bode well for leveraging information from model genomes to crop plants like soybean. Further, we described substantial blocks of genes with the same evolutionary (duplication) history, allowing us to study and compare the individual evolution of genes within a common genomic context. These blocks include two duplicates in M. truncatula, one orthologous and the other paralogous with soybean. This duplication may be part of a larger genome duplication event in the common ancestor of soybean and M. truncatula. If so, the analysis described here is just the first step in understanding the evolution of legume genomes and a useful addition to our knowledge about genomic reorganization that occurs at a the scale of megabase or less. M. truncatula BACs were sequenced as part of an international effort to sequence the genespace of this model legume [14]. Two additional M. truncatula BACs were sequenced and examined before the international genome sequencing had begun [57]. Putative homologs of soybean sequences in M. truncatula and A. thaliana were identified by searching the soybean sequences against all sequenced M. truncatula BACs and the A. thaliana proteome using BLAST [58] (The Institute for Genomic Research, Arabidopsis Proteome version 5). After identifying genes (see below), protein/protein comparisons (BLASTP) were performed in order to confirm that BACs were syntenic and to identify syntenic genes (see below). Genbank accessions for soybean and M. truncatula sequences, A. thaliana gene numbers, and mapping information are shown in Table 1.

Sequence assemblies
Sequences were aligned and merged in regions of sequence overlap on the basis of 99% identity or better. End-sequenced BAC clones that tentatively spanned gaps in the sequence were identified based on strong hits (evalue = 0, ≥99% identity) to sequenced BACs on either side. Gap sizes were estimated by removing overlap from the estimated size of end-sequenced BAC(s).

Nomenclature
Throughout the manuscript, the following nomenclature is used. Regions surrounding and syntenic to the SCN resistance rhg1 locus are collectively referred to as synteny block 1 (Figure 1). Regions surrounding and syntenic to SCN Rhg4 gene are collectively referred to as synteny block 2 ( Figure 2). Synteny block 1 is divided into blocks 1a and 1b, which are separated by gaps in all three species ( Figure  1). Within each synteny block, species are labeled as Gm (soybean), Mt (M. truncatula), or At (A. thaliana). The chromosome number follows the "At" abbreviation for A. thaliana. If more than one homoeolog is present, the species abbreviation is appended with an underscore followed by the synteny block and lower case roman numerals (i.e. Mt_2i, Mt_2ii, At4_2i, and At3_2ii in Figure  2) (Figures 1, 2). Sequence assemblies separated by physical gaps are labeled as sequence assemblies "top" and "bottom" in arbitrary order ( Figure 1).

Gene prediction and identification of synteny
Genes were predicted in G. max and M. truncatula genomic sequences using the dicot (Arabidopsis) matrix of FGENESH [59,60]http://www.softberry.com. BLASTP was used to compare predicted proteins between databases containing these G. max or M. truncatula predicted genes and all A. thaliana proteins with an e-value cutoff of e-8 and percent identity cutoff of 40% for the top high scoring segment pair for soybean and M. truncatula comparisons and an e-value cutoff of e-8 for comparisons to A. thaliana [58]. These cutoff values generally identified homologs in syntenic positions while rejecting related genes in nonsyntenic positions.
In this study, we defined synteny to include both conservation of gene content and order between species. In estimating syntenic density (the percentage of genes conserved between two species), repetitive sequences (genes with similarity to transposable elements, including retroelements) were not included and tandemly duplicated genes were counted as one. Synteny between two species was estimated from the first to the last pair of conserved genes in the available sequence for both species.

Phylogenetic analysis
To distinguish between orthologous and paralogous regions, we constructed phylogenetic trees as follows. BLASTP or TBLASTN, as appropriate, were used to compare all G. max genes with the following sequences: all G. max and M. truncatula proteins in the corresponding genomic regions of this analysis; the nonredundant A. thaliana proteome; soybean and M. truncatula EST unigene sets [61] (GMGI v.11 and MTGI v.7; The Institute for Genomic Research. Rockville, MD). The top 25 hits ≥100 amino acids with e-values ≤ e-10 were included in the analysis. Tandem duplications and highly related genes in the same gene family were grouped for analysis.
Initial alignments were calculated using T-COFFEE [62] with manual evaluations and edits in Jalview [63] for poorly aligning sequences. For subsequent phylogenetic analysis, an HMM calculated for each alignment using hmmer [64] was used to realign sequences and to identify and remove indel regions and sequences with fewer than 60% matches to the model. Parameters for hmmbuild were: archpri = 0.7, gapmax = 0.3.
Parsimony trees were calculated using the protpars of Phylip [65], with maximum likelihood branch lengths calculated using TREE-PUZZLE [66]. Parameters for protpars were: randomize input order; use ordinary parsimony; search for best tree; select one best tree for further analysis in TreePuzzle. Parameters for TreePuzzle were: user defined tree (from parsimony search); approximate parameter estimates; Whelan-Goldman substitution model [67] estimate amino acid frequencies from data set; allow rate heterogeneity with eight gamma-distributed rates.

Nucleotide substitutions
Codon-aligned nucleic acid sequences were created with TranslateAlign.pl (courtesy Dan Kortschak, University of Adelaide, Adelaide, Australia). Nucleotide substitutions levels were calculated using these alignments with SNAP (Synonymous/Non-synonymous Analysis Program) [68,69]. In this program, the levels of synonymous and nonsynonymous substitutions per site are approximated using methods developed by Nei and Gojobori [70], incorporating Ota and Nei's statistic [71]. Median synonymous substitution levels were converted into estimates of time since divergence using an estimate of 6.1 × 10 -9 substitutions per synonymous site per year [35].