A genomic perspective on the important genetic mechanisms of upland adaptation of rice

Background Cultivated rice consists of two important ecotypes, upland and irrigated, that have respectively adapted to either dry land or irrigated cultivation. Upland rice, widely adopted in rainfed upland areas in virtue of its little water requirement, contains abundant untapped genetic resources, such as genes for drought adaptation. With water shortage exacerbated and population expanding, the need for breeding crop varieties with drought adaptation becomes more and more urgent. However, a previous oversight in upland rice research reveals little information regarding its genetic mechanisms for upland adaption, greatly hindering progress in harnessing its genetic resources for breeding and cultivation. Results In this study, we selected 84 upland and 82 irrigated accessions from all over the world, phenotyped them under both irrigated and dry land environments, and investigated the phylogenetic relations and population structure of the upland ecotype using whole genome variation data. Further comparative analysis yields a list of differentiated genes that may account for the phenotypic and physiological differences between upland and irrigated rice. Conclusions This study represents the first genomic investigation in a large sample of upland rice, providing valuable gene list for understanding upland rice adaptation, especially drought-related adaptation, and its subsequent utilization in modern agriculture.


Background
Asian cultivated rice (Oryza sativa type indica and Oryza sativa. type japonica) has phenotypically divergent ecotypes [1]. Irrigated and upland rice, respectively adapted to growing in paddy fields and rain-fed dry lands, are two main ecotypes that are generally considered to show apparent phenotypic differentiation. The upland type is more drought tolerant, taller, of lower tillering potential, more resistant to blast, and tends to have longer and thicker roots than its irrigated counterpart [2,3]. Accordingly, upland varieties are widely adopted in upland areas to optimize crop yields because of their more efficient water usage and adaptation to dry land. Though the yield of upland rice is comparatively lower than the irrigated rice, it provides the daily staple food on which nearly one hundred million people depends [4]. Because we are now facing global population expansion and a growing shortage of water resources, upland rice and especially knowledge about its genetic mechanisms for dry land adaptation become more important as they allow the possibility of transforming or modifying high-yield irrigated varieties of rice to upland ones.
Drought resistance is an important component of rice adaption to most rainfed upland areas because in this ecosystem water could not be well preserved and frequent or sporadic drought events could occur each year [5]. Previous research proposed three different mechanisms for plant drought resistance: drought escape, dehydration avoidance, and dehydration tolerance [6]. Through drought escape, timing of plant lifecycle is modulated to escape the time of drought when drought time is predictable in each year. Dehydration avoidance refers to the case that plants constitutively avoid excess loss of water through more developed root systems or a thick layer of cutin [6]. Dehydration tolerance refers to the inducible mechanisms, like osmotic adjustment, which help plants to prevent damage and survive when excess loss of water does happen [6]. For the frequent and unpredictable drought stress in upland areas, the dehydration avoidance and dehydration tolerance should be the appropriate options for upland crops. In fact, upland rice seems to depend more on the constitutive dehydration avoidance than the inducible dehydration tolerance in its adaptation [7][8][9]. By far, some genes have been reported to be related with drought resistance [10], such as genes in abscisic acid (ABA) synthetic pathway (such as 9-cis-epoxycarotenoid dioxygenase genes) [11,12], transcription factors interacting with the cis-acting dehydration response element (DRE) [13], genes synthesizing osmoprotectants [14,15], and mitogen-activated protein kinase (MAPK) genes [16]. However, these studies focused on single gene's effect, while adaptation to dry land may involve many genes in the genome. Furthermore, the valuable genetic resources within upland rice, which were ignored by most previous studies, are also worthy of enhanced utilization.
Upland and irrigated rice evolve their respective characteristic traits under both natural and artificial selections, which lead to phenotypic adaptation to respective environment and simultaneously genetic differentiation between ecotypes. Some statistical tools for identifying selected regions during population differentiation have been well established. Fixation index (F ST ) is a commonly used statistics that measure population differentiation based on genetic polymorphism data. F ST is positively associated with allele frequency difference. Cross-population composite likelihood ratio test (XP-CLR), a newly invented approach, incorporates not only genetic differentiation but also allele frequency spectrum information [17]. By jointly modeling multilocus allele frequency differentiation and frequency spectrum under neutrality and selection, XP-CLR was demonstrated to have high power in detecting selective sweeps during population differentiation. Besides, Nielsen's CLR, a parametric test also based on composite likelihood ratio, detects selective sweeps using the spatial distribution of site frequency spectrum [18]. This approach only uses SNP data of one population and thus detects selective sweeps in single population without consideration of population differentiation.
To investigate the genetic mechanisms for upland adaptation and identify genes for its important component, 'drought resistance', we use our whole-genome resequencing data for a large panel of accessions, including 84 upland accessions and 82 irrigated accessions reported by our previous work [19] , where these data were used to test that a low-diversity region containing a key enzyme gene of abscisic acid synthesis was due to human selection. Considering the wide distribution of upland rice in the world, we obtained our samples from Asia, Africa and South America, covering most major distribution areas of upland rice. We explored phylogenetic relationship between upland and irrigated ecotypes and population structure of current upland rice population. More importantly, by detecting selective sweep regions using F ST and XP-CLR, we compiled a list of candidate genes that are likely responsible for upland adaptation or that are related to phenotypic differentiation between the upland and irrigated ecotypes, providing valuable insights about crop breeding for dry land.

Phenotypes of the upland accessions differ from those of the irrigated ones
For the 84 globally sampled and representative upland rice accessions (see Additional files 1 and 2), we want to confirm whether they possess the reported phenotypes that distinguish them from irrigated rice [2,3]. Therefore we grew the varieties to do phenotyping (see Additional files 3 and 4). Our examination focused on four non-yield phenotypes that are reported to be differentiated between upland and irrigated rice: root weight, maximal main root length, tiller number, and plant height. Moreover, we also examined four yield traits, including thousand grain (1000-grain) weight, number of productive panicles, number of filled grain per panicle and number of empty grain per panicle. Under both well-watering and dry land (rainfed upland) environments, the upland accessions consistently showed significantly taller plant architecture, fewer tillers, longer main roots and heavier root weights, which supports the previously reported phenotypic differences (Table 1, see  Additional file 5 for the phenotypic difference). Regarding the yield traits, under both environments upland accessions show larger 1000-grain weight and less productive panicles as compared to irrigated accessions ( Table 1). The numbers of filled and empty grain per panicle show no significant differences between the two ecotypes under the irrigated condition. However, under the rainfed upland condition, the upland accessions perform much better, i.e. have significantly more filled grain (Student's t-test, P = 0.016) and less empty grain (Student's t-test, P = 5.3e-05) per panicle than the irrigated accessions.
Using software SPSS statistics 17.0, we calculated the pearson correlation coefficients (ρ) between each of the four non-yield traits and each of the yield traits under upland condition (Additional file 6). The plant height is significantly associated with the 1000-grain weight (ρ = 0.263, P = 0.002), the number of empty grain per panicle (ρ = −0.226, P = 0.009) and the number of productive panicles (ρ = −0.2, P = 0.022). The tiller number is significantly associated with the number of productive panicles (ρ ≈ 1, P < 0.001) and the 1000-grain weight (ρ = −0.169, P = 0.05). The root weight is significantly associated with the number of filled grain per panicle (ρ = 0.295, P = 0.001). The maximal root length is significantly associated with the thousand grain weight (ρ = 0.443, P < 0.001) and the number of empty grain per panicle (ρ = −0.232, P = 0.008). Therefore all of the four non-yield traits are significantly associated with yield traits under rainfed upland condition. From the ρ values it could be speculated that larger root weight and greater root length of upland rice would contribute to net yield increase under upland condition probably because better developed root systems could contribute to better dehydration avoidance. However, the effects of plant height and tiller number on yield are complex. The larger plant height of upland accessions would possibly cause the reduction of productive panicles, *Phenotypic value is indicated by mean value ± standard deviation. #The difference is positive when the phenotypic value in the upland environment is larger than in the irrigated environment, otherwise the difference is negative. @ The number of productive panicles were measured at the stage of maturity, while tiller number was assessed before the earing period. Since different accessions have different heading dates, some accessions produce more tillers after we measured the tiller numbers. So the final mean numbers of productive panicles were larger than the mean tiller numbers in the irrigated condition. Δ In the upland environment, most of the accessions have less than two tillers. When we phenotyped the number of productive panicles, we found almost all the tillers develop productive panicles, i.e. the percentage of earbearing tiller is close to 100% in the rainfed upland condition. So here in the upland environment, tiller numbers and productive panicle numbers share the same values. The tillering ability is weak in our upland condition, and no more tillers were produced at the stage of maturity. We focus on four non-yield phenotypes in this study, including plant height, tillering number, root weight and maximal main root length, and four yield phenotypes, including thousand grain weight, number of productive panicles, number of filled grain per panicle and number of empty grain per panicle. These phenotypic data were collected under both irrigated and upland environments. but can increase 1000-grain weight and reduce the number of empty grain. The less tiller numbers of upland accessions can cause reduction of productive panicles, but would increase 1000-grain weight.
When comparing these phenotypes between the two culture conditions, we found that dry land environment can slightly reduce plant height and 1000-grain weight, but will dramatically decrease the tiller number, root weight, number of productive panicles and number of filled grain per panicle of both the upland and the irrigated ecotypes. However, the main root length and the number of empty grain per panicle responded differently to the different environments between the two ecotypes. We observed a minor decrease (−2.5%) of main root length of irrigated rice but a significant increase (+9.9%) of upland rice root length in dry land as compared with that in the irrigated condition (Table 1, Student's t-test, P = 0.002, Additional file 7), indicating an adaptive feature of upland rice. Specifically, 61 percent of the upland accessions showed increase of root length in the upland condition, while only about 38 percent of the irrigated accessions showed root length increase in the upland condition. Moreover, for the number of empty grain per panicle, we observed a slight and nonsignificant reduction of it for the irrigated accessions in the upland condition (Student's t-test, P = 0.29, Additional file 7) but a dramatic decrease of it for the upland accessions (Student's t-test, P = 5.9e-07, Additional file 7), suggesting another adaptive trait of upland rice. Based on the phenotypic differentiation between the upland and irrigated accessions and the correlations between the non-yield phenotypes and yield phenotypes in the upland condition, we define 'upland adaptation' in this study as the characteristic non-yield phenotypes of upland rice that are associated with the yield performance in upland conditions, such as plant height, tiller number, root weight and maximal root length which are directly associated with drought resistance.

SNPs calling and annotation for the 166 accessions
For the 82 irrigated and 84 upland accessions, we obtained 402,052,644 pair-end reads that passed the reads Pass Filter, amounting to around 8 G base pairs (NCBI accession code SRA066116). The re-sequencing data for 84 upland and 82 irrigated accessions was reported in our previous work [19]. Here we comprehensively analyze these re-sequencing data to determine the genetic components that might have caused the upland adaptation and phenotypic differentiation between the two ecotypes, which has been a long-term open question in rice study.
About 87 percent of these reads could be aligned to the reference genome (IRGSP/RAP build 5 Nipponbare genome) using SOAP-2.20 software [20]. The average mapping ratios of reads for the type indica and type japonica are 81% and 92%, respectively. On average, 29.54% of the reference genome was covered for each accession (ranging from 18.65% to 46.58%). The mean depth for each accession was about 0.50 × on average (ranging from 0.27 × to 0.96 ×) (see Additional file 8). In total, the genome coverage of the upland and irrigated population reached 41 × and 43 ×, respectively. After SNP calling and a series of filtering for quality control (see Methods), 3,571,104 high-quality SNPs were identified (Additional file 9 includes all the SNPs), of which 3,563,681 are non-singleton SNPs. By randomly choosing 56 SNPs and genotyping them via Sanger sequencing, we were able to validate our SNP calling accuracy at 96.4%. The sequencing data, though relatively low in depth, are enough to well satisfy our analyses, as we shall discuss below.
In the following analyses, only 3,029,822 SNPs with genotype data covering more than 10 individuals in both populations were used. We annotated these SNPs by classifying them into different categories according to the genomic context (see Additional file 10). The nonsynonymous-to-synonymous ratio of SNPs is 1.32, consistent with the previously reported ratio of 1.29 in rice [21]. SNPs causing radical mutations, such as premature stop codons or frame disruptions, were also identified (Additional file 10).
Phylogenetics and population genetics analysis for upland rice SNP genotypes were used in phylogenetic analysis for the 166 accessions. Additionally, 25 wild accessions from our previous data (10 O. nivara and 15 O. rufipogon, Additional file 11) were added to help address the phylogenetics of upland accessions [21]. We constructed the accessions' phylogenetic tree using a Neighbor-Joining approach with 100 bootstraps (Figure 1). Consistent with some of the previous reports [21], our results support that Indica and Japonica usually considered as two subspecies are very different groups, as they largely clustered in respective group and both have closely related wild accessions (Figure 1, also see Additional file 12). The results also show that probably the majority of current upland accessions belong to the type japonica (in our sample 76% upland accessions are japonica) (Figure 1, Additional file 12), consistent with the classification of the collection center based on traditional phenotypic characteristics and previous reports that upland rice mainly pertains to Japonica [22]. The phylogenetic tree showed that all the upland japonica accessions cluster together, suggesting that upland japonica likely have a single origin. However, for the type indica, the scattering pattern of upland indica indicates that upland indica accessions may Figure 1 Phylogenetic tree of rice accessions. Green, black and orange branches refer to upland, irrigated and wild accessions respectively. Analysis showed that differentiation between Indica and Japonica has existed within the wild population, since there are strains of wild rice close to both Indica and Japonica respectively, supporting the double domestication model. The tree shows multiple origins for upland rice, though the upland japonicas may bear a single origin. Bootstrap values are indicated in some of the major internal nodes. Some of the leaf nodes are labeled with the sample number of the rice accessions. 'ru' refers to 'rufipogon', and 'ni' refers to 'nivara'.
have been separately bred multiple times using different irrigated rice as the basic genetic materials.

Population structure in upland rice and admixture in upland indica
We use 24,890 SNPs with genotype data covering more than 100 accessions to analyze the population structure within our samples (see Methods). Principle component analysis (PCA) summarizes the population structure for the entire population (Figure 2a, b). The top four principle components explain about 65% of the genetic variation. The huge differentiation on principle component 1 distinguishes Indica from Japonica. By contrast, genetic differentiations between upland and irrigated ecotypes seem to be much lower. The F ST value for the two ecotypes calculated based on genome-wide SNPs is only 0.06. Even in the type japonica, where the upland accessions approximate monophyly and the ecotype differentiation is clearcut (Figure 1, Figure 2a), the upland and irrigated ecotypes still have a limited genetic differentiation (F ST = 0.13), much lower than the reported differentiation between type indica and type japonica (F ST = 0.55) [23].
We also investigated the population structure using FRAPPE to estimate the ancestry and population admixture of each accessions [24]. By increasing K (the number of populations) from 2 to 5, the population structure of upland and irrigated accessions are shown (Figure 2c). For K = 2, we see a differentiation between type indica and type japonica, and this differentiation also exists for Oryza rufipogon and Oryza nivara in the wild population. When K = 3, the wild rice and the cultivars become separated. When K = 4, the division between irrigated japonica and upland japonica emerge, while the irrigated indica and upland indica still share the same structure, indicating the ecotype differentiation is higher within the type japonica than in the type indica. When K = 5, a new subgroup in upland japonica appears, suggesting some accessions in upland japonica forms a unique subgroup which is differentiated from the others in it.
From the structure analysis, we found that some upland indica accessions, though primarily classified as type indica according to their phenotypes, turn out to be quite complicated in terms of their genomic composition. For example, six accessions (GS190, GS192, GS199, GS200, GS201, GS202) (enclosed in the red rectangular in Figure 2c), previously classified as upland indica according to their phenotypes (e.g., the seed shape, see Additional file 13), turn out to be more close to upland japonica (Figure 2c), indicating that for the upland ecotype, seed shape phenotype could be misleading in distinguishing type indica from type japonica. Besides the six accessions, GS215 and GS233 also have a large genomic proportion close to upland japonica(indicated with blue arrow in Figure 2c), which makes them cluster with japonica in the phylogenetic tree ( Figure 1, Additional file 12). And upland indica GS224 (Dular), indicated with a red arrow, has a large proportion of wild rice ancestry, and correspondingly it clusters with the wild rice in the phylogenetic tree ( Figure 1, Additional file 12). These results suggest that current upland indica group is a admixture, containing accessions of different origins. They are classified as belong to indica probably because some of their traits are more close to type indica than type japonica. Besides, the 19 upland indica accessions which remain in the indica group in the tree also seem to be polyphyletic, indicating they might originate multiple times and their genetic mechanisms for upland adaptation might be heterogenous. On the contrary, upland japonica accessions that form a monophyletic group are more likely to have similar mechanisms for upland adaption. Considering the well-known hybridization between indica and japonica, it's probable that upland japonica and some upland indica share similar adaptation genes, while the rest upland indica might be different. Therefore, in the following study on the ecotype differentiated genes, we first use a large sample size by comparing the merged upland and irrigated population to identify genes that show strongest signals and might be shared between upland japonica and some upland indica accessions. Then we mainly focus on the type japonica to find the japonica-specific upland adaptation genes. More data and other experiments would be needed to comprehensively address the adaptation mechanism in the complicated indica group.

Ecotype differentiated genomic regions (EDRs) in genomes
EDRs probably contain genes that account for the adaptive differentiation. Using a sliding window approach to screen regions with both the top 5‰ F ST and crosspopulation composite likelihood ratio test (XP-CLR) scores (Methods), EDRs were identified with selective signals to narrow down the genes underlying the phenotypic differentiation between upland and irrigated rice (Figure 3). After merging the neighboring EDRs, we eventually identified 74 EDRs in the genome with a median length of 22.7 kb that were differentiated between the upland and irrigated rice (Additional file 14). Within these EDR regions, there are a total of 8980 SNPs (Additional file 15). Compared with the whole genome SNPs used in our analysis, the SNPs in EDRs are significantly enriched in genic regions (Chi-square test, P < 2.2e-16) (see Additional file 16). Among the EDR SNPs, 2623 have allele frequency differences larger than 60% between ecotypes, which we subsequently labeled "ecotype differentiated SNPs" (EDSs) (see Additional file 17).

Ecotype differentiated genes (EDGs) in EDRs
There are 154 genes in the EDRs, which we called EDGs (see Additional file 18). For these EDGs, 74 genes have EDSs in the gene regions and the other 53 genes have EDSs located within 1 kb upstream or downstream regions. By gene enrichment analysis, we found that the EDGs are enriched in several classes ( Table 2, Additional   file 19), such as lipase containing GDSL domain class (P = 7.49e-05), peroxidases (P = 1.80e-03), glutathionerelated classes (P = 5.44e-08), and auxin signaling pathway (P = 3.80e-06). (c), the population structure reported by FRAPPE. Results were shown from K = 2 to K = 5. When K = 2, there is only division of Indica and Japonica. Then wild rice is separated when K = 3. When K = 4, the ecotype differentiation in Japonica emerges, i.e. upland japonica begins to differ from irrigated japonica. Our results show that upland-irrigated ecotype differentiation is genetically lower than the Indica-Japonica differentiation. The red rectangle indicates the six upland accessions, which were previously classified as upland indica accessions but are genetically more close to the upland japonica group. The blue arrows in K = 5 panel indicates the two upland indica, GS215 and GS233, which have a large genomic proportion close to upland japonica. The red arrow indicates upland indica, GS224, which is close to wild rice.
The GDSL domain-containing lipase has been reported to be involved in response to abiotic stress, such as drought [25][26][27], indicating that the genetic differentiation of these genes in upland rice may be relevant to their adaption to dry land. Peroxidases and glutathionerelated proteins bear antioxidant activity, another crucial factor in plant adaptation to a drought-like environment [6,7]. Furthermore, auxin, the famous growth-promoting hormone, plays an important role in determining plant architecture, and has been proved to control the formation of adventitious roots [8,[28][29][30]. Accordingly, differences in auxin signaling between upland and irrigated rice could potentially contribute to their different architectures and root systems. Additionally, we also found several individual EDGs of interest that might also be critical in upland adaptation or the phenotypic differentiation between the two ecotypes. Os07g0449700, a type-A response regulator, is of special interest (Figure 3c, d). Type-A response regulators belong to ARR gene family induced by cytokinin and play important roles in cytokinin signaling [31]. Quite a few pieces of evidence have shown that ARR genes impact root and shoot development and apparent functional overlap exists among type-A ARRs [31][32][33][34]. Moreover, the relevance of type-A ARR gene to the root and shoot phenotypes that are differentiated between upland and irrigated rice has also received experimental support; over-expression of a type-A response regulator in transgenic rice has been shown to cause dwarfness and poorly developed root systems [32]. Hence the ecotype differentiation on this type-A ARR gene, Os07g0449700, may be able to explain certain differences of root and plant height phenotypes between upland and irrigated rice. Further reciprocal transgenic experiments between upland and irrigated rice will eventually address the contribution of Os07g0449700 in the adaptation of upland rice.
Other EDGs were also implied by direct or indirect evidence to be relevant to upland adaptation. Os12g0597700 (Figure 3e, f) has a WRKY DNA-binding domain and WRKY genes have been well established to be important elements in drought tolerance [35][36][37][38]. Os12g0597000 encodes a SOS3-like calcium-binding protein 1 and its homologs in Arabidopsis are related to responses to water deprivation [39]. Os11g0474600 encodes a terpenoid synthase, and terpenoid plant hormones have been shown to inhibit rice tillering [40]. This gene may, therefore, explain the difference in tiller numbers between upland and irrigated rice. Besides, several pentatricopeptide repeat protein-coding genes (Os01g0646000, Os06g0184866, Os0 6g0185800) and two serine/threonine protein kinase genes  (Os04g0616400, Os04g0616600) may also be related to drought response as these classes of genes have been reported to be involved in responses to drought/osmotic stress and root development [38,41]. These results illustrate how these 154 EDGs may represent a valuable candidate gene list for investigating the phenotypic differentiation of the two ecotypes and upland rice adaptation.

EDGs selected in individual populations
The investigated EDGs indicate selection during the differentiation of the two ecotypes. To further pinpoint in which population the selection occurred, we used Nielsen's CLR test to identify possible selective sweeps in a single population (see Methods). We identified 661 selective sweep regions in the upland population with a median length of 27 kb, and 351 selective sweep regions in the irrigated population with a median length of 41 kb (Additional file 20). In the selected regions of these upland and irrigated populations, there are 2542 and 1285 genes, respectively.
Upland rice then has twice as many genomic regions/genes selected as irrigated rice, indicating that upland rice varieties probably suffered more selection pressure imposed by adaptation or breeding. By overlapping the 154 EDGs with these potentially selected genes, we found that almost all of these selected EDGs (150/154) could be reconfirmed as selected genes using Nielsen's CLR approach. Among these, 60 were potentially specifically selected in upland rice, 19 were potentially specifically selected in the irrigated population, while 71 were likely selected in both populations ( Figure 4, Additional file 18).

Japonica-specific and Indica-specific EDGs
Genetic architectures of complex traits in rice are often both rich and complicated since they are usually heterogeneous across the subspecies [42]. In light of this, we were also interested in the ecotype-differentiated genes within the type indica or type japonica. As Figure 1 depicts, upland indica have multiple origins, and thus may be genetically heterogeneous in terms of upland adaptation. When investigating the type-specific EDRs, we accordingly mainly focused on the type japonica. We carried out a similar pipeline to screen the EDRs in type japonica (Additional file 21) and found 52 EDRs with a median length of 24 kb (see Additional file 22). In these EDRs, 169 EDGs were identified (see Additional file 23), among which 47 EDGs are overlapped with the previous EDG list from the total population while the remaining 122 are japonica-specific. For these japonicaspecific EDGs, the significantly enriched gene families include mitogen-activated protein kinase (P = 1.37e-14),  WRKY genes (P = 0.03) and thioredoxin coding genes (P = 0.04) ( Table 2). Previous studies showed that mitogen-activated protein kinase plays an important role in response to biotic stress and root development [11,43]; one japonica-specific EDG of this class is very similar to blast induced kinase, which may then explain why upland rice are generally more blast tolerant than irrigated rice. Moreover, as mentioned previously, WRKY genes and thioredoxin may be critical for drought response. Intriguingly, among our list we found Os09g0410500, a homolog to the famous maize-domestication gene, teosinte-branched 1(tb1), indicating this gene may also be related to the different tillering abilities displayed by the upland and irrigated types. For the type indica, because of the admixture of upland indica only 19 upland indica accessions that remain to cluster in the indica group could be used for analysis. But the F ST between upland and irrigated accessions in indica is quite lower (0.06) than that in japonica (0.13), which means that a much larger sample would be needed to detect the ecotype differentiation in indica. However, because of the small population size of upland indica accessions in the world it is difficult to collect enough samples so far. Analysis using only the 19 upland indica accessions yield a list of 56 genes identified by intersecting F ST and XP-CLR top signals (Additional file 24, Additional file 25). Hence, it is not surprising that none of the 56 genes were overlapped with the previous EDG list from the total population, considering the small sample size and possibly heterogeneous adaptation mechanisms in indica. However, we noticed that nine and four of the genes from the indica XP-CLR top signals overlap with the previous EDGs of the merged populations and EDGs of the japonica population, respectively (Additional file 26, Additional file 27), indicating that some genes related to upland ecotype might be shared among some indica accessions and japonica but are more difficult to detect because of the limited and probable heterogenous samples. More experiments and data are needed to comprehensively address the indica-specific upland adaptation genes.

Discussion
Largely due to the popularity and fecundity of irrigated rice, there has been limited inquiry about the phenotypes that distinguish upland rice. Some work in this area has been done by Chang et al., who investigated the rate of leaf production, tillering production, plant height, root system and resistance to drought for 11 upland varieties as explanatory factors in the different phenotypes [3]. Likewise, Ono pointed out that Japanese upland varieties are tall and low tillering with broad leaves, long panicles, drought and blast resistant, and have a less efficient utilization of fertilizer [44]. Despite interesting findings, these studies were based on only a few varieties of upland rice. To our knowledge, the sample size of the phenotypic comparisons between upland and irrigated rice in this study represents a very large sample of upland varieties to date and the sample covers most distribution regions of upland rice over the world, providing solid evidence supporting Chang et. al's observations [3]. Our phenotyping also observed some new features, e.g., the phenomenon that dry land culture increases main root length of upland rice but decreases that of irrigated rice indicates a remarkable superiority of upland rice to adapt to an upland environment.
Beyond our thorough phenotypic analysis, this study, to our knowledge, also provides the first genomic perspective as to the phylogenetics and population structure of current upland rice. Consistent with our classification based on phenotypic characteristics and Glaszmann's report which indicated most upland cultivars belong to the type japonica [22], phylogenetic analysis revealed that a EDGs selected in upland rice EDGs selected in irrigated rice EDGs selected in both ecotypes unassignable EDGs Figure 4 EDGs population assignment. Using CLR, 154 EDGs were assigned to specific populations. In total, 60 genes were selected in upland rice, 19 genes in irrigated rice, and 71 genes were selected as probably being in both populations, while four were not able to be assigned to a specific population.
large proportion of upland accessions belong to the type japonica, which form a monophyly. The population structure inference reveals much less genetic differentiation between the two ecotypes than exists between the type indica-japonica, which could be due to either much more frequent gene flow between ecotypes than that between Indica and Japonica, or recent origin of upland rice.
Though the sequencing depth for each accession used in our study is relatively low, the data set itself is informative enough to address the questions we wanted to address about upland rice adaptation. Firstly, the SNPs identified have a high accuracy (96.4%), even taking into account the low coverage. Secondly, the average sequencing depth for each site in the upland or irrigated population is more than 40×, providing a sufficient sample size for allele frequency estimation. Likewise, only SNP sites with the genotype determined in more than 10 individuals within each population were used for analysis such as calculating allele frequencies, and only SNP sites with no less than 100 individuals' genotypes were used for estimating population structure. Thirdly, though the genome coverage rate for each accession is not high, for each pair-wise comparison there are enough informative sites for the distance between any two accessions to be precisely computed (see Additional file 28). Thus, as some previous studies have shown, we presented an interesting example substantiating the feasibility of genetic analysis with low coverage population genomic data [45,46].
Except for some QTL mapping efforts, very few reports have touched upon the genetic mechanisms that underlie the adaptive phenotypes of upland rice [47,48]. In this study, we identified some interesting candidate genes or gene families that likely contribute to the upland adaptive traits or phenotypic differentiation between the upland and irrigated ecotypes, not only providing exciting new evidence useful in harnessing the genetic adaptations of upland rice but also demonstrating the efficiency of our methodology. Though pending more experimental evidence to validate the precise phenotypic effects of all the identified alleles, a limited number of genes (around 150) provided a highly feasible list for functional investigation as well as for the upland breeding. Crucially, we would like to stress that our approach most likely caught most of the major genomic regions that account for ecotype differentiation and upland rice adaptation including those specific regions in type japonica, but identifying some indica accession specific adaptation genes still need further work and additional data.

Conclusions
In summary, we surveyed the phenotypes across a large sample of upland rice in both dry land and irrigated conditions, and by comparing them to irrigated rice in similar conditions we were able to demonstrate some significant phenotypic differentiations between the two ecotypes. Using high-throughput re-sequencing of a large batch of rice accessions, for the first time, to our knowledge, we clarified a single-origin pattern of upland japonica rice but a novel multiple origin pattern in upland indica accessions. With comprehensive population genetic analysis, we identified the ecotype differentiated regions and important selection genes during the differentiation between the upland and irrigated ecotypes, and more poignantly, provided great insights into upland rice adaption and the phenotypic differentiation between the two ecotypes. These findings have important implications in the spread of upland rice varieties in breeding. For example, the genes that likely account for drought resistance can be used to breed highly drought resistant varieties from the existing high-yield irrigated varieties, and concurrently, the genes likely related to tillering ability can be used to enhance the tillering ability of upland rice, a major factor limiting its yield and subsequently its widespread usage.

Investigating phenotypes of upland accessions
To assess the reported phenotypes of upland and irrigated rice, we grew both varieties. Due to the limited amount of seeds for some accessions, only 131 accessions were phenotyped (Additional file 3). Phenotyping was performed both in irrigated and dry land conditions (i.e., preventing submergence in water to simulate the rainfed upland environment). For the irrigated condition, seeds were germinated and seedlings then transplanted to a paddy field with submergence in water for the whole growth and development stage. For the rainfed upland condition, we conducted direct-seeding. To fully simulate the rainfed condition, no irrigation was conducted for the whole growth stage. When the rain came, we released the water to prevent submergence. According to our records, several moderate drought events occurred and caused a few individual plants death in the upland condition. For each accession, we planted 12 individuals in two rows (6 individuals in each row) with the row spacing of 30 centimeters and plant spacing of 20 centimeters. Before the earing period, we counted the tillering numbers. After the plants entered maturity, we surveyed the root weight, main root length, plant height and the yield traits. For each accession, about five individuals were randomly selected and phenotyped (Additional file 4). Because in the rainfed upland condition the percentage of earbearing tiller is close to 100%, in this study tiller number and productive panicle number in the upland condition share the same result.
The re-sequencing data The re-sequencing data for the 84 upland and 82 irrigated accessions was reported by our previous work [19]. In our previous study, these data were simply used to test that a selected region containing a key enzyme gene of ABA synthesis was not due to demographic reason. They have neither been used to investigate the population structure for upland rice, nor been used to study the population differentiated regions/genes between the two ecotypes. Here we use these re-sequencing data to further exploit the genetic components that underlie the upland adaptation and phenotypic differentiation between the two ecotypes. The 84 upland and 82 irrigated accessions were obtained from the Yunnan Academy of Agricultural Sciences (YAAS), Yunnan, China. The ecotype information was documented by the genetic resources center of YAAS. Both indica and japonica were selected for either the irrigated or upland groups (Additional file 2). For the irrigated ecotype, we included 42 indica accessions and 40 japonica accessions. For the upland ecotype, we selected 28 indica and 56 japonica accessions since in our genetic resources center most upland rice accessions belong to the type japonica (Additional file 2). The origins of the accessions cover a wide geographic range (Additional file 1) including most major regions of upland rice distribution (South Asia, Southeast Asia, South America, West Africa, etc.). All seeds were sown under irrigated conditions. When the seedlings turned to trefoil-stage, genomic DNA was extracted from the leaves using the CTAB method [49].

Reads mapping
Our previously obtained re-sequencing data of these upland and irrigated accessions were deposited in the NCBI Short Read Archive under the accession number SRA066116 [19]. The raw reads of the wild rice came from our previous work (NCBI Short Read Archive accession number SRA023116) [21]. The IRGSP 5.0 Nipponbare genome was downloaded from the RAP-DB database (http://rapdblegacy.dna.affrc.go. jp/download/latest/IRGSPb5.fa.masked.gz). We mapped the short reads to the Nipponbare genome using SOAP for each accession separately [20].

Calling SNPs and SNP annotation
After mapping the reads, we used SOAPsnp -v1.02 to call SNPs for the whole population of the irrigated and upland accessions [50]. A series of filtering was conducted to ensure the quality of the SNPs. For example, the SNPs with a quality value less than 15 or with a nearby copy number > 1.5 were removed. Also we eliminated SNPs sites with depths less than 6 or larger than 300 because sites with too low a depth are prone to sequencing errors while sites with too high depths are likely located within repeat or duplication regions [51]. In order to confirm the accuracy of SNP calling, we randomly selected 60 SNPs, amplified the sequences around them with PCR and subjected them to Sanger sequencing. In our experiment, 56 of the 60 pairs of primers worked. Afterward, genomic annotation was used to assign these SNPs into different categories related to transcription and translation, including genic regions (CDS, introns, and UTRs), promoter regions and intergenic regions using our in-house PERL scripts. SNPs that resulted in premature stop codons, disrupted start/stop codons or splice donor/acceptor sites were defined as radical SNPs.

Determining SNP genotypes for each accession
For each accession, we extracted the SNP genotype information from the output of SOAPsnp. Again, we conducted a series of quality control procedures to ensure the reliability of the genotype. The consensus files contained information including the best base genotype and its average quality score, the second best base genotype and its average quality score, and the consensus genotype and its quality scores for each site, which can be used as filters [50]. We used very stringent criterion to ensure that only genotypes with a quality score > =20 (with error rate no more than 1%) were taken into consideration. More explicitly, if the average quality score of the best base is > =20 while the average quality score of the second best base is < 20, then that genotype is the best base. If both the average quality score of the best and second best base are > =20, then the genotype is heterozygous. If both the average quality score of the best and second best base are < 20 but the quality score of the consensus genotype is > =20, then the consensus genotype is used. On average 67% of the SNP genotypes were missing for each accession due to the relatively low coverage of our re-sequencing data. However, this missing data will likely not undermine our analyses because the phylogenetic analysis is based on pair-wise distances averaged over many sites and the population genetics analyses are based on allele frequencies. As such, the data are sufficient to estimate the pair-wise distances and allele frequencies. Besides the 166 accessions we resequenced, the genotypes of the SNPs for the wild accessions were determined using the same method.

Constructing the phylogenetic tree
Incorporating the genotype information of wild rice accessions, we use the SNP genotypes of each accession to calculate the pair-wise distances between any two accessions using PHYLIP [52]. We used default parameters to calculate the distances, i.e. using F84 model, a transition/transversion ratio of 2.0 and no Gamma distributed rates across sites. We assumed a uniform mutation rate across the genome and thus gave same weight for different SNPs. Bootstrap with 100 replicates was conducted for the whole genome SNP data set. After the genetic distances were calculated for the 100 replicate data sets, phylogenetic trees were constructed using the Neighbor-Joining (NJ) approach based on the genetic distances [52]. The consensus tree was then built using the built-in 'consense' program of PHYLIP with the parameters, 'strict majority rule' and 'trees to be treated as unrooted'.

Population structure inference
We first conducted a PCA analysis using the procedure as reported [53]. 24890 SNPs with genotype data covering more than 100 accessions was used in the analysis. Moreover, the FRAPPE software [24], which used a frequentist approach to estimate individual admixture, was used to examine the population structure as well. The maximum iteration of EM to run is set as 10000.
Calculating F ST The allele frequencies of SNP alleles were determined based on the accessions' genotypes. SNPs with less than 10 individuals' genotype information in one population were discarded when calculating the allele frequencies for that particular population. Based on the allele frequencies, we calculated the 'F ST ' using the method described by Nei [54]. In the whole-genome scan, 20-kb sliding windows were used with 2-kb sliding step. By averaging the F ST values over SNP sites in each window, F ST values for the windows were obtained.

Scanning selective sweep using composite likelihood approaches
Cross-population composite likelihood ratio test (XP-CLR) and Nielsen's composite likelihood ratio test (CLR) are two approaches for identifying selective sweep signals using SNP data sets [17,18]. XP-CLR incorporates the allele frequency differentiation patterns between two populations and multilocus frequency spectrum to identify selective sweeps occurring in population differentiation. The Nielsen's CLR method uses the distribution pattern of allele frequency spectrum in a particular population to search for selective sweep signals in that population. In our analysis, we combined XP-CLR and Nielsen's CLR using the software XP-CLR and SweepFinder, respectively. In the XP-CLR analysis, we used the allele frequencies of both upland and irrigated populations, calculating XP-CLR scores and picking the top 5‰ regions. When calculating the XP-CLR scores, we used a window size of 0.1 cM, a 2-kb grid size and a maximum SNP number of 150 for each window. By overlapping these regions of top XP-CLR scores with those of top 5‰ F ST , we obtained a conservative set of ecotype differentiation regions (EDRs) between upland and irrigated rice and the ecotype differentiated genes (EDGs) within these regions. In Nielsen's CLR analysis, we used allele frequencies of a single population (upland or irrigated population) and found the top 5% grid points with the highest likelihood ratio (LR) values in the genome. Genes within 10 kb of these grid points were considered as the potential selection targets in either rice population [55]. Fine-scale grid points (a grid point at every 1000 nucleotides) were used when calculating the LR values.

Gene enrichment analysis for the EDGs
Gene enrichment analysis was done using chi-square test. For a specific category of genes, we calculated the gene number in the category within the whole genome using the rice gene annotation information [56]. For example, for the gene category "lipase-containing GDSL domain", we counted the number of lipase genes annotated with this domain in the rice gene annotation database [56]. The number of genes in this category was also counted in the EDGs. A contingency table was then constructed based on the number of interesting genes in the EDGs, the total gene number of EDGs, the number of interesting genes in the whole genome, and the total gene number in the whole genome. We then conducted the Chi-square test and obtained P-values.