Genome-wide association studies for agronomical traits in a world wide spring barley collection

Background Genome-wide association studies (GWAS) based on linkage disequilibrium (LD) provide a promising tool for the detection and fine mapping of quantitative trait loci (QTL) underlying complex agronomic traits. In this study we explored the genetic basis of variation for the traits heading date, plant height, thousand grain weight, starch content and crude protein content in a diverse collection of 224 spring barleys of worldwide origin. The whole panel was genotyped with a customized oligonucleotide pool assay containing 1536 SNPs using Illumina's GoldenGate technology resulting in 957 successful SNPs covering all chromosomes. The morphological trait "row type" (two-rowed spike vs. six-rowed spike) was used to confirm the high level of selectivity and sensitivity of the approach. This study describes the detection of QTL for the above mentioned agronomic traits by GWAS. Results Population structure in the panel was investigated by various methods and six subgroups that are mainly based on their spike morphology and region of origin. We explored the patterns of linkage disequilibrium (LD) among the whole panel for all seven barley chromosomes. Average LD was observed to decay below a critical level (r2-value 0.2) within a map distance of 5-10 cM. Phenotypic variation within the panel was reasonably large for all the traits. The heritabilities calculated for each trait over multi-environment experiments ranged between 0.90-0.95. Different statistical models were tested to control spurious LD caused by population structure and to calculate the P-value of marker-trait associations. Using a mixed linear model with kinship for controlling spurious LD effects, we found a total of 171 significant marker trait associations, which delineate into 107 QTL regions. Across all traits these can be grouped into 57 novel QTL and 50 QTL that are congruent with previously mapped QTL positions. Conclusions Our results demonstrate that the described diverse barley panel can be efficiently used for GWAS of various quantitative traits, provided that population structure is appropriately taken into account. The observed significant marker trait associations provide a refined insight into the genetic architecture of important agronomic traits in barley. However, individual QTL account only for a small portion of phenotypic variation, which may be due to insufficient marker coverage and/or the elimination of rare alleles prior to analysis. The fact that the combined SNP effects fall short of explaining the complete phenotypic variance may support the hypothesis that the expression of a quantitative trait is caused by a large number of very small effects that escape detection. Notwithstanding these limitations, the integration of GWAS with biparental linkage mapping and an ever increasing body of genomic sequence information will facilitate the systematic isolation of agronomically important genes and subsequent analysis of their allelic diversity.


Background
Determining the genetic basis of agronomic traits has been one of the major scientific challenges in the process of crop improvement . Most of the agronomically important traits are quantitative, resulting in greater difficulty for discerning genetic differences underlying the phenotype of interest. Currently, linkage mapping (analysis) is the most common approach in plants to detect quantitative trait loci (QTL) corresponding to complex traits. In linkage mapping, linkage disequilibrium (LD) is generated by establishing a population from a cross between two parental lines. The co-segregation of alleles of mapped marker loci and phenotypic traits allows the identification of linked markers. Due to the restricted number of meiotic events that are captured in a biparental mapping population, the genetic resolution of QTL maps often remains confined, to a range of 10-30 cM [1,2]. Moreover, linkage analysis can only sample a small fraction of all possible alleles in a population from which the parents originated.
An alternative approach, association mapping (AM) known as LD mapping relies on existing natural populations or designed populations of plants to overcome the constraints inherent to linkage mapping. LD mapping exploits ancestral recombination events that occurred in the population and takes into account all major alleles present in the population to identify significant markerphenotype associations. LD mapping was first introduced in genetic mapping studies in humans [3,4] and has been recently considered for plant research. By exploiting nonrandom associations of alleles at nearby loci (LD), it is possible to scoop out significantly associated genomic regions with a set of mapped markers. Success of mapping depends on the quality of phenotypic data, population size and the degree of LD present in a population [5,6]. In general, the power of association studies depends on the degree of LD between genotyped markers and the functional polymorphisms. The decay of LD varies greatly i) between species [7], ii) among different populations within one species and iii) also among different loci within a given genome [8,9].
LD mapping is based on two strategies: i) re-sequencing of selected candidate genes and ii) genome-wide association which exploits marker polymorphisms across all chromosomes [10]. Genome-wide association studies (GWAS) have become increasingly popular and powerful over the last few years in human and animal genetics. The emergence of more cost-effective, high-throughput genotyping platforms have rendered AM an increasingly attractive approach for QTL mapping in plants [11]. In the last few years, an increasing number of association studies based on the analysis of candidate genes have been published (reviewed in [7]). These include e.g. the Dwarf8 [12] and the phytoene synthase locus in maize [13], flowering time genes in barley [14], the PsyI-AI locus in wheat [15], the rhg-1 gene in soybean [16]; and a series of candidate genes in Arabidopsis [17,18].
Barley (Hordeum vulgare L.) was domesticated in the Fertile Crescent about 10,000 years ago [19][20][21]. Today barley is the fourth most important cereal crop after wheat, rice and maize. In addition to its agricultural importance, the barley genome is considered as a model for other crop species of the Triticeae tribe including wheat and rye [22,23]. In this regard an ever increasing repertoire of marker and sequence resources has been developed for barley which can be efficiently utilized [24][25][26]. Over the last few years candidate gene based AM studies were reported for barley [9,14,27]. GWAS with dense marker coverage are not yet conducted routinely for barley, albeit the potential of this approach has been demonstrated in some pilot studies [28][29][30].
Inbreeding crops such as barley are characterized by a high level of population structure caused by the impact of non random mating and subsequent selection. This is exemplified by two-rowed and six-rowed barley cultivars which form distinct subpopulations, because the corresponding breeding programs rely on different progenitors. The same applies to the subpopulations of spring and winter barley [31]. There are higher chances of occurrence of type I and type II errors in AM than in biparental QTL analysis due to the confounding effect of population structure in the panel [2,[32][33][34] Specific statistical approaches have been proposed to account for population structure in AM [35]. Yu et al. [36] described a mixed-linear model (MLM) approach which performs better than previous models [37]. Still these models have their individual shortcomings and care needs to be taken in controlling for population structure and balancing the rate of false positives and false negatives in the analysis.
In the present study, our main objective was to map genetic polymorphisms underlying complex agronomic traits such as heading date (HD), plant height (PHT), thousand grain weight (TGW), starch content (SC) and crude protein content (CPC) in spring barley using GWAS. We studied a diverse spring barley collection comprising 224 accessions from 52 countries previously described by Haseneyer et al. [38]. We provide a comprehensive overview on population structure and genetic diversity as well as their effects on GWAS. To study the dynamics of LD across the seven barley chromosomes we investigated the patterns of LD decay. Finally, we identify and locate a substantial number of known and novel QTL for the traits investigated.

Association mapping panel
The association mapping panel consists of 224 spring barley accessions selected from the Barley Core Collection (BCC) [39] and the barley Genebank collection maintained at the IPK Genebank Gatersleben, Germany. The panel comprises 96 two-rowed and 128 six-rowed genotypes, and among them 109 accessions originate from Europe (EU), 45 from West Asia and North Africa (WANA), 40 from East Asia (EA) and 30 from the Americas (AM). Most of the accessions are improved cultivars (149), some accessions are landraces (57) or breeder's lines (18). Further information on the germplasm can be obtained from the European Barley Database (EBDB, http://barley.ipk-gatersleben.de/ebdb.php3). This panel has been considered and described in detail by Haseneyer et al. [38]. Each accession has been singleseed descended, selfed for two generations under greenhouse conditions and subsequently propagated in the field.

Phenotypic evaluation
The accessions were planted in a 25 × 15 lattice design with three replications in the years 2004 and 2005 at the following locations: Stuttgart (Southwest Germany), Irlbach (Southeast Germany) and Wohlde (Northern Germany). Heading date (HD) and plant height (PHT) were scored in field plots. Thousand grain weight (TGW) was estimated from sampled grains per plot. Starch content (SC) and crude protein content (CPC) were estimated using a near infrared reflectance spectrometer (NIRS) from ground seed samples from all environments. In order to convert the nitrogen content to crude protein values, we considered a factor of 6.25. We followed the methods described in Naumann and Bassler [40] to estimate the starch content and nitrogen content. Phenotypic data were analyzed using REML (Residual Maximum Likelihood) implemented in GenStat 9 software [41]. Variance components were calculated by fitting a mixed linear model (MLM) to multi-environment data. Heritabilities were estimated for all traits considering the percentages of genotypic variance, over the total phenotypic variance including genotype (G) by environment (E) variance and error variance components. Phenotypic mean BLUEs (Best Linear Unbiased Estimates) were estimated taking into account the GxE variance and were used for association studies. Further information on phenotypic data can be obtained from Haseneyer et al. [38].
Genome-wide marker profiling DNA for SNP genotyping was extracted for each accession from bulked leaf samples of eight 2-weeks old seedlings. A customized oligonucleotide pool assay (IPK-OPA, unpubl) containing 1536 allele specific oligos was used to genotype the panel by Illumina's GoldenGate technology (Illumina, San Diego, CA). The IPK-OPA has been mainly built on a selection of markers from two pilot assays (pOPA1, pOPA2) that are polymorphic between the two barley cultivars 'Barke' and, 'Morex'.
More than 95% of the 1536 SNP markers of the IPK-OPA have been included in a barley consensus map [26]. The SNP genotyping was performed at University of California (Southern California Genotyping Consortium, UCLA) following the protocol of Fan et al. [42,43]. More details about the successful SNP markers considered for GWAS are available as supplemental information (Additional file 1: Table S1).
Scoring SNP data was done using the Illumina Beadstudio package (Genotyping module 3.2.32; Genome viewer 3.2.9; Illumina, San Diego, CA) that can process the raw hybridization intensity data and thereby cluster the data. The normalization procedure implemented in the Beadstudio genotyping module includes outlier removal, background correction and scaling. The algorithm included uses a Bayesian model to assign normalized intensity values to one of the three possible homozygous and heterozygous genotype clusters. Stringent threshold scores (Call Rate > 0.9 and GenTrain Score > 0.7) were used to identify ambiguous results. SNPs that failed to show twogroup clustering were strictly excluded from the analysis. From a total of 1536 SNP markers, 985 markers yielded good quality genotypic calls. Among the 985 successful SNP markers only 957 markers are genetically mapped and we used these 957 markers for our analysis (Additional file 1: Table S1). Among the 224 accessions in the panel of genotypes, 12 genotypes performed badly in the assay (Additional file 2: Table S2). For these 12 genotypes more than 90% of the SNP markers data is missing, hence were excluded from subsequent analysis.

Genotypic data analysis and population structure
Polymorphic information content (PIC) values were calculated for each SNP using Powermarker 3.25. [44]. Major allele frequency, minor allele frequency (MAF), gene diversity and Nei's genetic distance (d) [45] were calculated and a NJ (Neighbor-Joining) dendrogram (data not shown) based on d was computed. From the 957 SNPs, a final set comprising 918 SNPs with MAF larger than 0.05 was used for analysis of population structure, LD and marker trait associations.
To estimate the number of subgroups in the panel, different methodologies and different software packages were employed and compared in order to determine the appropriate population structure in collection. For the quantitative assessment of the number of groups in the panel, a Bayesian clustering analysis was performed using a model based approach implemented in the software package STRUCTUREv 2.2 [46,47]. This approach uses multi-locus genotypic data to assign individuals to clusters or groups (k) without prior knowledge of their population affinities and assumes loci in Hardy-Weinberg equilibrium. The program was run with 918 SNP markers for k-values 1 to 15 (hypothetical number of subgroups), with 100000 burnin iterations followed by 50000 MCMC (Markov Chain Monte Carlo) iterations for accurate parameter estimates. To verify the consistency of the results we performed 5 independent runs for each k. An admixture model with correlated allele frequencies was used. The most probable number of groups was determined by plotting the estimated likelihood values [LnP(D)] obtained from STRUCTURE runs against k. LnP(D) is the log likelihood of the observed genotype distribution in k clusters and is an output by STRUCTURE simulation. The k value best describes the population structure based on the criteria of maximizing the log probability of data or in other words the value at which LnP(D) reaches a plateau [46]. STRUCTURE results with the SNP marker dataset were confirmed with the results from STRUCTURE runs using a set of Diversity Array Technology (DArT) markers (Pasam et al. unpubl, Additional file 3: Figure S1). In a second approach principal coordinate analysis (PCoA) based on the dissimilarity matrix was performed using DARwin (Diversity Analysis and Representation for windows) [48]. In a third approach a NJ dendrogram based on Nei's genetic distance matrix was constructed. The substructure in the collection using different methodologies was compared and the final k value using STRUCTURE was ascertained. For this k value, the Q-matrix (population membership estimates) was extracted from STRUCTURE runs. This matrix provides the estimated membership coefficients for each accession in each of the subgroups.

Linkage disequilibrium analysis
The extent of LD effects both the number of markers required for GWAS and the resolution of mapping the trait. LD is in many cases influenced by population structure resulting from the demographic and breeding history of the accessions. Genome-wide LD analysis was performed among the panel and subgroups by pair wise comparisons among the SNP markers using HAPLOVIEW [49]. LD was estimated by using squared allele frequency correlations (r 2 ) between the pairs of loci [50]. The loci were considered to be in significant LD when P < 0.001, the rest of r 2 values were not considered as informative. The pattern and distribution of intra-chromosomal LD was visualized and studied from LD plots generated for each chromosome by HAPLOVIEW. To investigate the average LD decay in the whole genome among the panel, significant intra-chromosomal r 2 values were plotted against the genetic distance (cM) between markers. The smothering second degree LOESS curve was fitted using GENSTAT [41]. A critical value for r 2 was estimated by square root transforming of unlinked r 2 values to obtain a normally distributed random variable, and the parametric 95th percentile of that distribution was taken as a critical r 2 value [32]. Unlinked r 2 refers to the r 2 between the marker loci with a genetic distance greater than 50 cM or on independent linkage groups.

Association analysis
Different statistical models were used to calculate P-values for associating each marker with the trait of interest, along with accounting for population structure to avoid spurious associations by TASSEL v.2.1 (http://www.maizegenetics. net). We followed the formula y = Xb+M + Zu + e, where y is a response vector for phenotypic values, b is a vector of fixed effects regarding population structure, α is the vector of fixed effect for marker effects, u is the vector of random effects for co-ancestry and e is the vector of residuals. X can be either the Q-matrix or the PCs from Principal Component Analysis (PCA), M denotes the genotypes at the marker and Z is an identity matrix. Six models comprising both general linear models (GLM) and mixed linear models (MLM) were selected to test the marker-trait-associations (MTA). Results were compared to determine the best model for our analysis. PCA was conducted with TASSEL. The first ten significant PCs explained 43% of the cumulative variance of all markers. A kinship matrix (K-matrix), the pair-wise relationship matrix which is further used for population correction in the association models was calculated with 918 SNP markers using TASSEL [51]. The following models were tested: i) Naive model: GLM without any correction for population structure; ii) Q-model: GLM with Q-matrix as correction for population structure; iii) P-model: GLM with PCs as correction for population structure; iv) QKmodel: MLM with Q-matrix and K-matrix as correction for population structure; v) PK-model: MLM with PCs and K-matrix as correction for population structure and vi) K-model: MLM with K-matrix as correction for population structure [35,36,52,53]. All SNP markers were remapped by association mapping to determine the mapping resolution of the panel as suggested by [24]. The critical P-values for assessing the significance of MTAs were calculated based on a false discovery rate (FDR) separately for each trait [54], which was found to be highly stringent. Considering the stringency of the model used for accounting for population structure, most of the false positives were inherently controlled. Thus, we considered a more liberal approach as proposed by Chan et al. [55] for determining the threshold level for significant MTAs. It was suggested that the bottom 0.1 percentile distribution of the P-values can be considered as significant, which in our analysis resulted in threshold levels of 0.05 to 0.09 for individual traits. Alternatively, as a compromise between the two approaches an arbitrary threshold P-value of 0.03 was used for all traits and all models. This rather rough estimate was obtained by arranging-log10 P-values in a descending order, and the value at which the curve starts to flatten is determined as the threshold value. All association models with all traits were re-analyzed using GENSTAT [41] to check for any discrepancy.

Phenotypic data
Large phenotypic variation was observed for all traits. Outliers in the data were identified based on the residuals derived from the data of all environments and were removed from further analysis. For the trait heading date, data from year 2004 was excluded from the analysis due to differences in scoring this trait between the individual locations. Variance components were calculated by REML. The results confirmed that genotypic variance was significant for all traits (P < 0.001). GxE interactions were also significant (P < 0.001) but represented only a small fraction of the total variance. Heritabilities ranged between 0.90-0.95 indicating the robustness of the data and the low error rate. Year-wise means, ranges and heritabilities over all environments for the traits HD, PHT, TGW, SC and CPC are presented in Table 1 and their frequency distributions are illustrated in Additional file 4: Figure S2. The correlation exhibited by the agronomic traits between each other is outlined in Table 2. The traits SC and CPC are highly correlated (-0.7) and other traits showed moderate to weak correlation among each other. PHT was shown to be weakly correlated with HD and also with SC and CPC. TGW is found to be positively correlated with SC and negatively correlated with CPC. Substantial phenotypic differences were reported between two-rowed and six-rowed genotypes. The means for all traits were significantly different between the two groups (Additional file 5: Table S3). The variation observed was larger for all traits in six-rowed barleys than in two-rowed barleys. The greatest influence of spike morphology (two-rowed vs. six-rowed) on phenotypic variation was seen for TGW, whereas the greatest influence of population structure was observed for PHT (Additional file 6: Table S4). Best Linear Unbiased Estimates (BLUEs) of genotypic means were calculated from the fixed genotypic effects to avail unbiased mean estimates. Using Best Linear Unbiased Predictors (BLUPs) is less suitable as it would cause double shrinking [56]. Henceforth we used BLUEs in our further analysis. However, comparison of both BLUPs and BLUEs revealed very high concordance between both estimates, which is a direct consequence of the high heritabilities (Additional file 7: Figure S3).

Population structure and genetic diversity
From the high quality 985 SNPs, 957 markers had been genetically mapped and therefore were considered for this study. Of these, 39 SNPs (4%) were excluded because of a MAF below 0.05. Of the remaining SNPs, the majority revealed a MAF between 0.1 to 0.5 ( Figure 1). These SNP markers were distributed over all seven chromosomes with an average spacing of 1.18 cM. The distribution of SNP markers is not exactly uniform and varies within and among chromosomes with a minimum of 105 markers on chromosome 4H and a maximum of 164 markers on 5H (Table 3). Diversity statistics computed for each SNP are summarized in Additional file 8: Table  S5. PIC values for SNPs ranged from 0.09 to 0.5 with an average of 0.30. Most of the markers (726) displayed PIC values exceeding 0.25, demonstrating the informativeness of these markers in our panel. The average PIC values of the markers on each chromosome ranged between 0.29 (5H) to 0.33 (6H). The mean gene diversity value for the whole panel was 0.39 and spread within a range of 0.09 to 0.5. It was reported in several studies that the stratification of barley cultivars is concordant with spike morphology, mainly as a result of breeding history [57,58]. Therefore, similar molecular diversity statistics were generated separately for two-rowed and six-rowed barley groups within our panel and for the six subgroups. Observed mean PIC values are higher for the six-rowed group (0.31) than for two-rowed barleys (0.27). Similarly,  average gene diversity estimated was higher in six-rowed (0.38) than in two-rowed accessions (0.33). The population structure in the panel of 212 barley accessions was analyzed using 918 SNP markers and a model based approach in STRUCTURE. The LnP(D) appeared to be an increasing function of k for all the values observed. But the most significant increase of LnP(D) was observed when k was increased from 1 to 2 ( Figure 2). At k = = 2 the panel is clearly categorized into two-rowed and six-rowed barleys with few exceptions. The two main groups were further divided yielding six subgroups in total as LnP(D) values nearly reached a plateau at k = 6. Hence, we chose a value of k = 6 for our analysis as minimum number of groups present in the panel. Different values of k are still possible but will not qualitatively affect the results. An accession was assigned to a subgroup if at least 50% of the genome information was estimated to belong to one group. The accessions clustered into groups mostly according to their spike morphology and their geographical origins, as was demonstrated already by Haseneyer et al. [38]. The six groups are defined as: Group 1 (G1): 24 six-rowed barleys mostly from AM and WANA; G2: 31 accessions mostly six-rowed barley from EA; G3: 31 accessions mostly six-rowed barleys from EU; G4: 24 accessions mostly two-rowed from EU; G5: 79 accessions mostly two-rowed barleys from EU; G6: 23 accessions mostly two-rowed from WANA and AM ( Figure 3). The dominant stratification of the population according to spike morphology is confirmed by PCoA (Additional file 9: Figure S4) and NJ dendrogram (not shown). In the PCoA, it is obvious that the primary axis separates the accessions based on row type and further grouping is related to the region of origin. Overall, the clustering of accessions was consistent among various methods and we further explored the genetic diversity within these groups. The summary statistics for each group with 918 SNP markers is reported in Table 4. Observed gene diversity values ranged from 0.27 in G5 to 0.35 in G1; PIC values ranged from 0.22 in G5 to 0.29 in G1. Pairwise genetic distances ranged from 0.006 to 0.628, with an overall mean of 0.39. The average overall genetic distance between groups has been calculated, and the largest genetic distance of 0.36 was observed between the groups G2 (six-rowed, EA) and G5 (Two-rowed, EU). Similarly G4 (six-rowed, EU) and G5 (six-rowed, EU) are found to be closely related groups with an average genetic distance of 0.17 (Table 5).
Linkage disequilibrium LD analysis was performed using 918 SNPs for i) entire panel, ii) separately for two-rowed and six-rowed barleys, and iii) each of the six subgroups. Pairwise LD was estimated using the squared-allele frequency correlations (r 2 ) and was found to decay rapidly with the genetic distance. We studied different aspects of LD in our panel and observed that LD varies along the chromosomes with regions of high LD interspersed with regions of    low LD (Additional file 10: Figure S5). A critical value of r 2 , or basal LD, was calculated from inter-chromosomal LD analysis and is estimated to be 0.2 beyond which LD is assumed to be caused by genetic linkage. The point at which the LOESS curve intercepts the critical r 2 is determined as the average LD decay of the population. Based on these criteria the intra-chromosomal LD decayed between 5-10 cM for individual chromosomes and average LD decay of the whole genome was observed to be at 7 cM ( Figure 4). Extensive variability in the magnitude of r 2 at a given genetic distance was detected reflecting the wide local variation in the extent of LD across the chromosomes. The correlation between r 2 and marker distance was found to be significantly negative (r = -0.40) for markers below a distance of 10 cM, whereas marker pairs with larger distance showed no significant correlation with r 2 . Significant intra-chromosomal r 2 values (P < 0.001) ranged from 0.02 to 1 with an average of 0.12 for the whole panel. Among all significant loci in LD, 13.7% of the loci are above the critical r 2 value of 0.2 in the whole panel.
Pairs of loci are classified into 4 groups based on the inter-marker genetic distance: 0-10 cM (tightly linked markers), 11-20 cM (moderately linked markers), 21-50 (loosely linked markers) and > 50 (independent markers) [59]. The percentages of significant loci pairs and mean r 2 values for all classes of markers in the whole panel and different subgroups are presented in Table 6. Among all loci pairs, only 39.4% were in significant LD in the whole panel. The percentage of significant loci pairs decreased with the distance between loci; 62.2% of the tightly linked markers showed significant r 2 . Similarly 45.1%, of the moderately linked markers 38.3% of the loosely linked markers and 28.5% of independent markers were in significant LD. The portion of r 2 values exceeding the basal LD level of 0.2 decreased from 33.7% in the group of tightly linked markers to 10% for moderately linked markers to less than 4% for independent markers. Mean r 2 values decreased from 0.2 for closely linked marker loci to 0.08 for unlinked marker pairs. All loci pairs being in complete LD are spaced at genetic distance < 5 cM.

Patterns of linkage disequilibrium within subgroups
At the intra-chromosomal level, mean r 2 values for tworowed and six-rowed barley groups ranged between 0.18 and 0.17, which is slightly more than the mean r 2 of whole panel. The percentages of significant r 2 values were higher in the two-rowed than in the six-rowed subgroup for all classes of marker pairs except for the independent markers. This pattern is also similar to LD values above the basal level of 0.2, and a slightly slower LD decay was observed for two-rowed barley compared to the group of six-rowed types and to the whole panel. Similarly, the mean r 2 values were estimated for individual subgroups where they ranged from 0.3 (G5) to 0.49 (G4). The LD decay in the subgroups was much slower than in the whole panel. In Figure 5, binned r 2 values are mapped against the recombination distance (cM) across the genome. In the whole panel the average LD decays below a basal level (0.2) within 5 cM, while in the tworowed and six-rowed groups the basal level is reached between 10-15 cM and with LD in six-rowed barley decaying faster than in two-rowed barley. Within G5 LD decays to the basal level within 20-25 cM, while it does not reach the basal level in the remaining subgroups (G1,2,3,4,6). Average LD decay graphs for each group showed different patterns. Specifically, in the subgroups G4 and G5 at distances 45 and 74 cM we observed larger LD peaks. Scrutinizing these peaks revealed that high LD in these regions was caused by markers with low allele frequencies. The consequence of the reduced population size of the individual subgroups is that the presence of a solitary allele in single accession already might show a MAF above the critical threshold. Varying patterns of LD decay in different sub-populations are likely reflecting their breeding   histories [1] and may impinge on the QTL mapping resolution of the panel.

Evaluation of the association panel
All 918 SNPs were re-mapped using an LD approach. A model with kinship accounting for population structure was used for validating the genetic map position of the markers. We used each marker information as an individual trait and ran the analysis with the remaining SNPs to find the most significantly associated markers. The map distance between the target marker in question and the most highly associated marker was used to evaluate the resolution of the panel. More than 85% of the SNP markers had their genetic map position within 0-10 cM distance of their original map position and the majority of them remapped at the same position ( Figure 6). This re-mapping of markers shows that the resolution of QTL captured by AM approach in our panel will be within a range of 5-10 cM.

Association analysis Comparison of models
We tested several models to detect associations between SNP markers and agronomic traits. Owing to the complexity and the considerable amount of population structure present in our panel, we observed numerous spurious associations when using the naive (simple) model for AM. Hence, we assessed the usefulness of various linear models to account for population structure by comparing their ability to reduce the inflation of false positive associations.
To this end ranked P-values from GWAS were plotted in a cumulative way for each model by using spike morphology as phenotypic trait ( Figure 7). As demonstrated by Kang et al. [53] the distribution of P-values ideally should follow a uniform distribution with less deviation from the expected P-values. The models QK, PK and K showed a good fit for P-values, while the other models were characterized by the excess of small P-values which is tantamount to an abundance of spurious associations. This is particularly obvious in the case of the "naive" model, where nearly half of the P-values are smaller than 0.01. On the other hand, the K-model performed similar to the PK and QK model in displaying a highly uniform distribution Figure 6 Evaluating the mapping resolution of the panel.
Distribution of SNPs according to their re-mapped distances using the K-model of genome-wide association approach. The group 'identical' refers to the SNPs that mapped at exactly the same position and the group '0' refers to the SNPs that mapped within a distance of 0.01 to 0.99 cM Figure 7 Comparison of different GWA models. Cumulative distribution of P-values computed from 918 SNPs and row-type phenotype for different association models are presented. The more uniform the distribution of P-values, the better is the model Figure 5 Comparison of LD patterns and LD decay in the whole panel and subgroups. Mean r 2 values are plotted against the genetic distance for different groups of P-values and at the same time requiring less computational time. Irrespective of the model, major marker trait associations were constantly detected. However, the more stringent the model was the less spurious background associations were detected. All models considered for GWAS are presented for the trait spike morphology (Additional file 11: Figure S6). For all other traits only results from the K-model will be presented and discussed.

Barley spike morphology (row type)
Apart from comparing different AM models, we aimed to examine the spike morphology trait as a proof of concept for GWAS and to evaluate the resolution of the association panel. According to its spike morphology barley is classified as two-rowed and six-rowed types and the genes for this trait have been well documented with some of them already cloned [29,30,60]. We scored the row type character in the panel and considered 918 markers for AM using all models. A marker trait association was considered when the marker main effect was significant at 0.03 [-log 10 (0.03) = 1.5]. This results in a total of 34 markers that are significantly associated with the trait row type by using the K-model. (Additional file 11: Figure S6). The results are congruent with previous row type studies (see Figure 8).

Heading date
Thirty-four markers were found to be significantly associated with heading date (HD). These were grouped into 19 QTL located on all chromosomes. Significant marker trait associations within a genetic distance of 5-10 cM are delineated into a single QTL. Chromosome 2H harbors the maximum number of markers associated with the trait (Figure 9a). Some of these association results with the SNP markers effectively correspond to genomic regions of previously mapped flowering time QTL. These include genomic regions of various prominent flowering pathway genes like Ppd-H1, HvFT1, HvCO1 and HvCO3 (see Table 7).

Plant height
Thirty-two markers displayed significant associations with plant height (PHT). These markers detected 19 QTL (Table 8). Except for chromosome 1H, significantly associated markers were found on all chromosomes with the majority located on 2H and 3H (Figure 9b).

Thousand grain weight
Thirty-six markers yielding 21 QTL were significantly associated with Thousand Grain Weight (TGW, Figure  9c). Markers significantly associated with the trait were present on all chromosomes. As expected some of the TGW related QTL overlapped with QTL for spike morphology. The markers SNP56, SNP215, SNP385 and SNP458 are co-localized to the same region as Vrs3, Vrs1, Vrs4 and Int-c genomic regions (Table 9).

Starch content
Thirty-five markers were found to be significantly associated with the trait Starch Content (SC). These markers formed a total of 25 QTL (Figure 9d). Significantly associated markers for starch content were present on all chromosomes. Similar to TGW markers corresponding to the Vrs3 region (SNP56 & SNP66) are significantly associated with starch content. Several significant markers, co-localized with previously mapped genes and QTL for SC (Table 10).

Protein content
We found thirty-four markers to be significantly associated with crude protein content (CPC). These markers detected a total of 23 QTL (Figure 9e) and were distributed over all chromosomes. Some of the QTL for protein content overlapped with the QTL regions identified for CPC (Table 11).

Discussion
In the present study we describe the application of whole genome association mapping in a panel of diverse spring barley genotypes for agronomic traits. For each of the analyzed traits we identified 19 to 25 QTL. A substantial portion of the derived QTL locations are congruent with previously identified QTL in various biparental mapping populations (Tables 7, 8 , 9, 10, 11). GWAS are strongly influenced by the quality of the phenotypic data [79]. In the present study heritabilities for all traits exceeded 0.9 and phenotypic means reflected a broad variation in the panel. The observed differences for two-rowed and six-rowed groups were expected due to their different breeding histories and the pleiotropic effects of spike morphology (Additional file 5: Table S3). Phenotypic variation observed for all traits is higher in the six-rowed group than in the two-rowed group, which is in accordance with the higher genetic diversity  (Table 4). A more detailed analysis of population structure revealed six subgroups, which were mostly defined by spike morphology and geographical origin, both of which are known to impinge on the expression of agronomic traits.

Genetic diversity and population structure
Arguably an association mapping panel should suffice both phenotypic and molecular diversity for the outcome of reliable association results. Owing to the availability of a large number of mapped SNP markers that can be interrogated in a multiparallel manner [26], we were able to achieve a high marker coverage amounting to 1 marker per 1.18 cM. The average PIC (0.30) and Gene diversity (0.33) values observed in this panel of accessions are comparable with the results in previous studies using bi-allelic markers. PIC values differed among chromosomes and among different germplasm subgroups (Tables 3 & 4). Among all chromosomes, the highest average PIC value (0.33) was detected for chromosome 6H-which corresponds to the observations made by Rostoks et al. [24] in a set of European barley cultivars. We determined the population structure in our panel by implementing various approaches (STRUCTURE, PCoA and NJ-dendrogram) and found similar results. Several previous studies e.g. Maliysheva-Otto et al. [57], Rostoks et al. [24], Zhang et al. [58] and Hamblin et al. [80] have shown that growth habit, spike morphology and geographical origin are the major factors that mirror population structure in barley. Since the present study has been restricted to spring barley, spike morphology and geographical origin were the fundamental determinants for population substructuring (G1 to G6) (Figure 3). The 55 landrace accessions included in this panel were distributed among all groups. The subgroups G1, G2 and G3 are mainly six-rowed barleys and the subgroups G4, G5 and G6 include mainly two-rowed barleys. Two-rowed barleys in the panel are more closely related to each other and less diverse than the six-rowed barleys, which is in contrast to the findings of Zhang et al. [58] for Canadian germplasm. While in our panel two-rowed barleys even outnumbered the six-rowed accessions, the reason for their limited diversity might be that the majority originated from Europe. The geographical distribution of the accessions has a major influence on the diversity of alleles sampled in the population [57]. In Europe, tworowed barley is mainly grown as raw material for malt production. Malting quality is a quantitative trait. The use of a limited number of principal progenitors in the corresponding breeding programs has resulted in the reduction of genetic diversity and in the concomitant formation of a distinct subpopulation as it is seen in our present panel [81].

LD configuration and consequences
The resolution of LD mapping depends on the extent of LD across the genome and the rate of LD decay with genetic distance [82,83]. Genome-wide LD studies for barley have been previously reported in various populations using different molecular markers such as AFLP, SSR and DArT [57,58,84], with few studies, however, relying on more than 1000 markers. In our panel of spring barley accessions of worldwide origin, intra-chromosomal whole genome LD decays below the critical r 2value (0.2) within a genetic distance of 5 cM. It needs to be kept in mind that this is an average value, which summarizes substantial intra-chromosomal LD variation. The extent of intra-chromosomal LD for different chromosomes in our panel ranges from 5-10 cM with varying patterns along each chromosome (Additional file 10: Figure S5). Previous studies found various levels of LD decay in different barley populations [9,29,83] and among different chromosomes [24]. The LD decay was more rapid in the study of Comadran et al. [85] probably due to the inclusion of landraces in the collection. Caldwell et al. [9] also showed that LD decays more rapidly in barley landraces compared to elite barley cultivars. Less extensive LD beyond 10 cM has been found in our panel, as the majority of significant LD values above the basal level (33.7%) are due to tightly linked markers. Significant inter-locus LD values of unlinked markers (4%) may be the result of population structure (Table 6). We found some closely linked markers that are in complete Linkage Equilibrium (LE), while some distantly linked markers exhibited high LD values. This reflects the dynamic variation of LD patterns along the chromosomes as it has been shown in this panel at the sequence level for several transcription factors [27]. As to the individual subgroups, the portion of significant r 2 -values above the basal level (0.2) is higher within six-rowed than in two-rowed groups indicating high LD in these groups. Interestingly, LD in all subgroups extended beyond 30 cM except for G5 where LD extended to about 20-25 cM ( Figure 5). This is most likely because of the larger population size of G5 compared to the other subgroups. The extensive LD observed in the subgroups is probably due to their decreased population size and a concomitant increase in relatedness.

Genome-wide association mapping
Despite the advantages of GWAS to pinpoint genetic polymorphisms underlying agronomic traits, this approach may suffer from an inflation of false positives Significant markers associated for trait PHT, their MAF, P-value of association, variance explained by marker (R 2 ), effect of the most significant marker within the QTL interval, name of the QTL, and the reference QTL or gene from literature.
due to population structure [4,52,86]. Several statistical models to correct for the effect of population structure have been proposed and tested in previous studies [37,52,87]. Since we detected a considerable amount of structure in the present panel we used linear models to control for population structure and to reduce the false positive associations. Similar to the previous studies of comparing GWAS models in allogamous and autogamous species [37,52], our results suggest that K-model, QK model and PK model performed better than others ( Figure 7). Moreover, for the K-model computational time is faster and no additional steps like identifying appropriate population structure (Q-matrix) in the panel are required. Since in an exploratory analysis mostly consistent results were obtained for all three approaches, the K-model was employed in the complete analysis of all traits to avoid redundancy of data. Still it should be kept in mind that correcting for population structure not only reduces the frequency of false positives but also may entail false negatives in situations where a Significant markers associated for TGW, their MAF, P-value of association, variance explained by marker (R 2 ), effect of most significant marker within the QTL interval, name of the QTL, and reference QTL or gene from literature.
character state is strongly correlated with population structure [28]. In order to confirm the efficiency and resolution of the panel for association mapping using the range of available markers, we re-mapped all 918 SNPs using the K-model. From 918 SNPs, 783 were re-mapped within 10 cM of their original positions. Only 14% of the markers mapped beyond 10 cM. Among the successfully remapped markers more than 95% markers are within 5 cM distance from the original map position indicating the mapping resolution of our panel (see Figure 6).
Rostoks et al. [24] has used the same approach to evaluate their barley collection for GWAS with a subset of markers and successfully mapped 80% of the markers.
To demonstrate the suitability of the panel and the model for GWAS, we first analyzed spike morphology (row type) (Figure 8). This trait can be easily scored and is important from the agronomic and the domestication point of view. The genetic basis of row type is already well known and several QTL have been mapped and genes have been cloned [29,60,88]. We identified 34 marker-trait associations for this trait (Figure 8). Our identified marker-trait associations for row type are concurrent with all previously identified major loci-vrs1 [88], vrs2, vrs3, vrs4 and int-c [29,30]. Additional, less significant associations detected for row type could not be associated to any known major loci, and need to be further explored. These results for row type act as a proof of concept for GWAS in our spring barley panel and reflect the efficiency of GWAS for high resolution QTL mapping in inbreeding species. Some of the row type QTL overlapped with associated regions for other traits, especially with the traits TGW, SC and CPC (Additional file 12: Figure S7). As expected, two-rowed barley has higher TGW than the six-rowed types, as the number of sink organs (kernels) in two-rowed spikes is smaller than in six-rowed spikes. While the effect of spike architecture on TGW is clearly pleiotropic, its influence on SC and CPC is the result of breeding history and end use quality. In case of malting barley, varieties are generally bred for high starch and low protein content. In Europe mostly two-rowed barley is preferred for malting while six-rowed barley is primarily used as feed and is characterized by high protein content [22]. As a result, the two-rowed types in our panel have higher starch content and lower protein content than six-rowed types (Additional file 5: Table S3). As expected, the landraces included in the panel did not show this stratification as they did not underly this selection pressure. Heading date (HD) reflects the adaptation of a plant to its environment and is a complex trait effected by numerous QTL both in outbreeding [89] and in inbreeding species [61]. Many SNP markers were found to be associated with the trait HD ( Figure 9a) and we report a total of 34 significant SNPs defining 19 QTL. Some of these QTL hit genomic regions that were previously reported to harbor major genes including HvFT3, PpdH1, HvFT4, eps2, HvGI, HvCO3, HvFT1 and HvCO1 (Table 7). In a previous study using the same panel, fragments from three flowering time candidate genes were re-sequenced and SNPs within the gene PpdH1 revealed the largest effects on HD [14]. In the present GWAS, SNPs located in the vicinity (ca. 2 cM) of PpdH1 showed significant associations with HD (Table 7). By further including all PpdH1 SNPs from Stracke et al. [14] into our GWAS, these SNPs revealed the highest association of all markers used ( Figure 10). These findings lend strength to the hypothesis that a further increase in marker coverage will either lead to the detection of additional associations or improve the significance of existing QTLs.
For the trait PHT we found 19 putative QTL regions located on chromosomes 2H, 3H, 4H, 5H, 6H and 7H comprising 32 marker trait associations. Semi-dwarf and dwarf cultivars have been developed worldwide to reduce lodging and to improve the harvest index. Different genes/alleles have been deployed in different geographic regions: the GA sensitive sdw1 dwarfing gene has been deployed in America and Australia, while its allelic form, termed denso, is frequently seen in European two-rowed germplasm. The recessive uzu allele is found in Japanese, Chinese and Korean cultivars [70,90]. Many QTL for PHT coincide with previously mapped QTL and genes ( Table 8). The QTL4_PHT on chromosome 2H coincides with the mapping position of sdw3 which plays a major role in gibberellins-insensitive dwarfing barley [68]. Two allelic forms of the dwarfing gene denso/sdw1 map to the same genomic region as QTL8_PHT located on the long arm of chromosome 3H [70]. The QTL7_PHT is about 10 cM distant from the uzu locus based on the consensus map presented in grain genes database.
Thousand grain weight (TGW) is one of the major yield components having direct effect on the final yield. Altogether 21 QTL were found for this trait and some of them are in vicinity of row type genes. Some of the QTL were further confirmed by previously mapped QTL in same genomic regions (Table 9). QTL14_TGW on 5HL is observed to effect other traits like PHT, SC and CPC.
As outlined above, starch and protein content of the grain are major determinants of the end use quality.
Several of the 25 QTL detected for starch content coincided with the previously identified QTL (Table 10). These include QTL for related traits like acid detergent fiber (ADF) content, starch granule size and granule shape [76]. QTL21_SC on 7H is located in the region of the waxy locus known to encode granule-bound starch synthase I (GBSS I), which catalyses the synthesis of amylose [91,92]. For the total grain crude protein content we identified 23 QTL, located on all the seven chromosomes. Eleven of these QTL regions co-localize with previously mapped QTL, while 12 QTL are novel (Table 11). Interestingly, the majority of QTL for traits SC and CPC are located on chromosome 7H. Some of the QTL identified for SC coincide with QTL for CPC e.g. chromosomes 1H (55 cM), 2H (33.74 cM), 3H (55 cM), 5H (110 cM) and 7H (12 cM and 121 cM) ( Table 10 & 11). The coincidence of the QTL for these two traits can be expected due to their negative correlation (Table 2). On the other hand, we cannot rule out that some of the shared QTL are the result of linkage of underlying genes.

GWA reveals small effects only
Even the best associations observed in the present study showed only modest R 2 values (percentage of genetic trait variation explained) for the corresponding SNPs, implying low variance predicted by each SNP. This is exemplified by the QTL 'Qsch7a', which in a biparental QTL mapping study explained 47% of variation in SC [76]. In the present study, 'QTL23_SC' located in the same genomic region as 'Qsch7a' explains only 0.2% of the variation. Many GWAS in humans have reported low R 2 values and the rest of the unexplained variation is termed as 'unexplained missing heritability' [93]. Roy et al. [94], among others, reported R 2 -values to range from 0.2% to 3.95% in GWAS for plants, which corresponds well with our present results. In a consorted study for the trait "body height", an impressive number of 40 genotypic variants have been identified under a stringent threshold. Together these were only able to explain around 5% of the variation in human body height [95,96]. Possible explanations for this "missing heritability" include i) insufficient marker coverage, in cases where the causal polymorphism is not in perfect LD with the genotyped SNP reduces the power to detect associations and the variation explained by such a SNP marker. This has been demonstrated in the present study for the effect of the PpdH1 gene on HD; ii) rare alleles (MAF < 5%) with a major effect have been dropped from the analysis and will go undetected in cases where they are associated; iii) the expression of a character or trait depends on a large number of genes/QTL with small individual effects which escape statistical detection; iv) inadequacy of the statistical approaches available to detect epistatic interactions in GWAS and v) biased estimates of R 2 for individual SNPs due to the level of population stratification in the panel [93,95,[97][98][99]. Although the above mentioned reasons were mainly discussed in the context of GWAS in humans, they also pertain to GWAS in plants and other organisms. In addition to the above mentioned reasons, the statistical model employed for the analysis will affect the variation explained by the SNPs. As the stringency and threshold of the models increases, the power of detecting small effect SNPs will be reduced. We observed that in the case of using stringent models for GWAS the larger portion of the trait variation is explained by the model itself and the less variation is left to be explained by genetic effects. For the trait HD the K-model, explained nearly 70% of the variation of the trait. Reducing the stringency of the model would increase the variation explained by the marker, but at the same time would result in more false positives. Especially in inbreeding crops like barley, it is difficult to preclude completely the effect of relationship among genotypes by applying simpler models. Hence, GWAS in highly structured populations of inbreeding crops such as barley will depend on the careful optimization of the model regarding sensitivity vs. selectivity.

Conclusions
Overall, our results provide new details on the chances and pitfalls of GWAS in structured populations of inbreeding crops like barley. Results from the present study provide an insight into the genetic architecture of important agronomic traits for barley (HD, PHT, TGW, SC and CPC). In total, we identified 107 QTL for these traits. Some genomic regions harbor QTL for more than one trait and, based on map comparisons, 50 QTL have been found to concur with previously mapped QTL. For all traits together, 57 novel QTL have been detected. To mitigate the shortcomings of GWAS in inbreeding crops, future association studies might implement novel strategies such as joint linkage and LD mapping which were already successfully applied in various species [89,[100][101][102]. Furthermore, to fine map and "mendelize" selected QTLs, staggered patterns of LD decay observed for different genepools of barley (cultivars, landraces, wild barley) may be exploited in combination with biparental mapping and marker saturation strategies exploiting the ever increasing body of genomic sequence [30,103]. The feasibility of such an approach was recently demonstrated by identifying a candidate gene for the ANTHOCYANINLESS 2 locus using a combination of association mapping followed by a segregation analysis in a biparental population and a BAC contig analysis [104].