Genotyping-by-sequencing and microarrays are complementary for detecting quantitative trait loci by tagging different haplotypes in association studies

Background Single Nucleotide Polymorphism (SNP) array and re-sequencing technologies have different properties (e.g. calling rate, minor allele frequency profile) and drawback (e.g. ascertainment bias), which lead us to study the complementarity and consequences of using them separately or combined in diversity analyses and Genome-Wide Association Studies (GWAS). We performed GWAS on three traits (grain yield, plant height and male flowering time) measured in 22 environments on a panel of 247 diverse dent maize inbred lines using three genotyping technologies (Genotyping-By-Sequencing, Illumina Infinium 50K and Affymetrix Axiom 600K arrays). Results The effects of ascertainment bias of both arrays were negligible for deciphering global genetic trends of diversity in this panel and for estimating relatedness. We developed an original approach based on linkage disequilibrium (LD) extent in order to determine whether SNPs significantly associated with a trait and that are physically linked should be considered as a single QTL or several independent QTLs. Using this approach, we showed that the combination of the three technologies, which have different SNP distribution and density, allowed us to detect more Quantitative Trait Loci (QTLs, gain in power) and potentially refine the localization of the causal polymorphisms (gain in position). Conclusions Conceptually different technologies are complementary for detecting QTLs by tagging different haplotypes in association studies. Considering LD, marker density and the combination of different technologies (arrays and re-sequencing), the genotypic data presently available were most likely enough to well represent polymorphisms in the centromeric regions, whereas using more markers would be beneficial for telomeric regions.


Abstract 23
Background: Single Nucleotide Polymorphism (SNP) array and re-sequencing technologies 24 have different properties (e.g. calling rate, minor allele frequency profile) and drawback (e.g. 25 ascertainment bias), which lead us to study the complementarity and consequences of using 26 them separately or combined in diversity analyses and Genome-Wide Association Studies 27 (GWAS). We performed GWAS on three traits (grain yield, plant height and male flowering 28 time) measured in 22 environments on a panel of 247 diverse dent maize inbred lines using 29 three genotyping technologies (Genotyping-By-Sequencing, Illumina Infinium 50K and 30 Affymetrix Axiom 600K arrays). 31 Results: The effects of ascertainment bias of both arrays were negligible for deciphering 32 global genetic trends of diversity in this panel and for estimating relatedness. We developed 33 an original approach based on linkage disequilibrium (LD) extent in order to determine 34 whether SNPs significantly associated with a trait and that are physically linked should be 35 considered as a single QTL or several independent QTLs. Using this approach, we showed 36 that the combination of the three technologies, which have different SNP distribution and 37 density, allowed us to detect more Quantitative Trait Loci (QTLs, gain in power) and 38 potentially refine the localization of the causal polymorphisms (gain in position). 39 11 our set of loci. LD extent varied significantly between chromosomes for both high 226 recombinogenic (>0.5 cM/Mbp) and low recombinogenic regions (<0.5 cM/Mbp, Table 1). 227 Chromosome 1 had the highest LD extent in high recombination regions (0.062 ±0.007 cM) 228 and chromosome 9 the highest LD extent in low recombinogenic regions (898.6±21.7 kbp) 229 (Table 1). Linkage disequilibrium extent relative to genetic and physical distances was highly 230 and positively correlated in high recombinogenic regions (r = 0.86), whereas it was not in low 231 recombinogenic regions (r = -0.64). 232 233 Table 1: Variation of LD extent, and percentage of genome covered. 234 Genetic and Physical LD extent were obtained by adjusting Hill and Weir model's on 235 100 different sets of 500,000 loci randomly sampled in high and low recombination 236 regions, respectively. The value represented the average across these 100 sets. The 237 percentage of genome coverage was estimated using markers with MAF > 5% and 238 E(r²k) = 0.1, for each technology and for the three technologies combined 239 The effective population size (Ne) estimated from the Hill and Weir's model [37] using 242 genetic distance varied from 7.9±0.04 (Chromosome 1) to 41.2±0.12 (Chromosome 7) in high 243 recombinogenic regions. Noteworthy, the same approach lead to unrealistic values in low 244 Finally, we studied the variation of local LD extent along each chromosome by adjusting the 248 Hill and Weir's model against genetic distance within a sliding windows of 1cM (Additional 249 file 4: Figure S8). After removing intervals that did not reach our criteria (Absence of model 250 convergence, effective population size > 247, low number of loci), the 3,205 remaining 251 intervals (90%) showed a high variation for genetic LD extent along each chromosome, with 252 LD extent varying from 0.019 (Ne = 246) to 0.997 cM (Ne = 0.06) (Additional file 4: Figure  253 S8). 254

Large differences in genome coverage between technologies
We estimated the percentage of the genome that was covered by LD windows around SNPs, 255 calculated by using either physical or genetic distances (Table 1). We observed a strong 256 difference in coverage between the three technologies at both genome-wide and chromosome 257 scale, as illustrated in Figure 2 on chromosome 3 (Table 1, and Additional file 2: Figure S2). 258 For a LD extent of r²K = 0.1, 74%, 82% and 89% of the physical map, and 42%, 58% and 259 71% of the genetic map were covered by the 50K array, the GBS and the 600K array, 260 respectively (Table 1). For the combined data (50K + 600K + GBS), the coverage strongly 261 varied between chromosomes, ranging from 83% (chromosome 7) to 98% (chromosome 1) of 262 the physical map, and from 51% (chromosome 7) to 97% (chromosome 1) of the genetic map 263 (Table 1). For the physical map, increasing the LD extent threshold to r²K=0.4 reduced the 264 genome coverage from 89% to 49% for 600K, 82% to 28% for GBS, 74% to 20% for 50K 265 and 90% to 52% for the combined data. Increasing the MAF threshold reduced slightly the 266 genome coverage, with smaller reduction for the physical map than genetic map. Surprisingly, 267 increasing the SNP number by combining the markers from the arrays and GBS did not 268 strongly increase the genome coverage as compared to the 600K, regardless of the threshold 269 for LD extent (Figure 2 and Additional file 2: Figure S2). 270 We observed a strong variation of genome coverage along each chromosome with contrasted 271 patterns in low and high recombinogenic regions (Figure 2 and Additional file 2: Figure S2). 272 While low recombinogenic regions were totally covered with all the technologies (except for 273 few intervals using the 50K array), the genome coverage in high recombinogenic regions 274 varied depending on both technology and SNP distribution. 47% of the 2Mbp intervals in 275 high recombination regions were better covered by the 600K array than the GBS against only 276 1%, which were better covered by GBS than 600K. 277 278

Number of QTLs detected using genome-wide association studies increases with markers density
We observed a strong variation in the number of SNP significantly associated with the three 279 traits across the 22 environments ( Table 2). The mean number of significant SNPs per 280 environment and trait was 3.7, 44.7, 17.9 and 62.4 for the 50K, 600K, GBS and the three 281 technologies combined, respectively (Table 3). Considering the p-value threshold used, 28, 282 303 and 204 false positives were expected among the 243, 2,953 and 1,182 associations 283 detected for 50K, 600K and GBS, respectively. False discovery rate appeared therefore higher 284 for GBS (17.2%) than for DNA arrays (11.5% and 10.2% for 50K and 600K, respectively). It 285 could be explained by the higher genotyping error rate of GBS due to imputation and/or by its 286 higher number of makers with a lower MAF. Both reduce the power of GBS compared to 287 14 DNA arrays and therefore lead to a higher false discovery rate. Proportionally to the SNP 288 number, 50K and 600K arrays resulted in 1.5-and 1.7-fold more associated SNPs per 289 situation (environment × trait) than GBS (p-value<2x10 -6 , Table 3). This difference between 290 arrays and GBS was higher for grain yield (GY) and plant height (plantHT) than for male 291 flowering time (DTA, Table3). 292 293 294  We used two approaches based on LD for grouping significant SNPs: (i) considering that all 311 SNPs with overlapping LD windows for r²K=0.1 belong to the same QTL (LD_win) and (ii) 312 grouping significant SNPs that are adjacent on the physical map and are in LD (r²K > 0.5, 313 LD_adj). The QTLs defined by using the two approaches were globally consistent since 314 significant SNPs within QTLs were in high LD whereas SNPs from different adjacent QTLs 315 were not (Additional file 6: Figure  LD_adj approach increased strongly when the LD threshold was set above 0.5. Differences in 319 QTL groupings between the two methods were observed for specific LD and recombination 320 patterns. This occurred for instance on chromosome 6 for grain yield (Additional file 6: 321 Significant SNPs  QTLs  Technology  50K  600K  GBS  Combined  50K  600K  GBS  Combined  Marker Nb  42046  459191  308929  810580  42046  459191  308929  810580   Total Nb   DTA  52  759  345  1115  20  130  133  226  plantHT  68  778  299  1061  16  96  90  160  GY  123  1416  538  1941  33  166  120  238  Per trait  81  984  394  1372  23  131  114 Figure S9-LD-Adjacent and Additional file 7: Figure S9-LD-Windows). Within this region, 322 the recombination rate was low and the LD pattern between associated SNPs was complex. 323 While LD_adj splitted several SNPs in high LD into different QTLs (for instance QTL 232,324 235,237,249), LD_win grouped together associated SNPs that are genetically close but 325 displayed a low LD (Additional file 6: Figure S9-LD-Adjacent and Additional file 7: Figure  326 S9-LD-Windows). Reciprocally, for flowering time, we observed different cases where 327 LD_win separated distant SNPs in high LD into different QTLs whereas LD_adj grouped 328 them (QTL 25 and 26, 51 to 53, 95 to 97, 208 and 209, 218 and 219). As these differences 329 were specific to complex LD and recombination patterns, we used the LD_win approach for 330 the rest of the analyses. 331 332 Although a large difference in number of associated SNPs was observed between 600K and 333 GBS, little difference was observed between QTL number after grouping SNPs (Table 2,  334   Table 3). The mean number of QTLs was indeed 1.0, 5.9 and 5.2 and 9.5 for the 50K, 600K, 335 GBS, and the three technologies combined, respectively (Table 3). Note that the number of 336 QTLs continued to increase with marker density when SNPs from GBS, 50K and 600K were 337 combined ( Figure 4). The number of SNPs associated with each QTL varied according to the 338 technology (on average 3.7, 7.6, 3.4 and 6.6 significant SNPs for the 50K, 600K, GBS, and 339 the combined technologies, respectively). The total number of QTLs detected over all 340 environments by using the 600K array and GBS was close for flowering time (130 vs 133) 341 and plant height (96 vs 90). It was 1.4-fold higher for the 600K than GBS for grain yield (166 342 vs 120). 343

The 600K and GBS were highly complementary for association mapping
The 600K and GBS technologies were highly complementary to detect QTLs for the three 345 traits: 78%, 76% and 71% of the QTLs of flowering time, plant height and grain yield, 346 respectively, were specifically detected by 600K or GBS ( Figure 5). On the contrary, 50K 347 displayed very few specific QTLs. While only 9 out 69 QTLs from the 50K array were not 348 detected when the 600K array was used, 39 QTLs detected using the 50K array were not 349 detected when using GBS. When we combined the GBS and 600K markers, 7% of their 350 common QTLs had -log 10 (Pval) increased by 2 and 21% by 1 potentially indicating a gain in 351 accuracy of the position of the causal polymorphism (Additional file 8: Table S3). 352 This complementarity between GBS and 600K is well exemplified with two strong 353 association peaks for flowering time on chromosome 1 (QTL32) and 3 (QTL95) detected in 354 several environments (Additional file 8: Table S3 and Figure 6a). In order to better understand 355 the origin of the complementarity between GBS and 600K technologies for GWAS, we 356 scrutinized the LD between SNPs and the haplotypes within these two QTLs (Figure 6b and c, 357 and Additional file 9: Figure S10 for other examples). For example, QTL95 showed a gain in 358 power. It was only identified by the 600K array although the region included numerous SNPs 359 from GBS close to the associated peak. None of these SNPs was in high LD with the most 360 associated marker of the QTL95 (Figure 6b). Another example is QTL32, which was detected 361 by 1 to 10 GBS markers in 9 environments with -log(p-value) ranging from 5 to 7.6, whereas 362 it was detected by only two 600K markers in one environment (Ner12W) with -log(p-value) 363 slightly above the significance threshold (Additional file 8: Table S3 and Figure 6b). 364

365
Haplotype analyses showed that the SNPs from the GBS within QTL95 were not able to 366 discriminate all haplotypes (Figure 6c). In QTL95, using the 600K markers allowed one to 367 18 discriminate the three main haplotypes (H1, H2, H3), whereas using the GBS markers did not 368 allow discrimination of H3 against H1 + H2. As H1 contributed to an earlier flowering time 369 than H2 or H3, associations appeared more significant for the 600K than for GBS (Figure 6c). 370 In QTL32, the use of GBS markers allowed identifying late individuals that mostly displayed 371 H1, H2 and H3 haplotypes, against early individuals that mostly displayed H4 and H5 372 haplotypes ( Figure 6c). The gain of power for GBS markers as compared to 600K markers for 373 QTL32 originated from the ability to discriminate late individuals (black alleles) from early 374 individuals (red alleles) within H4 haplotypes (Figure 6c). 375

Stability, pleiotropy and distribution of QTL detected across environments
After combining the three technologies, we identified 226, 160, 238 QTLs for the flowering 376 time, plant height and grain yield, respectively (Table 3 and Additional file 8: Table S3). We 377 highlighted 23 QTLs with the strongest effects on flowering time, plant height and grain yield 378 (-log 10 (Pval) ≥ 8, Table 4). The strongest association corresponded to the QTL95 for 379 flowering time (-log 10 (p-value) = 10.03) on chromosome 3 (158,943,646 -159,005,990 bp), 380 the QTL135 for GY (-log 10 (p-value) = 18.7) on chromosome 6 (12,258,527 -29,438,316 bp) 381 and QTL78 on chromosome 6 (12,258,527 -20,758,095 bp) for plant height (-log 10 (p-value) 382 = 17.31). The QTL95 for flowering time trait was the most stable QTLs across environments 383 since it was detected in 19 environments (Additional file 8: Table S3). Moreover, this QTL 384 showed a colocalization with QTL74 for grain yield in 5 environments and QTL30 for plant 385 height in 1 environment suggesting a pleiotropic effect. More globally, 472 QTLs appeared 386 trait-specific whereas 70 QTLs overlapped between at least two traits (6,3%, 5.2% and 3.0% 387 for GY and plantHT, GY and DTA, and DTA and plantHT, respectively) suggesting that 388 some QTLs are pleiotropic (Additional file 10: Figure S11). This is not surprising since 389 average corresponding correlations within environments for these traits were moderate (0.47, 390 0.54 and 0.45, respectively). Only 0.7% overlapped between the three traits (Additional file 391 10: Figure S11). Twenty percent of QTLs were detected in at least two environments and 9% 392 in at least three environments (Additional file 10: Figure S12 and Table S4). We observed no 393 significant differences of stability between the three traits (p-value = 0.2). However, 6 out 7 394 most stable QTLs (Number of environments >5) were found for flowering time and a higher 395 proportion of QTLs were specific for plant height than grain yield and flowering time (85% vs 396 77% for both flowering time and grain yield, p-value = 0.09, p-value = 0.2), respectively. We 397 observed that QTLs that displayed a significant effect in more than one environment had 398 larger effects and -log(p-value) values than those significant in a single environment. This The distribution of QTLs was not homogeneous along the genome since 82%, 77% and 79% 414 of flowering time, plant height and grain yield QTLs, respectively, were located in the high 415 recombinogenic regions, whereas they represented 46% of the physical genome (Additional 416 file 10: Table S5 and Figure S13). The QTLs were more stable (≥ 2 environments) in low than 417 in high recombinogenic regions (12.8% vs 5.8%, p-value = 0.03). 418

GBS required massive imputation but displayed similar global trends than DNA arrays for genetic diversity organization
In order to reduce genotyping cost, GBS is most often performed at low depth leading to a 420 high proportion of missing data, thereby requiring imputation in order to perform GWAS. 421 Imputation can produce genotyping errors that can cause false associations and introduce bias 422 in diversity analysis [33]. We evaluated the quality of genotyping and imputation obtained by 423 different approaches, taking the 50K or 600K arrays as references. The best imputation 424 method that yielded a fully genotyped matrix with a low error rate for the prediction of both 425 heterozygotes and homozygotes was the approach merging the homozygous genotypes from 426 Tassel and the imputation of Beagle for the other data (GBS 5 in Additional file 1: Table S1). 427 The quality of imputation was high with 96% of allelic values consistent with those of the 428 50K and 600K arrays. This level of concordance is identical than in a study of USA national 429 maize inbred seed bank by Romay et al. [32]. It is higher than in a diversity study of 430 European flint maize collection (93%) by Gouesnard et al. [33], which was more distant from 431 the reference AllZeaGBSv2.7 database than for the panel presented here. 432

433
The ascertainment bias of arrays due to the limited number of lines used for SNP discovery 434 was reinforced by counter-selection of rare alleles during the design process of DNA arrays 435 [3,4]. For GBS, the polymorphism database to call polymorphisms included thousands of 436 diverse lines [35]. In our study, we used AllZeaGBSv2.7 database. After a first step of GBS 437 imputation (GBS 2 ), missing data dropped to 11.9% i.e. only slightly more than in Romay However, although highly correlated, level of relatedness differed between GBS and DNA 446 arrays, especially when the lines were less related as showed by the deviation (to the left) of 447 the linear regression from the bisector (Figure 3). 448

The extent of linkage disequilibrium strongly varied along and between chromosomes
Linkage disequilibrium extent in high recombinogenic regions varied to a large extent among 449 chromosomes, ranging from 0.012 to 0.062 cM. Similar variation of genetic LD extent 450 between maize chromosomes has been previously observed by Rincent et al. [14], but their 451 classification of chromosomes was different from ours. This difference could be explained by 452 the fact that we analyzed specifically high and low recombination regions. According to Hill 453 and Weir Model [37], the physical LD extent in a genomic region increased when the local 454 recombination rate decreased. As a consequence, chromosome 1 and 9 had the lowest and 455 highest physical LD extent and displayed the highest and one of the lowest recombination rate 456 in pericentromeric regions, respectively (0.26 vs 0.11 cM / Mbp, Table 1 and Additional file 457 10: Table S5). Unexpectedly, the genetic LD extent also correlated negatively with the 458 recombination rate. It suggested that chromosomes with a low recombination rate also display 459 a low effective population size. Background selection for deleterious alleles could explain this 460 pattern since it reduced the genetic diversity in low recombinogenic regions [38,39]. Finally, 461 23 we observed a strong variation of the LD extent along each chromosome (Additional file 4: 462 Figure S8). As we used a consensus genetic map [40] that represents well the recombination 463 within our population, it suggested, according to Hill and Weir's model, that the number of 464 ancestors contributing to genetic diversity varied strongly along the chromosomes. This likely 465 reflects the selection of genomic regions for adaptation to environment or agronomic traits 466 [38], that leads to a differential contribution of ancestors according to their allelic effects. 467 Ancestors with strong favorable allele(s) in a genomic region may lead ultimately to large 468 identical by descent genomic segments [41]. 469

SNPs were clustered into QTL highlighting interesting genomic regions
In previous GWAS, the closest associated SNPs were grouped into QTLs according to 470 either a fixed physical distance [1] or a fixed genetic distance [30,42]. These approaches 471 suffer of two drawbacks. First, the physical LD extent can vary strongly along chromosomes 472 according to the variation of recombination rate (Additional file 2: Figure S2). Second, the 473 genetic LD extent depends both on panel composition and the position along the genome 474 (Table 1). These approaches may therefore strongly overestimate or underestimate the number 475 of QTLs. To address both issues Cormier et al. [43] proposed to group associated SNPs by 476 using a genetic window based on the genetic LD extent estimated by Hill and Weir model in 477 the genomic regions around the associated peaks [37]. In our study, we improved this last 478 approach (LD_win): 479 -First, we used r²K that corrected r 2 for kinship rather than the classical r 2 since r²K 480 reflected the LD addressed in our GWAS mixed models to map QTL [17]. 481 -Second, we took advantage of the availability of both physical and genetic maps of 482 maize to project the genetic LD extent on the physical map. This physical window was useful 483 to retrieve the annotation from B73 reference genome, decipher local haplotype diversity 484 ( Figure 6) and estimate physical genome coverage (Table 1, Figure 2). 485 -Third, we considered an average LD extent estimated separately in the high and low 486 recombinogenic genomic regions. This average was estimated by using several large random 487 sets of pairs of loci in these regions rather than the local LD extent in the genomic regions 488 around each associated peaks. 489

490
We preferred this approach rather than using local LD extent in order to limit the effect of (i) 491 the strong variation of marker density along the chromosome (Additional file 2: Figure S2), 492 (ii) the local ascertainment bias due to the markers sampling (iii) the poor estimation of the 493 local recombination rate using a genetic map, notably for low recombination regions [3, 41] 494 (iv) errors in locus order due to assembly errors or chromosomal rearrangements. 495

496
We compared LD_win with LD_adj, another approach based on LD to group the SNPs 497 associated to trait variation into QTL. The discrepancies between the two approaches can be 498 explained by the local recombination rate and LD pattern. Since LD_adj approach was based 499 on the grouping of contiguous SNPs according to their LD, this approach was highly sensitive 500 to (i) error in marker order or position due to genome assembly errors or structural variations, 501 which are important in maize [44] (ii) genotyping or imputation errors, which we estimated at 502 ca. 1% and ca. 4%, respectively, for GBS (Additional file 1: Table S1) associated with different traits in the Cornfed dent panel [11]. Conversely, we did not identify 523 in our study any QTL associated to the florigen ZCN8, which showed significant effect in 524 these two previous studies. This relates most likely to the fact that we narrowed the flowering 525 time range in our study, in particular by eliminating early lines. This reduced the 526 representation of the early allele in the Zcn8, leading to a MAF of 0.27 in our study vs. 0.35 in 527 Rincent et al. [11], which can diminishes the power of the tests [14]. 528

Complementarity of 600K and GBS for QTL detection resulted mostly from the tagging of different haplotypes rather than the coverage of different genomic regions.
Number of significant SNPs and QTLs increased with the increase in marker number (Table  529 3, Figure 4). This could be explained partly by a better coverage of some genomic regions by 530 SNPs, notably in high recombinogenic regions which showed a very short LD extent and were 531 enriched in QTLs (Additional file 10: Figure S13). Numerous new QTLs identified by the 532 600K array and GBS as compared with those identified by the 50K array were detected in 533 high recombinogenic regions that were considerably less covered by the 50K array than the 534 600K array or GBS (Additional file 2: Figure S2). 535

536
The high complementarity for QTL detection between GBS and 600K array was only 537 explained to a limited extent by the difference of the SNP distribution and density along the 538 genome, since these two technologies targeted similar regions as showed by coverage analysis 539 ( Figure 2 and Additional file 2: Figure S2). However, at a finer scale, SNPs from the 600K 540 array and GBS could tag close but different genomic regions around genes. SNPs from the 541 600K array were mostly selected within coding regions of genes [4], whereas SNP from GBS 542 targeted more largely low copy regions, which included coding but also regulatory regions of 543 genes [32,35]. To further analyse the complementarity of the technologies, we analysed local 544 haplotypes. We showed that both technologies captured different haplotypes when similar 545 genomic regions were targeted ( Figure 6). Hence, we pinpointed that GBS and DNA arrays 546 are highly complementary for QTL detection because they tagged different haplotypes rather 547 than tagging different regions ( Figure 6). Based on the L-shaped MAF distribution, which 548 suggest no ascertainment bias, and the high number of sequenced lines used for the GBS, we 549 expect a closer representation of the variation present in our panel by this technology 550 27 compared to the 600K array, but this comes to the cost of an enrichment in rare alleles. Both 551 factors tend to counterbalance each other in terms of GWAS power. 552

553
Our results suggest that we did not reach saturation with our c. 800,000 SNPs because (i) 554 some haplotypes certainly remain not tagged (ii) the genome coverage was not complete, and 555 (iii) the number of significant SNPs and QTLs continued to increase with marker density 556 ( Figure 4). Considering LD and marker density, the genotypic data presently available were 557 most likely enough to well represent polymorphisms in the centromeric regions, whereas 558 using more markers would be beneficial for telomeric regions. New approaches based on 559 resequencing of representative lines and imputation are currently developed to achieve this 560 goal. 561 562

Plant Material and Phenotypic Data
The panel of 247 genotypes (Additional file 11: Table S6) Table S7). Adjusted means 579 of hybrids were combined with genotyping data of the lines to perform GWAS. 580

Genotyping and Genotyping-By-Sequencing Data
The inbred lines were genotyped using three technologies: a maize Illumina Infinium HD 50K 581 array [3], a maize Affymetrix Axiom 600K array [4], and Genotyping-By-Sequencing [2,35]. 582 In the arrays, DNA fragments are hybridized with probes attached to the array that flanked 583 SNPs that have been previously identified between inbred lines (Additional file 5: 584 Supplementary Text 1 for the description of the data from the two arrays). Genotyping-by-585 sequencing technology is based on multiplex resequencing of tagged DNA from different 586 individuals for which some genomic regions were targeted using restriction enzyme (Keygene 587 N.V. owns patents and patent applications protecting its Sequence Based Genotyping 588 technologies) [2]. Cornell Institute (NY, USA) processed raw sequence data using a multi-589 step Discovery and a one-top Production pipeline (TASSEL-GBS) in order to obtain genotypes 590 (Additional file 5: Supplementary Text 2). An imputation step of missing genotypes was 591 carried out by Cornell Institute [36], which utilized an algorithm that searches for the closest 592 neighbour in small SNP windows across the haplotype library [35], allowing for a 5% 593 29 mismatch. If the requirements were not met, the SNP was left ungenotyped for individuals. 594

595
We applied different filters (heterozygosity rate, missing data rate, minor allele frequency) for 596 a quality control of the genetic data before performing the diversity and association genetic 597 analyses. For GBS data, the filters were applied after imputation using the method 598 "Compilation of Cornell homozygous genotypes and Beagle genotypes" (GBS 5 in Additional 599 file 1: Figure S1; See section "Evaluating Genotyping and Imputation Quality"). We 600 eliminated markers that had an average heterozygosity and missing data rate higher than 0.15 601 and 0.20, respectively, and a Minor Allele Frequency (MAF) lower than 0.01 for the diversity 602 analyses and 0.05 for the GWAS. Individuals which had heterozygosity and/or missing data 603 rate higher than 0.06 and 0.10, respectively, were eliminated. 604 605

Evaluating Genotyping and Imputation Quality
Estimating the genotyping and imputation quality were performed using 245 lines since two 606 inbred lines have different seedlots between technologies. The 50K and the 600K arrays were 607 taken as reference to compare the concordance of genotyping (genotype matches) with the 608 imputation of GBS based on their position. While SNP position and orientation from GBS 609 were called on the reference maize genome B73 AGP_v2 (release 5a) [48], flanking 610 sequences of SNPs in the 50K array were primary aligned on the first maize genome reference 611 assembly B73 AGP_v1 (release 4a.53) [49]. Both position and orientation scaffold carrying 612 SNPs from the 50K array can be different in the AGP_v2, which could impair correct 613 comparison of genotype between the 50K array and GBS. Hence, we aligned flanking 614 sequences of SNPs from the 50K array on maize B73 AGP_v2 using the Basic Local 615 30 Alignment Search Tool (BLAST) to retrieve both positions and genotype in the same and 616 correct strand orientation (forward) to compare genotyping. The number of common markers 617 between the 50K/600K, 50K/GBS, GBS/600K and 50K/600K/GBS was 36,395, 7,018, 618 25,572 and 5,947 SNPs, respectively. The comparison of the genotyping and imputation 619 quality between the 50K/GBS, 50K/600K and 600K/GBS was done on 5,336 and 24,286 620 PANZEA markers [50] in common, and 26,154 markers in common, respectively. The 621 genotyping concordance of the 600K with the 50K array was extremely high (99.50%) but 622 slightly lower for heterozygotes (92.88%). In order to achieve these comparisons, we 623 considered the direct reads from GBS (GBS 1 ) and four approaches for imputation (GBS 2 to 624 GBS 5 , Additional file 1: Figure S1). GBS 2 approach consisted in one imputation step from the 625 direct read by Cornell University, using TASSEL software, but missing data was still present. 626 GBS 3 approach consisted in a genotype imputation of the whole missing data of the direct 627 read by Beagle v3 [13]. In GBS 4 , genotype imputation by Beagle was performed on Cornell 628 imputed data after replacing the heterozygous genotypes into missing data. GBS 5 , consisted in 629 homozygous genotypes of GBS 2 completed by values imputed in GBS 3 (Additional file 1: 630 Figure S1). 631

Diversity Analyses
After excluding the unplaced SNPs and applying the filtering criteria for the diversity 632 analyses (MAF > 0.01), we obtained the final genotyping data of the 247 lines with 44,729 633 SNPs from the 50K array, 506,662 SNPs from the 600K array, and 395,024 SNPs from the 634 GBS ( Figure 1). All markers of the 600K array and GBS 5 that passed the quality control were 635 used to perform the diversity analyses (estimation of Q genetic groups and K kinships). For 636 the 50K, we used only the PANZEA markers (29,257 SNPs) [50] in order to reduce the 31 ascertainment bias noted by Ganal et al. [3] when estimating Nei's index of diversity [51] and 638 relationship coefficients. Genotypic data generated by the three technologies were organized 639 as G matrices with N rows and L columns, N and L being the panel size and number of 640 markers, respectively. Genotype of individual i at marker l (G i,l ) was coded as 0 (the 641 homozygote for an arbitrarily chosen allele), 0.5 (heterozygote), or 1 (the other homozygote). 642 Identity-By-Descend (IBD) was estimated according to Astle and Balding [19]: 643 where p l is the frequency of the allele coded 1 of marker l in the panel of interest, i 645 and j indicate the inbred lines for which the kinship was estimated. We also estimated the 646 Identity-By-State (IBS) by estimating the proportion of shared alleles. For GWAS, we used 647 K_Chr [14] that are computed using similar formula as K_Freq, but with the genotyping data 648 of all the chromosomes except the chromosome of the SNP tested. This formula provides an 649 unbiased estimate of the kinship coefficient and weights by allelic frequency assuming Hardy-650 Weinberg equilibrium. Hence, relatedness is higher if two individuals share rare alleles than 651 common alleles. 652 653 Genetic structure was analysed using the sofware ADMIXTURE v1.22 [18] with a 654 number of groups varying from 2 to 10 for the three technologies. We compared assignation 655 by ADMIXTURE of inbred lines between the three technologies by estimating the proportion 656 of inbred lines consistently assigned between technologies two by two (50K vs GBS 5 , 50K vs 657 600K, 600K vs GBS 5 ) using a threshold of 0.5 for admixture.   Table S5: Recombination rate and proportion of low and high recombination regions. 1159 Average recombination rate ("RecRate") and proportion of the physical ("Phys") and genetic 1160 ("Genetic) map in low ("LowRec", <0.5 cM / Mbp) and high ("HighRec", >0.5 cM / Mbp) 1161 recombination regions for each chromosomes. "Chr" indicates the chromosome. Physical and 1162 genetic size columns indicated the size of each chromosome in bp and cM, respectively. 1163