Skip to main content

COMPILE: a GWAS computational pipeline for gene discovery in complex genomes



Genome-Wide Association Studies (GWAS) are used to identify genes and alleles that contribute to quantitative traits in large and genetically diverse populations. However, traits with complex genetic architectures create an enormous computational load for discovery of candidate genes with acceptable statistical certainty. We developed a streamlined computational pipeline for GWAS (COMPILE) to accelerate identification and annotation of candidate maize genes associated with a quantitative trait, and then matches maize genes to their closest rice and Arabidopsis homologs by sequence similarity.


COMPILE executed GWAS using a Mixed Linear Model that incorporated, without compression, recent advancements in population structure control, then linked significant Quantitative Trait Loci (QTL) to candidate genes and RNA regulatory elements contained in any genome. COMPILE was validated using published data to identify QTL associated with the traits of α-tocopherol biosynthesis and flowering time, and identified published candidate genes as well as additional genes and non-coding RNAs. We then applied COMPILE to 274 genotypes of the maize Goodman Association Panel to identify candidate loci contributing to resistance of maize stems to penetration by larvae of the European Corn Borer (Ostrinia nubilalis). Candidate genes included those that encode a gene of unknown function, WRKY and MYB-like transcriptional factors, receptor-kinase signaling, riboflavin synthesis, nucleotide-sugar interconversion, and prolyl hydroxylation. Expression of the gene of unknown function has been associated with pathogen stress in maize and in rice homologs closest in sequence identity.


The relative speed of data analysis using COMPILE allowed comparison of population size and compression. Limitations in population size and diversity are major constraints for a trait and are not overcome by increasing marker density. COMPILE is customizable and is readily adaptable for application to species with robust genomic and proteome databases.

Peer Review reports


Association mapping is commonly used to mine genetic diversity in large populations for identification of the genes and alleles underlying complex traits [1, 2]. Genome-wide association studies (GWAS) offer the advantage of high-resolution mapping without the requirement to create mapping populations [3]. Using GWAS for candidate gene identification in maize (Zea mays) presents several challenges because of a complex and dynamic genome containing numerous transposable elements [4, 5]. However, the richness of maize genetic diversity, exemplified by single-nucleotide polymorphisms identified across thousands of maize genomes, has allowed high-resolution mapping of traits to candidate genes by GWAS [6,7,8]. Sequenced polymorphisms are captured in several genotyped populations, including the Goodman Association Panel of 282 inbred lines and landraces (Goodman AP) [9], the Nested Association Mapping (NAM) Recombinant Inbred Line populations of 5600 lines derived from 26 parental lines intermated with B73 [10], and the 2815-member U.S. Department of Agriculture-Agricultural Research Service (USDA-ARS) North Central Regional Plant Introduction Station (NCRPIS) populations [8].

Robust identification of genes involved in complex traits is generally improved by using larger populations [8, 11]. However, the use of large numbers of lines densely populated with sequence markers magnifies the computational demand for GWAS data analysis. Several data reduction techniques have been applied to GWAS, such as the use of compression in the Mixed Linear Model (MLM) method [2], collapsing individuals of similar genotypes into clusters treated as individual taxa [12]. A significant additional effort is required to annotate candidate genes within QTL intervals and to determine sequence variation in both coding and non-coding sequences.

Here, we demonstrate a semi-automated computational tool we call ‘COMPILE’, by which GWAS, executed using a more advanced statistical approach pioneered by Rincent et al. [13], revealed significant markers associated with maize candidate gene sequences. These markers were associated with their adjacent genes, and these genes aligned with rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana) public proteome databases to gain context from sequence-similar proteins in those species. This approach is generalizable to any species with lists of genetic features and proteomes in GFF and FASTA formats, respectively. We also developed several scripting tools useful in generating annotated figures displaying GWAS results in the common “Manhattan” plot format, both genome-wide and in smaller chromosome regions at high-resolution.

The COMPILE automated pipeline dramatically increases the speed of data analysis for researchers interested in GWAS. We validated COMPILE using published data from Chen and Lipka [14] on genes involved in γ-tocopherol synthesis in the developing embryo of maize caryopses, and from Romay et al. [8], on flowering time in maize. In addition to the candidate genes identified in each of these studies, additional genes associated with each trait were identified. Based on transcript profiling, a long non-coding RNA expressed for only a few hours in the nucellus of the developing kernel is associated with tocopherol synthesis. As proof-of-concept, COMPILE identified new candidate genes in the Goodman AP that might contribute to resistance of maize stems to penetration by larvae of the European corn borer (Ostrinia nubilalis). We show here that population size and diversity are major limitations in defining QTL and candidate genes contributing to this resistance. These limitations cannot be overcome by increasing marker density.


COMPILE executes GWAS in an efficient one-step process

The COMPILE program executes GWAS by integration of phenotypic data with the K_chr Mixed Linear Model (MLM) without compression to generate a Manhattan plot, and then automatically searches for significant loci, matches those loci to adjacent maize genes, and finds the most similar sequences to those genes in rice and Arabidopsis (Dataset S1). This computational pipeline is adaptable to any species with structured genetically diverse populations that have been fully sequenced or well-populated with genetic markers. We built COMPILE on the GAPIT platform [15] using well-structured populations of maize as a test system. Use of COMPILE with established populations and published trait data confirmed the K_chr model [14], to be more robust for those populations. Use of the maize B73_RefGen_v4 genome assembly also allows non-coding RNA (ncRNA) features to be incorporated in the outputs.

Because COMPILE is built on the GAPIT platform, the parameters used within the GWAS are customizable. The system is a self-contained and self-constructing structure, including portable software installations and pre-installed R packages. COMPILE is made available on Github for ease of use and further development ( The speed of the analysis is improved in several ways: by pre-generation of the kinship matrices and annotation databases, by parallel execution of the MLM for all chromosomes, and by automated interpretation of significant results (Datasets S1-S3). The genotype files used in COMPILE are easily replaced with those produced or filtered by different parameters (different kinship algorithms, different marker filtering approaches, etc.), and we include scripts useful for processing these genotype files (Dataset S1). The parameters of the GAPIT execution are also modifiable within the COMPILE script (Dataset S2). Although we executed GWAS using a Mixed Linear Model without compression, the compression function can be customized into the analysis as described in Dataset S1. Association of identified significant markers is also possible either by selecting the nearest n genes or by specifying an LD window in base pairs, because LD can differ widely between different organisms. Different protein similarity databases could also be added in order to identify sequence-similar genes from other species.

Using a 7th generation Intel processor and 32GB of RAM, the approximate time to complete each analysis for a complete phenotype dataset (using the parallel chromosome processing feature of COMPILE when applicable) is approximately 5 min for the Goodman 2.7 version of COMPILE and 50 minutes for the NCRPIS 2.7 version of COMPILE.

To visualize chromosomal regions harboring QTL at high resolution takes 1 min per megabase using the FOCUS script. We observed a total compute time of ~ 2 days for full GWAS of the Goodman AP using the Goodman 3.2.1 marker set.

COMPILE identifies novel genes for γ-tocopherol synthesis in the caryopsis

We tested the analytical power and accuracy of COMPILE with published GWAS data obtained from the Goodman AP used to quantify the ratio of α- to γ-tocopherol in maize grain [16], as an example of a trait involving a small number of genes (Fig. 1, Table 1). This trait was among those revisited in Chen and Lipka [14], in their introduction of the K_chr model. We used a marker set similar to that of Chen and Lipka [14], but they used a random subset of 10% of their marker data to generate kinships, whereas we used the full set to generate kinships and did not include population structure covariates. Chen and Lipka [14] used the Benjamini-Hochberg protocol to control at 5% for significance. Consistent with the data from Lipka et al. [16], a SNP locus at about 205.8 Mbp on chromosome 5 lies within a window containing ZmVTE4, which COMPILE auto-annotated as a gene encoding a γ-tocopherol methyltransferase (Table 1). Three other genes were identified by Chen and Lipka [14] that were significant at a Benjamini-Hochberg FDR of 10%, an apparent cis-NAT pair of a transposase-like DUF659-containing protein and an EMBRYONIC FLOWER 1-like protein on Chromosome 8, and a gene of unknown function on Chromosome 7 (Table 1). However, COMPILE identified two additional strong QTL at an FDR of 5%, a MADS box-containing transcription factor gene (MADS36), homologous to those of Arabidopsis and rice related to floral identity and fruit development, and a long intergenic non-coding (linc) RNA that is the true cis-NAT pair with EMF1 on Chromosome 8 (Fig. S1).

Fig. 1
figure 1

Manhattan Plot showing GWAS results for ratio of α-tocopherol to γ-tocopherol conversion. Data from Lipka et al. [16] were input into COMPILE for GWAS analysis and gene discovery. Negative log10 p-values are plotted against physical position (B73_RefGen_v4) on all 10 chromosomes. Values in red are significant at Benjamini-Hochberg false discovery rate of 5%. A visual marker for the Bonferroni threshold (averaged from the individual threshold y-values of each chromosome) at (α = 0.1) is indicated by the blue horizontal line. The vertical line marks the position of the maize tocopherol O-methyltransferase (ZmVTE4). Dotted lines indicate the positions of genes identified by COMPILE not identified in the original study: a MADS-box 36 transcription factor (Zm00001d043589), a putative glycosyl transferase gene (Zm00001d019057), and a the long non-coding RNA occurred very early in embryo development, and only between 84 and 96 h post-pollination in the nucellus [17]

Table 1 Genes associated with QTL for α/γ-tocopherol ratio in maize kernels by COMPILEa

As the four genes identified by COMPILE were not indicated in the original study to be involved in tocopherol synthesis [16], we examined a compendium of metadata on Maize Expression Atlas available through ePlant (, which was assembled from published data for caryopsis development [18]. VTE4 expression in the developing caryopsis is largely confined to the embryo, scutellum and scutellar aleurone layer. All of the genes identified by COMPILE showed strong expression in tissues of the developing caryopsis (Fig. S2). Homologs closest in sequence in rice and Arabidopsis were annotated as floral identity genes, and the expression atlas showed high expression during kernel development. Expression of EMF1-like gene was highest in ear primordia and embryo, and during caryopsis development, but like VTE4, highest expression within the caryopsis was observed in the embryo, scutellum and scutellar aleurone layer, with moderate expression in the pericarp (Fig. S2). A gene encoding a DUF659-containing protein showed strong expression in the anthers, radicle and coleoptile, and embryo expression was confined to the aleurone adjacent to the scutellum. The putative glycosyl transferase gene displayed exceptionally high expression specifically in the scutellar aleurone layer.

COMPILE identifies known and novel genes associated with flowering time

We then applied COMPILE to published GWAS data obtained from the NCRPIS population used to identify loci that control the more complex trait of days-to-silking [8]. Using publicly available Best Linear Unbiased Polymorphisms (BLUPs) for the days-to-silking phenotype [8], COMPILE identified both known and novel QTL for this trait (Fig. 2a; Table 2). COMPILE recapitulated the findings of strong QTL attributed to known negative regulators of flowering time (Table 2). These include, on Chromosome 8, a Phosphatidyl-ethanolamine binding protein8 (PEBP8) gene (formerly called ZCN8 [19]), and the flowering repressor ZmRap2.7, and, on Chromosome 10, a Circadian-Clock Time, ZmCCT1, a gene that confers late-flowering (Fig. 2a). Also identified were, on Chromosome 3 coincident with a QTL identified in a maize NAM population, a gene encoding a DUF647-containing protein. The QTL cluster on chromosome 1, consisting of the sex- and internode-determinant Teosinte Branched1, flowering time-related Dwarf8, second photoperiodic response regulator Col2, and Phytochrome A1, and on chromosome 7 a QTLthe leucine-zipper transcription factor Delayed Flowering1 (DFL1) gene, were also identified, but at p-values above the Bonferroni cutoff.

Fig. 2
figure 2

Manhattan Plot showing NCRPIS 2.7 and Goodman 2.7 GWAS results for flowering time. Data from Romay et al. [8] were used in GWAS analysis and gene discovery. Negative log10 p-values are plotted against physical position (B73_RefGen_v4) on all 10 chromosomes. Markers significant at a Benjamini-Hochberg false discovery rate of 10% are shown in red. A visual marker for the Bonferroni threshold (averaged from the individual threshold y-values of each chromosome) at (α = 0.1) is indicated by the blue horizontal line. Vertical solid lines indicate positions of significant QTL and gene annotations for flowering time, as shown by Romay et al. [8]. Dotted lines indicate positions of novel genes identified. Gene identities are described in Romay et al. [8]. a Results for GWAS conducted using the 2279-member NCRPIS population and NCRPIS 2.7 marker data. b Results for GWAS conducted using the 282-member Goodman AP collection within the NCRPIS population and Goodman 2.7 marker data. c Results for GWAS conducted using the 282-member Goodman AP collection within in the NCRPIS population and the nearest neighbor of similar phenotype (564 lines) and NCRPIS 2.7 marker data. d Results for GWAS conducted using the 282-member Goodman AP collection within in the NCRPIS population and three nearest neighbors of similar phenotype (1128 lines) and NCPRIS 2.7 marker data

Table 2 Genes associated with QTL for growing-degree-day-adjusted days to flowering identified by COMPILEa

Six additional strong QTL were identified using COMPILE that were not observed by Romay et al. [8]. These included genes encoding on Chromosome 1, a C2H2 Zinc-finger domain protein, on Chromosome 2, a gene encoding a tyrosine kinase SH3 (Src-domain) protein involved in uncoating clathrin [20], on Chromosome 9, a gene encoding a cyclin-like F-box domain protein, and, on Chromosome 10, genes encoding transcription factor Teosinte branched1/Cincinnata/Proliferating cell factor (TCP), a TolB-like protein, and a Syntaxin81 (SYP81) t-SNARE-type protein (Fig. 2a; Table 2).

We then compared published relative expression data across plant organs for genes reported by Romay et al. [8], as contributing to the flowering time trait (Fig. S3) from a compendium of metadata for whole plant expression [21, 22], in the Maize Expression Atlas ( The Dwarf8, PhyA1, and DLF1 were more broadly expressed, with significant internode and ear primordia expression, whereas RAP2.7 was expressed in germinating grains and roots. Dwarf8 and PhyA1 were expressed in the silk tissues. TB1 was only expressed in ear primordia and female spikelets, whereas the CONSTANS-like Col2 and PEBP8 genes were expressed almost exclusively in leaves (Fig. S3). In addition to the genes contributing to flowering time identified by GWAS [8], COMPILE identified eight additional genes that might contribute to the trait. All but one gene was expressed in ear primordia and female spikelets, indicating an association with flowering organs. A gene encoding a C2H2-Zn finger-containing domain was expressed in the female spikelet, in ear primordia and embryo, while genes encoding an SH3-domain protein, a Syntaxin81 and a UbiE3 had highest relative expression in ear primordia (Fig. S4). The TolB-L gene had moderate expression in young leaves but was the only gene without significant expression in ear primordia or embryo.

Larval penetration by the European corn borer (Ostrinia nubilalis) is a quantitative trait within the Goodman AP

During the course of studies of maize stem development in the Goodman AP population, we observed during collection of the stover that wide variation existed in the extent of damage to the pith by the European corn borer (Ostrinia nubilalis). Stem damage during penetration of the larvae of European corn borer (Ostrinia nubilalis) at senescence varied widely among 274 field-grown genotypes of the Goodman AP collection (Fig. 3, a and b). The amount of internal damage in the pith correlated with the number of entry holes made by larvae along the internodes. We counted the number of entry holes created by larvae per unit of internode length and plotted the frequency of borer damage (Fig. 3c). Two-dimensional heat maps of field planting position against insect damage showed little to no correlation between damage and position within the field (Fig. S5), and quantile-quantile plots of the distribution of damage were generally linear except at extremes, which showed positive skewing (Fig. S6). Senescent internodes 4 and 5 were milled, and cell walls were isolated for analyses of cellulose and lignin composition. Pearson’s correlation coefficients measured for each trait showed insignificant correlation between insect damage and lignin composition (G-lignin R2 = 0.006; S-lignin R2 = 0.001), cellulose abundance (R2 = 0.003), and stover density (R2 = 0.000) (Fig. S7).

Fig. 3
figure 3

Micrographs of internodes of maize and histogram of insect damage. a Maize internodes without European corn borer damage. b Maize internodes with holes remaining after penetration of the corn borer larvae. c. Frequency distributions for insect damage in the Goodman AP. The two populations were normalized and assigned a damage index from 1, with zero damage, to 10, in increments of 0.032 holes/cm of internode length, with an average of 0.11 holes/cm of internode length, corresponding to an average index of 3.28. Mo17 with an index of 0.05 falls in bin 2, and B73 with an index of 0.15 falls into bin 5

Larval penetrance QTL contain candidate resistance genes

Several candidate genes for entry hole data were identified at the p-value (−log10 p) ≤ 1 × 10− 4 level in the Goodman AP (Fig. 4). The QTL with the lowest p-value, the only one significant with a Bonferroni threshold at α = 0.1, corresponded to a gene of unknown function located on Chromosome 9. The unknown protein of 123 amino acids and molecular weight of 12.6 kDa has a predicted signal peptide, and Kyte-Doolittle analysis across an 11-amino-acid window predicted that in addition to the signal peptide, one small hydrophobic domain is located upstream and one broad domain is located downstream of a central Arg- and Pro-rich cytoplasmic domain (Fig. S8a). Owing to the abundance of Arg residues throughout the protein, the pI is 10.3. Two rice homologs with over 90% sequence identity displayed similar hydropathy patterns and a search at NCBI using BLAST ( revealed homologs with high similarity mostly in grass species (Fig. S8, a-f). The rice sequences were nearly identical except for an extended N-terminus in the japonicum cultivar (Fig. S8, b and d). A shorter cytoplasmic domain occupies the N-terminal domain after the signal peptide, and the C-terminus contains a strongly hydrophilic SSRDDS common to a leucine-rich repeat Receptor-Like Kinase (RLK), but the proteins share no other sequence that would indicate homologous function. Neither maize nor rice sequences match any known protein. BLAST alignments show short sequences in the middle cytoplasmic regions align to a Ribonuclease Regulating Protein gene (RraA). A putative sugar transporter and a gene encoding a choline O-acetyltransferase (CHAT) domain-containing protein also contain short alignments in a strongly hydrophilic domain sandwiched by large hydrophobic domains, but overall these proteins are unrelated.

Fig. 4
figure 4

Manhattan Plot showing GWAS results for insect damage index in the Goodman AP using Goodman 2.7 data. Negative log10 p-values are plotted against physical position (B73 RefGen_v4). A visual marker for the Bonferroni threshold (averaged from the individual threshold y-values of each chromosome at (α = 0.1) is indicated by the blue horizontal line. One location, red circle, was significant. Gene identities are described in Table 3

Table 3 Genes associated with QTL for European corn borer stem penetration

Two QTL with near-significant p-values ≤1 × 10− 5 were located on Chromosome 2, including a Ser/Thr receptor kinase-like protein and a spectrin repeat-containing CHUP1 homolog involved in chloroplast movement and anchoring to the plasma membrane. On Chromosome 8 was a one-helix protein gene (OHP3) involved in photoprotection in the light-harvesting complex during light stress. Although significant to p-values of only ≤1 × 10− 4, on Chromosomes 1 and 2 were located WRKY31 and MYB-L transcription factor genes, respectively. Other genes of interest near significance included, on Chromosome 1, a Riboflavin synthase-like gene (RboS-L), on Chromosome 2, a Prolyl 4-hydroxylase (P4H1) gene involved in hydroxyproline synthesis, on Chromosome 3, an IAA-Ala Amino Hydrolase (IAH) gene involved in yielding IAA from amino acid conjugates, on Chromosome 5, a Guanosine Diphosphatase1 gene (GDA1) involved in transport of GDP-Mannose into the lumen of the Golgi apparatus, on Chromosome 7, a UDP-Glu:UDP-Gal 4-epimerase gene (UGE5), and, on Chromosome 10, a WRKY63-like transcription factor gene (Fig. 4, Table 3).

The incorporation of the high-density Goodman 3.2.1 marker set was particularly useful in discriminating between genes close to markers. As some of these QTL were not easily assignable to a single gene due to local genome architecture, we used results from both the 2.7 and 3.2.1 Goodman versions of the analysis. Examination of GWAS results at higher resolution enabled us to confidently map several QTL to genes, including, on Chromosome 9, the unknown gene (Fig. S9a), on Chromosome 10, a WRKY63 (Fig. S9b), and on Chromosome 8 a locus 3′ of the OHP3 gene and upstream of an Arabinogalactan-protein22-like gene (AGP22-L) (Fig. S9c). One near- significant location on Chromosome 1 is positioned 3′ of both a WRKY31 and a gene encoding a DUF1336-containing protein forming a cis-NAT pair (Fig. S9d).

Expression of candidate genes during maize stem development

In a previous study, we provided a comprehensive inventory of the cell-wall genes differentially expressed in rind tissues of individual internodes during the development of the maize B73 stem [23]. We defined several complex expression patterns associated primarily with elongation growth or secondary wall synthesis. Of the candidate genes identified by GWAS as contributing to larval penetration, IAH was more highly expressed in the lower internodes (Fig. 5a). The OHP3and RboS-L genes, although much lower in expression, were expressed in a secondary wall-related pattern. By contrast, P4H1 and UGE5 were expressed with greater relative expression during primary wall formation. The RLK-L and WRKY63 genes were weakly expressed in all internodes (Fig. 5a). The WRKY31, GDA1-L, MYB-L, AGP22-L, and the unknown gene were not expressed above 100 reads in any internode during the time-course of stem development.

Fig. 5
figure 5

Expression of genes associated with insect damage in developing stems of field-grown and greenhouse-grown maize. a Expression in maize B73 of genes associated with larval penetrations during stem development. Transcript levels in rind tissues from Internodes 9 through 2 from field-grown plants were normalized and compared as counts per 20 M reads. Values are the mean ± variance or S.D. of two or three independent rind collections, respectively. Genes with expression greater than 500 reads per 20 M were ordered by their ratio of expression (black diamonds) in secondary cell-wall-forming tissues (Internodes 5 and 4) to elongating tissue (Internodes 8 and 6). b Differential expression in maize B73 and Mo17 of genes associated with larval penetrations during stem development. Transcript levels in rind tissues of greenhouse-grown plants taken at elongation stages (Internodes 8 and 6) and secondary wall synthesis stages (Internodes 5 and 4) of each inbred were pooled and normalized and compared as counts per 20 M reads

We also compared expression between Mo17, which had a low borer damage index of 0.05 (bin 2), with that of B73, which gave an index of 0.15 (bin 5) (Fig. 3c). Greenhouse-grown plants free of insect pests were sampled at 49 days post-planting, where internodes 6 and 8 represented the elongation stage, and internodes 4 and 5 represented peak secondary wall formation [23]. Slight differences in expression between Mo17 and B73 were observed at either developmental stage with two exceptions. Transcripts of the RboS-L and IAH genes in B73 were less abundant at the secondary wall stage in greenhouse-grown plants than those sampled in the field, and the IAH, in particular, was expressed at over 10-fold higher levels in the field-grown plants (Fig. 5b). Both genes were much more highly expressed at both developmental stages in the greenhouse-grown Mo17 plants. Genes that were not expressed in B73 in field- or greenhouse-grown plants, were also not expressed in Mo17 plants (Fig. 5b).

As described for the genes identified in the Romay et al. [8] dataset, we probed the Maize Gene Expression Atlas for genes associated with larval stem damage. Five of the genes indicated by GWAS, including the unknown with the strongest p-value, exhibited insignificant expression in any internode of either B73 or Mo17 (Fig. 5). However, probing for whole plant organ expression, we found that all of them are expressed in other organs. Eight of the genes associated with larval damage exhibited significant expression in the internode, and five of these also showed strong expression in ear primordia and embryonal tissues. (Fig. S10). Seven other genes without significant internode expression displayed more specialized expression (Fig. S11). WRKY31 was also expressed only in roots, and expression is enhanced by drought stress but not pathogen stress. The AGP22-L gene was strongly expressed in mature leaves. Although we detected no expression in any internode, GDA1-L was broadly expressed across all organs. The Maize Gene Expression Atlas showed the unknown gene to exhibit solely highest relative expression in roots.

Although the Maize Gene Expression Atlas does not evaluate differential expression induced by insect damage, we probed the atlas for responses to abiotic and biotic stress as contributed by Hoopes et al. [22], Opitz et al. [24], Makarevitch et al. [25]. The unknown gene was unresponsive to abiotic drought, salt or temperature stress, but strongly upregulated in leaves upon infection with Colletotrichum or Cercospora fungi (Fig. S12a). Cell-wall related genes P4H and UGE5 also exhibited enhanced leaf expression upon fungal inoculation (Fig. S12, b and d). Enhanced expression of AGP22-L, IAH, and RLK-L was also recorded in young and old leaves, but expression of AGP22-L and IAH were also enhanced by salt stress (Fig. S12, c, e, and f).

Limitations of population size and marker density in establishing significant QTL

Candidate loci containing genes that might contribute to resistance to corn borer penetration achieved p-values of only ≤10− 5. Although the Goodman AP was sufficient to identify traits governed by a small number of genes (Fig. 4), we tested the impact of population size and structure on candidate gene identification using the complex trait of days-to-silking from Romay et al. [8]. When GWAS was applied to the Goodman AP alone represented in the 2815-member NCRPIS collection, five major loci were identified with p-values of ≥10− 6. To test the minimum population size needed for the major QTL to appear significant within the results, we created populations that paired the 282 Goodman AP lines with their nearest one and three neighboring lines from the NCRPIS population in trait value (564 lines and 1128 lines, respectively), and performed GWAS using the same NCRPIS 2.7 marker set [8]. The 564-line population was insufficient to improve the LOD values to below p = 10− 5, but the 1128-line population captured the PEBP8, RAP2.7, and CCT1 loci above the Bonferroni cut-off (Fig. 2). However, both subpopulations indicated other loci of significance not seen in GWAS of the full population.

We next determined if the LOD values could be improved for the PEBP8 and ZmRAP2.7 loci in the Goodman AP if marker density was increased. For this, we analyzed the region containing both QTL using the Goodman marker sets 2.7 and 3.2.1. The results using marker set Goodman 3.2.1 defined both PEBP8 and RAP2.7 within the 110 to 150 Mbp window of Chromosome 8, but with no improvement of LOD value (Fig. S13).


Validation of COMPILE

By integrating the K_chr model into the advanced GAPIT platform, we aimed to streamline GWAS with methods and scripts that rapidly generate the marker and kinship files (Dataset S1). We made these tools adaptable for any species with diverse populations. The introduction of the K_chr approach also presents the benefit of parallel processing, which reduces GAPIT run time by nearly an order of magnitude given sufficient memory and processing power. In addition, it alleviates the need for compression, which is another approach designed to avoid the confounding effect of testing a marker using a kinship structure including that marker. The construction of COMPILE also presents a major resource for rapid gene annotation, reducing a manual process to a simple automated process which executes within a few seconds for all candidate markers below a user-defined p-value. (Dataset S1). Our tools to plot GWAS results against genome architecture facilitate data visualization (Dataset S3). We found that all versions of COMPILE execute rapidly, even for large populations and marker sets.

We tested several parameters to probe the limitations of population size and marker density. Our reproduction of published GWAS highlights the improvement by larger populations that capture the broadest maize genetic diversity. Inclusion of additional markers did not increase statistical power in small populations, but they did increase resolution for identification of genes when using the FOCUS script. Our approach also highlights how small differences in protocols, such as marker set preparation and filtering approaches, kinship generation, and the inclusion of population structure covariates, can lead to different results with the same data. Although it is difficult to account for the best data preparation practices for every quantitative trait, the flexible framework of COMPILE and the methods made available here should allow fully customizable dataset preparation and GWAS execution.

We also used published expression data and our own RNAseq data during maize stem development at the time of corn borer infestation as additional validation that candidate genes identified were expressed at this time of development. In discriminating lignin abundance and saccharification yield in stover in two maize inbred lines, we performed whole genome differential expression to validate candidate genes within a QTL [26]. By coupling QTL mapping with transcriptome analysis, maize hypersensitive defense response genes that cluster in associated pathways were identified [27]. Differential expression of phenylpropanoid-related genes showed which were associated with lignin-abundance QTL [28], and combining QTL mapping with transcriptome sequencing revealed candidate genes for flowering time in Brassica species [29].

Novel genes associated with tocopherol synthesis and flowering time

COMPILE iterated identification of the VTE4 gene discovered by Chen and Lipka [14], as the major contributor to γ-tocopherol synthesis in the developing embryo of maize caryopses. According to the Maize Gene Expression Atlas (, additional genes associated with this trait, and identified by COMPILE were expressed in developing caryopses but with different patterns. Based on transcript profiling, the long non-coding RNA is expressed for only a few hours in the nucellus [17].

We also examined the expression profiles of the genes indicated by Romay et al. [8] to gain insight into their potential complexity of interaction. The two strongest candidates, RAP2.7 and PEBP8, had distinct expression patterns. Whereas RAP2.7 expression was primarily in root and germinating grains, PEBP8 was almost exclusively expressed in mature leaves (Fig. S3). Of the four genes clustered on Chromosome 1, DWARF8 and PhyA1 were broadly expressed, but Col2 expression was essentially confined to mature leaf and the female spikelet and TB1 had almost exclusive expression in the ear primordia and female spikelet (Fig. S3). By contrast, seven of the eight additional genes identified by COMPILE exhibited broad expression, including expression in ear primordia and female spikelet, indicating that they at least expressed in floral organs (Fig. S4). The exception was the TolB-like factor that had more broad expression, but was not expressed in ear primordia and young embryos.

Novel resistance genes for larval penetration

The European corn borer is a major maize pest in the US and Europe. Borers have a bivoltine growth habit, with the first generation feeding primarily on leaves and the second feeding on leaf sheath and stalk tissues [30]. Damage caused by stalk tunneling can result in broken stalks and lodging, but yield loss is attributed primarily to poor ear development. The challenge has been to identify the genetic basis for the second-generation larval damage [31]. Resistance to tunneling damage has been reported in inbred lines [32], and QTL have been established in recombinant inbred populations derived from resistant and susceptible parents [33, 34]. A laccase and a lignin-related cinnamoyl-3-hydroxylase were associated with tunneling resistance in a Multi-parent Advanced Generation InterCrosses (MAGIC) population [35].

Larvae must first penetrate the outer rind tissues to gain access to the pith, and differences in stalk strength and penetration resistance have also been evaluated as quantitative traits independent from pith tunneling [31, 36, 37]. The diversity of maize cell-wall hemicellulose, cellulose, and lignin composition [38], and hydroxycinnamate ester content [39], have also been targeted for QTL analysis for correlations with insect damage. When penetration resistance was assayed across Intermated B73 x Mo17 (IBM), NAM and NCRPIS populations, genes encoding a 4-Coumarate-CoA Ligase and a Caffeoyl-CoA O-methyltransferase were linked to loci for penetrometer resistance, whereas only a single Cellulose Synthase9 gene was identified [36]. We found little or no correlation between number of larval entrance holes and the bulk property of cellulose content or lignin content or composition (Fig. S7) and, thus, resistance to larval penetration must rely on more subtle alterations in wall composition and architecture. Nevertheless, rind penetrometer resistance, as a measure of stalk strength, indicate genes encoding enzymes of wall metabolism, such as a laccase, a UDP-GlcA decarboxylase, which generates UDP-Xyl, a pectin methylesterase, and a pectate lyase [40]. We found association with two cell wall-related genes, including prolyl 4-hydrolase1 (P4H1), which converts proline to hydroxyproline, and a UDP-Glc:UDP-Gal epimerase (UGE5) that interconverts these two nucleotide sugars (Fig. 4). The P4Hs are required for substrate synthesis for hydroxyproline-rich glycoproteins (HGRPs), which are known to play important roles in plant defense by insolubilizing and oxidatively cross-linking under stress, strengthening the cell wall [41]. Hydroxyprolines, along with Ara and Gal components of the glycans, are components of arabinogalactan-proteins (AGPs), which also play roles in defense [42]. UGE5 is essential in this synthesis [43].

Tunneling and stem damage resistance have been associated with genes of signaling pathways rather than those of wall synthesis. Among QTL for tunneling resistance were genes encoding a hexokinase1, a phospholipase A2, a histidine kinase, and a Ca2+/calmodulin-dependent protein kinase, whereas the stem damage was associated with an actin depolymerizing factor, an LRR receptor-like kinase, and two uncharacterized proteins [37]. However, lack of overlap between QTL defined in this study, and those by several others [33, 35, 36, 44], underscores the strong gene × environment variation in genes that contribute to resistance and the difficulties in comparisons between different populations, methods of data analysis, and trait proxies used as assays.

An RLK, CHUP1, and an OHP3 were significant at a p-value of ≤10− 5 (Fig. 4). The Arabidopsis and rice proteins closest in sequence to the RLK is De-Etiolated1 (DET1), whose products mediate cross-talk in abscisic-acid signaling pathways during growth and development [45]. Chloroplast Unusual Positioning 1 (CHUP1) encodes a chloroplast outer envelope protein that mediates anchorage to the plasma membrane and mediates chloroplast intracellular redistribution in response to light intensity [46], and one-helix proteins (OHPs) are part of the light-harvesting complex, where they participate in sensing of light intensity and energy dissipation, as well as triggering of photomorphogenesis [47]. If expression of these genes in the rind tissues in response to larval penetration, a broader role of stem chloroplast function beyond photosynthesis is inferred. Other significant hits with p-values ≤10− 4 include genes that encode WRKY and MYB regulators, enzymes for a member of the riboflavin synthase pathway, a hydrolase of IAA-amino acid conjugates, and GDA1 involved in GDP-Mannose transport. A wheat homolog closest in sequence to the Arabidopsis IAH gene (IAR3), TalAR3, was shown to have much stronger substrate specificity for the long side chain auxins indole butyric acid (IBA) and indole propionic acid (IPA) [48]. Nehela et al. [49] showed that these auxins were increased during herbivory by Diaphorina citri, especially when the insect vector transmitted a ‘Candidatus Liberibacter asiaticus’ (CLas) pathogen. They proposed a model where auxins increase the defense response in Citrus against both CLas and its insect vector, through cross talk between the salicylic and jasmonic acid pathways. The WRKY gene families of maize and their Arabidopsis and rice homologs play critical roles in biotic responses to prokaryotic and eukaryotic invasions, including those from insects [50], and specifically corn rootworm [51]. MYB factors are also implicated in abiotic and biotic resistance, including synthesis of metabolites toxic to insects [52], and MYB80 specifically in reactive-oxygen species (ROS) related to environmental stress [53].

Nitric oxide (NO) and ROS are induced upon attack by pathogens and mediate cell-wall cross-linking and the hypersensitive response [54]. Levels of riboflavin are linked with changes in NO [55], and ROS involved in pathogen defense [56]. However, the association of riboflavin with these responses is complex. Although synthesis of riboflavin is required for NO and ROS synthesis [57], down-regulation of free riboflavin induces ROS and pathogen defense [56]. In greenhouse-grown plants in the absence of insect challenge, RboS-L expression is over 10-fold higher in Mo17 than in B73 during secondary wall formation (Fig. 5b), and Mo17 stalks show about 3 times less insect damage in field-grown plants (Fig. 3c). Although some of the candidate genes might be false positives, others, including the unknown, P4H, UGE5, RLK-L, IAH, and AGP22-L, have been associated with responses to pathogen stress (Fig. S12).

Our most significant GWAS insect damage location was an unknown protein found to be a member of a family of genes mainly in grass species and not found in dicots by a BLASTp analysis of the entire plant protein sequence collection at NCBI. The closest by sequence rice homolog was found to be upregulated 4-fold when inoculated with the fungal pathogen Magnaporthe grisea isolate FR13 in rice leaves [58]. The same rice gene was found to be upregulated in rice mutants with an overexpressed calcium-dependent protein kinase that are salt and drought tolerant [59]. Thus, evidence exists for a role in pathogen biotic stress for the unknown gene that could be related to resistance to insect damage. Unlike RboS-L and IAH gene transcripts, which have a more constitutive expression as part of the larger salicylic acid and jasmonate pathways, the unknown gene may only be expressed at detectable levels during and at the location of the specific stress event.

Any candidate gene or cluster of genes identified by GWAS must eventually be validated by genetic functional analysis, for example, by over-expression or knockdown experiments or mutant analysis. In the absence of these experiments, identification of QTL that are in common with other independent GWAS studies offers some degree of confidence for further study. A meta-analysis across twenty-eight QTL studies in maize revealed 86 potential QTL involved in insect resistance [60]. Subsets of candidate genes were specific to tissue type or to one or more insect species. Ten of the 15 candidate penetration-resistance genes identified by COMPILE match genome locations from the meta-analysis, including our strongest candidate, the gene of unknown function on Chromosome 9 (Table S1).


The relative speed of data analysis using COMPILE allowed in-depth comparison of population size on trait analysis. Population size and diversity are major constraints for a trait and are not overcome by increasing marker density. COMPILE is customizable and is readily adaptable for application to species with robust, genotyped diversity panels and proteome databases. Even with the constraint of a small population size, novel candidate genes involved in tocopherol synthesis were identified. Application of COMPILE to a much larger maize population found novel candidate genes potentially contributing to flowering time. Using the Goodman AP, novel candidate genes were identified for potential resistance against penetration of the stem by the European corn borer. Analysis of candidate genes exhibiting different expression and levels of resistance both in silico using alternate pathogen systems and comparing inbred lines with varied stem damage gave greater confidence in their effect on corn borer damage. Stem penetration is a marker for one type of resistance to borer damage and identifies different sets of genes from those identified by resistance to pith damage.


Marker set construction and kinship matrix generation

Versions of our GWAS pipeline were generated for both the Goodman AP of 282 maize (Zea mays) inbreds and landraces [9], and the NCRPIS AP of 2815 maize lines [8]. For both the Goodman and NCRPIS APs, fully imputed data from the ZeaGBSv2.7 dataset [61], were converted to B73_RefGen_v4 (henceforth “v4”) coordinates and combined with markers from previous genotyping assays [2, 62,63,64], as described in Chen and Lipka [14]. All taxa names were standardized to the formats used in the phenotype data of Romay et al. [8], and a Perl script was developed to standardize the phenotype data format used (Dataset S1, Script 1). An additional version of the pipeline for the Goodman AP was generated using the HapMap 3.2.1 dataset (using the final marker set “Goodman 3.2.1”), which consisted of a much denser array of 83 million markers generated by Illumina paired-end sequencing [65]. Fully imputed marker data files using v4 coordinates were downloaded from Panzea ( and converted from VCF to HapMap format using a Perl script (Dataset S1, Script 2). All marker sets in this study were filtered to remove sites not anchored to the v4 reference genome, with > 75% missing genotype data, or with minor allele frequency (MAF) < 0.05.

Our Goodman 2.7 dataset contained ~ 310,000 markers, our NCRPIS 2.7 dataset ~ 231,000 markers, and our Goodman 3.2.1 dataset ~ 20.1 million markers. Final HapMap files for each of these three datasets were converted to the numerical format used by GAPIT ( [15];) using a Perl script, which conservatively imputed missing data with the major allele (Dataset S1, Script 3), and used to generate per-chromosome kinship matrices. The matrices were created using a separate Perl script (Dataset S1, Script 4) implementing the K_chr model [65], and the Loiselle pairwise kinship algorithm [66].

Generation of annotation databases

The maize gene list (Gramene 62, was filtered to contain only the chromosome, start/stop coordinates, and gene symbol for all transcripts and RNA features using a Perl script, which also converted transcript names to their corresponding protein names (Dataset S1, Script 5). A Perl script was used to create a list of maize genes and their midpoint coordinates from the maize gene list (Dataset 1, Script 6), and another Perl script was used to generate a file relating the position of each marker in the dataset to its nearest ten genes from the gene list (Dataset 1, Script 7). To generate a database relating each maize protein to its most sequence-similar protein in rice and Arabidopsis, the maize (B73_RefGen_V4,, rice (RAP-DB,, and Arabidopsis proteomes (Araport 11, were used. These FASTA files were each converted to BLAST databases using the NCBI standalone BLAST+ toolkit [67]. The BLAST+ toolkit and a pair of Perl scripts (Dataset S1, Scripts 8–9) were then used to obtain a database containing, for each maize gene, the gene name, the name and protein description of the closest BLAST matches, and their respective e-value and alignment scores.

Interpretation of GWAS results

Two methods were used for choosing significant SNPs from the results: 1) Bonferroni threshold [68], where the threshold p-value was defined as α/n, where α is the experiment-wise desired p-value and n is the number of statistical tests, i.e. the number of markers on a given chromosome, 2) a Benjamini-Hochberg false discovery rate (FDR) correction with a user-specified p-value, where each p-value was corrected according to the formula Q(i/m), where Q is the specified false discovery rate, i is the rank of the p-value (the smallest having rank 1), and m is the total number of tests [69]. Additional Perl scripts assembled the final reports, which contained the name, chromosome, and position of the original marker; the original and Benjamini-Hochberg FDR-corrected p-values of the marker; the distance to the midpoint of the closest gene, tRNA, or noncoding RNA; the identity of the genetic feature; the names of rice and Arabidopsis genes closest in sequence to the maize genes; and the description, alignment score, and e-value for each rice and Arabidopsis gene (Dataset S2, Scripts 1–3). For these analyses, significant markers were associated to genes by selecting the nearest 10 genes to the given marker, though the option also exists of specifying an LD window. Scripts were modified to perform the analysis upon the Goodman 2.7 dataset, the NCRPIS 2.7 dataset, and defined regions within the Goodman 3.2.1 and Goodman 2.7 datasets simultaneously (Dataset S2, Scripts 1–3).

Supplementary scripts and tools

Several additional Perl scripts and tools were developed to streamline the generation of figures and the analysis of GWAS results for this study. The first tool was developed to re-annotate the results according to different available significance thresholds of Bonferroni threshold, Benjamini-Hochberg correction, or manually-defined cutoff, as well as associate genes with significant markers either by LD window or nearest n genes (Dataset S3, Scripts 1). The second tool generates Manhattan plots with accurately placed vertical lines at genes and/or markers of interest for figure generation and annotation purposes (Dataset S3, Script 2). The third tool (Dataset S3, Script 3) aligns cognate Manhattan plots from low- and high-marker-density Goodman results with a customizable display of the genome architecture in a narrowly defined region, showing genes, non-coding RNA features, and exons. The fourth tool (Dataset S3, Script 4) is a generic script for genome architecture visualization easily generalizable for a given species. All scripts are available online on github at

Maize Stover collection and sample preparation

Genotypes comprising 274 lines of the Goodman Association Panel (AP) of 282 maize inbreds and landraces, including B73 and Mo17 [9], were originally obtained without restrictions from the Maize Genetics Cooperation Stock Center, at the University of Illinois, Urbana/Champaign (, and grown at the Purdue University Agricultural Center for Research Education. Lines in duplicate were distributed randomly across a rectangular plot, ten plants per 17-ft row with 30-in. spacing between rows and bordered by rows of B73 not included in the sampling. Plants received 200 kg hectare− 1 supplemental nitrogen. The lower 70 cm of senescent stems encompassing parts of internodes 3 through 6 from five field-dried plants per line, were harvested and air-dried at 50 °C. After scoring internodes 4 and 5 for the number of larval entry points, they were pooled and ground to 40-mesh in a Model 3 Wiley Mill (Thomas Wiley, Swedesboro, NJ). Milled samples were washed with warm 50% ethanol (v/v in water) and then warm water to remove soluble sugars and other metabolites, suspended in water, flash-frozen in liquid N2, and lyophilized.

Trait analyses

The number of holes generated by the corn borer larvae per unit length of internodes 4 and 5 were recorded as the quantitative trait in each of the stem stover samples for each genotype. The two populations were normalized and assigned a damage index from 1, with zero damage, to 10, in increments of 0.032 holes/cm of internode length, with an average of 0.11 holes/cm of internode length, corresponding to an average index of 3.28. Density, as packed volume of the milled stover after centrifugation, was determined before washing. One to 2 mL of milled stover were placed into pre-weighed 15-mL conical plastic centrifuge tubes (Corning) and the tubes with sample were weighed to the nearest 0.1 mg of sample and centrifuged for 2 min at 2500 rpm (1500 gmax) in a swinging-bucket rotor. Density was determined as mg mL− 1 packed volume.

Cellulose abundance was determined twice in triplicate as described previously [26]. Briefly, 5 ± 0.2 mg of cell wall material was hydrolyzed in 2 M trifluoroacetic acid at 120 °C for 90 min in borosilicate glass vials sealed with Teflon®-lined caps. The pellet was washed three times with water and adjusted to 1 mL for sampling. Cellulose was determined as Glc equivalents by phenol-sulfuric assay [70]. Technical replicates with variance greater than 5% were rerun.

Relative lignin abundance was determined by PyMBMS using a modified protocol from Sykes et al. [71], as described in Penning et al. [72]. Briefly, about 4 mg of freeze-dried cell wall material was added to an 80-μL stainless steel cup and pyrolyzed at 500 °C with a helium flow of 0.9 L min− 1 at standard temperature and pressure. Data acquisition time was 90 sec, but product evolution was essentially complete by 30 sec. A Merlin data acquisition system with 17-eV electron ionization was used to gather mass spectral data from m/z of 30–450 using a scan rate of 0.5 sec. Total ion counts were compiled from the 60 spectra obtained during the first 30 sec of product evolution. Relative abundance of fragments m/z 124, m/z 137, m/z 138, and m/z 151 (G-lignin) and m/z 154, m/z 167, m/z 168, and m/z 194 (S-lignin) were used estimate the proportion of total lignin [72].

RNA isolation and expression analysis

Maize (Zea mays) B73 and Mo17 lines were grown at the Purdue University Agricultural Center for Research and Education in West Lafayette, IN, and in campus greenhouses, as described by Penning et al. [23]. Rind tissues were harvested from internodes 2 through 9 of field-grown B73 plants grown for 35 to 63 days, at stages representing secondary cell wall biomass deposited in lower internodes to early and late elongation in upper internodes. A separate collection of rind tissues from B73 and Mo17 internodes 4 through 7, represented the same developmental stages of elongation through secondary wall development in greenhouse-grown plants 49 d after planting [23]. Rind tissues were flash-frozen in liquid N2, pulverized by mortar and pestle in liquid N2, and RNA extracted from approximately 2 mg of ground tissue in 1 mL of ice-cold TRIzol reagent (Invitrogen, Life Technologies).

Expression analysis was carried out as previously described by Penning et al. [23]. Briefly, cDNA libraries were constructed from pooled total RNA from three biological replicates and clustered on a HiSeq 2000 to produce paired-end 100 base sequences. High-quality trimmed sequences were mapped to the B73_RefGen_v2/3 from PlantGDB ( using Bowtie2 [73]. Because the high degree of gene duplication in maize, a custom Perl script was used was used to split duplicate reads between the two loci when they mapped together [26]. One count per million or greater was used as a threshold for the detection of transcript [74]. The RNAseq data are available at Expression of specific genes for specific internodes were found using Perl scripts described previously [26]. The averages and standard deviation or variance for each internode were calculated in Excel (

Additional published data on maize caryopsis and organ expression, and the degree of induction of expression by abiotic and pathogen stress, were obtained from the Maize Gene Expression Atlas at ePlant (

Availability of data and materials

All scripts used to generate COMPILE are in the three datasets included in this published article and its supplementary information files. The RNAseq data are available at NCBI with the following link:

Complete scripts for COMPILE are available online on github at





Goodman Association Panel


CONSTANS C-terminus-like transcriptional co-factor


choline O-acetyltransferase


Chloroplast unusual positioning1

CLas pathogen:

Candidatus Liberibacter asiaticus pathogen




Domain of unknown function


Guanosine Diphosphatase


Genome-Wide Association Studies


IAA-Ala Amino Hydrolase


Indole butyric acid


Indole propionic acid

LOD score:

Logarithm of the odds score


Mixed Linear Model


Proto-oncogene1-like transcription factor


Nested association mapping


North Central Regional Plant Introduction Station


nitric oxide


One-helix protein gene






Pyrolysis Molecular Beam-Mass Spectrometry


Quantitative trait locus (loci)


Apetala2-like transcription factor


Riboflavin synthase-like


Receptor-like kinase


Ribonuclease Regulating Protein


Reactive oxygen species


UDP-Glu:UDP-Gal 4-epimerase


WRKY sequence-containing transcription factor


  1. Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, et al. Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc Natl Acad Sci U S A. 2001;98:11479–84.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. Yu JM, Pressoir G, Briggs WH, Bi IV, Yamasaki M, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38:203–8.

    CAS  Article  PubMed  Google Scholar 

  3. Brachi B, Morris GP, Borevitz JO. Genome-wide association studies in plants: the missing heritability is in the field. Genome Biol. 2011;12:232.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Fu H, Dooner HK. Intraspecific violation of genetic colinearity and its implications in maize. Proc Natl Acad Sci U S A. 2002;99:9573–8.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat Genet. 2005;37:997–1002.

    CAS  Article  PubMed  Google Scholar 

  6. Tian F, Bradbury PJ, Brown PJ, Hung H, Sun Q, et al. Genome-wide association study of leaf architecture in the maize nested association mapping population. Nat Genet. 2011;43:159–62.

    CAS  Article  PubMed  Google Scholar 

  7. Yan J, Warburton M, Crouch J. Association mapping for enhancing maize (l.) genetic improvement. Crop Sci. 2011;51:433.

    Article  Google Scholar 

  8. Romay MC, Millard MJ, Glaubitz JC, Peiffer JA, Swarts KL, et al. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 2013;14:R55.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  9. Flint-Garcia SA, Thuillet AC, Yu J, Pressoir G, Romero SM, et al. Maize genes for lignin abundance and saccharification yield association population: a high-resolution platform for quantitative trait locus dissection. Plant J. 2005;44:1054–64.

    CAS  Article  PubMed  Google Scholar 

  10. Chia J-M, Song C, Bradbury PJ, Costich D, De Leon N, et al. Maize HapMap2 identifies extant variation from a genome in flux. Nat Genet. 2012;44:803–7.

    CAS  Article  PubMed  Google Scholar 

  11. Morrell PL, Buckler ES, Ross-Ibarra J. Crop genomics: advances and applications. Nat Rev Genet. 2011;13:85–96.

    CAS  Article  PubMed  Google Scholar 

  12. Zhang Z, Ersoz E, Lai C-Q, Todhunter RJ, Tewari HK, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42:355–60.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. Rincent R, Moreau L, Monod H, Kuhn E, Melchinger AE, et al. Recovering power in association mapping panels with variable levels of linkage disequilibrium. Genetics. 2014;197:375–87.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Chen AH, Lipka AE. The use of targeted marker subsets to account for population structure and relatedness in genome-wide association studies of maize (Zea mays L.). G3 Genes Genom Genet. 2016;6:2365–74.

    CAS  Article  Google Scholar 

  15. Lipka AE, Tian F, Wang QS, Peiffer J, Li M, et al. GAPIT: genome association and prediction integrated tool. Bioinformatics. 2012;28:2397–9.

    CAS  Article  PubMed  Google Scholar 

  16. Lipka AE, Gore MA, Magallanes-Lundback M, Mesberg A, Lin HN, et al. Genome-wide association study and pathway-level analysis of tocochromanol levels in maize grain. G3. 2013;8:1287–99.

    Article  Google Scholar 

  17. Yi F, Gu W, Chen J, Song N, Gao X, Zhang X, et al. High temporal-resolution transcriptome landscape of early maize seed development. Plant Cell. 2019;31:974–92.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  18. Doll NM, Just J, Brunaud V, Caïus J, Grimault A, Depège-Fargeix N, et al. Transcriptomics at maize embryo/endosperm interfaces identifies a transcriptionally distinct endosperm subdomain adjacent to the embryo scutellum. Plant Cell. 2020;32:833–52.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  19. Zheng XM, Wu FQ, Zhang X, Lin QB, Wang J, et al. Evolution of the PEBP gene family and selective signature on FT-like clade. J System Evol. 2016;54:502–10.

    Article  Google Scholar 

  20. Lam BC-H, Sage TL, Bianchi F, Blumwald E. Role of SH3 domain-containing proteins in clathrin-mediated vesicle trafficking in Arabidopsis. Plant Cell. 2001;13:2499–512.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. Stelpflug SC, Sekhon RS, Vaillancourt B, Hirsch CN, Buell CR, De Leon N, et al. An expanded maize gene expression atlas based on RNA sequencing and its use to explore root development. Plant Genome. 2015;9:1.

    CAS  Article  Google Scholar 

  22. Hoopes GM, Hamilton JP, Wood JC, Esteban E, Pahsa A, Vaillancourt B, et al. An updated gene atlas for maize reveals organ-specific and stress-induced genes. Plant J. 2019;97:1154–67.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  23. Penning BW, Shiga TM, Klimek JF, SanMiguel PJ, Shreve J, et al. Expression profiles of cell-wall related genes vary broadly between two common maize inbreds during stem development. BMC Genomics. 2019;20:785.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. Opitz N, Paschold A, Marcon C, Malik WA, Lanz C, Piepho H-P, et al. Transcriptomic complexity in young maize primary roots in response to low water potentials. BMC Genomics. 2014;15:741.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. Makarevitch I, Waters AJ, West PT, Stitzer M, Hirsch CN, Ross-Ibarra J, et al. Transposable elements contribute to activation of maize genes in response to abiotic stress. Plos Genet. 2015;11:e1005566.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. Penning BW, Sykes RW, Babcock NC, Dugard CK, Held MA, et al. Genetic determinants for enzymatic digestion of lignocellulosic biomass are independent of those for lignin abundance in a maize recombinant inbred population. Plant Physiol. 2014;165:1475–87.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  27. Olukolu BA, Wang G-F, Vontimitta V, Venkata BP, Marla S, et al. A genome-wide association study of the maize hypersensitive defense response identifies genes that cluster in related pathways. Plos Genet. 2014;10(8):e1004562.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  28. Thomas J, Guillaumie S, Verdu C, Denoue D, Pichol M, Barrière. Cell wall phenylpropanoid-related gene expression in early maize recombinant inbred lines differing in parental alleles at a major lignin QTL position. Mol Breed. 2010;25:105–24.

    CAS  Article  Google Scholar 

  29. Jian H, Zhang A, Ma J, Wang T, Yang B, Shuang LS, et al. Joint QTL mapping and transcriptome sequencing analysis reveal candidate flowering time genes in Brassica napus L. BMC Genomics. 2019;20:21.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Krakowsky MD, Lee M, Woodman-Clikeman WL, Long MJ, Sharopova N. QTL mapping of resistance to stalk tunneling by the European corn borer in RILs of maize population B73 x De811. Crop Sci. 2004;44:274–82.

    CAS  Article  Google Scholar 

  31. Flint-Garcia SA, Darrah LL, McMullen MD, Hibbard BE. Phenotypic versus marker-assisted selection for stalk strength and second-generation European corn borer resistance in maize. Theor Appl Genet. 2003;107:1331–6.

    CAS  Article  PubMed  Google Scholar 

  32. Barry D, Antonio AQ, Darrah LL. Registration of Mo45, Mo46, and Mo47 maize germplasm lines with resistance to European corn borer. Crop Sci. 1995;35:1232–3.

    Article  Google Scholar 

  33. Bohn M, Schulz B, Kreps R, Klein D, Melchinger AE. QTL mapping for resistance against the European corn borer (Ostrinia nubilalis H.) in early maturing European dent germplasm. Theor Appl Genet. 2000;101:907–17.

    CAS  Article  Google Scholar 

  34. Cardinal AJ, Lee M, Sharopova N, Woodman-Clikeman WL, Long MJ. Genetic mapping and analysis of quantitative trait loci for resistance to stalk tunneling by the European corn borer in maize. Crop Sci. 2001;41:835–45.

    CAS  Article  Google Scholar 

  35. Jiménez-Galindo JC, Malvar RA, Butrón A, Santiago R, Samayoa LF, et al. Mapping of resistance to corn borers in a MAGIC population of maize. BMC Plant Biol. 2019;19:431.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  36. Peiffer JA, Flint-Garcia SA, De Leon N, McMullen MD, Kaeppler SM, Buckler ES. The genetic architecture of maize stalk strength. Plos One. 2013;8:e67066.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  37. Samayoa LF, Malvar RA, Olukolu BA, Holland JB, Butrón A. Genome-wide association study reveals a set of genes associated with resistance to the Mediterranean corn borer (Sesamia nonagrioides L.) in a maize diversity panel. BMC Plant Biol. 2015;15:35.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Santiago R, Barros-Rios J, Malvar RA. Impact of cell wall composition on maize resistance to pests and diseases. Int J Mol Sci. 2013;14:6960–80.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  39. López-Malvar A, Butrón A, Samayoa LF, Figueroa-Garrido DJ, Malvar RA, Santiago R. Genome-wide association analysis for maize stem cell wall-bound hydroxycinnamates. BMC Plant Biol. 2019;19:519.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Li K, Yan J, Li J, Yang X. Genetic architecture of rind penetrometer resistance in two maize recombinant inbred lines. BMC Plant Biol. 2014;14:152

    CAS  Article  Google Scholar 

  41. Cannon MC, Terneus K, Hall Q, Tan L, Wang Y, Wegenhart BL, et al. Self-assembly of the plant cell wall requires an extensin scaffold. Proc Natl Acad Sci U S A. 2008;105:2226–31.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Liu X, Wolfe R, Welch LR, Domozych DS, Popper ZA, Showalter AM. Bioinformatic identification and analysis of extensins in the plant kingdom. PLoS One. 2016;11(2):e0150177.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  43. Seifert GJ, Barber C, Wells B, Dolan L. Roberts K galactose biosynthesis in Arabidopsis: genetic evidence for substrate channeling from UDP-D-galactose into cell wall polymers. Curr Biol. 2002;12:1840–5.

    CAS  Article  PubMed  Google Scholar 

  44. Foiada F, Westermeier P, Kessel B, Ouzunova M, Wimmer V, et al. Improving resistance to the European corn borer: a comprehensive study in elite maize using QTL mapping and genome-wide prediction. Theor Appl Genet. 2015;128:875–91.

    CAS  Article  PubMed  Google Scholar 

  45. Zhang Z, Zheng X, Yang J, Messing J, Wu Y. Maize endosperm-specific transcription factors O2 and PBF network the regulation of protein and starch synthesis. Proc Natl Acad Sci U S A. 2016;113:10842–7.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  46. Oikawa K, Yamasato A, Kong S-G, Kasahara M, Nakai M, et al. Chloroplast outer envelope protein CHUP1 is essential for chloroplast anchorage to the plasma membrane and chloroplast movement. Plant Physiol. 2008;148:829–42.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  47. Hey D, Grimm B. ONE-HELIX PROTEIN2 (OHP2) is required for the stability of OHP1 and assembly factor HCF244 and is functionally linked to PSII biogenesis. Plant Physiol. 2018;177:1453–72.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  48. Campanella JJ, Olajide AF, Magnus V, Ludwig-Müller J. A novel auxin conjugate hydrolase from wheat with substrate specificity for longer side-chain auxin amide conjugates. Plant Physiol. 2004;135:2230–40.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  49. Nehela Y, Hijaz F, Elzaawely AA, El-Zahaby HM, Killiny N. Citrus phytohormonal response to Candidatus Liberibacter asiaticus and its vector Diaphorina citri. Physiol Molec Plant Path. 2018;102:24–35.

    CAS  Article  Google Scholar 

  50. Wei K, Chen J, Chen Y, Wua L-J, Xieb D. Multiple-strategy analyses of ZmWRKY subgroups and functional exploration of ZmWRKY genes in pathogen responses. Mol BioSyst. 2012;8:1940–9.

    CAS  Article  PubMed  Google Scholar 

  51. Lawrence SD, Novak NG, El Kayal W, Ju CJ-T, Cooke JEK. Root herbivory: molecular analysis of the maize transcriptome upon infestation by southern corn rootworm, Diabrotica undecimpunctata howardi. Physiol Plant. 2012;144:303–19.

    CAS  Article  PubMed  Google Scholar 

  52. Johnson ET, Dowd PF. Differentially enhanced insect resistance, at a cost, in Arabidopsis thaliana constitutively expressing a transcription factor of defensive metabolites. J Agric Food Chem. 2004;52:5135–8.

    CAS  Article  PubMed  Google Scholar 

  53. Chen X, Duan X, Wang S, Wu W, Zhang X. Virus-induced gene silencing (VIGS) for functional analysis of MYB80 gene involved in Solanum lycopersicum cold tolerance. Protoplasma. 2019;256:409–18.

    CAS  Article  PubMed  Google Scholar 

  54. Delledonne M, Xia Y, Dixon RA, Lamb C. Nitric oxide functions as a signal in plant disease resistance. Nature. 1998;394:585–9.

    CAS  Article  PubMed  Google Scholar 

  55. Yoshioka H, Mase K, Yoshioka M, Kobayashi M, Asai S. Regulatory mechanisms of nitric oxide and reactive oxygen species generation and their role in plant immunity. Nitric Oxide. 2011;25:216–21.

    CAS  Article  PubMed  Google Scholar 

  56. Deng B, Deng S, Sun F, Zhang S, Dong H. Down-regulation of free riboflavin content induces hydrogen peroxide and a pathogen defense in Arabidopsis. Plant Mol Biol. 2011;77:185–201.

    CAS  Article  PubMed  Google Scholar 

  57. Asai S, Mase K, Yoshioka H. A key enzyme for flavin synthesis is required for nitric oxide and reactive oxygen species production in disease resistance. Plant J. 2010;62:911–24.

    CAS  Article  PubMed  Google Scholar 

  58. Qi H, Jiang Z, Zhang K, Yang S, He F, Zhang Z. PlaD: a transcriptomics database for plant defense responses to pathogens, providing new insights into plant immune system. Genom Proteom Bioinform. 2018;16:283–93.

    Article  Google Scholar 

  59. Campo S, Baldrich P, Messeguer J, Lalanne E, Coca M, San SB. Overexpression of a calcium-dependent protein kinase confers salt and drought tolerance in rice by preventing membrane lipid peroxidation. Plant Physiol. 2014;165:688–704.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  60. Badji A, Otim M, Machida L, Odong T, Kwemoi DB, Okii D, et al. Maize combined insect resistance genomic regions and their co-localization with cell wall constituents revealed by tissue-specific QTL meta-analyses. Front Plant Sci. 2018;9:895.

    Article  Google Scholar 

  61. Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9:e90346.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  62. Cook JP, McMullen MD, Holland JB, Tian F, Bradbury PJ, et al. Genetic architecture of maize kernel composition in the nested association mapping and inbred association panels. Plant Physiol. 2012;158:824–34.

    CAS  Article  PubMed  Google Scholar 

  63. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. Plos One. 2011;6:e19379.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  64. McMullen MD, Kresovich S, Villeda HS, Bradbury PJ, Li H, et al. Genetic properties of the maize nested association mapping population. Science. 2009;325:737–40.

    CAS  Article  PubMed  Google Scholar 

  65. Bukowski R, Guo X, Lu Y, Zou C, He C, et al. Construction of the third-generation Zea mays haplotype map. Gigascience. 2018;7:gix134.

    Article  Google Scholar 

  66. Loiselle BA, Sork VL, Nason J, Graham C. Spatial genetic structure of a tropical understory shrub, Psychotria officinalis (Rubiaceae). Am J Bot. 1995;82:1420–5.

    Article  Google Scholar 

  67. Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  68. Dunn OJ. Multiple comparisons among means. J Amer Stat Assoc. 1961;56:52–64.

    Article  Google Scholar 

  69. Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B-Stat Meth. 1995;57:289–300.

    Article  Google Scholar 

  70. Dubois M, Gilles KA, Hamilton JK, Rebers PA, Smith F. Colorimetric method for determination of sugars and related substances. Anal Chem. 1956;28:350–6.

    CAS  Article  Google Scholar 

  71. Sykes R, Kodrzycki B, Tuskan G, Foutz K, Davis M. Within tree variability of lignin composition in Populus. Wood Sci Technol. 2008;42:649–61.

    CAS  Article  Google Scholar 

  72. Penning BW, Sykes RW, Babcock NC, Dugard CK, Klimek JF, et al. Validation of PyMBMS as a high throughput screen for lignin abundance in lignocellulosic biomass of grasses. Bioenerg Res. 2014;7:899–908.

    CAS  Article  Google Scholar 

  73. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  74. Li P, Ponnala L, Gandotra N, Wang L, Si Y, et al. The developmental dynamics of the maize leaf transcriptome. Nature Genet. 2010;42:1060–7.

    CAS  Article  PubMed  Google Scholar 

Download references


Not applicable.


We developed a computational pipeline COMPILE to accelerate identification and annotation of candidate genes in Genome-Wide Association Studies. COMPILE uses a Mixed Linear Model that links significant Quantitative Trait Loci (QTL) to candidate maize genes and RNA regulatory elements contained in the maize genome, and then matches maize genes to their closest rice and Arabidopsis homologs by sequence similarity. We validated COMPILE using published data to identify known and novel QTL and non-coding RNAs associated with α-tocopherol biosynthesis and flowering time. We then applied COMPILE to identify candidate loci contributing to resistance of maize stems to penetration by larvae of the European Corn Borer (Ostrinia nubilalis).


The Center for Direct Catalytic Conversion of Biomass to Biofuels (C3Bio), an Energy Frontier Research Center funded by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Award Number DE-SC0000997, supported of cell wall compositional and bioinformatic analyses. The U.S. Department of Energy, Office of Science, Feedstock Genomics Program, Office of Biological and Environmental Research, Award Number DE-FOA-0000598, supported the RNA deep-sequencing analyses. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data or in writing the manuscript. USDA is an equal opportunity provider and employer.

Author information

Authors and Affiliations



M.J.H designed and implemented the computational pipeline for GWAS; M.J.H. and B.W.P performed the borer damage and carbohydrate assays; B.W.P. performed the RNAseq and expression analyses; M.C.M. and N.C.C. directed the work and assisted in data analyses; all authors contributed to writing and editing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Nicholas C. Carpita.

Ethics declarations

Ethics approval and consent to participate

Genotypes comprising 274 lines of the Goodman Association Panel (AP) of 282 maize inbreds and landraces, including B73 and Mo17 [9], were originally obtained without restrictions from the Maize Genetics Cooperation Stock Center, at the University of Illinois, Urbana/Champaign (, and grown at the Purdue University Agricultural Center for Research Education. All local national or international guidelines and legislation were adhered to in the production this study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Figure S1.

Association of a long intergenic non-coding RNA among QTL associated with α/γ tocopherol ratio in maize kernels. Figure S2. Expression of genes associated with α-tocopherol synthesis in developing caryopses. Figure S3. Expression profiles of genes associated with days-to-silking of genes identified by GWAS. Figure S4. Expression profiles of genes associated with days to silking identified by COMPILE. Figure S5. Heat map for larval penetrance in holes per cm internode length across field range and row. Figure S6. Quantile-Quantile plots for larval damage index in the Goodman AP population. Figure S7. Scatter plots comparing European corn borer damage index to cellulose abundance, G- and S-lignin abundance, and milled stover density. Figure S8. Hydropathy plots of maize and other grass proteins encoded by genes homologous to the candidate maize gene associated with resistance to larval penetrance. Figure S9. FOCUS plots of candidate genes among QTL associated with European Corn Borer damage in maize stalks. Figure S10. Expression profiles of genes associated with larval penetration with significant internode expression. Figure S11. Expression profiles of genes associated with larval penetration with little or no internode expression. Figure S12. Expression profiles of genes associated with larval penetration that are induced by abiotic and pathogen stress. Figure S13. Manhattan plots showing GWAS results for the 20Mbp Chromosome 8 region using both low- and high-density markers. Table S1. QTL associated with resistance to European corn borer penetration in maize stems compared with a meta-analysis of QTL associated with multiple stem and insect resistance and storage pests.

Additional file 2: Dataset S1.

Scripts 1-9, Dataset S2. Scripts 1-3. Dataset S3. Scripts 1-4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hill, M.J., Penning, B.W., McCann, M.C. et al. COMPILE: a GWAS computational pipeline for gene discovery in complex genomes. BMC Plant Biol 22, 315 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Zea mays
  • Maize
  • Genome
  • Computational biology
  • GWAS
  • QTL
  • γ-Tocopherol synthesis
  • Flowering time
  • Ostrinia nubilalis
  • European corn borer