Dissecting grain yield pathways and their interactions with grain dry matter content by a two-step correlation approach with maize seedling transcriptome

Background The importance of maize for human and animal nutrition, but also as a source for bio-energy is rapidly increasing. Maize yield is a quantitative trait controlled by many genes with small effects, spread throughout the genome. The precise location of the genes and the identity of the gene networks underlying maize grain yield is unknown. The objective of our study was to contribute to the knowledge of these genes and gene networks by transcription profiling with microarrays. Results We assessed the grain yield and grain dry matter content (an indicator for early maturity) of 98 maize hybrids in multi-environment field trials. The gene expression in seedlings of the parental inbred lines, which have four different genetic backgrounds, was assessed with genome-scale oligonucleotide arrays. We identified genes associated with grain yield and grain dry matter content using a newly developed two-step correlation approach and found overlapping gene networks for both traits. The underlying metabolic pathways and biological processes were elucidated. Genes involved in sucrose degradation and glycolysis, as well as genes involved in cell expansion and endocycle were found to be associated with grain yield. Conclusions Our results indicate that the capability of providing energy and substrates, as well as expanding the cell at the seedling stage, highly influences the grain yield of hybrids. Knowledge of these genes underlying grain yield in maize can contribute to the development of new high yielding varieties.


Background
Maize production in 2007 was about 800 million tonnesmore than rice or wheat http://faostat.fao.org, and it is likely to become the most important source for human nutrition by 2020 [1]. Conventional breeding approaches employing direct phenotypic selection with limited or no knowledge of the underlying morpho-physiological determinants have successfully improved yield [2]. Maize grain yield and its major components -kernel weight, kernel number per ear, ear number per plant -have been studied by quantitative trait locus (QTL) mapping approaches [3]. The identified chromosome regions pro-vide a starting point for further decoding the mechanisms affecting maize production. In European maize breeding, early maturity of high yielding varieties is an important breeding goal, since the short growing season limits productivity. Therefore, grain dry matter content, as an indicator for early maturity, is a major factor determining maize productivity.
Genes directly involved in grain yield, including those associated with grain number (e.g., OsCKX2), grain weight (e.g., GS3 and GW2) and grain filling were identified in rice ( [4] for review). Further, genes indirectly associated with grain yield via plant height (e.g., Rht1, sd1, and BRI1) and tillering (e.g., TB1, FC1, and MOC1) were also identified. These findings underline the important roles of cell cycle, phytohormone signaling, carbohydrate supply, and the ubiquitin pathway and have increased our understanding of grain yield. However, the mechanisms and pathways controlling yield and yield-related traits still remain largely unknown.
Genome-scale oligonucleotide arrays have become a powerful tool in detecting the pathways and pathway interactions underlying biological processes. In maize, results on ear and kernel development have been reported [5,6]. However, no results focusing on maize yield or early maturity are available.
Our objectives were to investigate the genes and gene networks underlying grain yield in maize, and their interaction with genes underlying grain dry matter content, by employing a newly developed two-step correlation analysis that combines multi-environment field data and transcription profiles.

Grain yield-involved genes
The modified F-test with a false discovery rate (FDR) of 0.01 [7] revealed that 12,288 out of the 43,381 gene-oriented probes representing complementary maize genes were differentially expressed in the parental inbred lines of the 98 hybrids. For 10,810 among them, the fold change was greater 1.3 and the log-2 expression intensity was greater 8.0. This set of significant differentially expressed genes was subjected to further analyses. The average number of genes differentially expressed between the parents of a hybrid was 3350, which equals 7.7% of the genes on the array (see Additional file 1). The mid-parent expression level of 2511 differentially expressed genes was significantly (p < 0.01) correlated with hybrid performance (PY) or heterosis (HY) for grain yield. In Step 1 of the two-step selection approach (Figure 1), 540 genes were found to be highly significantly (p < 0.0001) correlated with PY or HY. In Step 2, additional 205 genes were added to the set of grain yield associated genes S. The gene expression of 468 genes (62.8% of 745 genes) was positively and that of 277 (37%) negatively correlated with PY (see Additional file 2). Note however, that these percentages are based on probes and may overestimate the actual number of differentially regulated genes, because there may not always be a one-to-one relationship between probes and genes.
In a cross validation procedure, three of the seven flint lines and five of the fourteen dent lines were randomly sampled with 100 repetitions. On average 190 of the 200 genes showing the strongest correlation with PY in the estimation set were among the set of the 200 genes with the strongest correlation in the complete data set. For HY the average number of agreeing genes was 185. This result confirms that the different genetic backgrounds of the inbred lines only marginally contributed to the random error in the correlation analysis.

Interaction between grain yield and grain dry matter content associated genes
The negative correlation r(PY, PD) = -0.410 between hybrid performance for grain yield and grain dry matter content was significant (p = 0.002). This suggests that the gene networks involved in grain yield and grain dry matter content might be overlapping and negatively interacting with each other. Employing the two-step selection approach (Figure 1) we detected 622 genes associated with grain dry matter content. A total of 103 genes had an influence on both traits and had correlations of opposite sign with regard to grain dry matter content and grain yield (see Additional file 2). Some of these genes were Figure 1 Schematic representation of a two-step correlation approach. L, average expression level of a gene in the parents of a hybrid; g*, gene not included in set S in a previous repetition of Step 2; r, correlation coefficient; p, p-value for statistical significance; PY, hybrid performance for grain yield; HY, mid-parent heterosis for grain yield. r(L,PY) for gene g: (p < 0.0001) ?
Add gene g to the set S of genes involved in grain yield yes yes Step 1 For all g in S and all g* not in S: r(L g* ,L g ) > 0.9 ?
Step 2 no Set S is complete yes Add gene g* to set S repeat located in the phytohormone signaling pathways (e.g., auxin-responsive factor, beta-glucosidase) and the flavonoid metabolism (e.g., isoflavone reductase, 2hydroxyisoflavanone dehydratase; Table 1). Among the interacting genes, only 39 genes were identified in Step 1. However, 64 more genes were included in Step 2. About half of these additional genes were associated with only one trait (grain yield or grain dry matter content) at the 0.0001 level, but were highly correlated with a significant gene concerning the second trait.

Functional classification of trait-involved genes
To examine the functions of grain yield and grain dry matter content associated genes, these were grouped into functional categories based on the MIPS Functional Catalogue ( Table 2, Additional file 2). The functional category METABOLISM contained most of the genes for both traits. For grain yield, it was followed by PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT and for grain dry matter content by CELL RESCUE, DEFENSE AND VIRULENCE. Furthermore a large number of genes were related to processes involved in ENERGY. In Step 2 of the selection approach, the additional genes in categories CELL CYCLE AND DNA PROCESSING and CELL FATE were included in the set of grain yield associated genes, resulting in an enrichment of these two categories. The category CELL RESCUE, DEFENSE AND VIRULENCE included the largest number of genes, which were associated with both traits.

Significantly regulated metabolic pathways
In an enrichment analysis of the grain yield associated genes with RiceCyc, we determined overrepresented pathways. These included sucrose degradation, cyclopropane and cyclopropene fatty acid biosynthesis, and plant respiration ( Table 3, Additional file 2). Many grain yield associated genes were classified to the pathways of glycolysis, fructose degradation to pyruvate and lactate, glucose fermentation to lactate, and the Calvin cycle. Two genes were involved in the biosynthesis of the growth hormone IAA, one of these two genes was associated with both grain yield and grain dry matter content. One gene (MZ00042300) coding for a hexokinase involved in the degradation of sugars (e.g. sucrose), was associated with both traits (Figure 2).

Maize transcriptome at seedling stage
Gene expression of the parental inbred lines was profiled at the seedling stage. This strategy largely reduced the variance during plant collection, since seedlings can be grown in large quantities under highly controlled conditions [9]. Maize seedling transcriptome employed in our study did not take into account important trait-involved genes, which were regulated by developmental and environmental conditions. However, from previous research [5,6,10] it is known that grain yield associated genes ( Table 1) were also regulated in ear or kernel development or stress response. This supports the hypothesis that the relative expression patterns of grain yield associated genes have already been established in early development stages [11]. Therefore the latent efficiency of these genes as determined at the seedling stage is expected to have a direct influence on grain yield.

Two-step selection of trait-involved genes
Our newly developed two-step correlation approach targets at identifying all genes associated with grain yield and grain dry matter content using our expression and field data. On the one hand, it detects the most relevant genes in Step 1 using the stringent significance level of p < 0.0001. On the other hand, it also includes further important genes with the less stringent significance level of p < 0.01 on the basis of co-expression (r > 0.9). Employing co-expression reduced the number of about 2500 genes, which were significant at the 0.01 level, to 640. In conclusion, the two-step approach allows a more focused detection of relevant genes with a possibly important biological significance than solely a low statistical significance level. In Step 1, only 39 genes associated with both traits were detected. This number would have been too small to examine the interaction between the pathways involved in both traits. However, the additional genes identified in Step 2 enabled us to decode major interaction networks of grain yield and grain dry matter content (Table 1).

Plant metabolism -sucrose degradation and glycolysis
Hexose phosphates derived from sucrose degradation are used to meet the energy and substrate requirements for plant growth. The finding that sucrose degradation was overrepresented in grain yield-involved genes (Table 3) suggests its significant role in maize production. Three genes encoding three types of invertases (MZ00005490, vacuolar invertase; MZ00026683, cytosolic invertase; MZ00033179, cell wall invertase) and one gene encoding a hexokinase (MZ00042300) were found to be positively associated with grain yield ( Figure 2 and Table 1). This implies that sucrose degradation is up-regulated in high yielding hybrids, resulting in an increased hexose phosphate pool during the seedling stage ( Figure 2). These results coincide with the fact that the strong relationship between invertase activity and growth rate was largely explained by common chromosomal regions co-located with genes encoding invertase and other related enzymes [12].     A considerable number of grain yield associated genes were found to be involved in glycolysis, an integrated (whole) plant metabolism using hexose phosphates (Table 3). PFK (MZ00013816, adenosine kinase/phosphofructokinase) is the principle enzyme regulating the entry of metabolites into glycolysis [13] through conversion of fructose-6-phosphate to fructose-1,6-bisphosphate. Its encoding gene was positively correlated with grain yield, indicating the up-regulation of glycolysis in high yielding hybrids. This result is supported by the fact that genes encoding alpha and beta subunits of PFP (Pyrophosphate-fructose 6-phosphate 1-phosphotransferase; MZ00024213 and MZ00024012, respectively), involved in interconversion of fructose-6-phosphate and fructose-1,6-bisphosphate, were both positively correlated with grain yield. These findings suggest that glycolysis is involved in grain yield, and the up-regulation of glycolysis seems to be a downstream effect of sucrose degradation up-regulation. This results in an increase of hexose phosphate, supplying more energy and more substrates, which are necessary for a strong seedling development. This deduction is supported by the fact that hexoses as well as sucrose have been recognized as important signal molecules in source-sink regulation and balance [14].
The relationship between carbohydrate metabolism and phytohormone signaling is illustrated by the fact that cytokinins enhance the gene expression of cell wall invertase and hexose uptake carriers [15]. One gene encoding a beta-glucosidase (MZ00035426) providing active cytokinins [16], one gene encoding a beta-glucosidase aggregating factor (MZ00013608) and a direct downstream gene of cytokinin (MZ00031351) encoding A-type response regulator [17] were positively associated with grain yield (Table 1). This suggests that up-regulated carbohydrate metabolism could partially be the result of cytokinin signaling regulation.

Plant growth -cell expansion and endocycle
The growth of plant tissue generally proceeds in two stages. The first stage is cell division followed by cell expansion until differentiation is completed [18]. In an early developmental phase during endosperm development, cell division takes place and then organelle proliferation and cell expansion occur. In a later developmental phase, starch and proteins are deposited into the endosperm tissue. The early developmental phase decides over the final volume of the grain filling and consequently partly over the amount of final grain yield, due to the total cell number and the size of the cells [19]. In our results, the marker genes of cell expansion encoding V-type H + ATPase (MZ00013961) and aquaporins (MZ00043527) for water up-take [20] together with expansins (e.g. MZ00022872) and endo-1,3-beta-D-glucosidase (MZ00004156) for cell wall loosening [21], were positively associated with grain yield (Figure 3 and Table  1). This indicates that probably a high cell expansion rate in the seedling stage and maybe also later in the early phase of endosperm development is associated with high grain yield in hybrids. Larger cells, due to an increased cell expansion, have also been observed in maize roots of hybrids compared to their parental inbred lines [22]. The high expression of a gene (MZ00027266) encoding an FtsZ-like protein, which stimulates chloroplast division [23], indicates that hybrids with high grain yield may proliferate more chloroplasts along with cell expansion during seedling development and possibly also during endosperm development. This coincided with the regulation of genes located in the calvin cycle and chlorophyllide a biosynthesis (Table 3).
DNA synthesis, persisting after transition to cell expansion without subsequent cell division (M-phase), leads to endocycle, which significantly contributes to cell expansion in higher plants ( [24] for review). The finding that the functional category CELL CYCLE AND DNA PRO-CESSING was overrepresented in grain yield associated genes (Table 2) suggests that this set of genes may play a significant role in grain yield regulation through their influence on endocycle, because most cells used for transcription profiling had already completed the cell division stage. For example, a gene (MZ00041750) encoding a DNA replication licensing factor and a gene (MZ00027598) encoding a subunit of a replication factor were positively associated with grain yield, which suggests that changes in the replication rate lead to alterations in the cell cycle of the hybrids. This deduction is also supported by the fact that several genes encoding enzymes involved in DNA repair were positively associated with grain yield. The ploidy level affects the cell size by increasing the metabolic output [25]. This supports the hypothesis that up-regulation of sucrose degradation and glycolysis in high yielding hybrids could be the result of a high ploidy level during cell expansion.
The endocycle is mediated by a down-regulation of cyclin-dependent kinase (CDK) activity in cells [25]. A gene (MZ00017440) encoding a B-type cyclin-dependent kinase (CDBK) was negatively associated with grain yield, implying that down-regulation of this CDKB could affect endocycle. Such a down-regulation could also be realized through less phosphorylation of CDK-inhibitors (ICK/ KPRs) by CDKBs [26]. Another gene (MZ00021442) encoding ICK/KPR was also positively associated with grain yield, which stimulates the endocycle by decreasing the CDK activity. The activation of the ubiquitin-proteasome pathway [25] is a further mechanism to decrease CDK activity. The genes (e.g. MZ00020431) encoding the anaphase-promoting complex (APC) and another gene (MZ00030283) which encodes an APC-activating protein DG, differentially expressed genes; grain yield-involved, genes involved in grain yield; GDMC-interaction, the grain yield-involved genes which negatively interacted with grain dry matter content; GDMC-involved, genes involved in grain dry matter content; n, number of genes; p, p-value for statistical significance. The symbol "-" represents data unavailable. The numbers in boldface represent significance at p < 0.05. The percentages in italics represent the first two largest categories in each set of genes. Grain yield-involved, genes involved in grain yield; GDMC-interaction, the grain yield-involved genes which negatively interacted with grain dry matter content; GDMC-involved, genes involved in grain dry matter content; n, number of genes; p, p-value for statistical significance. The symbol "-" represents data unavailable. The data in boldface represent significance at p < 0.05. and belongs to the CCS52A class [27], were positively associated with grain yield. This suggests that the APCdependent proteasome pathway may influence the endocycle through the proteolysis of cyclins and regulation of cyclin/CDK complexes. This deduction is consistent with previous results, where higher expression levels of CCS52A coincided with higher levels of endocycle in Medicago nodules [27]. Cell expansion and endocycle are also controlled by further mechanisms. The orthologue of ZmDRP1A (MZ00014057) is a positive factor for cell expansion in Arabidopsis [28,29]. In our study, it was positively associated with grain yield. In contrast, the orthologue of ZmSMT2 (MZ00056596) in Arabidopsis impedes endocycle [30]. In our study it was negatively associated with grain yield. This suggests the regulatory role of both genes in cell expansion during the maize seedling stage. Recently, a study demonstrated that transcriptional coactivators (AtMBF1s) play a significant role in controlling leaf cell expansion and the ploidy level [31]. From our results, a gene (MZ00003819; ZmMBF1c) encoding an orthologue of AtMBF1c was highly positively associated with grain yield and had a high fold-change across hybrids. This suggests that ZmMBF1c could significantly contribute to grain yield by controlling cell expansion along with regulating endocycle in the maize seedling.
Auxin is a phytohormone that regulates cell expansion and has been studied the most among all phytohormones [32]. Four genes (MZ00038300, MZ00021497, MZ00024781 and MZ00044325) encoding auxin-responsive factors were associated with grain yield, and also two genes (MZ00040986 and MZ00026772) encoding proteins for IAA modification. Furthermore, two genes possibly involved in IAA synthesis were associated with grain yield, indicating that the auxin signaling pathway could directly contribute to grain yield of maize hybrids throughout cell expansion.

Overlap of pathways involved in grain yield and grain drymatter content
The fact that some metabolic genes were positively associated with grain yield but negatively associated with grain dry matter content suggests that overlaps exist at the metabolic level. A part of the grain yield associated genes located on regulatory or signaling pathways, such as the ubiquitin pathway or phytohormone pathways (Table 1 and Figure 3), were also associated with grain dry matter content, suggesting that regulatory genes involved in both traits are overlapping. When higher grain yield is achieved in breeding programs by accumulating genes positively associated with grain yield, these overlaps could lead to a decrease in grain dry matter content, resulting in higher post-harvest production costs due to artificial grain drying [3]. The selection of lines with a high expression of genes positively associated with one trait but at the same time not negatively with the second trait could result in a simultaneous increase of grain yield and grain dry matter content.

Conclusions
We found that a high expression of genes involved in cell expansion, assessed at the parental lines of hybrids, was positively correlated with high grain yield of the hybrids. Therefore we hypothesize that hybrids with a high cell expansion rate have an advantage in growth and in grain development. At the same time, they probably can also provide more energy and substrates for growth, along with cell expansion. However, due to a negative correlation between grain yield and grain dry matter content, this latent ability of high yielding hybrids has a negative effect on grain dry matter content after harvest. Our study greatly extended the understanding of the mechanisms underlying grain yield at the molecular level. The results suggest that selection of inbred lines after transcript profiling at the seedling stage can help increase selection efficiency in maize breeding.
The factorial crosses were evaluated in 2002 at six agroecologically diverse locations in Germany (Bad Krozingen, Eckartsweier, Hohenheim, Landau, Sünching, Vechta). The 21 inbred parents were evaluated for their per se performance in 2003 at four locations (Eckartsweier, Hohenheim, Sünching, Pocking) and in 2004 at three locations (Eckartsweier, Hohenheim, Bad Krozingen). The trials were evaluated in two-row plots using adjacent α designs with two to three replications. Hybrid performance for grain yield (PY) was assessed in Mg ha -1 adjusted to 155 g kg -1 grain moisture and hybrid performance for grain dry matter content (PD) in percent. The mid-parent heterosis of the hybrids for grain yield (HY) and grain dry matter content (HD) was determined. The field data were analyzed with a mixed linear model, which was described in detail in a previous study [33], where it was referred to as Experiment 1. The correlation between PY and PD was tested using a permutation test [34]. The distribution of the test statistic was approximated with Monte Carlo sampling using 9,999 samples.

Microarray data
Seedlings of the 21 maize inbred lines were grown in a climate chamber under regulated growth conditions. RNA was isolated from a mixture of five seedlings of each line, which were 7 days old. The 46 k array from the maize oligonucleotide array project http://www.maizearray.org/ , University of Arizona, USA) was used for transcription profiling [7]. For the microarray experiment an interwoven loop design [35] was applied. It resulted in 63 hybridizations of dent and flint lines by sampling each dent line five times and each flint line eight times. Blank and negative controls, which were located in all blocks of the array, were used to confirm the stability of the experiment. Because no Spike-in RNA was mixed into the isolated RNA, all Spike-in probes, were used as blank or negative controls. For experimental validation of the microarray experiment, two genes in eight different lines were evaluated by Quantitative RT-PCR, essentially in accordance with the microarray data. The microarray data were deposited in Gene Expression Omnibus (GEO) under the series accession GSE17754.
The gene-oriented probes with intensities (on a log2 scale) greater than the average intensity plus three times the standard deviation of all Spike-in probes were considered to be reliably expressed. Genes were further analyzed for differential expression, if their expression foldchanges between at least one pair of parental lines were greater than 1.3. The gene-oriented probes together with Spike-in probes were tested for statistically significant differential expression across all comparisons with a moder-   is represented by the numbers in the boxes. Positively (P) and negatively (N) associated genes are shown in brown and blue, respectively. The boxes with two frames show genes with interactions to grain dry matter content (GDMC). The representation of the cell cycle genes regulating endocycle were taken from a previous review [25].
h ated F-test and subsequently with a nested F-test for each comparison of parental lines. The LIMMA package [36] was applied for the tests. According to the most significant Spike-in probe with an adjusted p-value of 0.049, a false discovery rate (FDR) of 0.01 was chosen as a more conservative cutoff in order to detect significant differential expression between inbred lines. For each differentially expressed gene, we calculated the average L of the expression level (log2 scale) in the parents of each hybrid.

Correlation analysis
The correlations r(L, PY), r(L, PD) r(L, HY), and r(L, HD) between the average expression level of a gene in the parental lines and the hybrid performance and heterosis for grain yield and grain dry matter content, respectively, were determined. Significance of the correlations was tested with a t-test with n -2 degrees of freedom, where n = 98 is the number of hybrids in the factorial. A type I error rate of 0.01 adjusted for multiple testing using a false discovery rate [37] was employed and the p-value of each gene was adjusted accordingly. Confidence intervals for the correlations were determined based on Bca (biascorrected accelerated) bootstrap (α = 95%, 10,000 resamples) [38]. We employed a newly developed two-step correlation approach to identify genes associated with grain yield (Figure 1). In Step 1, all genes for which the correlations r(L, PY) or r(L, HY) were highly significant (p < 0.0001) were assigned to the set S. In Step 2, such genes that were not included in set S in the previous step but were highly correlated (r > 0.9) with genes included in set S in the previous step, were then added to S. Step 2 was iteratively repeated until no new genes were added to set S.
To determine a set of genes T associated with grain dry matter content we carried out a similar approach, but here only the correlations for hybrid performance r(L, PD) were considered in Step 1, because heterosis for grain dry matter content is low in maize [39].
The stability of the correlations was investigated with a cross validation procedure. In the cross validation, five dent and three flint lines were selected from the 7 × 14 factorial to compile the estimation set [40]. The set of trait associated genes was determined in the estimation sets generated by 100 rounds of cross validation. For each gene, it was determined how often it was assigned to the set of the trait associated genes in the 100 estimation sets. The genes were arranged according to this frequency and the sequence of the first 200 genes was compared to the sequence of the 200 genes with the smallest p-value determined from the complete data set. The difference between these two sets of genes was used as a measure for the instability of the correlations which were introduced by the genetic background. devised and planned the study, contributed to the lab analysis, and contributed to the writing of the paper. All authors read and approved the final manuscript.