Web interface
The WheatExp homepage includes a brief description of the database and project design as well as details of all currently available datasets (Fig. 1a). Our data processing pipeline allows for the rapid incorporation of complementary RNA-seq expression datasets as they are published and we invite suggestions for the addition of new datasets from the user community. We anticipate regular expansions of the database to broaden the range of developmental and temporal expression profiles included. This approach will maximize the utility of the database for researchers studying diverse aspects of wheat development and ensures access to the most relevant high-quality expression datasets.
From this main hub (Fig. 1a), the database can be queried in one of two ways; either by entering the DNA or protein sequence of a gene of interest as a BLAST query, or by a text search for a known gene ID from the Ensembl genomics annotation platform [5] (e.g. Traes_6AS_9E38A95CB.1) or for an annotated functional term associated with the gene’s encoded protein (e.g. “bHLH” or “Cytochrome P450”). For BLAST searches, results are displayed on a new page and include details of each BLAST alignment, sequence and a link to the corresponding gene ID page on the external Ensembl genomics hub for simple cross-referencing (Fig. 1b). A maximum of six matched results may be selected for side-by-side display within the same graph to allow simple comparisons between multiple genes. While this feature was originally implemented to enable comparisons among wheat homoeologues, any set of up to six genes may be selected for comparison, regardless of their relationship.
Likewise, when browsing using the text search function, up to six genes can be selected for addition to the results list, which can subsequently be viewed side-by-side in the results window. For larger-scale analyses, tabular expression data for any number of genes can be downloaded by providing a list of Ensembl gene IDs of interest. The functional terms associated with each gene are obtained through standard gene annotation files in GFF3 format from the IWGSC which are stored within the database for text search function. We chose to adhere to the widely-used standard gene nomenclature format employed by the IWGSC and Ensembl genomics platform [5] and selected the set of annotated cDNA sequences from this platform as our mapping reference. External links to the annotated sequences for each gene are included in the results. This nomenclature format is increasingly becoming the standard for gene annotations within the plant research community, so our use of this reference will allow for the simple translation between projects and will maintain complementarity with the IWGSC project. This will facilitate comparative genomics studies with model plant species and other economically-important crops, such as rice, barley and maize, as the genomic resources contained within the Ensembl platform in each of these species improves. Additionally, comparisons can be made with more distantly related species to analyze functional gene divergence during the course of evolution.
Graphical expression profiles from all datasets are presented on a single results page, displaying mean RPKM/FPKM values +/− Standard Error Mean (SEM) (Fig. 1c). Graphs can be downloaded in one of four image formats and data is also presented in an accompanying table, which can be exported in ‘.csv’ format (Fig. 1c). Gene-level expression data can be downloaded separately, or in bulk as a single tabular file containing all data.
Expression data
All expression profiles in WheatExp are generated from RNA-seq datasets. This approach has several advantages over existing expression studies derived from microarray data, which until recently, was the standard technology used for large-scale expression analysis (e.g. “Plant Expression database (PLEXdb)”, a database of microarray-based expression profiles in different plant species [20]). One of the advantages of RNA-seq is that it is an open platform that does not rely on predetermined sets of probes printed on a gene chip. In addition, this technology provides more reliable expression profiling across a broader dynamic range than is possible with microarrays.
An important advantage of the application of RNA-seq data in polyploid species is that it facilitates the distinction among homoeologues and recently-diverged paralogous genes by allowing the application of stringent read mapping thresholds. Our selection of only uniquely-mapped reads has the dual benefit that the expression data are not only robust, but also homoeologue-specific, since the differences between these genomes (average 97 % identical) are distinguished by the selected mapping parameters. This is illustrated in Fig. 2 by two examples: CIRCADIAN CLOCK ASSOCIATED1, where the expression of the three homoeologous genes is approximately equal (Fig. 2a) and CONSTANS1 where the D-genome homoeologue contributes the majority of transcripts to the overall expression (Fig. 2b).
Simulated RNA-seq data
One drawback of using uniquely-mapped RNA-seq reads for expression analysis is that any read which maps equally well to identical regions in different genes is discarded, potentially resulting in an underestimation of the expression levels of highly similar genes [21]. To determine the extent of this effect in our database, we performed a simulated RNA-seq experiment. We generated 29.4 M synthetic 100bp paired-end reads with random expression levels and Illumina HiSeq2000 error profiles (‘ART’, mode art-illumina, default parameters except -m 500, −s 100 –ss HS20 [22]). All reads were processed using the same pipeline as for all biological RNA-seq data. By comparing the known number of simulated reads with the number of mapped reads, we can determine for each contig the proportion of reads discarded during mapping. Using a set of 3,476 homoeologous triplets (=10,428 genes) identified from a previous study [7], we mapped the subset of reads originating from each homoeologue to a reference comprised only of their genome of origin (i.e. A-genome reads were mapped to A-genome transcripts etc.). For the A, B and D genomes, an average of 98.6, 98.4, and 98.4 % of reads mapped uniquely to their transcripts of origin, respectively, demonstrating that only a small proportion of reads are discarded during mapping when their homoeologous genes are absent from the reference. When we repeated the mapping of all generated reads to the full reference, unique mapping rates were reduced to 82.4, 83.6 and 80.6 % for the A, B and D homoeologous triplets. In each case, this was a slightly lower unique mapping rate than for all remaining transcripts in our dataset (84.4 %). Despite this reduction in the mapping rate, we observed a high level of correlation between the number of generated reads and the observed mapped reads (r = 0.95, 0.96, 0.95 for A, B and D homoeologous triplets, Fig. 3). Therefore, while the estimated expression levels of homoeologous genes in our database are, on average, slightly reduced due to their sequence similarity, the reported expression remains closely correlated with the true expression level. Furthermore, this effect is approximately equal for transcripts originating from the three homoeologous wheat genomes (Fig. 3), demonstrating the absence of bias when comparing homoeologue-specific expression profiles for a gene of interest.
Limitations
The main application of WheatExp is to compare the relative expression levels of the different homoeologues of a single gene across different tissues, developmental stages, environmental conditions and genetic backgrounds. For users interested in comparing the expression of different genes, we have included a statement on the website indicating that comparisons among genes are valid only when the genes being compared have the same number of homoeologues in the reference genome. Based upon results from our simulated RNA-seq experiment, genes where one homoeologue is absent from the reference will exhibit a higher proportion of uniquely-mapped reads and the expression levels of the two remaining homoeologues may also be inflated by the incorrect mapping of reads from the absent homoeologue. Additionally, no expression data will be reported for any genes which lack annotation within the current IWGSC release and any contig assemblies which are duplicated in the reference assembly will exhibit a reduced number of uniquely mapped reads. However, our project design allows for regular updates and refining of the mapping reference as this is expanded through the IWGSC project. As the mapping reference is improved we will re-map and re-process each dataset to generate updated expression sets using new versions of the reference, reducing the incidence and impact of such bias.
Our approach and data analysis pipeline can be applied to other polyploid species for which a homoeologue-specific genomic assembly is available to use as a reference. A critical parameter that must be considered in this application is the average level of identity among homoeologues, since this will affect the selection of the threshold for mapping uniquely mapped reads and thus the ability to discriminate between homoeologues.