Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case

Background For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, ‘omics’, and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. Results We show that the automatic gene annotations of potato have low accuracy when compared to a “gold standard” based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. Conclusions We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of ‘-omics’ datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper. Electronic supplementary material The online version of this article (doi:10.1186/s12870-014-0329-9) contains supplementary material, which is available to authorized users.

. ITAG Gold Standard validation. ITAG Gold Standard validation. For each annotation method (i.e., a pipeline and a variant of the ensemble algorithm) the F-measure of the Gold Standard validation is shown, see Materials and Methods for a full description of the F-measure calculation. A score of 1 means a perfect agreement between an annotation flow and the Gold Standard. A score close to zero means that the annotation flow is not in line with the Gold Standard. (A) F-measure of the BP annotations. (B) F-measure of the MF annotations. The results show that both in BP and MF the ensemble algorithm improves the results considerably when k is 1 or 2.

Supplemental Method 1. Composing the Gold Standard
The combined output of BLAST2GO, Trinotate BLAST, Trinotate HMM, Phytozome and OrthoMCL pipelines was checked against the set of identifiers of genes with known functional characterization. Only results with genes matching one in the set were processed. In order to manually asses them we added matching GO names and genes descriptions. Then we assigned scores to the results: 3 was assigned if the annotation was correct, 1 if it was not, 2 if we were not able to say (neutral). The results with score 3 compose the Gold Standard. Figure 1 presents the initial assessment of pipelines outputs.  Bash scripts runGoldGenesMining_PGSC.sh and runGoldGenesMining_ITAG.sh search parsed pipeline output for the results with genes matching any of genes with experimental validation in the literature. Names of GO terms and descriptions of genes are added to the results, to make the manual assessment easier. ## 3. Standard "go.obo" file ## ## All files should be placed in the same (working) directory. ## ## The output file is placed in the "RESULTS" directory (created if doesn't exist). ## Its name is a concatenation of names first and second input files, with "-" between them and ".out" at the end: ## "RESULTS/[file_with_validated_gene_names]- [parsed_flow_output] awk -v i="$id" -v g="$go" -v n="$name" 'BEGIN{c=0} $1˜i {printf i"\t";{for(j=2;j<=NF;++j)printf $j" "} print"\t"g"\t"n} ' $gold_file  [GO] ## 3. Standard "go.obo" file ## ## All files should be placed in the same (working) directory. ## ## The output file is placed in the "RESULTS" directory (created if doesn't exist). ## Its name is a concatenation of names first and second input files, with "-" between them and ".out" at the end: ## "RESULTS/ awk -v i="$id" -v g="$go" -v n="$name" 'BEGIN{c=0} $1˜i {printf i"\t";printf $2"\t";{for(j=3;j<=NF;++j)printf $j" "} print"\t"g"\t"n} ' $gold_file

RUN BLAST FOR 18 COMPLETE PLANT PROTEOMES AGAINST UNIPROT
As OrthoMCL does not assign GO terms to genes, we did a BLASTP search of our protein sequences against the UniProt database, and get the BLASTP output in a tabular format.
Here is an example of the first lines of the output file for the A. thaliana proteome (we have one BLAST output per proteome): All the BLAST output files must be placed in the same directory, and should have the extension .blast (for example "Athaliana Uniprot.blast").

RETRIEVE GO TERMS FOR ALL ANNOTATED UNIPROT ENTRIES
We then retrieve GO terms from the UniProt entries of the 10 best UniProt hits and create another tabular file using the java program AllBio GetGOTermsFromSwissProt. Program call: javac AllBio_GetGOTermsFromSwissProt.java (program compilation) java AllBio_GetGOTermsFromSwissProt -i /path_to_the_Swissprot_db/ -o /path_to_the_out_file/ The first column is the UniProt ID of the hit; the second column contains a list of GO terms and GO IDs associated with this hit, separated by ; (semicolon). In this example we call it Swissprot GO.txt. The first lines of this file look like this:

Supplemental Method 3. The Trinotate pipeline
Here we describe the necessary steps to perform the annotation with Trinotate. We used default settings everywhere to annotate the input data with NCBI-BLAST (with SwissProt database) and HMMER tool (with Pfam database). Trinotate uses specific releases of these databases. The results were collected with Trinotate, according to the guidelines (http://trinotate.sourceforge. net). We performed the annotation for two sets of gene models: • ITAG File with coding sequences was used as a main input: potato.Sotub.cds.itag.v1.fasta For that reason some comments are specific to these inputs. In general, input with coding sequences is reffered to as transcripts.fasta, while input with peptide representatives is reffered to as peptide.fasta. As trinotate is designed to annotate transcriptts produced by Trinity, some additional pre-processing is necessary, if the input comes from other source.
All scripts can be run in UNIX environment. Comments includes additional hints on running the pipeline.

EXTRACTION OF GO-GENE PAIRS
We used Python script that has a main script (parse_trinotate_output.py) and a method sub-script (helper_methods.py) to parse the trinotate_output.txt into two files, one with annotations made with NCBI-BLAST, one with annotations made with HMMER. Both scripts are available in Supplementary Materials. Format of each file: [ ###### Analyze the pfam based prefictions ##### # step 1. Get the association E-scores pfam_raw_info = arr[5] # The actual E-score always starts with "E:" score_regex = "E:" scores_raw = re.split(score_regex,pfam_raw_info) scores = [] # The E-score of the association ends with ' (or when the line ends) for i in range(1,len(scores_raw)): # step 2. Get the pfam ids for each association and get its score # Note: if a certain pfam id has more than one E-score, we keep the lowest pfam_col = re.split(r'\ˆ|W|'',pfam_raw_info) # this hash keeps a mapping from pfam ids to their scores pfam = {} # this arguments keeps where we are in the scores list hit_counter = 0 # a pfam id starts with "PF" and followed by numbers for p in pfam_col: if not re.search("ˆPF\d+",p):continue # trinotate's annotations has the pfam id then a dot then a number # to map pfam ids to GO terms we only need the prefix p = re.split(r'\.',p)[0] # get the minimal score pfam[p]= min(pfam.get(p,1),float(scores[hit_counter])) hit_counter +=1 # this is a test for QA: if the pointer "hit_counter" did not # reach the last value in scores than we have an error # note that if the length of scores is lower than the number # of observed pfam ids then an exception will be thrown (i.e., the # run stops with an error) if hit_counter != len(scores): break # step 3.go over the pfam ids, map them to GO terms # and write the results to the output file (inclusing the scores) for p in pfam.keys(): if not pfam2go.has_key(p):continue escore = pfam [p] for g in pfam2go.get(p): out1.write(gene+"\t"+g+"\t"+str(escore)+"\n") ###### Analyze the GO prefictions ##### # here we analyze the column that directly predicts # GO terms for each gene

Supplemental Method 4. Preprocessing of gene expression data
We collected potato expression profiles from over 20 studies covering 326 experimental conditions. The raw data contained 52,998 probes. We applied quantile normalization using the Limma package (Smyth, 2005) and subtracted the background intensity from the foreground intensity for each spot using the normexp method (Ritchie et al., 2007). We removed probes whose expression level was consistently low over the experimental conditions and thus reduce statistical noise when performing co-expression analysis (Tzfadia et al., 2012). We set an intensity level threshold of 204 based on the histogram curve of normalized intensities (see Fig. 2)., and removed probes whose expression was lower than the threshold in all experiments. By setting a standard deviation of 312, we reduced the number of probes from 24,434 to 14,000 (see Fig. 3). These probes were mapped to 12,956 genes, approximately the same amount of genes analyzed in Tzfadia et al. (2012).

Supplemental Method 5. How to use the R scripts
This document describes the scripts for comparing pipelines, merging pipelines, and validations using gene expression data and a list of gold standard annotations. All scripts were written in R (2.15.1) and were tested both in Windows and UNIX.

REQUIREMENTS
The calculations of our tools could be computationally demanding. For example, to get GO-based similarity measures between pipelines we calculate semantic similarities between all GO terms that are present in the flows. For example, in the potato data we calculated similarities among >5000 GO biological process terms using the GOSemSim R package. As a rule of thumb, we recommend to use a machine with at least 8GB RAM memory. To run the scripts the following R packages should be installed: hash, GO.db, GOSemSim, corrplot, and multicore (optional, required only for running in Unix).

INPUT
To run our analyses the user should create a directory and put all data in three sub-directories: • flows -output of the pipelines. For each pipeline add a tab-delimited file with two columns: gene id and GO term id.
• gs -gold standards. For each gold standard add a tab-delimited file with two columns: gene id and GO term id.
• ge_data -gene expression datasets. Each gene expression dataset is a tab-delimited text file in which columns are conditions and rows are genes. The first row contains the condition names (i.e., the header of the table) and the first column contains the gene names.

R FILES
• flow_comp_methods.R: all auxiliary functions. All data loading processes, measurements, and algorithms are implemented in this script.
• MAIN.R: the main script that runs the analysis. This script receives as input the name of the folder that contains all the data (divided to three sub-directories as explained above) and runs the following stages: a. Read the flows and remove duplications (genes and GO terms).
b. Split the flows by GO types: MF, BP, and CC.
c. Get GO-free similarity scores between the pipelines.
d. Get GO-based similarity scores between the pipelines.
e. Get the gold standard validation scores for each flow (GO-based and GO-free).
f. Get the gene expression validation scores for each flow.
g. Calculate the ensemble solutions of the pipelines.
h. Validate the ensembles (gold standard and gene expression).
i. Save all the results in RData files (see details in Output R objects). These files will be saved in a new directory called output_objects.
Since the supervised approach uses the gold standard annotations as input, we compared the two ensemble methods only based on gene co-expression, as described in the main text ( Figure  6). The comparison using the potato PGSC data is shown in the figure below.
The results show that for the potato data, the two ensemble approaches provide comparable results. The original unsupervised ensemble (named Ensemble:k in the figure above, as in the main text) has superior results with k=1 or k=2 in terms of the number of highly co-expressed GO terms. However, with l=2, the supervised ensemble has slight advantage over the unsupervised ensemble (k=2) in terms of the percentage of highly co-expressed GO terms.
Based on the comparison above, we conclude that for the currently available potato data, the two approaches have similar performance. As the unsupervised ensemble is simpler and uses less information, we chose to use it for annotating the potato genes. The supervised ensemble implementation is also available as part of the R code, and can be utilized in future studies.
Methods S7: Co-expression of the gold standard genes The figure below shows histograms of the co-expression values between all gene pairs, and between gene pairs in the gold standard (GS) set. The distribution of the GS genes has slightly higher mean (0.007 vs. 0.014) but both scores are very close to zero, and the distributions are highly overlapping.
The genes in the gold standard fall into 17 different GO process categories. The average coexpression of these categories is not higher than that of for random gene sets of the same size.
Both tests suggest that the co-expression information is complementary to the set of GS genes and is not biasing the co-expression analysis.