Functional annotation of creeping bentgrass protein sequences based on convolutional neural network

Background Creeping bentgrass (Agrostis soionifera) is a perennial grass of Gramineae, belonging to cold season turfgrass, but has poor disease resistance. Up to now, little is known about the induced systemic resistance (ISR) mechanism, especially the relevant functional proteins, which is important to disease resistance of turfgrass. Achieving more information of proteins of infected creeping bentgrass is helpful to understand the ISR mechanism. Results With BDO treatment, creeping bentgrass seedlings were grown, and the ISR response was induced by infecting Rhizoctonia solani. High-quality protein sequences of creeping bentgrass seedlings were obtained. Some of protein sequences were functionally annotated according to the database alignment while a large part of the obtained protein sequences was left non-annotated. To treat the non-annotated sequences, a prediction model based on convolutional neural network was established with the dataset from Uniport database in three domains to acquire good performance, especially the higher false positive control rate. With established model, the non-annotated protein sequences of creeping bentgrass were analyzed to annotate proteins relevant to disease-resistance response and signal transduction. Conclusions The prediction model based on convolutional neural network was successfully applied to select good candidates of the proteins with functions relevant to the ISR mechanism from the protein sequences which cannot be annotated by database alignment. The waste of sequence data can be avoided, and research time and labor will be saved in further research of protein of creeping bentgrass by molecular biology technology. It also provides reference for other sequence analysis of turfgrass disease-resistance research. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-022-03607-8.


Introduction
Creeping bentgrass (Agrostis soionifera) is a perennial grass of Gramineae, belonging to cold season turfgrass.Due to its excellent characteristics, such as, strong adaptability, good ornamental, it is a preferred grass species in golf course, lawn tennis court, courtyard, park and other green areas.However, creeping bentgrass with shallow adventitious roots has poor disease-resistance.For example, it is susceptible to coin spot and brown spot.The innate immunity can be induced in plant, which relies on a surprisingly complex response mechanism to recognize and counteract different invaders.The induced physical and chemical barriers are activated to effectively combat invasion by microbial pathogens, as well as inducible defensive mechanisms upon attack [1,2].Among them, the induced systemic resistance (ISR) is often activated by plant growth promoting bacteria in soil rhizosphere, and has broad-spectrum resistance to bacteria, fungi and pathogens [3,4].
Since without disease resistance-inducing factor the resistance of plants may not be induced [5], Butanediol (BDO) is often adopted as a new type of disease resistanceinducing factor, which provides durable disease resistance.ISR produced by BDO effectively inhibits grass leaf diseases [6,7].Studies have shown that many resistance proteins enter the nucleus to activate the immune response and triggers the signal transduction pathway, including resistance signal activation, transcription factor regulation and hormone signal pathway activation [8].For instance, a number of preliminary proteome analyses in rice successfully identified some known pathogenesis-related proteins that accumulate abundantly after Jasmonic acid treatment or inoculation by the pathogenic fungus M. grisea [9,10].Oh et al. [11] analyzed the secreted protein encoding the lipase with antimicrobial activity in Arabidopsis.However, except for these few preliminary studies, the study about the proteins relevant to the ISR mechanism is still scarce, especially for the turfgrass.One major reason is that many proteins identified and analyzed involving in signaling processes are below the threshold of detection [12].Hence, more efforts are worth and urgent to be put into the study of disease-resistance related proteins.
In our previous work [13], BDO was used to induce ISR resistance in creeping bentgrass infected with Rhizoctonia solani, and a genetic research of creeping bentgrass by the transcriptome analysis was performed to analyze ethylene-dependent signal transduction pathways involved in ISR mechanisms.In that work, only the sequences annotated by the database alignment were analyzed.However, there are a large number of protein sequences, which were not aligned in seven databases, or aligned but not annotated.Since these sequences were non-annotated, analysis cannot be performed, so not reported in previous work.Considering the important role played by the protein in ISR mechanism, in this work, we will provide an explicit analysis about these protein sequences of creeping bentgrass.It is very helpful for fully understanding of the ISR mechanism and further study of the disease resistance of creeping bentgrass if we can annotate these protein sequences correctly even not exhaustively.
The function of protein is usually analyzed and annotated by biochemical experiments, which are time-and labor-consuming.At present, the number of protein sequences in UniPort database exceeds 100 million, and still increases rapidly [14].The traditional methods are not enough to make up the increasing gap between the requirement and the speed of protein annotation by experimental means [15].Therefore, the protein function prediction methods, such as machine learning, were proposed and widely adopted in the research [16].However, traditional computational methods have disadvantages such as high false positive control rate and low accuracy.In the recent year, the deep learning technology becomes an important method in the field of protein research [17,18] while it is scarcely applied to the study of grass.In Ref. [19], an accurate and stable function prediction model was built to extract protein features with the convolutional neural network (CNN).It may solve the shortcomings of other methods, especially false positive control rate.
In the current work, the protein sequences from the creeping bentgrass seedlings with BDO treatment will be reported.To provide more helpful information about the protein sequences obtained, with the deep learning algorithm, we performed following analysis about non-annotate protein sequences of creeping bentgrass, (1) Based on CNN and protein binary encoding representation strategy, a functional prediction model was established following Ref.[19] with some modifications and adjusted to achieve high false positive control rate with annotated protein sequences from some Gene Ontology (GO) terms, which were collected from the Uniport database; (2) The established model was applied to non-annotated protein sequences of creeping brntgrass to predict functional classification; (3) The prediction model was further applied to select the proteins relevant to disease-resistance and signal transduction from non-annotated protein sequences of creeping bentgrass.Such treatment avoids waste of data, and supplements the analysis of proteins of creeping bentgrass.The research lays foundation for further mining ISR response proteins of creeping bentgrass and exploring the disease-resistance mechanism of turfgrass.

Plant growth conditions and production of sequencing libraries
Seeds from creeping bentgrass 'PennA-4' (Chinese Academy of Agricultural Sciences) were grown by modified method of Kroes et al [13,20].The surface of seeds was disinfected with 70% ethanol for 1 min, disinfected with 15% sodium hypochlorite for 5min, cleaned with sterile water for 10 min, and finally dried with filter paper.Seeds were sown in 50-mL culture flask with 10 mL of MS medium containing 100 μmol L -1 of BDO.About 20 seedlings per flask were cultured in a growth chamber at 22 ℃ under 100 μEm -2 s -1 light.The experimental materials were seedlings cultured for twelve-dayold under the above conditions.Rhizoctonia solani (#3.2888 from China General Microbiological Culture Collection Center) in PDO liquid medium (potato 200 g•L -1 , glucose 20 g•L -1 ) was shaking culture for 2-3 day with 120 r•min -1 at 25℃.Concentration of the bacterial sample was a final OD340 of 0.8.Roots of seedlings were directly sprayed with 2 mL of the bacterial fluid.The brown blotch symptoms of creeping bentgrass seedlings were observed, and the mycelium began to grow after 3-5-day post-inoculation.After 24, 48 and 72 h post-inoculation, the seedlings with different treatment were removed, and the leaves were cut.The treated materials were tested and analyzed, and all the samples were mixed and spliced (Eukaryotic Nonreference Transcriptome).The transcriptome analysis was performed by Illumina Sequencing.Sequencing libraries were produced by NEBNext UltraTM RNA Library Prep Kit (NEB, San Diego, CA, USA), and each sample attributes sequences for index codes.

Establishing of protein function prediction model 2.2.1 Constructing the data sets of training and testing
At first, we should construct database for establishing the prediction model under CNN frame.Some of obtained protein sequences of creeping bentgrass can be aligned and classified into the GO terms.Excluding some GO terms with too few protein sequences, we chose 7 terms in cellular component (CC) domain, 10 terms in molecular function (MF) domain, and 12 terms in biological process (BP) domain.The annotated protein sequences with these GO terms were collected from the UniPort database (see Fig. 1).With the protein sequences collected, we constructed positive and negative data for establishing prediction model by adopting binary classification, which would be also used to treat protein sequences of creeping bentgrass.For a GO term studied, we considered the proteins in this GO term as positive data.The negative data were selected from the left GO terms after removing the repeated sequences.To avoid overemphasizing one of all left GO terms, for a GO term studied with  sequences, we selected the sequences from the left GO terms in order until 3N sequences were selected.A data set with 4 sequences was obtained.After shuffling, 60% of 4 proteins were used as training set, 20% as testing set, and 20% as valuation set.Here, imbalance binary classification was adopted to emphasize the negative data.Such treatment can improve the false positive control rate that we focused on in the current work, but lower sensitivity, which is less important in the present work.

Prediction model based on CNN
In the current work, we adopted a deep learning algorithm, CNN, to analyze the protein sequences.The protein is expressed as a one-dimensional sequence of amino acids, which is quite analogous to the sentence classification.We chose an explicit CNN frame, textCNN [21], which has been successfully applied to analyze the text, and to study the proteins [19], which we followed in the current work.The model was implemented with the Tensorflow3 library and the python programming language with some modifications to get best performance for the data sets considered in the current work.The binary cross-entropy loss function was adopted in all models training, and the Adam [22] optimizer with default parameters was used for the optimization during back-propagation.The weight parameters were initialized with the He initialization method [23], and biases were initialized to zero.For a protein sequence, we need first encode the amino acids into binary vectors.Because only about twenty amino acids were discovered, we encoded an amino acid to a 5-bit binary vector as shown in Fig. 2. For example, the alanine was encoded as [0,0,0,0,1].In the current work we did not distinguish the rare and undetermined amino acids, and encoded them all as [0,0,0,0,0].The lengths of protein sequences are different while the CNN requires a fixed length.We considered the proteins of sequence length less than  = 800 amino acids, which constitute the majority (>95%) of the protein sequences in the studied GO terms (and almost 100% of non-annotated sequences of creeping bentgrass).For the protein sequences less than 800 amino acids, the left positions were complemented by binary vector [0,0,0,0,0].With such encoding, a protein sequence was converted into an  × 5 matrix.
After encoding, the convolution layer with He normalization as kernel initializer was adopted to extract the information from the digitized protein sequences as shown in Fig. 2. To obtain the information on different length levels, convolution kernels with different sizes were adopted with layers.Conv2D function in Tensorflow3.In the current work, we chose  = 120 convolution kernels with size as 2 !× 5 with  = 0, 1, 2, … , 8.After convolution, nine ( − 2 !+ 1) ×  arrays were obtained as , where  is for the number of kernels with size as 2 !× 5. To accelerate the learning speed, the batch normalization was applied with parameters momentum and epsilon being 0.99 and 0.001, respectively, which was followed by the ReLu activation.The max pooling was adopted by selecting the maximum in ( − 2 !+ 1) elements of  "# ! for certain  and .Then, the  vectors with the same size  were concatenated to a vector with size  = 1080.Additional fully connected layer was not applied here because it was found not helpful to improve the results and made the model hard to converge.Instead, modified from Ref. [19], after batch normalization, we adopted the layers.densefunction to provide the classification probability.It includes a layer with 1080 input neurons and 2 output neurons.The softmax activation was then adopted to transfer the values of the two output neurons  (,-to  (,-with differentiable softmax function / " #&/ " % .One can find that  (,-is value between 0 and 1 and  ( +  -= 1, which is just the classification probability.If the possibility  ( > 0.5, the sequence will be classified into the GO term.Here the L2 regularization was also adopted to avoid overfitting.

Model's performance with the protein sequences from the Uniport database
When establishing the prediction model above, we trained the model by the protein sequences collected from the Uniport database in 29 GO terms, which numbers were shown in Fig. 1.The model and parameters were adjusted to obtain the best performance of false positive control rate because in the current work we want to establish a prediction model to select proteins with certain function correctly but not exhaustively.To evaluate the performance of the model, we introduce five widely-used measurements, sensitivity (SE), specificity (SP), precision (PR), accuracy (AC) and Matthews correlation coefficient (MCC), which explicit definitions are given in Additional file 1. Amongst five measurements, the specificity (SP) is the most important factor because it reflects the false positive.In Fig. 3, we presented the violin plots to show the overall picture of the results for three domains.Generally speaking, the model works well for the GO terms in three domains, especially for the GO terms in MF domain, which is comparable to the model in Ref. [19] where the studied GO terms are also in MF domain.Amongst five measurements, SP values are quite well for three domains (Fig. 3b), which satisfies our requirement to ensure the correctness of functional annotation of protein.Besides, smaller MCC values for BP domain than other two domains suggests that the functions of proteins in BP domain are more complex and harder to be learned by the prediction model.

Screening of non-annotated sequences
To annotate the protein sequences of the transcripts obtained in Section 2.1, based on four databases, NR, KOG, SwissProt, and KEGG, the BLASTALL package (release 2.2.28) [24] from the NCBI was adopted with a significant threshold of E-value 10 −5 .The KOBAS software in the KEGG pathways was used to check the statistical enrichment of differential expression genes (DEGs) [25].The Blast2go v2.5 software (Biobam, Spain) was used to functionally categorize the sequences based on Gene Ontology (GO) with an E-value filter 1 × 10 '0 .The total number of annotated protein sequences is 208,672.GO classification associations, combined with statistically transcripts were uploaded on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under accession number SRR5658390.The annotated protein sequences can be found in these data.In the followings, we will focus on the protein sequences which is not annotated.
The non-annotated sequences used in our study referred to the sequences that were not aligned with databases, or the sequences that were aligned but no predicted results.According to the database alignment, 118,856 amino acid sequences were not annotated.All non-annotated amino acid sequences are provided in the Additional file 2. The exploration of this part of protein functions has great significance for further understanding the ISR mechanism of creeping bentgrass.The annotation results of these non-annotated protein sequences with established model are given in the followings.

Functional annotation of non-annotated sequences of creeping bentgrass by GO terms in three domains
In the above, we established the prediction model with good performance, especially for the false positive control rate, by training the GO terms in three domains.With the trained model, we can analyze non-annotated protein sequences of creeping bentgrass and classify them into the GO terms considered above.The protein sequences with the function of certain GO term selected by prediction model are provided in the Additional file 3.In Table 1, the number of the proteins belonging to a GO term and the ratio of the number to the total number of the protein sequences are listed.

Table 1
The number and ratio of the proteins belonging to a GO term predicted from non-annotated protein sequences of creeping bentgrass.In the CC domain, the number of selected protein sequences for GO: 0005576 ('extracellular region') is the largest, accounting for 36.72%,while the number for GO:0055044 ('symplast') is the smallest, only 0.99%.In the MF domain, the number of proteins belonging to GO:0016247 with function 'channel regulator activity' is the largest in non-annotated protein sequences, accounting for 20.79%.The number for GO: 0005085 ('guanyl-nucleotide exchange factor activity') is smallest, only 0.86%.From Ancestor Chart, the function 'antioxidant activity' (DO: 0016209) is a part of 'cellular response to stimulus' (GO: 0051716), which is closely related to the disease-resistance of creeping bentgrass.Its predicted number accounts for 9.50%.In the BP domain, GO: 0001906 ('Cell killing') has the largest number, accounting for 26.58%, followed by DO:0048519 ('negative regulation of biological process'), accounting for 24.96%.3.8% of non-annotated protein sequences belongs to GO:0002376 ('immune system process').

Functional annotation of non-annotated sequences of creeping bentgrass by GO terms relevant to the disease-resistance response and signal transduction
In the above, the established prediction model was applied to select the protein sequences from non-annotated sequences of creeping bentgrass belonging to the GO terms chosen to establish the model.In this subsection, we focused on 13 GO terms with functions relevant to stimulus response and signal transduction related proteins (which was not used when establishing the prediction model).The model was trained by protein sequences of these GO terms, which were also collected from the UniPort database.The performance of the model is presented in Table 2.It is well known that the disease-resistance response and signal transduction process belong to BP domain.One can expect that the results are analogous to the results in the BP domain in Fig. 3(b).The SP values are considerable large, and spans from 96% up to 100%, which satisfies high false positive control rate required in the current work.

Table 2
The performance of prediction model and the number of the proteins for disease-resistance and signal transduction related GO terms.The GO ID, term, and number NGO of protein sequences in a GO term collected from the UniPort database are listed in the first to third columns, respectively.The measurements are listed in the fourth to eighth columns.The number of the proteins belonging to a GO term predicted from non-annotated protein sequences of creeping bentgrass is listed in last column.With the model trained by the data from UniPort, the protein sequences annotated from the non-annotated sequences of the creeping bentgrass are provided in the Additional file 4. In Table 2, we list the number of protein sequences with certain function.The number of protein sequences with function of 'negative regulation of molecular function' (GO:0044092) is the largest, accounting for 47,437.In Ancestor Chart, 'negative regulation of molecular function' is the biological regulation process.There are 24,287 annotated proteins with function of 'response to biological stimulus' (GO:0009607), 11,555 annotated proteins with function of 'negative regulation of response to external stimulus' (GO:0032102), and 3022 annotated proteins with function of 'regulation of cellular response to stress' (GO:0080135).These protein functions are closely related to the disease-resistance response of creeping bentgrass, and also reflect the positive response of creeping bentgrass to external stimulation after being infected by Rhizoctonia solani by inducing a large number of disease-resistance related proteins.Besides, there are 1260 annotation proteins with function of 'immune response regulating signaling pathway' (GO:0002764) and 148 annotation proteins with function of 'negative regulation of signal transduction proteins' (GO:0009968), both of which are closely related to signal transduction process.The annotation of the above protein functions is of the great significance for further explore the disease-resistance and signal transduction of creeping bentgrass.

Conclusions
To understand the ISR mechanism of turfgrass, the high-quality protein sequences were obtained from the creeping bentgrass seedlings infected by Rhizoctonia solani with BDO treatment to induce the ISR response.Amongst protein sequences obtained, some were functionally annotated according to the database alignment while the rest of the protein sequences were left non-annotated.A functional prediction model was established based on CNN with emphasizing high false positive control rate following Ref.[19], and applied to treat these left non-annotated proteins sequences to find the sequences relevant to the disease-resistance response and signal transduction which play important roles in ISR mechanism.
To establish the model, the data sets were collected from UniPort database in 29 GO terms in three domains.The sequences were encoded into a matrix, and convolution kernels were adopted to extract the information of sequences in different length levels.With the classification probability obtained, the protein sequences can be annotated.Compared the annotations with current model and these from UniPort database for the sequences in testing dataset, the prediction model was evaluated by five measurements, SE, SP, PR, AC, and MCC.A significant performance of the current model is the high SP values obtained for the data sets in 29 GO terms in three domains, which reflect the good false positive control rate.It guarantees the correctness of the annotation, though some sequences with certain function maybe omitted.With the established and trained model, the protein sequences belonging to 29 GO terms were selected from the nonannotated sequences of creeping bentgrass.
The established model was retrained by the data from UniPort in 13 GO terms relevant to the ISR mechanism.The non-annotated protein sequences of the creeping bentgrass were analyzed with retrained model.The protein sequences were annotated as different functions, which mainly involve 'response to biological stimulus', 'negative regulation of response to external stimulus', 'negative regulation of molecular function', 'regulation of cellular response to stress' and 'immune response regulating signaling'.These protein molecules play different roles in the disease-resistance process of creeping bentgrass.The results provide good candidates of the proteins with certain functions from all obtained protein sequences, which can be studied by molecular biology technology in further studies.With selected protein sequences, the waste of experimental data of protein sequence of creeping bentgrass can be avoided, and the experiment consumption of time and labor in further molecular biological studies can be saved.The current results are helpful to understand the ISR mechanism, and also provide reference for other sequence analysis of turfgrass disease-resistance research.

Fig. 1 .
Fig. 1.The number of protein sequences in every GO term used in training.

Fig. 2 .
Fig. 2. The workflow of CNN adopted in the current work.

Fig. 3 .
Fig. 3. Violin plots for the performance of prediction model in three domains.