The family of Deg/HtrA proteases in plants

Background The Deg/HtrA family of ATP-independent serine endopeptidases is present in nearly all organisms from bacteria to human and vascular plants. In recent years, multiple deg/htrA protease genes were identified in various plant genomes. During genome annotations most proteases were named according to the order of discovery, hence the same names were sometimes given to different types of Deg/HtrA enzymes in different plant species. This can easily lead to false inference of individual protease functions based solely on a shared name. Therefore, the existing names and classification of these proteolytic enzymes does not meet our current needs and a phylogeny-based standardized nomenclature is required. Results Using phylogenetic and domain arrangement analysis, we improved the nomenclature of the Deg/HtrA protease family, standardized protease names based on their well-established nomenclature in Arabidopsis thaliana, and clarified the evolutionary relationship between orthologous enzymes from various photosynthetic organisms across several divergent systematic groups, including dicots, a monocot, a moss and a green alga. Furthermore, we identified a “core set” of eight proteases shared by all organisms examined here that might provide all the proteolytic potential of Deg/HtrA proteases necessary for a hypothetical plant cell. Conclusions In our proposed nomenclature, the evolutionarily closest orthologs have the same protease name, simplifying scientific communication when comparing different plant species and allowing for more reliable inference of protease functions. Further, we proposed that the high number of Deg/HtrA proteases in plants is mainly due to gene duplications unique to the respective organism.


Background
Proteolysis, the enzyme-mediated hydrolysis of peptide bonds, is a vital process for every organism. It is associated with many intracellular and extracellular events, e.g. the removal of damaged proteins, nutrient uptake, processing of protein precursors, and signaling [1,2] . A huge variety of proteolytic enzymes, utilizing several different catalytic mechanism, mediate these processes. The family of Deg proteases (for degradation of periplasmic proteins) [3], also known as HtrA proteases (for high temperature requirement A) [4], are one important group of these proteolytic enzymes. They are ATP-independent serine endopeptidases found in all domains of life, including Bacteria, Archaea and Eukarya. Deg/HtrA proteases belong to the S1B subfamily of the clan PA according to MEROPS nomenclature [5], which features a catalytic domain of the trypsin type, with His-Asp-Ser as catalytic triad. Most Deg/HtrA family members contain one to four PDZ protein-protein interaction domains [6], but members without PDZ domains have been described in plants [7][8][9] and animals [8,10]. Deg/HtrA proteases are best studied in Escherichia coli and mammals, where three (DegP, DegQ and DegS) or five (HtrA1-4 and Tysnd1) Deg/HtrA paralogs are present, respectively. DegP from E. coli is a protein quality control enzyme in the periplasm, acting as a protease and degrading irreversibly damaged proteins, or as a chaperone, thereby assisting with refolding of denaturated proteins [11]. A second E. coli protease, DegS, acts in a stress signaling cascade sensing misfolded proteins in the periplasm and transducing the signal to the cytoplasm [12]. Human Deg/HtrA proteases have been shown to play critical roles in severe diseases, such as Alzheimer, agerelated macular degeneration and several cancers (reviewed in [13]).
Compared to the vast literature on prokaryotic and mammalian Deg/HtrA proteases, relatively little is known about members of this family in plants. Searches in genomic databases revealed 16 genes encoding putative Deg/HtrA proteases in Arabidopsis thaliana [14], 15 in Oryza sativa [15] and 20 in Populus trichocarpa [16]. However, to date only a few Deg/HtrA proteases from A. thaliana have been studied in detail. It was experimentally shown that six AtDeg proteases are located in chloroplasts [17][18][19][20][21][22], one in peroxisomes [8], one in mitochondria [E. Zeiser, C. Huber, P. Huesgen, H. Schuhmann, I. Adamska, unpublished], and one in the nucleus [23]. Two more Deg proteases are predicted to reside in chloroplasts, five in mitochondria (one of them with a possible dual chloroplastidial/mitochondrial localization), and the subcellular location of one protein is uncertain (reviewed [24]). The chloroplast-located Deg/HtrA proteases were reported to be involved in the degradation of damaged photosynthetic proteins, especially the photosystem II (PSII) reaction center D1 protein under light stress conditions (reviewed [24]). Additionally, the thylakoid lumen-located AtDeg1 protease acts as a chaperone, assisting in the assembly of PSII dimers and supercomplexes [25].
Little is known about Deg/HtrA proteases targeted to compartments other than the chloroplast. However, it was demonstrated that the peroxisomal AtDeg15 protease is a processing enzyme, cleaving the N-terminal peroxisomal targeting signal 2 that is present in some nuclear-encoded peroxisomal proteins [7,8].
Based on the evolutionary relationship of the conserved trypsin domain, Deg/HtrA proteases from Archaea, Bacteria and Eukarya cluster into four distinct clades, whereby plants are the only organisms containing proteases from all four clades [7]. The relatively high number of Deg/HtrA proteases and their diversity in plants, together with the observation that some of them localize to the same compartment, have a similar domain arrangements, and comparable sizes [7,14,16], carries a high risk of confusion. This is potentiated by the fact that during genome annotation of vascular plants (e.g. A. thaliana and O. sativa), Deg/HtrA proteases were numbered according to the order of their discovery, thus giving orthologous proteins different numbers and names depending on the organism. For rice, the situation is even more complex with two independent genome annotation databases for O. sativa ssp. japonica, i.e. the Rice Annotation Database [26] and the MSU Rice Genome Annotation Project Database [27]. Therefore, one gene might occur in the literature under more than one identifier or name.
In the study presented here, we reassessed the number of Deg/HtrA proteases in several photosynthetic eukaryotic model organisms from the Viridiplantae line, such as the dicots A. thaliana and P. trichocarpa, the monocot O. sativa, the moss Physcomitrella patens and the unicellular green alga Chlamydomonas reinhardtii, whose genomes are completely sequenced. Using phylogenetic comparison and domain structure analysis, we propose a unified nomenclature for Deg/HtrA proteases in green plants (including green algae) based on the long-established nomenclature reported for A. thaliana [28]. Furthermore, we were able to identify a "core set" of eight Deg/HtrA proteases shared by all organisms examined here and postulate that the high number of Deg/HtrA proteases in plants is mainly due to gene duplications unique to the respective organism.

Results and discussion
An inventory of Deg/HtrA proteases To establish a standardized nomenclature, we reassessed the number of Deg/HtrA proteases in the vascular plants O. sativa ssp. japonicaand P. trichocarpa, the moss P. patens and the green alga C. reinhardtii by searching annotated genome databases for the presence of deg/ htrA sequences (see Methods for details). The secondary structure of these sequences was analyzed using the HHpred platform [29] in order to confirm the presence of a Deg/HtrA protease domain, thereby excluding false positives from the database searches (data not shown). Additionally, this approach also yielded the domain architecture of con firmed Deg/HtrA proteases, which is included in Tables 1, 2, 3, 4, 5. Table 1 summarizes the Deg/HtrA proteases from A. thaliana, which were reported before based on amino acid (aa) sequence alignments [14] (Table 1, columns 1-3). Using the HHpred platform [29], the presence of a Deg/ HtrA-like protease domain could be confirmed for all of these proteins (Table 1, column 5), although two proteins seem to be proteolytically inactive. In AtDeg6 the protease domain is truncated and the protease domain of AtDeg16 lacks the Asp residue of the catalytic triad (Table 1, column  5 and Additional file 1 showing all protease sequences analyzed in this study). The remaining 14 Deg/HtrA proteases contain the conserved catalytic triad of His, Asp and Ser required for proteolytic activity (Table 1, column 5). Of the potentially active proteases, AtDeg5 and AtDeg15 (the latter with an elongated N-terminus) do not contain any recognizable PDZ domain. AtDeg7 possesses two predicted protease domains, one potentially active and a second, degenerated one with a mutated catalytic triad [6,30], as well as four PDZ domains arranged in tandems (Table 1, column 5). Considering the domain arrangement and length of AtDeg7, which is twice as long as most other Deg/HtrA family members, it was proposed that this protease arose from a gene duplication and fusion event, whereafter the second protease domain lost its proteolytic activity and acquired a new function in protein-protein interaction [30].
The 15 deg/htrA protease genes that were reported earlier for O. sativa [15] were confirmed in this study (Table 3, columns 1-3). However, the protease previously reported as OsDegP4 (LOC_Os03g62900) was only found in the MSU Rice Genome Annotation Project Database [27], but not in the Rice Annotation Database [26], and an additional potential OsDeg protease was identified (Os03g0608600/ LOC_Os03g41170) by BLAST search and homology prediction (Table 3, columns 1-3). Both proteases lack recognizable PDZ domains. The protein Os02g0712000 (LOC_Os02g48180), originally named OsDegP2, possesses a similar domain arrangement to AtDeg7, since it contains two protease domains (a putative active and a second with mutated catalytic triad residues) and four PDZ domains ( Table 3, column 5). Proteins Os01g0278600 (OsDegP1, LOC_Os01g17070), Os08g0144400 (OsDegP11, LOC_Os08 g04920), and Os12g0141600 (OsDegP14, LOC_Os12 g04750) appear to be proteolytically inactive due to mutated active site residues, with the latter containing two inactive protease domains and lacking a PDZ domain (Table 3, column 5, and Additional file 1).
Seventeen genes encoding for Deg/HtrA proteins are present in the genome of the moss P. patens (Table 4, columns 1 and 2). Two of these proteins, Pp1s176_111V6 and Pp1s67_44V6, have mutated active site residues in their protease domain and are predicted to be proteolytically inactive (see Additional file 1 for aa sequences), while Pp1s63_95V6 and Pp1s196_28V6 do not contain any detectable PDZ domain. Two other proteins, Pp1s237_5V6 and Pp1s21_327V6 have, similarly to AtDeg7, a potentially active and an inactive protease domain (Table 4, column 4).
In the genome of C. reinhardtii 15 deg/htrA genes were identified ( Table 5, columns 1-3). Three of these genes, Cre38.g785300, Cre03.g203700, and Cre13.g579900.t1, encode proteolytically inactive enzymes, since at least one residue of the catalytic triad is missing in each of these proteins (column 5, see Additional file 1 for aa sequences). Cre19.g752200 contains, in addition to a Deg/HtrA protease domain, a beta-glycanhydrolase domain in the same ORF, but at present it is not clear whether this constitutes a new type of domain combination or is the result of an erroneous gene annotation. During the analysis of the Deg/HtrA sequences from C. reinhardtii, the occurrence of long (i.e. 10-20 aa) single aa repeats reduced the quality of sequence alignments and hints to a general problem with the assembly of the C. reinhardtii genome. Therefore, the number of Deg/ HtrA proteases might change with future genome database updates, similar to the situation in P. trichocarpa.
As mentioned earlier, the number of Deg/HtrA proteases present in non-plant organisms is much lower. A general trend to an increased number of protein family members in plants has also been observed for other serine protease families [31]. However, the reasons for this phenomenon remain elusive. Compared to other organisms, plants have acquired an additional, highly structured and complex compartment, the chloroplast, and perform oxygenic photosynthesis, a process that is connected to the generation of reactive oxygen species. It is tempting to speculate that this might contribute to an increased need for proteolytic capabilities, and a First model identifier is from Phytozome v7.0 (http://www.phytozome.net), the second identifier is the corresponding identifier according to [16]. Discrepancies between the suggested gene model and the UniprotKB entry were solved by analyzing the EST data (if present) and analysis of the genomic sequence for the presence of ORFs yielding aa sequences similar to ortholog or paralog proteins, with respect to potential splicing sites. b According to [16]   OsDeg5 therefore higher protease numbers. On the other hand, although land plants are sessile and therefore cannot escape from stress conditions, the high number of genes encoding Deg/HtrA proteases is unlikely to reflect an adaptation to this life style, since the motile green algae C. rheinhardtii possesses a comparable number of Deg/ HtrA encoding genes.
Phylogenetic analysis of "green"Deg/HtrA proteasesproposal of a standardized nomenclature To establish a nomenclature system based on homologies, we next examined the evolutionary relationship of the Deg/HtrA proteases retrieved from the database searches. The aa sequences of protease domains containing an intact catalytic triad as identified by the sequence alignment were phylogenetically analyzed using the maximum likelihood (ML) method. Proteases HtrA [UniProt: P73354], HhoA [UniProt: P72780], and HhoB [UniProt: P73940] from the cyanobacterium Synechocystis sp. PCC6803 [32] were included into the tree for comparision, due to the cyanobacterial origin of chloroplasts [33]. As the focus of this study is on green plants, no sequences from other photosynthetic eukaryotes (e.g. reg algae, diatoms) were included. Proteins lacking the catalytic triad or with an incomplete protease domain (Tables 1, 2, 3, 4, 5) were not included in this analysis to avoid misleading positions in the resulting phylogenetic tree. The presence of such inactive protease variants in plant genomes suggests that they might have acquired roles other than proteolysis, resulting in altered evolutionary pressure on the protease domain and the potential for higher mutagenesis rates.
Initial phylogentic analysis showed that four proteins, such as Os12g0141500 (LOC_Os03g62900), Os12g0141500 (LOC_Os12g04740) and Os03g0608600 (LOC_Os03g411 70) from O. sativa and Cre07.g332050 from C. rheinhardtii (Tables 3 and 5) did not cluster with any other analyzed Deg/HtrA protease and seemed to be only distant relatives of this protease family (see Additional file 2 for the respective ML tree). Hence these proteases were excluded in the further analysis for clarity (see Additional file 3 for final input data).
The Deg/HtrA proteases investigated here form four distinct clades (Figure 1; see Addtional file 4 for a tree containing the original gene model names), similar to an earlier study that included Deg/HtrA proteases from evolutionarily very distant taxa and only a few plant orthologs [7]. Clade I is further split into two subgroups, where subgroup IA includes orthologs of Deg1, Deg5 and Deg8 (Figure 1, Addtional file 4). Subgroup IB comprises the prokaryotic (cyanobacterial) Deg/HtrA proteases, and one protease each from the land plants A. thaliana (AtDeg14,  Table 3) and P. patens (PpDeg14 , Table 4). Notably, the Deg14 protease is missing in the green alga C. reinhardti (Table 5).
PpDeg1-group-like (Pp1s152_166V5.1), which passed all validation procedures as described above and in the 'Methods' section, seems to be more distantly related to Deg/HtrA proteases from groups IA and IB (Figure 1). Based on its position in the tree, and the comparably low bootstrap support, it was not possible to decide whether it can be included in subgroup IA or IB. Alternatively, the gene model and the respective protein sequence might require improvement. Clade II includes AtDeg2-AtDeg4 and AtDeg9-AtDeg13 and their orthologs ( Figure 1, Addtional file 4). Clades III and IV gather AtDeg15 and AtDeg7 and their orthologs, respectively ( Figure 1, Addtional file 4).
Based on the phylogenetic tree, we grouped all orthologous Deg/HtrA proteases from analyzed plant species and propose a common name for enzymes from the same group in order to unify the nomenclature between different plant species (Tables 1, 2, 3, 4, 5, last two columns). Since the majority of detailed studies on plant Deg/HtrA proteases focused on A. thaliana enzymes, we used their well-established nomenclature [14,28] as a guideline for renaming Deg/HtrA orthologs in the other organisms analyzed here (Tables 2, 3, 4, 5 last columns).
In P. trichocarpa, we renamed PtDeg5.1 (Pt771291) to PtDeg5 since only one isoform of this protein is present in this organism and combined PtDeg14.1 (Pt662713) and PtDeg14.2 (Pt662714) encoded by the same ORF (see above) under the common name PtDeg14 (Table 2). A new gene model (POPTR_0008s07940) similar to AtDeg10 was named PtDeg10.  For Deg/HtrA proteases from O. sativa, we propose to change the existing nomenclature present in the TIGR/ MSU database [27], and we also provide preliminary new names for the more distantly related Deg/HtrA-like proteases or proteins without an intact protease domain (Table 3). For these proteins, we suggest to use the names "OsDeg-like1-6", in order to prevent confusion between e.g. OsDeg1 (Os05g0568900, LOC_Os05g49380) and the more distantly related protein formerly know as "OsDegP1", now OsDeg-like1 (Os01g0278600, LOC_Os01g17070) ( Table 3).
Since no names were given for annotated Deg/HtrA proteases in P. patens we propose to name them based on phylogeny as suggested in Table 4 (last column).
For C. reinhardtii, the proposed nomenclature of Deg/ HtrA proteases partially matched those present in the Phytozome 7.0 and UniProt databases (Table 5). However, we suggest to change the names of Deg1 (Cre02. g088400), Deg11 (Cre12.g498500) and Deg13 (Cre14. g630550) to CrDeg1.1, CrDeg1.2, and CrDeg1.3 (Table 5) since all three proteases are more closely related to AtDeg1 than to AtDeg11 or AtDeg13 (Figure 1, Addtional file 4). For Cre19.g752200, we propose the name CrDeg9.1, since its protease domain seems to be evolutionary related to AtDeg9, although the domain arrangement of this protease (it contains a beta-glycanhydrolase domain in the C-terminal half of the protein) is unusual for these enzymes ( Table 5). The protease domain of Cre14. g617600, described as Deg9 in both the Phytozome 7.0 and UniProt databases, seems to be more closely related to those of Deg10 proteases, but the bootstrap support is insufficient to justify its renaming. For this reason we suggest the name CrDeg9.2 for this protein (Table 5). A new gene model Cre12.g548200 was named CrDeg15 (Table 5) since the protease domain was the closest related to those of AtDeg15 (Figure 1, Addtional file 4).

Analysis of domain arrangement supports proposed nomenclature
Analysis of the protein aa sequences with the HHpred platform yielded predictions for the number and the arrangement of protease and PDZ domains in each Deg/ HtrA protease (Figure 1 and Tables 1, 2, 3 and 5, column 5; Table 4, column 4). This data supports the presence of four major Deg/HtrA clades (Figure 1), as reported before [7]. Proteases from clade I contain one protease domain and one PDZ domain (with the exception of all Deg5 orthologs, where the PDZ domain is missing), whereas proteases from clade II contain one protease domain and two PDZ domains (Figure 1). Clades III and IV contain Deg/HtrA proteases with non-canonical domain arrangements: Clade III consists of very large proteins (approximately 1,000 aa), which according to prediction contain one active and one inactive protease domain, and 4 PDZ domains (Figure 1). Recently, it was shown that the inactive protease domain in AtDeg7 is involved in trimerization of this enzyme [30]. Whether this holds true for other Deg7 orthologs remains to be examined. Proteins from clade IV do not contain any detectable PDZ domain, and their protease domain is shifted towards the C-terminus (Figure 1). Since this domain arrangement is unusual for Deg/HtrA proteases [6],    Figure 1 Maximum likelihood phylogenetic tree of Deg/HtrA proteases in selected plant species. Following plant species were investigated: Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Physcomitrella patens, Chlamydomonas reinhardtii, and the cyanobacterium Synechocystis sp. PCC6803. Phylogenetic tree labeled labeled with the new names as suggested by this study. Filled circles indicated a bootstrap support (100 replicates) of > 90%, empty circles indicate a bootstrap support of > 70%. Additionally, the domain arrangement representative for proteases from each group is indicated. Deg/HtrA proteases from clade I contain one protease domain (oval shapes) and one PDZ domain (diamonds), with the exception of Deg5 proteases, which possess a protease domain only. Proteases from clade II contain an additional PDZ domain, clade III gathers proteases with one active (oval shape) and one inactive (discontinous oval shape) protease domain and four PDZ domains, whereas enzymes from clade IV contain a single protease domain, which is shifted toward the C-terminus.
proteins from this group are sometimes not classified as members of this family, e.g. the mammalian ortholog of plant Deg15, called Tysnd1 [10]. However, due to the presence of a Deg/HtrA protease domain we classified Deg15 orthologs as Deg/HtrA family members (Tables 1,  2, 3, 4, 5). Although the phylogenetic tree and, as a consequence, the standardized protease nomenclature are built on the aa sequences of the protease domains alone, they are supported by the analysis of the domain arrangements, using the aa sequence of the full-length protein. All proteases share the same domain arrangement with their nearest ortholog, e.g. all Deg1 proteins from the five analyzed organisms possess one PDZ domain, all Deg5 proteins contain none and all Deg7 proteins contain two protease and four PDZ domains (Tables 1, 2, 3 and 5, column 5; Table 4, column 4).
A "core set"of Deg/HtrA proteases in plants All organisms examined here contain between 15 to 17 deg/htrA-encoding genes, whereas the number of potentially active enzymes is slightly lower. Although the total number of Deg/HtrA proteases is similar in all plants analyzed in this study, the distribution of the proteases within the phylogenetic tree ( Figure 1) differs for each species.
From this collection of Deg/HtrA protease encoding genes, we extracted the hypothetical minimum number of Deg/HtrA proteases present in plants. This "core set" represents conserved Deg/HtrA protease types found in every organism examined here, in the lowest possible copy numberfor example, the genome of P. trichocarpa contains three Ptdeg7 genes, however, A. thaliana and O. sativa contain only one, therefore the "core set" contains one Deg7 protease. For plants, this conserved "core set" consists of eight proteases (Table 6), such as Deg1, Deg5, and Deg8 detected in the thylakoid lumen [9][10][11][12][13][14][15][16][17], Deg2 and Deg7 in the chloroplast stroma [18,21], Deg9 in the nucleolus [36], Deg15 in the peroxisome [8], and Deg10 is predicted to have a mitochondrial localization [14]. C. reinhardtii, for example, possesses only "core set" proteases as Deg/HtrA enzymes, although some are present in duplicates. This "core set" seems to provide all the proteolytic potential of Deg/HtrA proteases that is necessary for a hypothetical plant cell.

Conclusion
In this study, we present the first detailed analysis of the Deg/HtrA protease family in green plants, including genomes from vascular plants, a moss, and a green alga. Based on phylogenetic analysis of the protease domains and analysis of the domain arrangement in the fulllength protease, we propose a standardized nomenclature for Deg/HtrA proteases in plants. Although biochemical data is only available for selected proteases from A. thaliana, our data suggests (within the limits of a sequence-only analysis) that proteases with the same name might indeed execute comparable physiological functions. Compared to animals and prokaryotes, the number of Deg/HtrA proteases encoded in plant genomes is much higher, which is partially due to genome or gene duplications. However, the exact reasons are probably different for every organism. A "core set" of eight protease genes was identified for plants, of which The presence of a protease in a particular organism is indicated by +, its absence by -. If more than one isoform is present, the names are given. Proteases of the "core set" are depicted in bold. At, Arabidopsis thaliana; Cr, Chlamydomonas reinhardtii; Os, Oryza sativa; Pp, Physcomitrella patens; Pt, Populus trichocarpa.
at least one copy is present in every genome examined here. This seems to be the minimum number of Deg/ HtrA proteases necessary for plants. We are confident that the work presented here will be a valuable tool and guide-line for future research on plant Deg/HtrA proteases that will allow easy communication between research groups working with different photosynthetic organisms.