Genome-wide identification of expansin gene family reveals expansin genes are involved in fibre cells growth in cotton

Background : Expansins ( EXPs ), a group of proteins that loosen plant cell walls and cellulosic materials, are involved in regulating cell growth and diverse developmental processes in plants. However, the biological functions of this gene family are still unknown in cotton. Results: In this paper, we identified a total of 93 expansin genes in Gossypium hirsutum . These genes were classified into four subfamilies, including 67 GhEXPAs , eight GhEXPBs , six GhEXLAs , and 12 GhEXLBs , and divided into 15 subgroups. All 93 expansin genes are distributed over 24 chromosomes excluding Ghir_A02 and Ghir_D06. All GhEXP genes contain multiple exons and each GhEXP protein has multiple conserved motifs. Transcript profiling and qPCR analysis revealed that the expansin genes have distinct expression patterns in different stages of cotton fibre development. Among them, three genes ( GhEXPA4o , GhEXPA1A , and GhEXPA8h ) were highly expressed in the initiation stage, nine genes ( GhEXPA4a , GhEXPA13a , GhEXPA4f , GhEXPA4q , GhEXPA8f , GhEXPA2 , GhEXPA8g , GhEXPA8a , and GhEXPA4n ) had high expression during the fast elongation stage, while GhEXLA1c and GhEXLA1f were preferentially expressed in the transition stage of fibre development. Conclusions: Our results provide a solid basis for further elucidation of biological functions of expansin genes in cotton fibre development and valuable genetic resources used for crop improvement in the future.


Background
Expansins are a kind of cell wall loosening protein, widely present in the higher plants and bacteria and fungi. Expansins can loosen plant cell walls and cellulosic materials without lytic activity [1,2], and they may unlock the network of wall polysaccharides, permitting turgor-driven cell enlargement [3]. The plant expansins superfamily is divided into four subfamilies, which include α-expansin (EXPA), β-expansin (EXPB), expansin-like A (EXLA), and expansin-like B (EXLB). EXPA and EXPB proteins are involved in cell expansion and other developmental events during which cell-wall modification occurs, but no confirmed enzymatic activity has been detected for these proteins [4]. In addition, EXLA and EXLB belong to two smaller subfamilies of the expansin superfamily and phylogenetic analysis shows that the two kinds of proteins constitute separated and well-resolved groups. Their structure is predicted to be the same as other expansins, and they can also target the cell wall for modification, however, their biological functions remain uncertain [1].
Typical plant expansins are torpedo-shaped proteins containing two domains, domain I and domain II. Plant expansins are usually 250 to 275 amino acid residues in length and the majority have a signal peptide in the N-terminus; the signal peptides are usually 20 to 30 amino acid residues [1,4]. Domain I is a six-stranded double-psi beta-barrel (DPBB), which has similar characteristics to the catalytic domain of glycoside hydrolase family 45 proteins (GH45) and contains a conserved His-Phe-Asp (HFD). The DPBB domain does not, however, possess the same catalytic activity as GH45. Domain II is homologous to group-2 grass pollen allergens [1], and it was recently classified as a family-63 carbohydrate binding module (CBM63) [1,5].
Expansins were first identified as the endogenous proteins inducing cell wall extension in plants by extension in plants by the McQueen-Mason group [6]. At present, studies of expansins have shown that they can participate in many developmental processes, and are thought to function in cell growth and enlargement, pollen tube invasion of the stigma (in grasses), wall disassembly during fruit ripening, abscission, stress resistance, and other cell separation events [1,3,7,8]. Cotton is the most important natural fibre crop worldwide and expansins play an important role in fibre cells [9,10]. The mRNA levels of expansins were firstly reported to be high during cell elongation but decreased when cell elongation ceased [11]. Field experiment data showed that expansins improved cotton fibre length and micronaire value [10].
Subsequently, some expansin genes preferentially expressed in cotton fibres were isolated and identified using several different approaches, including cDNA arrays, subtractive PCR, RT-PCR, and SNP-based chromosomal assignment [12,13], and researchers have a preliminary understanding of the function of these cotton expansin genes at the transcript level [12,14,15]. The functions of expansin genes have been further investigated in cotton fibre development [9,10,16,17]. For example, GhRDL1 is localized in the cell wall and interacts with GhEXPA1. Overexpressing GhRDL1 increases fibre length, and cotton plants overexpressing GhRDL1 and GhEXPA1 produce many more fruits [17]. GhEXPA1 expression levels are regulated by the transcript factor GhHOX3, which promotes cotton fibre elongation [16]. GbEXPATR, a Gossypium barbadense-specific expansin, can also enhance cotton fibre elongation through cell wall restructuring [9].
Cotton is the most important global fibre crop and accounts for 90% of natural fibre production in the world [18,19]. The cotton genome has been sequenced and resequenced in succession [20][21][22][23][24]. These genome data make genome-wide identification of gene families possible. We conducted genome-wide identification of cotton expansin genes in this paper. This research can provide genome-wide information of cotton expansin genes and promote further investigation of the biological function of expansin genes during cotton fibre development and other developmental processes.

Results
Identification and sequence analysis of the cotton expansin gene family To identify expansin genes in cotton, we searched the annotation file of the TM-1 genome 5 using expansin as the keyword [24]. As a result, a total of 98 expansin gene ID records were initially obtained and the corresponding protein sequences were extracted. All expansin proteins were submitted to the NCBI website (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) to analyse their conserved domains. The expansin proteins included two conserved domains, DPBB-1 and Pollen_allerg_1. Consequently, 93 expansin genes with both DPBB-1 and Pollen_allerg_1 domain were ultimately identified for further analysis, and each expansin gene ID was named according to nomenclature guidelines [25]. The detailed information is shown in Table S1. The expansin gene family contained four subfamilies, including EXPA, EXPB, EXLA, and EXLB.
In order to determine the biochemical properties of genes in the expansin family in cotton, the isoelectric point (pI), molecular weight (MW), and signal peptide of expansin proteins were analysed. The pI values of EXPA and EXLA members were above 7.0 except for GhEXPA8b and GhEXPA7d. However, The pI values of EXPBs and EXLBs were below 7.0 except for GhEXPB3a, GhEXPB3b, GhEXPB1a, GhEXPB1b, GhEXLB1d, and GhEXLB1jAdditional file1: Table S1). The pI values of expansin family members ranged from 4.65 to 12.01, with an average of 8.47. The average MW of expansin subfamily members was 27.42 kD, ranging from 14.29 to 41.53 kD. The length of expansin protein sequences ranged from 150 (GhEXLA17e) amino acids (aa) to 366 aa (GhEXPA5e), and the signal peptide size ranged from 17 to 35 aa in length (Additional file1: Table S1).
To further understand the detailed information of each cotton expansin protein, 93 expansin protein sequences were used for multiple sequence alignment (Additional file 3: Figure S1). The results showed that expansins have similar sequence characteristics: the majority of them consist of a signal peptide, conserved domains I and II, which is consistent with the previous study [4]. The amino acid sequence of domain I was more 6 conserved than that of domain II, especially among EXPA members (Additional file 3: Figure S1). Notably, almost all of the EXPAs (excluding GhEXPA13a, GhEXPA13b, and GhEXPA15d) and three EXLA members (GhEXLA17a, GhEXLA17b, and GhEXLA17c) contained a conserved motif (HFD) in domain I (Additional file 3: Figure S1). Members of EXPB, EXLB, and the other six EXLA members did not have the HFD motif. Six EXLA contained an extra segment named EXLA extension of the C-terminus. In addition, a conserved motif named BOX 1 was found in almost all the expansin members (Additional file 3: Figure S1) Phylogenetic relationships, genes structure and protein motifs of the cotton expansin genes In order to evaluate the evolutionary relationships of cotton expansins, a phylogenetic tree was inferred. The expansins were divided into four major subfamilies, EXPA, EXPB, EXLA and EXLB [25]. The EXPA subfamily was the largest group with 67 members, and the other subfamilies contained eight (EXPB), six (EXLA), and 12 (EXLB) members. The four expansin subfamilies comprised 15 subgroups (Fig. 1). We discovered that EXPA-IV was the largest subgroup, which included 17 expansin members, and EXPA-VII, EXPA-VIII, and EXPA-IX were the smallest subgroups with only two expansin members each.
Based on the genomic data information of the cotton expansin family, the analysis of expansin gene structure was performed using the online tool (GSDS 2.0, http://gsds.cbi.pku.edu.cn/index.php) (Fig. 2). The results showed that the gene structure (exon-intron organization) of the expansin members included two to five exons and that the same subfamilies had similar characteristics of exon types (Fig. 2a, b). Most of the EXPA members had three exons (51 of 67 EXPA members). 12 EXPAs had two exons, and four EXPAs had four exons. All members from EXPB subfamily had four exons except for GhEXPB1a (five exons). Four EXLA members contained five exons and two members had four exons. EXLB members had four (seven EXLBs) or five exons (five EXLBs).. These structural differences might be caused by the insertion or loss of exons in the course of long-term evolution and natural selection. Moreover, the similarity in gene structures also suggests that functional complementation exists among these genes.
In order to identify the conserved motifs in expansin proteins, MEME online software (http://meme-suite.org/index.html) was used to analyse the potential motifs. As a result, a total of ten distinct motifs were identified (Fig. 2c, Additional file 4: Figure S2). The results showed that the motifs of all cotton expansins had unifying features; for example, each expansin protein contained motif 5 and almost all of them contained motif 4 except for GhEXPA15d, GhEXPA17e, and GhEXLB1j. In addition, the type, arrangement, and number of the motifs had similar characteristics in the same subfamily. More than half of the EXPA members (38/67) had seven motifs, and 21 members had six motifs. EXPB, EXLA, and EXLB subfamilies possessed similar motif characteristics and most of them contained five motifs (motifs 4, 5, 7, 8, and 9). GhEXPB2d and GhEXPB3b of the EXPB subfamily included four motifs, and three members (GhEXLB1g, GhEXLB1c, and GhEXLB1j) of EXLB also had the same number of motifs. These results showed that EXPB, EXLA, and EXLB subfamilies had close evolutionary relationships. The similarities between gene structures and sequence motifs implied that cotton expansin family genes had duplication events over evolutionary time

Chromosomal location and collinear analysis of the expansin gene family
To determine the chromosomal location of GhEXP genes in G. hirsutum, genomic data from G. hirsutum were used [24]. The physical position of cotton expansin genes was determined using positional information files downloaded from the CottonGen website (https://cottonfgd.org/). As shown in Fig. 3, 93 expansin genes were distributed on 24 8 chromosomes, excluding Ghir_A02 and Ghir_D06. The chromosome Ghir_A05 contained the eight expansin genes, whereas Ghir_A06 included only one expansin gene. The numbers of expansin genes located on other chromosomes ranged from two to seven. In addition, some of the expansin genes were located on the chromosome in clusters, for example, both Ghir_A08 and Ghir_D08 possessed a gene cluster with four distinct EXLBs (Fig. 3).
These results showed that the distribution of expansin genes was uneven on each chromosome. Collinearity analysis showed that expansin genes were collinear frequently between A and D sub-genomes (Fig. 4), which indicates that expansin genes with collinear relationships may have similar function

Expression patterns of the expansin gene in cotton fibre
To comprehensively investigate the temporal expression patterns of the cotton expansin gene family, fibre samples of different developmental stages were used for transcriptome analysis. A heat map was constructed with these transcriptome data ( (GhExp2) have also been reported to be highly expressed in cotton fibre [15]. 9 qRT-PCR analysis of the special expansin genes in cotton fibres In order to further identify the key expansin genes involved in fibre cell growth, 14 expansin genes that are predominantly expressed in different stages of developmental cotton fibres were selected to verify their expression level using qRT-PCR experiment.
These expansin genes were evidently up-regulated at the initiation, elongation, or transition stages. In particular, GhEXPA4o, GhEXPA1a, and GhEXPA8h were predominantly expressed at 0 DPA (Fig. 6a), suggesting that these two genes may function in the initial stage of fibre cells.
Cotton fibre elongation is a very important stage of fibre cells. Nine expansin genes showed higher expression level at the fibre elongation stages with distinct expression characteristics (Fig. 6b). The expression level of GhEXPA4a reached a peak at 3 DPA and GhEXPA13a and GhEXPA4f peaked at 5 DPA. The expression levels of GhEXPA4q, GhEXPA8f, and GhEXPA2 were the highest at 7 DPA and GhEXPA8g, GhEXPA8a, and GhEXPA4n peaked at 10 DPA (Fig. 6b). GhEXPA4f and GhEXPA2 are homologous genes in allotetraploid cotton species that are respectively located in the A and D sub-genomes of the 10 th chromosomes, and both genes have specific expression in cotton fibre cells [15].
In addition, GhEXPA8a and GhEXPA8g are two important genes that we have found during cotton fibre elongation, homologous genes of them can promote the elongation of hypocotyl in Arabidopsis [26].These results revealed that the expression peaks of the majority of genes appeared from 7 to 10 DPA, which are usually called the fast elongation stages.
Moreover, we obtained two expansin genes that were predominantly expressed at transition stages, named GhEXLA1c and GhEXLA1f (Fig. 6c). Both of these genes belonged to the EXLA subfamily with unclear biological roles. The expression levels of GhEXLA1c and GhEXLA1f were the highest at 20 DPA, which is the transition stage of fibre cells from fast elongation to secondary cell wall synthesis. These data suggested that these two genes, which are essential in the transition stage, may prepare well for the cellulose synthesis of secondary wall thickening period.

Discussion
Cotton is one of the world's most important economic crops, as well as a powerful singlecell elongation model plant [18]. Cotton fibre is the main raw material of the textile industry, and its developmental process has been studied for many years. This article focused on expansins, which are cell wall loosening proteins that function in cotton fibre development. Many expansin gene families have been identified from eudicots and monocotyledons, including Arabidopsis, grape, jujube, Chinese cabbage, maize, and rice, [4,[27][28][29][30]. However, expansins have rarely been analysed in cotton, except for a few studies on the expression levels of GhEXP genes, transcriptional regulation, and genetic transformation studies [10,[15][16][17]. In recent years, the completion of the cotton genomic sequence through multiple sequencing technologies has provided powerful data support for analysing gene families [22,24,31,32], which will greatly facilitate the identification and analysis of cotton gene families at the whole genome scale.
In this paper, we present the first report of the expansin gene family from upland cotton, which included 93 members. The sequences of 93 expansins all included two conserved domains, DPBB_1 and Pollen_allerg_1. Phylogenetic analysis revealed that 93 cotton expansins were divided into 15 subgroups of four subfamilies (Fig. 1). The number of expansin subgroups was consistent with the number of expansin ancestors including 15 to 17 expansin genes, and each of these ancestors evolved into an extant clade in the phylogenetic tree [33]. Thus, we speculate that each clade of the existing expansin family may be extended by each clade ancestor.
In addition, this study showed that the EXPA subfamily genes in cotton were significantly expanded, including 67 total EXPAs (Fig. 1) [4,29], with the most significant difference that the EXPBs are more numerous in monocotyledons than eudicots [1]. These results provide significant insights into the evolution and functions of expansin genes in cotton.
According to the chromosomal location information of expansin genes, the 93 genes were unevenly located in the other chromosomes of A and D sub-genomes except for a lack of expansin genes in Ghir_A02 and Ghir_D06 chromosomes (Fig. 3). This uneven distribution on the chromosomes is ubiquitous whether in monocotyledonous or dicotyledonous plants [29,34,35], and there are certain tendencies of distribution on different chromosomes. For example, several EXLB genes had a clustered distribution in Ghir_A08 and Ghir_D08, and expansin genes from the same subfamily were distributed on the majority of chromosomes, such as EXPAs in Ghir_A05 (Fig. 3). Previous research has shown that tandem duplications and segmental duplications can affect major variation in family size and the distribution of most gene families; nevertheless, counts of tandem and segmental duplications were negatively correlated, and no families exhibited high levels of both tandem and segmental duplication [36].
The uneven chromosomal distribution of cotton expansin genes suggested that segmental duplication events may play an important role in the course of expansin evolution. Some expansin genes are clustered on chromosomes, which may reveal the existence of tandem duplication of expansin genes during expansin evolution. However, the fact that only a few EXLB genes are clustered suggests that tandem duplication is not the main reason for expansin superfamily evolution, but that segmental duplication is the main evolutionary force. Abundant collinearity relationships of expansin genes were present in cotton within the A or D sub-genomes and between the A and D sub-genomes; these relationships revealed possible homologous gene pairs with similar functions. In the same subfamily category and even subgroup, most members had almost the same conserved gene structure and motif distribution (Fig. 2 b, c), thus further confirming their close evolutionary relationships and phylogenetic classification [37].
Gene expression patterns can provide important clues to gene function in the growth and development of plants. Transcriptome analysis has shown that cotton expansin genes participate widely in different fibre development stages (Fig. 5). We obtained the 14 predominantly expressed genes from transcriptome data in distinct stages of cotton fibre development (Fig. 6). Three of these genes had a higher level of expression in the initial stages of cotton fibre development (Fig. 6a); three genes, GhEXPA4o, GhEXPA1a, and GhEXPA8h, were firstly obtained in the early phase of fibre development, and revealed high expression levels by qPCR; however, their role in initial stages of fibre development still needs to be clarified. In addition, we obtained nine expansin genes with higher expression level in the elongation stages (Fig. 6b) and two expansin genes that were predominantly expressed at transition stages (Fig. 6c).
GhEXPA4f, was highly consistent with the report of Harmer et al. where GhExp1 transcripts were highly abundant in the fibre [15], and its homologous gene GhEXPA2 showed a similar expression pattern. This result was basically identical to the GhExp2 expression level [15] and overexpressed GhEXPA1 (named GhEXPA4f based on nomenclature guidelines and in this study) increased upland cotton fibre length [17]. Our qRT-PCR results showed that this gene was highly expressed in the fast elongation stages (Fig. 6b), further proving the importance of this expansin gene in cotton development. Moreover, the homologous gene of GhEXPA4f, referred to as GhEXPA2, was located in the A subgenome. By sequence alignment, GhEXPA2 was named GhEXPA8 by Bajwa et al. [10], and transgenic cotton plants expressing GhEXPA8 showed that fibre length was significantly improved in field experiments [10].
In addition, the other seven new expansin genes we identified were predominantly expressed in elongation stages of cotton fibre development; the functions of these genes need to be further studied in terms of promoting fibre elongation. More importantly, we have found that GhEXPA8a and GhEXP8gare homologous with AtEXP8 in Arabidopsis thaliana, AtEXP8 can promote the hypocotyl elongation in Arabidopsis thaliana [26]. This result implied that GhEXPA8a and GhEXP8g can promote the cotton fibre elongation in cotton. The above results showed that expansin genes with higher expression levels were all members of the EXPA subfamily in the initial and elongation stages of fibre development. This may be due to the fact that there are more members of the EXPA (67/93) subfamily in the expansin family; moreover, these data also suggested that expansin genes of the EXPA subfamily are essential in the first two stages of cotton fibre development.
EXLA and EXLB were included in two smaller expansin subfamilies. Phylogenetic analysis shows that these proteins constitute separate and well-resolved groups, however their biological functions are uncertain [1]. In particular, there are relatively few studies on 14 EXLA functions besides AtEXLA2 in Arabidopsis thaliana. AtEXLA2 was reported to have obvious expression in both the hypocotyl and root; over-expression of AtEXLA2 resulted in slightly thicker walls in non-rapidly elongating etiolated hypocotyl cells [38]. In this paper, we found two EXLA genes, referred to as GhEXLA1c and GhEXLA1f, with higher expression at 20 DPA (Fig. 6c). Interestingly, this phase was called the fibre development transition stage [18], and transitional cell wall remodelling is a distinct, stable developmental stage lasting at least four days (18 to 21 DPA) [39]. In addition, it was reported that an expansin-like protein from Hahella chejuensis could bind cellulose and enhance cellulase activity [40]. These results implied that EXLA subfamily genes function in facilitating transition of fibre development from elongation to secondary cell wall synthesis. However, the detailed biological functions of EXLAs remain to be assessed in cotton fibre development; more research needs to be conducted in order to understand and make use of EXLAs in cotton transition stages either in theory or practice.

Conclusions
Overall, we successfully performed a genome-scale analysis of the expansin family genes in upland cotton species with a special emphasis on fibre development. A total of 93 cotton expansin genes were obtained. Our analysis has provided information for understanding the cotton expansin superfamily, including gene evolution, gene structure, protein motifs, collinear relationships, and gene expression patterns. Moreover, we obtained expression patterns of 14 expansin genes in cotton fibre development at different stages. Among them, three genes were highly expressed in the initiation stage, nine genes had high-level expression during the fast elongation stage, while GhEXLA1c and GhEXLA1f were preferentially expressed in the transition stage of fibre development.
The results will lay the foundation for further clarification of the biological functions of expansin genes and the molecular mechanism of many of the important cotton agricultural 15 traits, especially on the elongation stage of cotton fibre development. Identification and sequence analysis of the cotton expansin genes The cotton expansin gene sequences were obtained from the new cotton genome [24] (http://cotton.hzau.edu.cn/EN/download.php). Expansin was used as the keyword for retrieval from the genome annotation file named Ghirsutum_Integrated_Function Annotation; we then extracted the corresponding expansin gene ID. According to these expansin gene IDs, all expansin protein sequences were obtained from another file named Ghirsutum_gene_peptide by local BLAST. Then, all expansin protein sequences were submitted to the NCBI CDD (conserved domain database) (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) where conserved domains were identified. For the sake of rapid search speed, we processed this work with the Batch web CD-search tool (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi), where the maximal number of protein queries per request is 4000, providing adequate processing power for our purposes. We executed the search program using default parameters. The canonical expansin protein contains both conserved domains: DPBB-1 (including DPBB_1 superfamily) and Pollen_allerg_1 (including Pollen_allerg_1 superfamily). We acquired the final gene sequences of the upland cotton expansin family for further analysis.

Plant materials
Using the ExPaSy online tools [41] (https://www.expasy.org/resources/), we analysed the molecular properties of the identified expansin proteins, which were included to compute the molecular weight (MW) and isoelectric point (pI), and predicted their signal peptide sequences with SignalP 5.0 Server (http://www.cbs.dtu.dk/services/SignalP/). Sequence alignment of the expansin protein sequences was executed in Vector NTI Advance 11 software (version 11.5); followed by searching for conserved amino acid and conserved domain properties of expansin protein.

Phylogenetic tree construction
To analyse phylogenetic relationships, Arabidopsis thaliana (A. thaliana) expansin protein sequences were downloaded from TAIR (https://www.arabidopsis.org/) and EXPANSIN CENTRAL (http://www.personal.psu.edu/fsl/ExpCentral/). Multiple sequence alignment of the identified cotton expansin and A. thaliana expansin proteins was executed in MEGA software (version 6.0) [42], and a phylogenetic tree was constructed in the same software, using the neighbour-joining method. The number of bootstrap replications was 1000, and the rest of the parameters were set as the defaults.

Analysis of expansin gene structures and motifs
Analysis of gene structures were performed to identify exons, introns, and UTRs.
Corresponding GFF data of identified expansin gene ID were extracted from GFF file named Ghirsutum_gene_model in the new cotton genome data [24] (http://cotton.hzau.edu.cn/EN/download.php), then the expansin GFF data were analysed using the online tool GSDS (version 2.0, http://gsds.cbi.pku.edu.cn/) [43]; the results were saved in SVG image format. Motifs of expansin protein sequences were analysed using the online tool MEME (http://meme-suite.org/index.html) [44]. According to the required file format, we submitted the expansin sequences into the online tool. The maximum number of motifs was set to 10, the repeat number was set to 0 or 1, the remainder of the parameters were set to system defaults. The output draft images of gene structure and motif were further modified with the Adobe Illustrator CS3 software (version 13.0.0).

Chromosomal locations and collinearity relationships of expansin genes
We obtained the length of each chromosome from the new genome data [24], and a file of the lengths of all TM-1 chromosomes was obtained. Then positional information of the expansin gene on the chromosome was extracted from the GFF file, named Ghirsutum_gene_model in the new cotton genome data (http://cotton.hzau.edu.cn/EN/download.php) [24]; thus a file of positional information of the expansin gene was obtained. Afterwards, the two files were submitted to the online tool MG2C (http://mg2c.iask.in/mg2c_v2.0/) for analysis of expansin gene location on chromosomes. Collinearity analysis of cotton expansin genes was executed by the MCScanX software [45], the visualization of analysis results was drawn using Circos software [46]. The analysed results were exported in SVG format, and the SVG image was further modified with the Adobe Illustrator CS3 software (version 13.0.0).