Comparative analysis of de novo genomes reveals dynamic intra-species divergence of NLRs in pepper

Peppers (Capsicum annuum L.) containing distinct capsaicinoids are the most widely cultivated spices in the world. However, extreme genomic diversity among species represents an obstacle to breeding pepper. Here, we report de novo genome assemblies of Capsicum annuum ‘Early Calwonder (non-pungent, ECW)’ and ‘Small Fruit (pungent, SF)’ along with their annotations. In total, we assembled 2.9 Gb of ECW and SF genome sequences, representing over 91% of the estimated genome sizes. Structural and functional annotation of the two pepper genomes generated about 35,000 protein-coding genes each, of which 93% were assigned putative functions. Comparison between newly and publicly available pepper gene annotations revealed both shared and specific gene content. In addition, a comprehensive analysis of nucleotide-binding and leucine-rich repeat (NLR) genes through whole-genome alignment identified five significant regions of NLR copy number variation (CNV). Detailed comparisons of those regions revealed that these CNVs were generated by intra-specific genomic variations that accelerated diversification of NLRs among peppers. Our analyses unveil an evolutionary mechanism responsible for generating CNVs of NLRs among pepper accessions, and provide novel genomic resources for functional genomics and molecular breeding of disease resistance in Capsicum species.

sequencing and assembly of plant genomes. To date, hundreds of plant genomes have been published, and these sequences represent essential resources for breeding research [9,10]. Specifically, population genetic studies have been conducted using those reference genomes, along with resequencing data, to identify genomic variations associated with important agronomic traits [11,12]. However, due to extreme intra-genomic variations, one reference genome cannot represent the whole-gene repertoire of a plant species [9,13]. To overcome such limitations, pan-genome projects have been conducted in major crops such as rice, maize, and tomato, with the goal of constructing integrated genome sequences that represent the whole-gene repertoires of the target species [14][15][16]. These pan-genome analyses using multiple genome resources could provide a platform for plant breeding-associated agronomic traits such as resistance to biotic and abiotic stresses [13].
Nucleotide-binding, leucine-rich repeat genes (NLRs) have been rapidly amplified and diversified in plants. The domain architecture of NLRs was classified into three main components: An N-terminal domain including a toll/interleukin-1 receptor homology, a coiled-coil, or a resistance to powdery mildew 8, central nucleotide-binding (NB-ARC) domain, and a C-terminal domain including leucine-rich repeat. In particular, the conserved NB-ARC domain is mostly used for defining NLRs [17]. Extensive intra-and inter-species comparisons have been performed on NLRs [18,19]. For example, a species-wide study in 64 Arabidopsis thaliana accessions, termed the pan-NLRome, revealed the process of NLR evolution, including the diversification of NLR domain architectures and their specific selection patterns within the species [18]. Although pepper has a large expanded pool of NLRs [6], the complexity and variation of these genes within the species have not been previously examined.
Here, we report genome assemblies and annotations for two C. annuum accessions: the sweet bell pepper 'Early Calwonder (ECW)' and the hot chili pepper 'Small Fruit (SF)' . Comparative analyses of newly assembled and publicly available pepper genomes revealed the evolutionary relationships and genomic variations among pepper accessions. We also re-annotated NLRs and constructed a physical NLR map, based on the reference pepper genome 'Criollo de Morelos-334' (CM334), with the two assembled genomes from this study (ECW and SF) and other publicly available C. annuum accessions (Zunla-1 and Chiltepin). We identified genomic regions in the NLR map where intra-specific repertoires of NLR genes exhibited significant copy number variations (CNVs) among pepper accessions. Extensive comparison of those regions indicated that CNVs of NLRs could have arisen by accumulation of intra-specific sequence mutations in pepper genomes. The newly constructed genome assemblies and annotations, along with the NLR map, provide a valuable resource for functional genomics and molecular breeding of disease resistance in Capsicum spp.

Genome sequencing, assembly, and annotation
Two pepper genomes were assembled and annotated by the described pipeline ( Supplementary Fig. 1). Using the Illumina HiSeq X-ten and NovaSeq 6000 platforms, we generated 460.  (Table 1). We detected 1,323 (96.2%) and 1,316 (95.7%) conserved single-copy orthologs in genome assemblies of ECW and SF using BUSCO [20], respectively, indicating equivalent assembly quality compared to other pepper genomes (Supplementary Table 4).
Gene annotation predicted 35,355 and 35,158 proteincoding genes in ECW and SF, respectively (Table 1). Of those, 32,983 (ECW, 93.3%) and 32,838 (SF, 93.4%) have been assigned putative functional descriptions in public databases (Table 1 and Supplementary Table 5). Comparison of the genes in the ECW and SF genomes with publicly available pepper gene annotations revealed that the lengths of annotated genes were similar among all pepper genomes ( Supplementary Fig. 3). Validation of annotated genes using BUSCO detected 1,254 (91.2%) and 1,269 (92.3%) conserved single-copy orthologs in ECW and SF, respectively (Table 1 and Supplementary Table 4). Given the similar gene structure and validation of genome assemblies and annotated genes relative to publicly available pepper genomes, these results indicate that our assemblies and annotations of ECW and SF are reliable.
Repeat analysis revealed that 2.64 (ECW, 84.1%) and 2.63 Gb (SF, 82.7%) were annotated as repeat sequences, whereas only 1-2% of the assembled genomes were assigned as genes. LTR/Gypsy elements represented 68.8% of all annotated repeat types (Supplementary Table 6), consistent with the repeat contents of other pepper genomes [3].

Clustering and phylogenetic analyses of annotated genes
Gene annotations from ECW, SF, and five publicly available Capsicum genomes (CM334, Zunla-1, and Chiltepin in C. annuum; PI159236 in C. chinense; and PBC81 in C. baccatum) were clustered into 35,037 groups. Subsequently, we classified the genes as single-copy (cluster containing one gene in all species), multi-copy (cluster containing more than one gene in all species), or other, based on the number of genes in each group ( Fig. 1a and Supplementary Table 7). A total of 11,419 (32.6%) groups contained single-copy orthologs (Fig. 1a). ECW and SF had the smallest numbers of unclustered genes (541 [1.5%] and 627 [1.8%] genes, respectively), indicating that most of the protein sequences were very similar to those of other pepper gene annotations (Fig. 1a).
To further verify the evolutionary relationships, we constructed a phylogenetic tree using concatenation of the single-copy orthologs from pepper annotations (Fig. 1b). The four accessions in C. annuum, ECW, SF, Zunla-1, and Chiltepin, were more closely related to one another than to CM334 (Fig. 1b). A closer look at the gene clusters of C. annuum revealed that 108,533 genes (approximately 21,700 genes [61%] in each genome) were shared among 17,202 clusters (Fig. 1c). In the top 20 functional domain descriptions of genes in the core cluster, domains involved in defenses against pathogen and developmental functions were predominant (Supplementary Fig. 4).
On the other hand, ECW and SF contained 750 (2.1%) and 873 (2.5%) genes, respectively, which did not cluster with the other three accessions (Fig. 1c-d). Among them, genes containing NB-ARC, leucine-rich repeat, and protein kinase domains were most abundant ( Fig. 1d and Supplementary Table 8). Gene ontology (GO) analyses of ECW and SF specific genes also demonstrated enrichment of disease resistance-related proteins such as late blight and TMV resistance proteins and LRR receptorlike kinases ( Supplementary Fig. 5). Moreover, 2,204 (6.1%), 1,672 (4.7%), and 2,652 (7.7%) genes in CM334, Zunla-1, and Chiltepin, respectively, were not grouped with any accessions and also contained large numbers of defense-related domains against pathogens (Supplementary Fig. 6). Taken together, these observations indicate that specific gene repertoires, including disease resistance genes such as NLRs, have been dynamically changed among pepper genomes.

Identification and classification of NLRs
To elucidate NLR repertoires, we re-annotated NLRs in five pepper genomes (Fig. 2, Supplementary Table 9, and  Supplementary Table 10). Between 760 (in ECW) and 972 NLRs (in Chiltepin) were identified (Supplemental Table 10). Among them, ECW and SF had the smallest numbers of NLRs, with 760 and 761, respectively, whereas CM334 and Chiltepin had the largest number of NLRs, with 951 and 972, respectively. Subsequently, we constructed a phylogenetic tree of NLRs and determined their subgroups. The analyses also revealed CNV of NLR subgroups among pepper accessions. In particular, G1 and G2, the largest subgroups in pepper NLRs [21], exhibited variable copy numbers among all accessions (Supplemental Table 10). On the other hand, GT, which includes TIR-NLR (TNL), and G10, referred to as the ancient and autonomous NLR (ANL) [22], had a moderate number of NLR subgroups and moderate CNVs. Moreover, GR, which includes RPW8-type helper NLR (RNL), and G8, which contains the NLRs required for cell death (NRC) helper group [23], had small numbers of NLRs and the fewest CNVs. Collectively, these results suggest that there are CNVs between accessions within the same NLR groups as well as between groups within a species. Table 1 Summary of genome assembly, gene annotation, and BUSCO validation a The genome and annotation described in Kim et al. [3,6] b The genome and annotation described in Qin et al. [4]

Intra-specific diversification of NLRs in pepper accessions
To detect accurate ortholog relationships of individual NLR genes among pepper accessions, we constructed an NLR map based on the CM334 reference genome and four other accessions (ECW, SF, Zunla-1, and Chiltepin) ( Fig. 3 and Supplementary Table 11). As a result of this analysis, a total of 4,278 NLRs (98.2% of a total of 4,357) were assigned to the NLR map. The number of core NLR orthologs annotated in all accessions was 1,955 (391 pairs, 44.9%) (Fig. 3a). In addition to the core NLRs, a total of 1,670 dispensable NLRs were shared between two or more accessions (555 pairs, 38.3%), and the numbers of specific NLRs existing in only one accession were 161 (3.7%), 43 (1.0%), 44 (1.0%), 160 (3.7%), and 245 (5.6%) for CM334, ECW, SF, Zunla-1, and Chiltepin, respectively (Fig. 3b). The chromosomal distribution of NLRs based on the CM334 reference genome revealed that NLRs, including functional resistance genes, were enriched at both ends of chromosomes, and that subgroups were located on specific chromosomes (Fig. 3c). For example, NLRs in G1 and G2 were enriched at chromosomes 5 and 9, respectively (Supplementary Table 12).
To identify CNV regions in which NLRs were not evenly distributed among pepper accessions, we performed a chi-square test on the number of NLRs in physical clusters where intergenic region of NLRs was less than 50 kb. Two chromosomal regions and three scaffolds exhibited significant CNVs in NLR (Supplementary  accession. By contrast, we identified only one NLR in the SF genome and seven out of 12 NLRs in Chiltepin with orthologous genes in the same region (Fig. 4a). These results indicate that extreme CNVs of NLRs could be the result of low NLR copy numbers in the SF and Chiltepin genomes.
Specifically, CA.PGAv.1.6.scaffold1090.36 in CM334, ECW.scaffold2598.10 in ECW, Chr09.76 in Zunla-1, and Chr09.55 in Chiltepin did not match the NLRs in SF because the corresponding region was omitted during SF genome assembly. The NLRs of CA.PGAv.1.6.scaffold1090.36 and ECW.scaffold2598.10 also did not match the NLRs of Chr09.76 in Zunla-1 and Chr09.55 in Chiltepin because large insertions of 7,841 bp and 8,113 bp were located in Zunla-1 and Chiltepin genomes, respectively. Consequently, orthologous relationship between those genes were broken (Fig. 4b). When we mapped CA.PGAv.1.6.scaffold1090.36 in CM334 and ECW.scaffold2598.10 in ECW to the Zunla-1 and Chiltepin genomic regions, we identified early stop codons due to point mutation that generated abnormal termination of translation (Fig. 4b). In another case, we identified 14 stop codons in CM334 and ECW genomic regions corresponding to Chr09.70 in Zunla-1 (Fig. 4c). These stop codons were the consequence of both point mutations and frameshifts resulting from a combination of insertions or deletions (InDels). Mapping the Chr09.47 protein of Chiltepin to similar genomic regions in CM334 and ECW revealed six stop codons resulting from point mutations and insertions. Instead, Fig. 2 Phylogenetic relationship and classification of NLRs in C. annuum. The maximum-likelihood phylogenetic tree of NLRs from five pepper genomes was reconstructed using intact NB-ARC domains. Subgroups were assigned using ultra-fast bootstrap (UFBoot) and previous classification information [6,21]. The branch between GR and GT subgroups was used as a root. NRC helper-dependent groups were marked as a black outline. Red diamonds at the nodes represent UFBoot values above 90 an NLR in each of the genomes were annotated between the regions where the Chr09.70 and Chr09.47 proteins were mapped. When Chr09.70 and Chr09.47 were mapped to each genome, we identified three and six stop codons, respectively, resulting from point mutations. PCR amplification and Sanger sequencing also confirmed that small variation-mediated early stop codons resulted in truncated proteins of NLRs (Supplementary Fig. 7). Therefore, small genomic variations such as point mutation and InDels caused changes in the NLR repertoire.
Taken together, these results suggest that small sequence variations mediate the CNVs of NLRs within a species.

Discussion
Recently, plant pan-genomes have replaced the role of reference genomes [13]. However, the limited number of high-quality de novo genome assemblies, especially for large plant genomes, hinders the implementation of their pan-genome studies. Specifically, the pan-genome of pepper constructed based on low-depth sequencing The distribution and orthologous relationship of core, dispensable, and specific NLRs among pepper genomes. c Chromosomal distribution of NLRs per 1 Mb window in CM334 genome as a reference. A-E: The heatmap represents NLR copy numbers of CM334, ECW, SF, Zunla-1, and Chiltepin, respectively, in each window. F: The enriched NLR subgroups in each window were marked as a rectangle with group colors. The subgroups G1 and G2 were marked with black borders. The two significant CNV regions located on chromosomes are marked with red asterisks approaches [24] is still limited in understanding genomic diversity among pepper genomes. Here, we presented the genome assemblies with annotations of two pepper accessions, ECW and SF, containing large contig N50 of over 100 kb via short-read sequencing ( Table 1). The assessments of genome assemblies and annotations using BUSCO revealed that quality of two de novo genome assemblies and annotations are adequate for comparing to the other publicly available genomes for further analyses (Supplementary Table 4). The phylogenetic tree from single-copy orthologous genes was slightly different from a previous study [4] probably because of the application of different methodology such as maximum-likelihood and neighbor-joining methods resulted in different topology (Fig. 1b). Nevertheless, the closest relationship between Zunla-1 and its wild progenitor Chiltepin suggested that our phylogenetic tree represents reasonable topology. These results indicate that these two newly assembled and annotated genomes are of sufficient enough quality to compare gene repertoires with other genome assemblies and construct chromosome level assembly for pan-genome in C. annuum.
In general, annotation bias could be generated by different annotation methods and resources and prevent accurate comparative analyses of genes. In this study, re-annotation of NLRs was performed using the same method [25] with same protein sequences and finally could provide improved NLR resources in the five pepper genomes. Phylogenetic and comparative genomic analyses suggested that the G1 and G2 subgroups, which had been greatly duplicated in pepper [21], were diversified with large CNVs, not only after speciation but also after divergence within species (Supplementary Table 9 and 10). Conversely, the GR and G8 subgroups  Table 9 and 10). Because these groups contain helper NLRs that interact with multiple sensor NLRs to recognize pathogen effectors and mediate immune signaling [23], the evolution of NLRs in these groups may be more stringently regulated and conserved. Furthermore, we constructed an NLR map using whole-genome alignment to accurately predict the orthologous relationship of NLRs in C. annuum (Fig. 3). Comparative analysis based on the NLR map identified regions in which CNVs of NLRs differed significantly in C. annuum. This phenomenon has also been observed in other species, including Arabidopsis and tomato [18,19]. However, the detailed comparison of the CNV region revealed that truncated protein structures of NLRs mediated CNVs due to genomic sequence variations such as InDels and other mutations (Fig. 4). These results indicate that small genomic variations are crucial to the evolutionary process for NLR diversification.
We identified five statistically significant CNV regions which apparently appears to be a small number compared to more significant CNVs detected in the NLR groups based on the phylogenetic tree (Supplementary Table 10 and 13). This was due to the limited number of pepper genomes used to construct the NLR map. Nevertheless, analysis of NLR gene family integrated with phylogeny, synteny, and statistical test could provide comprehensive understanding of NLR diversity. Recently, a pan-genome analysis using 14 multiple reference genomes and 100 diverse lines in tomato elucidated the relationship between the number of copies and functional variations of genes such as NON-SMOKY GLYCO-SYL TRANSFERASE1 (NSGT1) and NSGT2 [16]. This suggests that comprehensive analyses of NLRs combined with multiple strategies and more genome assemblies could detect more CNVs and elucidate the evolutionary and functional mechanisms of NLRs associated with genomic variation.

Conclusion
In conclusion, we assembled and annotated two pepper genomes and construction of NLR map in pepper with publicly available genomes revealed the CNVs of NLR gene family. These two new pepper genome assemblies, annotations, and the NLR map represent a valuable resource for identification of functional disease resistance genes, as well as for studying the evolutionary mechanisms of disease resistance in genus Capsicum.

DNA extraction and sequencing
Since the pepper reference genome (CM334) is a landrace close to a wild species, two pepper accessions were selected to generate basic resources for pan-genome analysis. The accessions used in this study were 'Early Calwonder (ECW, IT158295)' , a non-pungent, bell-shaped pepper, and 'Small Fruit (SF, IT218615)' , a pepper with a high content of capsaicinoids. In addition, the cultivar ECW is known to be susceptible to the bacterial spot pathogens (Xanthomonas spp.) and used as near-isogenic lines for bacterial spot resistance genes (Bs1, Bs2, Bs3,  Bs4, and bs6) [26,27]. Both were obtained from the RDA-Genebank Information Center of National Agrobiodiversity Center (NAAS, RDA, Republic of Korea). The plants were grown at 24 °C under a 16/8 h light/dark cycle in an environmentally controlled growth chamber. Leaves from 3-week-old plants were frozen immediately in liquid nitrogen for isolation of genomic DNA. Genomic DNA (gDNA) with high molecular weight was extracted from frozen leaves, and the quality of gDNA was confirmed by spectrophotometric analysis (DS-11 Spectrophotometer; DeNovix Inc.) and agarose gel electrophoresis (1.0% w/v agarose TAE 1X gel containing 1X EcoDye; BIO-FACT, Daejeon, Korea). The paired-end (PE) and matepair (MP) libraries for NGS were constructed using the TruSeq DNA Nano Kit for 350-, 550-, and 600-800 bp insert sizes and the Nextera Mate Pair Kit for 2-and 5-kb insert sizes, respectively (Illumina, San Diego, CA, USA). The quality of each library was validated by qPCR. PE and MP libraries were sequenced with the HiSeqX-ten and NovaSeq6000 sequencing platforms, respectively (Illumina).
After gene annotation was completed, repeat annotation was performed for both pepper genomes by Repeat-Masker v4.0.3 (http:// www. repea tmask er. org) with default options and the pepper genome repeat library constructed in the previous study [6].

Identification and classification of NLR genes
To identify additional NLRs, we re-annotated each pepper genome assembly using TGFam-Finder v1.20 with default parameter [25]. Briefly, we used the same genome assemblies and annotations described above for C. annuum (CM334, ECW, SF, Zunla-1, and Chiltepin) and searched domains. After six-frame translation of the genomes, the target regions containing NB-ARC domain (PF00931) with 100 kb flanking sequence were searched. The NLRs of 50 plants used by Kim et al. [25], and each pepper annotation containing NB-ARC domains were used as resource proteins for protein mapping. RNA-seq reads obtained by the previous report [3] were used for transcriptome mapping. Ab-initio gene prediction was performed and final gene models were generated by combining gene models from protein alignments, assembled transcripts, and ab-initio gene prediction.
To assign putative NLR groups, the pipeline of NLR classification established by previous studies [6,19,21] was used with some modifications. Known NLR genes from GenBank and Plant Resistance Genes database (PRGdb) v3.0 (Supplementary Table 1) [45], and NLR group information from Kim et al. [6], were used as references for group assignment. The NB-ARC domains of NLRs were searched and extracted using NLR-parser v1.0 (P-value cutoff = 1.9e-5) [46]. We defined an intact NB-ARC domain with at least three of four major motifs (P-loop, GLPL, Kinase2, and MHDV) placed in sequence order and a length of at least 160 amino acids. These intact NB-ARC domains were aligned using MAFFT v7.407 (-maxiterate 1000 -globalpair) [47] and positions with gaps above 92% in aligned sequences were removed using trimAl v1.4.rev22 [48]. A maximum-likelihood phylogenetic relationships were inferred from IQ-TREE v1.6.12 [41] with ultrafast bootstrap (UFBoot) [49] of 1000 (-bb 1000 -alrt 1000 -safe). The substitution model was selected with ModelFinder [42] implemented in IQ-TREE. The bestfit model was JTT + F + R7. The group of intact NLRs was assigned based on known NLR genes, UFBoot value > 90% and previously assigned group information [6]. For partial NLRs without an intact NB-ARC domain, a putative NLR group was assigned using the group of intact NLRs and BLASTp (-evalue 1e-4). The group with the highest number of matches above 50% similarity and 30% coverage versus the NB-ARC domain of intact NLRs was assigned to partial NLRs.