TCM-Blast for traditional Chinese medicine genome alignment with integrated resources

The traditional Chinese medicine (TCM) genome project aims to reveal the genetic information and regulatory network of herbal medicines, and to clarify their molecular mechanisms in the prevention and treatment of human diseases. Moreover, the TCM genome could provide the basis for the discovery of the functional genes of active ingredients in TCM, and for the breeding and improvement of TCM. The traditional Chinese Medicine Basic Local Alignment Search Tool (TCM-Blast) is a web interface for TCM protein and DNA sequence similarity searches. It contains approximately 40G of genome data on TCMs, including protein and DNA sequence for 36 TCMs with high medical value.The development of a publicly accessible TCM genome alignment database hosted on the TCM-Blast website (http://viroblast.pungentdb.org.cn/TCM-Blast/viroblast.php) has expanded to query multiple sequence databases to obtain TCM genome data, and provide user-friendly output for easy analysis and browsing of BLAST results. The genome sequencing of TCMs helps to elucidate the biosynthetic pathways of important secondary metabolites and provides an essential resource for gene discovery studies and molecular breeding. The TCMs genome provides a valuable resource for the investigation of novel bioactive compounds and drugs from these TCMs under the guidance of TCM clinical practice. Our database could be expanded to other TCMs after the determination of their genome data. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-021-03096-1.


Background
Whole-genome sequencing of the plants that form the basis of traditional Chinese medicine (TCM) is an important means for gene discovery and cultivation, synthetic biology, drug discovery and molecular breeding involving TCMs [1][2][3][4]. The genomic sequence provides a valuable resource not only for fundamental and applied research, but also for evolutionary and comparative genomics analyses, particularly in TCMs [5][6][7][8][9].
The The information regarding TCM genome datasets is summarized in an online at the TCM-Blast website. The TCM genome data used in TCM-Blast were collected from the Herbal Medicine Omics Database (http:// herba lplant. ynau. edu. cn/ html/ Genom es/), the Medicinal Plant Genomics Resource (http:// medic inalp lantg enomi cs. msu. edu), and the BIG Data Center in Beijing Institute of Genomics (http:// bigd. big. ac. cn/ gsa/ stati stics) (the further details on the genome data sources for the thirty-six TCMs, see Table 1). These data resources have been published in professional journals and plant gene databases by academic institutions or government departments merged with plant gene databases, with abundant data sources and reliable data quality. In addition to other data resources, this database in our study has the following advantages: 1) this database is currently the largest Chinese medicine genome database; 2) this database includes the plant genetic data of Chinese medicine sources; and 3) this database provides support for the TCM breeding, cultivation of TCMs and the discovery of active ingredients in TCMs.

Overview of TCM-Blast
We have developed TCM-Blast, a web-based database for TCM genome alignment (Fig. 1). TCM-Blast offers an interface to choose from TCM genome databases including TCM protein and DNA sequence datasets, which provide query functions with BLAST implementation [40]. TCM-Blast currently contains approximately 40 GB of TCM genome data, including the proteins and DNA sequences of 36 TCMs.

The mains functions of TCM-Blast
The user can directly enter the query sequence directly by pasting into the query box or by uploading the sequence as a FASTA file from a local file. TCM-Blast provides multiple TCM sequence databases. Users can then select specific TCM genome databases to run different programs (blastn, blastp, blastx, tblastn, tblastx). TCM-Blast consists of five general BLAST form types [27,[41][42][43] for TCM genome data: blastn: search TCM nucleotide databases using a nucleotide query. blastp: search TCM protein databases using a protein query. blastx: search TCM protein databases using a translated nucleotide query tblastn: search TCM translated nucleotide databases using a protein query. tblastx: search TCM translated nucleotide databases using a translated nucleotide query TCM-Blast provides an optional search function for advanced users who need to collect more specific information (Fig. 2) with the ability to set different parameters, such as the expected threshold, word size, max target sequences, etc., to glean more specific information for users. The TCM-Blast sequence alignment results of the TCM genome sequence are displayed in the summary table, which contains the query sequence name, subject sequence name, subject source database, position score, identity percentage, and E value (Fig. 3).

A case study of this database
For example, the user can select the Salvia Miltiorrhiza protein database with the programs blastp and obtain their expected BLAST results by inputting the protein sequence. In Fig. 4, the user has input the protein sequence fragment: "MEKKQEDEKKTKLQGLPVDT SPY TQYKDLD -DYKKQAYGTEGHLQPNPGRG AAA STDAPTTTAAD-DPNKQLSSTDAINRQGVP" in the "Enter query sequences" box; selected the Salvia Miltiorrhiza protein database; and obtained the BLAST result by clicking the "Basic Search" button. The top score of this search was "evm.model.C153610.1" subject, indicating that the input sequence fragment has high similarity to the Salvia Miltiorrhiza protein. For more detailed use cases for this database, please refer to the Supplementary file.
In the future, we will collect more Chinese medicine genome data to provide data support for Chinese medicine research.

Conclusions
Here, we reported a database of TCM-Blast database that integrates several database resources and markedly improves the efficiency of TCM genomic research. This database will allow users to perform batch sequence searches against integrated TCM genomic sequence databases. Therefore, TCM-Blast provided comprehensive Chinese medicine genome resource data on TCM scientific research and eliminates the latent redundancy occurring in other platforms.
Additional file 1: Figure S1. Setting of protein sequence alignment options with Glycyrrhiza Uralensis protein database through the program of 'blastp' . Figure S2. BLAST result of protein sequence alignment with Glycyrrhiza Uralensis protein database by inputting the query protein sequence. Figure S3. Setting of protein sequence alignment options with Glycyrrhiza Uralensis Nucleotide Database by the program of 'tblastn' . Figure S4. BLAST result of protein sequence alignment with Glycyrrhiza Uralensis protein database by the program of 'tblastn' . Figure S5. Setting of nucleotide sequence alignment options with Glycyrrhiza Uralensis Nucleotide Database through the program of 'blastn' .

Availability of data and materials
TCM-Blast is a free database and visualization tool open to all users with no login requirements and can be accessed at the following URL: http:// virob last. punge ntdb. org. cn/ TCM-Blast/ virob last. php. The web tool is functional on all modern web browsing environments including Google Chrome, Mozilla Firefox and Safari. All related species genomes data can be downloaded from http:// virob last. punge ntdb. org. cn/ TCM-Blast/ db.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable. Fig. 4 The BLAST result of Salvia Miltiorrhiza protein alignment with the input of Salvia Miltiorrhiza protein sequence fragment into TCM-Blast. In the first section (a), the user checks their protein sequence. In the second section (b), the BLAST results with the input protein sequence are briefly displayed in the table. Furthermore, detailed score information on this alignment can be checked by clicking each score item button