This work is focused on the comparative genomic analysis of the draft nuclear genome assembly of Cicer arietinum genotype ICC4958 as published by our group recently [1]. At the top level, the current assembly is organized as Ca_LG_1 to Ca_LG_8; representing WGS contigs matched to the eight chickpea linkage groups, while Ca_LG_0 represents scaffolds that could not be matched to any of the eight pseudo-molecules. The chickpea genomic resource can be accessed through the webpage http://nipgr.res.in/CGWR/home.php. Figure 1 provides a flowchart summary of the CGWR and its components. In the following sections, we describe the main menu items individually followed by a brief account of the available tools and genomic tracks in the browser (represented as colored, collinear blocks with text labels and strand annotation) along with their salient features.
CGWR tools
The Tools menu of CGWR comprises a simple user-friendly GUI that enables rapid scanning and extraction of desired regions of the genome as well as pairwise alignments with user specified sequences, for identification of paralogs and orthologs. Various options are available to users from this pull down menu, including SSR Search, BLAST, CDS, Protein and Keyword Search, as shown in Figure 2.
SSR search
This page enables microsatellite analysis for 64418 simple sequence repeats (SSRs) detected in chickpea genome through in-silico identification. It contains a form where users can provide the motif of an SSR of interest, such as ACA. The tool finds all SSRs in the chickpea genome that match an input string and split the data for ease of interpretation resulting in a table with frequency of occurrence at individual iterations of the SSR. For example, ACA repeats occur at a total of 231 locations in chickpea nuclear genome, of which five iterations of ACA (i.e. ACAACAACAACAACA) occur at 112 positions whereas ten iterations (ACA10) occur at only three positions, and so on. At the bottom of the results page, this tool also returns an assessment of all ‘related’ SSR sequences that are one or two nucleotides longer than the input SSR sequence. For the ACA example, ‘related’ SSRs would include all instances of ACAX, XACA and XACAX, data for which gets scanned and reported along with the repeat number and frequency. (X here, refers to any of the four nucleotides A/C/T/G).
BLAST
The Basic Local Alignment Search Tool is a commonly used alignment program for detecting sequence similarity. Users can select the specific BLAST program and database based on the nature of their query which may be DNA, protein, translated RNA, or translated DNA. Database options for this tool include the complete set of CDS and proteins for ICC4958 as well as the nuclear genomic sequence. Input sequence can be entered as text, in FASTA format, or multiple sequences can be uploaded as files, if required. The tool returns alignments in HTML format and a summary of the output can be downloaded. In addition to text based summary, this page directly connects the tools menu to the chickpea browser within CGWR through a ‘browser’ link, as shown in Figure 2, to enable detailed investigation of the genomic region of interest. Upon every BLAST search, an additional ‘Alignment Track’ labeled “BLAST” gets added in the Browser for users to see the exact region of the genome aligned to the query, without any requirement for manual intervention.
CDS search
Apart from BLAST, the CGWR also enables direct Coding DNA Sequence (CDS) search for a known gene model, ID of which can also be identified by a BLAST search, as described above. The coding region of a gene is the portion composed of exons, and codes for protein. For an organism, it represents the sum total of the genome that is composed of gene coding regions. All CDSs predicted computationally for chickpea [1] can be searched by this tool, where users can paste one or more IDs of interest and obtain the respective CDSs. Results are directly connected to the chickpea browser to enable detailed investigation of the genomic region of interest, as explained above.
Protein search
Similar to the CDS search, the computationally predicted complement of translated regions for the chickpea genome (as per ref [1]) can be scanned by ID number. This menu supports text-based search providing quick and precise access to any desired protein of chickpea. Results can be downloaded and directly visualized in the genome browser, with examples provided within the form.
Keyword search
In case users do not have any prior information such as the sequence of interest or CIDs, the keyword tool allows a search of all potential chickpea IDs that contain a given input string of text in their annotation. For example, the tool returns six potential matches to the term ‘reductoisomerase’, and the list of these six can be downloaded along with information of each matched ID, including gene description, locus, PFAM ID and GO slim term. Further, the CGWR algorithm automatically generates an interactive map for the searched query, so that users can directly visualize the spatial patterns of occurrence of the list of IDs obtained from their search.
CGWR maps
The Maps menu provides interactive chromosomal maps of gene families, i.e. locations of desired gene models on their respective pseudomolecules. This tool produces genomic, sequence-based maps and displays pseudomolecules with the coordinates being in base pairs. It can also be used to click on any desired gene or cluster on the map in order to evaluate and visualize clustering of the mapped gene models on the chickpea genome. Users can obtain interesting single view snapshots of the chickpea genome wherein spatial position(s) of requested gene(s) can be displayed simultaneously across the eight pseudomolecules. This menu offers two procedures, one for visualization of pre-existing maps for selected chickpea gene families, and the other for customized construction of maps for desired sets of gene models by the user.
Chickpea gene family maps
Of the 640 unique gene models identified to be associated with the metabolism of flavonoids in chickpea, those that could be mapped to the eight distinct pseudomolecules have been depicted in Figure 3A, and the highest number of flavonoid gene models were found clustered on pseudomolecule 3. Such a tendency to cluster was not observed for gene models predicted to be associated with carotenoid metabolism (data available on CGWR website under Maps menu). Each gene or cluster can be analyzed in detail by clicking on the respective bar on the map image. For example, the top three flavonoid genes on LG-3 fall into one cluster that can be clicked to see full details of each member of the cluster, including gene name, functional annotation, gene ontology, TF binding sites, and complete sequence. More information can be noted by clicking the link that connects each gene or cluster to the CGWR genome browser. Our analysis across the entire plant kingdom revealed 9990 legume specific gene models and 2751 chickpea specific gene models in the chickpea genome and panel B of Figure 3 shows the mapped subset of the chickpea specific genes. Further, the putative resistance related gene models (R-genes) as identified through screening of the chickpea unigene set were also mapped and these appear to reside throughout the chickpea genome, although clustering may occur within the specific conserved classes that R-genes were assigned during the analysis (Figure 3D). Almost one third of chickpea genome repeats were identified to be various kinds of transposable elements, a majority of which represented retrotransposons and about 5% constituted DNA transposons. Of the latter group, Figure 3C shows the mapped RC helitrons, i.e. transposons that are thought to replicate by a rolling circle mechanism, and it can be seen that they are interspersed all over the chickpea genome and clustered in a few regions. It is notable that several types of LINEs and other gene families also appear to be clustered on the chickpea genome and it may be interesting to find out whether the clustering occurs in other legume genomes as well. This possibility can be queried within the CGWR by using a combination of the browser and tools menu as described in the following sections.
Customized maps
Figure 4 shows a flowchart based outline of the Maps Tool in the CGWR. The ‘create your own map’ option allows users to paste the IDs of their desired set of gene models to visualize spatial location maps similar to the ones depicted in Figure 3. On the submission form, users can type the gene ID into the input box, and hit enter on the keyboard. An example set of gene IDs is provided within the form itself. Users can also determine the CIDs of genes of interest through the keyword search. As shown in Figure 4, the map tool returns a table listing out the loci, start and end positions of the specified gene IDs provided as input. At the top of this table is a link to view Map that leads to the mapped image. If the gene of interest lies on one of the unassembled scaffolds, rather than one of the eight pseudomolecules, the program assigns it to an independent unassembled unit or virtual LG termed as ‘UN’. An example of such a case is provided within the CGWR pre-generated maps. These custom generated maps are interactive, allowing users to click any region of interest on the map, to visualize details about the respective region as described in the previous section. In addition, users can directly find links to the chickpea browser from any of the input gene models mapped by this algorithm, as clicking on these links redirects users to the corresponding regions of the chickpea genome, as shown in Figure 4. These maps can also be downloaded as high-resolution images for publication purposes. Thus, the CGWR provides direct connection between its various features by connecting Tools, Maps and the Browser at its backend.
Gene clustering
As shown in Figure 4, whenever two or more mapped gene models are found to occur within a pre-computed distance cut-off with reference to each other, they are considered to be part of a physical genomic cluster. In such cases, the output of the map will provide an additional ‘Clustering’ link. With the help of this feature, users can directly visualize the number of clusters and composition of each cluster identified in the input set of gene models. The maps are interactive and each cluster on the map can be clicked manually for gaining insights into its members, while users can also view the entire genomic region containing such gene clusters on the chickpea genome browser, for further analyses as shown in Figure 4, such as the presence of common upstream regulatory elements, or to identify nearby gene models and their functions.
Chickpea genome browser
The genome browser is one of the primary capabilities of the CGWR. Currently, the May 2013 assembly is available; the next freeze of the assembly will be made accessible as soon as it is released, in the near future. Figure 5 shows the default browser display i.e. the first 10 kbp data on the first Chickpea pseudomolecule LG1, although users can select any of the eight LGs from the pull-down list in the Data Source, and positional information can also be typed into the landmark or position box on the top left corner, e.g., Ca_LG_1 for the whole of chromosome 1 and Ca_LG_2:1..10,000 for the region from position 1 to 10,000 on chromosome 2. The region expanded in the browser will be highlighted in pale blue in the Overview section as shown in Figure 5. For the unassembled scaffolds, users can select Ca_LG_0 from the pull down data source list in the Search section, and type the name and position of the desired scaffold. Zooming and scrolling controls help to narrow or broaden the displayed chromosomal range to focus on the exact region of interest. Default browser display can be altered as desired by using track controls offered at the bottom of the browser enabled through the ‘configure tracks’ button, where about fifty different tracks are available to choose from, as shown in Figure 5.
In order to avoid information overload on account of such a large number of tracks, GBrowse controls can be coordinated in such a manner that display for some browser tracks may be turned off, and others may be collapsed into a condensed single-line display. Tracks can thus be hidden or filtered according to user preferences using track-based toggles for on/off and hide/show modes, apart from download, share, density and favorite modes. There is also a configure mode on each track that allows users to edit the display characteristics with respect to that track. Hovering on the colored bar corresponding to each track display releases an information bubble describing the respective track, and its data source(s), wherever applicable. Clicking on individual colored bars or features within a track opens a details page containing a summary of the respective properties of the track, with additional feature-specific information such as alignments or links to external information depending on the nature of the track. In the following section, we provide a list of tracks and examples of typical cross-track analyses that the CGWR browser can be used for.
Gene structure prediction
Currently the browser has seven independent tracks for genes and gene predictions that describe various aspects of gene structure, including tracks for selecting 5′ and 3’ UTRs, coding region (CDS), exons and introns for genomic DNA. The mRNA sequence for the predicted protein sequence is also available, along with GC content and six-frame translations of the genomic DNA.
Functional annotations
For protein or RNA coding genes, functional annotations are provided in the ‘Region’ and ‘Details’ sections of the main browser window. The uppermost ‘Named Gene’ track within Region section allows visualization of gene models outside the user-selected highlighted area expanded in all subsequent (lower) tracks. For visualization of gene annotation within the user-selected highlighted region, the ‘Annotation’ track can be used. These gene models are in yellow bars, and mouse hovering will open a bubble with functional annotation and PFAM domain information, wherever available. Clicking on each gene will return a page with detailed locus information, gene description, protein family classification and gene ontology Information, as well as the nucleotide sequence of the respective gene in FASTA format.
Molecular markers
The CGWR has a total of 12 individual tracks for assessment of molecular markers at the genomic level in chickpea. These include simple sequence repeats of two kinds, namely, in-silico SSRs and sequencing based SSRs, PIP markers, as well as tandem base substitutions and indels with reference to three other chickpea varieties. A total of 1,644,016 markers are depicted in these tracks. All SSRs identified on the genome can be visualized through an SSR track that enables further data analysis of various kinds. Hovering over an SSR will specify the number and type of that repeat, as to the number of SSRs of that specific kind present in the genome. For example a given SSR may be the fiftieth tetrameric SSR or the thousandth dimeric SSR etc. Clicking on the SSR will return a page detailing locus information, type, length, number and iteration of the SSR, along with the exact SSR motif. This track also has the facility to obtain the DNA from the flanking regions of the feature including 100 up- and down-stream bases to enable primer design efforts. In addition, the CGWR browser enables further interactive SSR analysis wherein users can find the number and type of any desired SSR. This page contains a form where length and motif of the SSR of interest can be typed in, and it returns a table providing information about whether SSRs of the respective kind are present, and if so, the number of SSRs in the concerned chromosome will be depicted as well. The SSR search options in the pull-down ‘Tools’ menu on the home bar at the top of the browser further enables a scan of all kinds of ‘related’ SSRs that differ by a length of one or two form the input SSR, as described earlier. Nucleotide diversity has been measured at the genomic scale by comparing ICC4958 with three other cultivated and wild chickpea genotypes, namely, desi-type JG62/ICC4951, kabuli-type ICCV2/IC12968, and wild-type P1489777. Variations have been analyzed between these four varieties revealing 32,919 InDels, and 1,504,646 SNPs and 41,824 tandem base substitutions all of which can be browsed in the CGWR through nine individual tracks representing each of these three categories compared pairwise between IC4958 and one of the above-mentioned genotypes. Each track, when clicked, provides details of gene structural variation via alignments between IC4958 and the respective variety being compared, along with additional flanking alignments from upstream and downstream regions, in order to assist in marker based studies and for acquiring the DNA for additional features and reverse complementation. The potential intron length polymorphism marker track (PIP markers) shows the markers that have been predicted using the PIP database.
Comparative genomics
As highlighted earlier, the most important and voluminous data in CGWR represents the comparative assessment of chickpea genome with other leguminous as well as non-leguminous plant species. In all the CGWR comprises 24 individual tracks for comparative genomics, of which, nine tracks representing nucleotide variation between four chickpea genotypes have been described above under molecular marker section. Additionally, there are nine more conservation tracks for depiction of orthologous gene models between ICC4958 and six other legumes including the kabuli genotype chickpea ‘CDC Frontier’, Glycine max, Medicago truncatula, Phaseolus vulgaris, Cajanus cajan, and Lotus japonicus, apart from Arabidopsis thaliana and Vitis vinifera (both non-legumes). These ortholog tracks show measures of evolutionary conservation and highlight regions of the genome that may be functionally important between the pair being considered. Clicking on the track leads to details of locus position, gene IDs and FASTA format sequences for both orthologous gene as well as the chickpea gene under consideration. At the bottom of this page is a link to the alignment between the orthologs. The BLAST [12] search engine described earlier in the Tools menu is also available to meet specific needs of comparison and alignments. Apart from these 18 tracks, CGWR also has six tracks for synteny evaluation of genotype ICC4958 with each of the six legume genomes listed above. We recommend a large window size to enable visualization of the direction of synteny for a given region, as well as to explore multiple syntenic matches between two genomes in a given region. Clicking on the synteny regions pops up a bubble that provides the start and end site for each matched locus. Color codes have been maintained for each plant species across the 24 comparative genomic tracks. Blue, for example, represents Phaseolus vulgaris while red represents Cajanus cajan, and so on.
Transcriptome
The browser contains tracks for detailed transcriptome analysis, with over 27000 ESTs and 274 million filtered reads representing transcripts of chickpea from independent tissue/organ based samples. The track for EST returns locus and strand information along with full nucleotide sequence of the EST. For each transcript, the track provides locus information, transcript description, data from gene ontologies of molecular function, biological process and cellular location, apart from enabling users to view expression across each of the six tissue samples studied.
Regulatory regions
Specific binding of transcription factors (TFs) to short and degenerate oligonucleotides on the genome is key to transcriptional regulation and gene expression. The CGWR browser contains tracks for predicted TF binding sites (TFBS) according to both PSSM based computational scores (JASPAR track) [15], as well as literature based data correspondence (PLACE tracks) [14]. For each track, the browser provides information regarding strand, locus, TFBS sequence motif, family-based classification as well as the plant species with evidence of a similar binding site. The family based classification allows one to decipher what transcription factor might bind to the region of interest and CGWR further provides details of the identifier and each family. For example, the site ‘CAACTC’ is known to bind to transcription factor CAREOSREP1, and is from the family of CAREs (CAACTC regulatory elements) found in the promoter region of a cysteine proteinase (REP-1) gene in rice. Regulatory region analyses can usually result in multiple TFBS predictions for the same site and therefore incorporation of two independent tracks for this purpose in CGWR provides the additional advantage of cross-referencing and evidence from multiple sources. We recommend a low window size in the range of 5 to 10 kb in order to visualize multiple predictions in an individual manner.
Transposable elements
Over 40% of the assembled draft genome represents interspersed repeats including various classes of transposable elements and these can be displayed individually. These tracks are available for DNA transposons, retrotransposons (LTRs) and other repetitive elements. The track for DNA transposons can be used to visualize more than 80 different kinds of DNA transposons as well as RC Helitrons. The track for retrotransposons enables visualization and analysis of various families of LINEs, SINEs and LTRs. Elements that could not be be classified into either of these two tracks have been assigned to a third track within transposable elements, namely, the other repetitive elements track, which consists of unknowns, simple-repeats, satellites etc. Each transposable element has been assigned a unique ID based on its genomic position. Transposable element IDs that occur in multiple copies with identical scores have been assigned sub-ids such as N.1, N.2, N.3 and so on, depending upon the number of occurrences. Clicking on an element will return a page detailing locus information, type and family-based classification of that specific transposon or retrotransposon, along with its entire sequence.
Non-coding RNA predictions
Gene models for non coding RNAs have been predicted for the chickpea genome, resulting in identification of about 121 distinct Rfam families including miRNA, snoRNA, rRNA, tRNA etc. These have been mapped to the genome and can be visualized via five individual tracks. Clicking these tracks returns a page containing features of the respective RNA locus, such as anticodon and amino acid specification (for tRNA), unique family classification (for miRNAs and snoRNAs), strand information; complete sequence as well as 2-dimensional structure notation.
Nucleosome positioning
Predictions for nucleosome states and linker DNA states for chickpea have been made using Arabidopsis as index species [16],[17], and results have been normalized and mapped to the chickpea genome as described in methods. This track provides a plot with information about nucleosome and linker DNA states. It superimposes occupancy and binding affinity scores, Viterbi predictions for optimal nucleosome positioning and the posterior probability of a genomic position to be the start of a nucleosome. For convenience of interpretation, regions of the genome that have a higher tendency toward nucleosome states are depicted on the positive Y-axis, while regions with higher tendency towards linker DNA states are shown on the negative Y-axis. A preliminary computational correspondence between nucleosome occupancy likelihood and gene structure reveals that coding regions of the genome are significantly enriched for nucleosome states than the regulatory regions, while the promoters are significantly depleted of nucleosomes. We also find that introns have much higher density for nucleosome states than any other genomic region (data not shown). These and other interesting aspects of the chickpea nucleosome positioning predictions are currently being investigated further in our laboratory.
Component integration and other services
In order to facilitate seamless exchange of data between its various components and capabilities, the CGWR backend enables dynamic inter-connections and frequent coupling of results between its three main sections, the Tools, Maps and the genome Browser. For example, users can carry out a BLAST run for identification of orthologs of their sequence of interest, and use links on the output page for direct access to the genomic regions containing the orthologs of interest. The list of potential orthologs can also be obtained by typing in a keyword of interest. The IDs thus identified can be mapped to the assembled genome for a high-resolution interactive image via the Maps menu, where again, direct connection to the browser is provided. For instance, the gene IDs of the chickpea orthologs obtained in the BLAST search (under Tools menu) can be pasted into the Maps menu to visualize where these domains lie on the chickpea genome, and whether they show any tendency towards spatial clustering. Whenever a gene of interest is found to lie within close proximity of other gene models of the same family, it is assigned to a cluster that can be visualized on the interactive map as well as the CGWR browser for detailed analysis of other aspects of the clustered region. Apart from these backend provisions, the Links menu of CGWR provides access to a wide variety of datasets and links to important external information regarding chickpea and legume-based genomics research. This menu also enables downloads of various datasets used in this work.
Example of a typical analysis
A user may have a gene of interest for which they want to find legume and non-legume homologs. It is possible to start by using BLAST of the sequence of interest against the chickpea genome in the Tools menu, and find the link to the browser on the BLAST output page. This will lead to the chickpea homologous gene in the browser display, and from here, similarity with six legumes and two non–legume plants can be found by using one of the 15 pre-computed comparative genomics tracks. Alternatively, nine pre-computed tracks enable detailed nucleotide level assessment of structural variations between chickpea and its wild- and desi-type cultivars. Clicking on these tracks will enable viewing comparative alignments as well as information about the co-ordinates of the alignment on both genomes. In case multiple homologs of the gene of interest are found within chickpea, these can all be mapped using the Maps menu for interactive and convenient detection of clusters within the respective gene family. The clusters, if any, can be further analyzed for other features of interest in and around the region via integrative links between maps and the genome browser. For example, users can toggle the ortholog tracks for the cluster to find out whether the concerned set of gene models is clustered in any of nine other plant genomes as well. Shared transcription factor binding sites, if any, in the upstream regulatory regions of clustered gene models can also be visualized through designated tracks in the browser. With the EST, mRNA and transcript tracks visible, it is possible to see the extent, if any, of tissue specific or organ specific expression for these gene models in chickpea. Additionally, turning on some of the TFBS prediction tracks would suggest whether there is evidence for any specific TF binding to the promoters of the identified homologs. Patterns of nucleosome arrays can also be visualized for assessment of DNA accessibility in these regions of the genome, and compared for overlap with other features such as coding and non-coding areas.