A BAC end view of the Musa acuminata genome
© Cheung and Town; licensee BioMed Central Ltd. 2007
Received: 28 December 2006
Accepted: 11 June 2007
Published: 11 June 2007
Musa species contain the fourth most important crop in developing countries. Here, we report the analysis of 6,252 BAC end-sequences, in order to view the sequence composition of the Musa acuminata genome in a cost effective and efficient manner.
BAC end sequencing generated 6,252 reads representing 4,420,944 bp, including 2,979 clone pairs with an average read length after cleaning and filtering of 707 bp. All sequences have been submitted to GenBank, with the accession numbers DX451975 – DX458350. The BAC end-sequences, were searched against several databases and significant homology was found to mitochondria and chloroplast (2.6%), transposons and repetitive sequences (36%) and proteins (11%). Functional interpretation of the protein matches was carried out by Gene Ontology assignments from matches to Arabidopsis and was shown to cover a broad range of categories. From protein matching regions of Musa BAC end-sequences, it was determined that the GC content of coding regions was 47%. Where protein matches encompassed a start codon, GC content as a function of position (5' to 3') across 129 bp sliding windows generates a "rice-like" gradient. A total of 352 potential SSR markers were discovered. The most abundant simple sequence repeats in four size categories were AT-rich. After filtering mitochondria and chloroplast matches, thousands of BAC end-sequences had a significant BLASTN match to the Oryza sativa and Arabidopsis genome sequence. Of these, a small number of BAC end-sequence pairs were shown to map to neighboring regions of the Oryza sativa genome representing regions of potential microsynteny.
Database searches with the BAC end-sequences and ab initio analysis identified those reads likely to contain transposons, repeat sequences, proteins and simple sequence repeats. Approximately 600 BAC end-sequences contained protein sequences that were not found in the existing available Musa expressed sequence tags, repeat or transposon databases. In addition, gene statistics, GC content and profile could also be estimated based on the region matching the top protein hit. A small number of BAC end pair sequences can be mapped to neighboring regions of the Oryza sativa representing regions of potential microsynteny. These results suggest that a large-scale BAC end sequencing strategy has the potential to anchor a small proportion of the genome of Musa acuminata to the genomes of Oryza sativa and possibly Arabidopsis.
Until novel technologies that will enable extremely low-cost genomic DNA sequencing are developed, funding bodies are very selective when choosing new plant genomes to sequence. Current technologies are only able to produce the sequence of a mammalian-sized genome of the desired data quality for $10 to $50 million or more. The initial goal of many genome projects is often to gain a glimpse of the genome of interest at a low cost and in an effective manner. In plants there is often some advantage in leveraging the finished genomes of Arabidopsis thaliana and Oryza sativa through comparative genomics. A. thaliana was chosen as model for the dicotyledons due to its small genome size (125 Mb)  and rice  (O. sativa) was the first cereal and monocot to be sequenced .
Musa species (bananas and plantains) comprise very important crops in sub-Saharan Africa, South and Central America and much of Asia. The Musa species Musa acuminata (AA genome) and Musa balbisiana, (BB genome), both with 2n = 22 chromosomes represent the two main progenitors of cultivated banana varieties. The haploid genome of Musa species was estimated as varying between 560 to 800 Mb in size [4–6], over four times larger than that of the model plant A. thaliana (125 Mb)  and over 30% larger than that of O. sativa (390 Mb) .
Comparative genomics in the monocots have focused on the extent of synteny between closely-related species of monocots belonging to the family of Poaceae . Extensive micro and macro synteny has been shown between O. sativa, barley, maize and wheat [9, 10] and the degree of conservation often varies between different chromosomal locations. Synteny between distantly related plants is more bioinformatically challenging to elucidate and probably occurs less frequently.
In order to understand the sequence content and sequence complexity of the Musa genome, it is necessary to sequence a large number of randomly selected clones that are representative of the entire genome. An alternative approach is to end-sequence a large number of Bacterial Artificial Chromosomes (BACs) randomly selected from a BAC library . This latter approach does not provide a truly random sampling of the genome since regions in which the restriction site for the particular enzyme used for library construction is under-represented will also be under-represented. Nevertheless, BAC end sequencing does provide a quasi-random sampling of the genome and carries with it the advantage that BAC clones that appear to contain targets of interest provide excellent material for other analyses such as fluorescent in situ hybridization (FISH) to metaphase or pachytene chromosomes or in depth sequencing for gene discovery. A large collection of BAC end-sequences (BES) is also an essential component of a genome sequencing project. Here, we examined whether Musa BES can lead to insights into the Musa genome composition using bioinformatic comparisons to protein, repeat, expressed sequence tags (ESTs) and other databases. From the BES, we investigate the Musa gene density, GC content, protein and SSR content and putative comparative-tile BACs that represents potential regions of microsynteny between the O. sativa and Musa species.
Results and discussion
Sequence searches, simple sequence repeats, GC profiling and protein discovery will be discussed first, followed by an analysis of genome mapping to O.sativa and A. thaliana to identify comparative tile BACs from the Musa library that will be likely collinear (i.e. showed microsynteny).
BAC end sequencing
Sequence statistics of the Musa BES
Total # sequences
Total base count (bp)
Minimum length (bp)
Average length (bp)
Maximum length (bp)
Database sequence searches
Sequence similarity search results
Number of hits (%)
Mitochondria + Chloroplast
Transposon + Repeats
TIGR protein database
Total number of BAC ends
Summary of transposon content
Number of BES
The GC content for organisms varies between the genomic, intron and exon regions and can be as low as 22% (Plasmodium falciparum) to more than 70% (Zea mays). GC content was determined on the matching region between the BES and the top protein hit. The mean GC content of all BES was 39% and coding sequence GC content was 47% consistent with previous studies which was shown to have an overall GC content to be 38% and within exons to be 49% based on 2 BACs . This and the previous section have shown that BES with protein matches can allow GC content and GC profiling to be calculated with some degree of accuracy. Further confirmation using a larger dataset was carried out using ESTs,- 2,280 Musa ESTs  was downloaded from GenBank, clustered and assembled to give 1,123 unique sequences of which 179 were contigs. The unique sequences generated 1,056 potential open reading frames containing an average GC content of 51%. These results are consistent with previous studies on GC content within monocots and dicots .
Simple sequence repeats
Distribution of SSRs
Musa BAC end tiling on the O. sativa and A. thalianagenome
Musa BAC end tiling on the O. sativa genome
O. sativa chromosomal location
Musa BACs that fulfil the criteria of having top blast hits to the same chromosome and having no homology to mitochondria and chloroplast were deemed candidate putative comparative-tile-BACs, and potentially represent regions of highly conserved gene content and organization. The predicted size of the Musa BACs (and thus the distance between the end-sequences) was compared to the span by which the paired matches are separated in the O. sativa and A. thaliana genomes respectively. Separations in the Musa BES matches that exceeded our arbitrary cut off of 500 Kb, may represent expansions of the syntenic regions and due to rearrangements during the evolution of the two genomes.
In this study, 2 major ideas were examined. Firstly, that Musa BES can lead to insights into the Musa genome with specific reference to gene density, GC content, protein and SSR discovery; and secondly, that the sequences can be used to identify regions of potential microsynteny between Musa and other species. The BAC end-sequences were shown to contain homology to proteins, expressed sequence tags, transposons, repeat sequences and to be useful for simple sequence repeat identification and estimation of gene statistics and GC content. Proteins encoded in these BES were shown to cover a broad range of GO categories. Although there is only limited microsynteny between Musa and O. sativa, the results suggest that a large-scale BAC end sequencing strategy has the potential to anchor at least a small portion of the genome of Musa onto that of the sequence of the O. sativa. Large-scale BAC end sequencing would show whether there are more regions of microsynteny between the reference genome and the genome of interest and if there was support for whole genome sequencing due to unique gene features and genome characteristics. BAC end data would be one useful indicator along with existing EST or genomic sequences for funding bodies to use when selecting new plant genomes to sequence and assess the potential of leveraging the finished genomes of A. thaliana and O. sativa through comparative genomics. We expect that a similar analysis using other plant or animal species would provide insights into the genome in a very cost effective and efficient manner through database searches and synteny to model species.
BAC end sequencing
The BES were generated from a Musa bacterial artificial chromosome (BAC) library constructed from leaves of the wild diploid 'Calcutta 4' clone (Musa acuminata subsp. Burmannicoides 2n = 2 × = 22) with an average insert size of 100 kb .
DNA template was prepared in 384-well format by a standard alkaline lysis method. End sequencing was performed using Applied Biosystems (ABI) Big Dye terminator chemistry and analyzed on ABI 3730 xl machines. Base calling was performed using TraceTuner and sequences were trimmed for vector and low quality sequences using Lucy .
BAC end database searches
Sequences were compared to all entries in the TIGR Plant Gene Indices  using blat and to the TIGR non-identical amino acids database that contains non-identical protein data from a number of databases including GenBank, RefSeq and Uniprot using blastx (cut-off value 1e-5). The BAC end-sequences were also compared with repetitive sequences in the TIGR Repeat Database  and an in-house transposon database using blastx with a cut-off value of 1e-5. The BAC end-sequences were compared with the TIGR rice genome sequence assembly and the A. thaliana genome sequence from TAIR using blastn with a cut-off value of 1e-10. To identify comparative tile BACs from the Musa library that were likely collinear (i.e. showed microsynteny) with the reference genomes, the searches against the Musa genomic sequence were parsed for the top pair of BES for which both ends had the highest significant match to a stretch of O. sativa or A. thaliana sequence and where the two regions on the Musa genome were between 100 kb and 500 Kb apart. The BAC end data sets for O. sativa, A. thaliana, maize and M. truncatula used for GC profiling was originally downloaded from GenBank and then the vector trimmed and cleaned sequences were downloaded from estinformatics.org .
EST clustering and assembly
Identification and analyses of simple sequence repeats
Perfect dinucleotide to hexanucleotide simple sequence repeats were identified using the MISA  Perl scripts, specifying a minimum of six dinucleotide and five tetranucleotide to hexanucleotide repeats and a maximum of 100-nucleotides interruption for compound repeats and the minimum length for mononucleotide repeats was 20 bases.
This work was supported by the International Network for the Improvement of Banana and Plantain (INIBAP), now part of Bioversity International, through agrant under theUSAID linkage fundscheme.
- Meinke DW, Cherry JM, Dean C, Rounsley SD, Koornneef M: Arabidopsis thaliana: a model plant for genome analysis. Science. 1998, 662: 679-682.Google Scholar
- International Rice Genome Sequencing Project: The map based sequence of the rice genome. Nature. 2005, 436: 793-800. 10.1038/nature03895.View ArticleGoogle Scholar
- Zhao W, Wang J, He X, Huang X, Jiao Y, Dai M, Wei S, Fu J, Chen Y, Ren X, Zhang Y, Ni P, Zhang J, Li S, Wang J, Wong GK, Zhao H, Yu J, Yang H, Wang J: BGI-RIS, An integrated information resource and comparative analysis workbench for rice genomics. Nucleic Acids Res. 2004, 32: D377-82. 10.1093/nar/gkh085.PubMedPubMed CentralView ArticleGoogle Scholar
- Lysak MA, Dolezelova M, Horry JP, Swennen R, Dolezel J: Flow cytometric analysis of nuclear DNA content in Musa. Theor Appl Genet. 1999, 98: 1344-1350. 10.1007/s001220051201.View ArticleGoogle Scholar
- Kamate K, Brown S, Durand P, Bureau JM, De Nay D, Trinh TH: Nuclear DNA content and base composition in 28 taxa of Musa. Genome. 2001, 44: 622-627. 10.1139/gen-44-4-622.PubMedView ArticleGoogle Scholar
- Bartos J, Alkhimova O, Dolezelova M, De Langhe E, Dolezel : Nuclear genome size and genomic distribution of ribosomal DNA in Musa and Ensete (Musaceae): taxonomic implications. Cytogenet Genome Res. 2005, 109: 50-7. 10.1159/000082381.PubMedView ArticleGoogle Scholar
- Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Singh NK, Raghuvanshi S, Srivastava SK, Gaur A, Pal AK, Dalal V, Singh A, Ghazi IA, Bhargav A, Yadav M, Dixit A, Batra K, Gaikwad K, Sharma TR, Mohanty A, Bharti AK, Kapur A, Gupta V, Kumar D, Vij S, Vydianathan R, Khurana P, Sharma S, McCombie WR, Messing J, Wing R, Sasaki T, Khurana P, Mohapatra T, Khurana JP, Tyagi AK: Sequence analysis of the long arm of rice chromosome 11 for rice-wheat synteny. Funct Integr Genomics. 2004, 4: 102-17. 10.1007/s10142-004-0109-y.PubMedView ArticleGoogle Scholar
- Gu Y, Coleman-Derr D, Kong X, Anderson O: Rapid genome evolution revealed by comparative sequence analysis of orthologous regions from four triticeae genomes. Plant Physiol. 2004, 135: 459-470. 10.1104/pp.103.038083.PubMedPubMed CentralView ArticleGoogle Scholar
- Salse J, Piegu B, Cooke R, Delseny M: New in silico insight into the synteny between rice (Oryza sativa L.) and maize (Zea mays L.) highlights reshuffling and identifies new duplications in the rice genome. Plant J. 2004, 38: 396-409. 10.1111/j.1365-313X.2004.02058.x.PubMedView ArticleGoogle Scholar
- Lai CW, Yu Q, Hou S, Skelton RL, Jones MR, Lewis KL, Murray J, Eustice M, Guan P, Agbayani R, Moore PH, Ming R, Presting GG: Analysis of papaya BAC end sequences reveals first insights into the organization of a fruit tree genome. Mol Genet Genomics. 2006, 276: 1-12. 10.1007/s00438-006-0122-z.PubMedView ArticleGoogle Scholar
- Vilarinhos AD, Piffanelli P, Lagoda P, Thibivilliers S, Sabau X, Carreel F, D'Hont A: Construction and characterization of a bacterial artificial chromosome library of banana (Musa acuminata Colla). Theor Appl Genet. 2003, 106: 1102-6.PubMedGoogle Scholar
- SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL: The paleontology of intergene retrotransposons of maize. Nat Genet. 1998, 20: 43-5. 10.1038/1695.PubMedView ArticleGoogle Scholar
- Aert R, Sagi L, Volckaert G: Gene content and density in banana (Musa acuminata) as revealed by genomic sequencing of BAC clones. Theor Appl Genet. 2004, 109: 129-139. 10.1007/s00122-004-1603-2.PubMedView ArticleGoogle Scholar
- Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, Hamilton J, Haas B, Sultana R, Cheung F, Wortman J, Buell CR: The Institute for Genomic Research Osa1 rice genome annotation database. Plant Physiol. 2005, 138 (1): 18-26. 10.1104/pp.104.059063.PubMedPubMed CentralView ArticleGoogle Scholar
- The Arabidoposis Information Resource. [http://www.arabidopsis.org]
- Kuhl JC, Cheung F, Yuan Q, Martin W, Zewdie Y, McCallum J, Catanach A, Rutherford P, Sink KC, Jenderek M, Prince JP, Town CD, Havey MJ: A unique set of 11,008 onion expressed sequence tags reveals expressed sequence and genomic differences between the monocot orders Asparagales and Poales. Plant Cell. 2004, 16: 114-25. 10.1105/tpc.017202.PubMedPubMed CentralView ArticleGoogle Scholar
- Wong GK, Wang J, Tao L, Tan J, Zhang J, Passey DA, Yu J: Compositional gradients in Gramineae genes. Genome Res. 2002, 12: 851-856. 10.1101/gr.189102.PubMedPubMed CentralView ArticleGoogle Scholar
- Santos CM, Martins NF, Horberg HM, de Almeida ER, Coelho MC, Togawa RC, da Silva FR, Caetano AR, Miller RN, Souza MT: Analysis of expressed sequence tags from Musa acuminata ssp burmannicoides, var. Calcutta 4 (AA) leaves submitted to temperature stresses. Theor Appl Genet. 2005, 110: 1517-1522. 10.1007/s00122-005-1989-5.PubMedView ArticleGoogle Scholar
- Thiel T, Michalek W, Varshney RK, Graner A: Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet. 2003, 106 (3): 411-422.PubMedGoogle Scholar
- Katti MV, Ranjekar PK, Gupta VS: Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol. 2001, 18: 1161-7.PubMedView ArticleGoogle Scholar
- Jung S, Abbott A, Jesudurai C, Tomkins J, Main D: Frequency, type, distribution and annotation of simple sequence repeats in Rosaceae ESTs. Funct Integr Genomics. 2005, 5: 136-43. 10.1007/s10142-005-0139-0.PubMedView ArticleGoogle Scholar
- Creste S, Benatti TR, Orsi MR, Risterucci AM, Figueira A: Isolation and characterization of microsatellite loci from a commercial cultivar of Musa acuminata. Molecular Ecology Notes. 2006, 6: 303-306. 10.1111/j.1471-8286.2005.01209.x.View ArticleGoogle Scholar
- Raboin LM, Carreel F, Noyer JL, Baurens FC, Horry JP, Bakry F, Tezenas Du Montcel H, Ganry J, Lanaud C, Lagoda PJL: Diploid ancestors of triploid export banana cultivars: molecular identification of 2n restitution gamete donors and n gamete donors. Molecular Breeding. 2005, 16: 333-341. 10.1007/s11032-005-2452-7.View ArticleGoogle Scholar
- Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics. 2001, 17 (12): 1093-1104. 10.1093/bioinformatics/17.12.1093.PubMedView ArticleGoogle Scholar
- Quackenbush J, Liang F, Holt I, Pertea G, Upton J: The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 2000, 28: 141-145. 10.1093/nar/28.1.141.PubMedPubMed CentralView ArticleGoogle Scholar
- Ouyang S, Buell CR: The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 2004, 32: D360-3. 10.1093/nar/gkh099.PubMedPubMed CentralView ArticleGoogle Scholar
- estinformatics.org. [http://www.estinformatics.org]