The repetitive DNA landscape in Avena (Poaceae): chromosome and genome evolution defined by major repeat classes in whole-genome sequence reads

Background Repetitive DNA motifs – not coding genetic information and repeated millions to hundreds of times – make up the majority of many genomes. Here, we identify the nature, abundance and organization of all the repetitive DNA families in oats (Avena sativa, 2n = 6x = 42, AACCDD), a recognized health-food, and its wild relatives. Results Whole-genome sequencing followed by k-mer and RepeatExplorer graph-based clustering analyses enabled assessment of repetitive DNA composition in common oat and its wild relatives’ genomes. Fluorescence in situ hybridization (FISH)-based karyotypes are developed to understand chromosome and repetitive sequence evolution of common oat. We show that some 200 repeated DNA motifs make up 70% of the Avena genome, with less than 20 families making up 20% of the total. Retroelements represent the major component, with Ty3/Gypsy elements representing more than 40% of all the DNA, nearly three times more abundant than Ty1/Copia elements. DNA transposons are about 5% of the total, while tandemly repeated, satellite DNA sequences fit into 55 families and represent about 2% of the genome. The Avena species are monophyletic, but both bioinformatic comparisons of repeats in the different genomes, and in situ hybridization to metaphase chromosomes from the hexaploid species, shows that some repeat families are specific to individual genomes, or the A and D genomes together. Notably, there are terminal regions of many chromosomes showing different repeat families from the rest of the chromosome, suggesting presence of translocations between the genomes. Conclusions The relatively small number of repeat families shows there are evolutionary constraints on their nature and amplification, with mechanisms leading to homogenization, while repeat characterization is useful in providing genome markers and to assist with future assemblies of this large genome (c. 4100 Mb in the diploid). The frequency of inter-genomic translocations suggests optimum strategies to exploit genetic variation from diploid oats for improvement of the hexaploid may differ from those used widely in bread wheat. Electronic supplementary material The online version of this article (10.1186/s12870-019-1769-z) contains supplementary material, which is available to authorized users.


Background
Genome evolution involves multiple processes including whole genome duplications (WGDs or polyploidy), segmental genome deletions or duplications, chromosome restructuring (fusion, fission, translocation, and inversion), and amplification or loss of gene and repetitive sequences, along with DNA mutation [1,2]. There is a growing interest in reconstructing ancestral genomes of fungi [3], animals [4] and plants [5], revealing principles governing genome evolution and diversification leading to speciation and adaptation.
Repetitive DNA constitutes a substantial fraction, typically between 25 and 85%, of plant genomes, and can be referred to as the repeatome. Repeat motifs vary extensively in sequence and dispersion patterns [6][7][8]. Several major groups of repetitive elements are found: ribosomal DNAs (rDNAs) [both 45S (18S-5.8S-26S) and 5S rDNAs with intergenic spacers], the telomeric repeats, class I retrotransposons (amplified through an RNA intermediate), class II DNA transposons (amplified through DNA copies), and tandem repeats (postulated to be generated/ modified by slippage replication, uneven crossing-over or rolling circle amplification) [7]. Their presence and similarity, variation in copy number and sequences, pose a major challenge to genome assembly and gene annotation [9]. Repetitive DNA has been postulated to have multiple roles in the genome, including genome stability, recombination, chromatin modulation and modification of gene expression [7]. Copy number variations in repeats, representing 5 to 10% of the human genome, are important for disease and population variation [10].
Through the decades up to 2010, repeatome knowledge came largely from DNA annealing experiments, screens of random clones, restriction fragment analyses, or amplification of conserved elements with primers. Now whole-genome shotgun sequencing approaches can be used for genome-wide, unbiased repeat analysis [11][12][13]. A k-mer analysis counts the number of motifs k-bases long in whole-genome sequence reads [14], to identify abundant motifs without using reference genomes. The graph-based clustering analysis (e.g. RepeatExplorer [11,15]) is another approach to identify and classify repeats from raw reads. Both are de novo identification strategies, and results can be used for repeat identification or protein domain searches. Because of the multiple genomic locations and difficulties of assembly, in situ hybridization to chromosomal preparation is essential to identify the genomic locations and specificity of repetitive motifs [16]. These approaches have been used to quantify the genome repetitive landscape in banana, radish, soybean and tobacco [17][18][19][20][21].
Common oat (Avena sativa L., 2n = 6x = 42, AACCDD) is a temperate crop (annual production of 23 million tons in 2017; http://faostat.fao.org) with the approved health claim as a 'superfood' because of oat beta-glucan, which helps reduce blood cholesterol level and heart disease risk [22,23]. Genomic resource development of common oat, important for breeding and improvement, has lagged behind other major crops [24,25]. There are several genetic maps of diploid and hexaploid species [26,27] but no draft genome sequences for the hexaploid crop (6x genome size 12,600 Mb/1C) [28] or its diploid relatives. The oat genome contains numerous families of repeats and apparently frequent chromosome translocations [29,30]. In recent phylogenetic analyses, common oat was inferred to experience ancient allotetraploidy and recent allohexaploidy events involving C-, A-and D-genome ancestors [31,32], while the genome reshuffling obscures contributions of different candidate maternal A-genome progenitors (bipaternal genome definition referred to [32]).
Here, we aimed to elucidate structure, organization, and relationship of all major repetitive DNA classes in diploid and hexaploid oats, examine their chromosomal locations, and understand the significance of repeatome in genome and chromosome evolution of Avena in the context of genomic, bioinformatic and cytogenetic evidence. The complete picture of repetitive DNAs provides new evidence for events occurring during evolution and speciation in the genus, including hybridization and chromosomal translocation events.

Graph-based clustering and repeat composition of Avena
Raw sequence reads (Illumina 250 bp paired end) obtained from Avena sativa, A. brevis, A. hirtula, and A. strigosa averaged 43.23% GC (guanine-cytosine) content (Additional file 1: Figure S1, Additional file 13: Tables S1, Additional file 14: Table S2a). For graph-based clustering of reads, a 1.72 to 2.87 Gb subset were analysed using RepeatExplorer [11] (Additional file 14: Table S2b). In total, more than 70% of reads were assigned into just 200 graph clusters of highly related sequence reads (Additional file 2: Figure S2, Table 1), with 12 to 18 clusters (depending on species) representing more than 1% of all the reads (Additional file 15: Table S3b).
After manual verification by checking domain homology or satellite motifs (Additional file 16:  Table S5). The clustering analysis groups solo-LTRs and SINEs with their parental elements; cluster graphs showed the greater abundance of LTRs compared to the coding sequences (Fig. 1h). The Blast results identified less than 20 chloroplast sequence clusters (abundance ranking 46-168, removed from further analysis), and about 11 rDNA clusters (Additional file 16: Table S4). Only a small proportion (2.44-4.49%) of clusters were unclassified (Additional file 17: Table S5), showing little similarity to characterized sequences, but some had adenine-thymine-rich domains (Additional file 18: Table S6 [34,35]). Our analyses were not designed to identify most microsatellite arrays (including telomeric sequences), typically shorter than 10-mers.
Clusters showed characteristic graph patterns ( Fig. 2a-p) that we used to classify the repeat families. As examples, 0.15% of reads formed the star-shape cluster graph for the tandemly repeated satellite DNA As-T119 (Fig. 2h), 0.77% of reads formed the ray-shape retrotransposon repeat Ah-R31 (Fig. 2n), 0.04% of reads formed the circular-shape repeat Ab-T159 (Fig. 2o), and 0.02% of reads forming line-shape simple repeat Ast-R176 (Additional file 19: Table S7). Eight of fragments used for in situ hybridization had 335-360 bp monomers (Additional file 20: Table S8a [ [36][37][38]). 312CL151C2 (no PCR products without designation, Additional file 20: Table S8b) was unique as it showed a higher order repeat structure of 232 bp dimer consisting of two closely related 116 bp monomers.

k-mer analyses of Avena
For cumulative repetitivity frequency plots of 10-to 64-mers, the steeper slope indicated the faster cumulative percentage changes, which varied relatively gentle for short k-mers (10-to 17-mers) and gradually increased steep-slope for longer k-mers (18-to 64-mers) (Fig. 2a-c). For the same repetitivity frequency, a shorter k-mer motif has a higher cumulative percentage (and higher frequency in raw reads; Fig. 3a-c). Among the four Avena species, 16-mer motifs occurring ≥10 times accounted for 44% of A. sativa genome, somewhat higher than other species (28% of A. brevis, 34% of A. hirtula, and 40% of A. strigosa; Fig. 3d). The 64-mer motifs occurring ≥10 times accounted for 11% of A. strigosa genome (Fig. 3e), higher than other species (5% of A. sativa, 2% of A. brevis, and 4% of A. hirtula) genomes. Overall, the graphs were consistent with the RepeatExplorer analysis, with a group of very abundant sequences representing about a quarter of the genome (inflection in e.g. the 16-mer graph, Fig. 3d), and other abundant sequences representing about 70%, before a flatter region of the graph with motifs represented less than 10 times per genome. For 16-mer motifs occurring ≥10 times, the cumulative percentage of common oat was nearly equivalent to Petunia axillaris, followed by A. strigosa, sorghum, A. hirtula, A. brevis, tomato, and potato ( Fig. 3f [21,[39][40][41]); it was notable that the cumulative percentage of 16-mer motifs occurring ≥1000 times converged for Avena species, tomato, and potato.
Chromosomal location and genome specificity of highly repetitive motifs Repetitive fragments used as FISH probes To localise repeats on Avena sativa chromosomes (Figs. 4 and 5, Additional files 3-10: Figures S3-S10), 25 probes were designed from representative sequences identified by k-mer and RepeatExplorer analyses to use for in situ hybridization, including nine satellite, one DNA transposon, four LTR-Gypsy retroelements and eleven unclassified sequences (Additional file 19: Table S7). Except for AF226603_45bp and pAs120a [36,37], we selected sequences with little or no homology to repetitive elements (TEs or tandem repeats) previously isolated by PCR or cloning strategies [30,42]. 45S and 5S rDNA [34,35] were used to identify some chromosomes (Additional file 18: Table S6).
Copy numbers and relative proportion of the selected probes were analysed in silico in A. sativa and three A-genome diploids (Additional file 20: Table S8a) to check abundance and genome specificity. Most repeats (92%) were present in all four genomes (Additional file 20: Table S8a), with expected variation from over 3 million copies per genome (As-16mer43bp in A.  Additional file 19:  Table S7, Additional file 20: Table S8a). The monomer number shown in dotplots (Fig. 2f, j, m and o) was a consequence of variability between monomers and clustering algorithm, and not related to genome structures: tandem repeat counts require very long reads (e.g. Nanopore or PacBio Sequel) or chromosome walking (e.g. BAC clones). Repeat copy numbers in the three diploid A genome species analysed was not the same (Fig. 6a), showing the whole spectrum of distribution and indicating differential amplification or loss after evolutionary separation.
No repeat was predominant in A. strigosa, a species where repeats As-T153 and As-16mer43bp were also absent. One repeat family (Ast-R171) was much more abundant in A. hirtula, and four families (As-T153, Ab-T166, Ast-T125 and Ab-T145) were dominant in the A. brevis genome (Fig. 6a, Additional file 20: Table S8a). The As-16mer43bp repeat was only abundant in A. brevis, being absent in A. strigosa and only 260 copies in A. hirtula.
Repeat names include species origin of the exemplar family member: Ab, Avena brevis; Ah, A. hirtula; Ast, A. strigosa; As, A. sativa and repeat type: T, tandem; R, retrotransposon For in situ hybridization, sequence fragments were synthesized as end-labelled oligonucleotides, or amplified by PCR from genomic DNA of Avena (A-genome species A. brevis, A. hirtula, A. strigosa, A. atlantica, A. wiestii, and A. longiglumis and C-genome species A. eriantha; Additional file 18: Table S6). Results from in situ hybridization of 25 repetitive sequences identified here to A. sativa metaphase chromosomes using two or three probes simultaneously are summarized in Additional file 21: Table S9, Figs. 4 and 5, Additional files 3-10: Figures S3-S10). As predicted from k-mer and RepeatExplorer analyses (Additional file 20: Table S8a), all probes gave hybridization signals and signal strength was generally in accordance with in silico copy number estimated.
Repeat As_16mer43bp is highly abundant with over 3 million copies, representing 1.45% of the Avena sativa genome (Fig. 2a, Additional file 20: Table S8a; a synthetic labelled oligonucleotide was used as probe), and showed strong signals, being dispersed along all C-genome chromosome arms with stronger signals at most pericentromeric regions, and weak dispersed signals on distal regions of 14 A or D-chromosome long arms (17/18 & 29-40; Fig. 5a). The less abundant repeat AF226603_45bp (0.33% of A. sativa genome; Additional file 20: Table S8a) showed a similar distribution pattern: abundant on 14 C-chromosomes (755,507 copies; Fig. 2b, Additional file 20: Table S8a) Several retrotransposon repeats, Ab-R18, Ab-R19 and Ast-R87 (Fig. 2c, d and e), but also tandem repeats Ab-T145, As-T153 and Ah-T118 (Fig. 2f, g and i) showed dispersed signals, with high abundance on 14 C-chromosomes (Fig. 4b, Additional file 3: Figure S3a-S3b) and much less or no signals on A-and D-chromosomes ( Fig. 4a and c).
Other probes labelled only some C-genome chromosomes and showed additionally more uniform signals on all chromosomes indicating large tandem arrays of at least 20 kb to see FISH signals as double-or more-dots (Additional file 3: Figure S3d and S3e), e.g. As-T175 and As-T119 ( Fig. 2h and j).

Discussion
Identification and abundance of repetitive DNAs Genome wide Analysis of unprocessed Avena genomic DNA sequence reads using motif counting (k-mer analysis) and graphbased clustering shows that repetitive DNA sequences represent some 72% of the genome (Fig. 1, Table 1).
Combining the in silico analysis with molecular cytogenetics on chromosomes in situ, we could identify the nature of the motifs and measure their abundance to give a comprehensive survey and evolutionary relationships of the repeat landscape of oat (Figs. 1 and 6). Notably, 96% of the sequences examined here could be classified as being related to either transposable elements or a relatively small number of tandemly repeated motifs (Figs. 1 and 2). Our strategy would not expect to reveal microsatellite motifs, short runs of dinucleotide or trinucleotide repeats with unique flanking regions, known to have an uneven distribution across the genome [44]. While there are increasing reports of genome-wide repeat surveys [13,45,46], most sequence assemblies collapse repeats to variable extents [21,47], while library screening or PCR amplification with primers are selective. Thus detailed comparisons between our results and many published analyses using whole genome assemblies, reference repeats (e.g. Repeat-Masker), or targeted screening may not be valid. Furthermore, classification of "families" within major groups of repeats is flexible, with some distinct families, and others where there are intermediates between sequences that would otherwise be regarded as distinct. Many of the major families of repeats identified here have been identified previously in selective screens of DNA libraries [30,36], although these studies could not quantify their abundance in the various diploid and the hexaploid genomes. Importantly, unlike the analysis of unprocessed random reads here, selective screens cannot show that all the repetitive components of the genome have been surveyed.
The 16-mers identified by our k-mer analysis with more than 10 copies per genome correspond to the figures from potato and tomato (see Fig. 3f). The 16-mers occurring less than 10 times represented between 24 and 40% of the oat genome (Fig. 3), indicate a relatively high variation within repeat sequence motifs, and these families may not be detected by reassociation kinetics (experimental) or graph-clustering (bioinformatics). Overall, the proportion of 16-mers occurring 10 or less times in Avena is similar to the 30% in Petunia or Sorghum (Fig. 3f). However, the variation between four Avena species (24, 28, 34 and 40%; Fig. 3) is hard to explain but may suggest greater homogenization in A. strigosa. A change in slope, as seen in A. strigosa for k-mers longer than 16 bp (Fig. 3), could be related to the frequencies of different repetitive DNA classes or their homogenization, but there were no conspicuous differences in repeat classes between A. strigosa and the other diploids (Additional file 14: Table S2). A. sativa shows a weaker change in graph slope (Fig. 3), consistent with addition of the genomes from one A. strigosa-like and two other species. LTR retrotransposons are largely responsible for the dramatic differences in genome sizes between related plant species, e.g. six-fold size difference between maize and rice genomes [48,49] so they could equally play a role here.
All genomes have mechanisms controlling TE amplification. Schorn et al. [54] have shown in Arabidopsis how RNA-driven DNA methylation is responsible for silencing, as is most likely the case for pararetrovirus sequences [55]. Large genomes bear higher proportions of TE sequence, and Lyu et al. [56] suggested that TE load reduction is the most important driver of genome diminution in mangroves. Here, it is notable that oat retrotransposon-related repetitive sequences families vary in abundance between diploid species, and some are essentially specific to one or two of the genomes (Fig. 6), suggesting loss and gain of particular families and directed turnover.

Tandem/satellite repeats
Tandem repeats or satellite DNA is a feature of most eukaryotic genomes. Here, we found 12 tandem repeat families, in eight of which the monomer lengths were 335 bp-360 bp (Additional file 22: Table S10). This has been noted as a monomer length required to wrap around two nucleosomes (~150 bp DNA for a single nucleosome) spaced by a variable unwrapping linker region of~30-60 bp [16,57]. Structural interactions between nucleosomes and DNA repeats can impact chromatin dynamics [58,59] and the stable wrapping of tandem repeats could be important for genome stability and methylation of domains leading to silencing. Tandem repeat probes show discrete signals on common oat chromosome arms (Additional file 3: Figure S3d-3e, Additional file 5: Figure S5e-S5f ) indicating large arrays of at least 20 kb, but some also show dispersed signals at intercalary sites (Fig. 5b-d), likely to represent multiple smaller arrays.
Submotifs of a repeat family can be used as genome-specific probes for in situ hybridization, e.g. the Brassica C-genome specific CACTA transposon [60]. Here, probe As_16mer43bp motif was found 3,346,757 reads in A. sativa, but was absent in A. strigosa (Additional file 20: Table S8a). Another repeat AF226603_45bp motif was first identified by Southern blot analysis [61] as being abundant. In contrast, two short 45 bp motifs of unknown family produced uniformly dispersed signals on common oat C-chromosomes (Fig. 4a, c-e). They are a relatively unusual length for plant repeat motifs, although c. 60 bp-length minisatellites are common in mammals [62].
rDNA was used as chromosome-specific probes. Based on phylogenetic evidence, in Avena, two NORs (45S rDNA sites) per haploid chromosome set were ancestral characters, while chromosome complements with 4 or more NORs were derived characters [63]. Structurally, the elimination of C-genome rDNAs and partial elimination of A-genome rDNAs following a hexaploidization event in A. sativa indicates that rDNA from one ancestor (the paternal genome (see [31]) might be silenced and lost following hexaploidization in A. sativa. Similar rapid loss of 45S rDNA sites is also seen in the tetraploid wheat-relative Aegilops ventricosa (with DDNN genome designation), where D-genome 45S rDNA sites are lost [64].

Evolution of repeats Diploid speciation and repetitive DNAs
The karyotypes with repetitive sequence locations provide a fresh perspective in understanding evolution in Avena. A-genome specific pAs120a was isolated long ago [36]. They discussed the repeat length and existence of four monomers inserted within pAs120a, suggesting cautiously that the pAs120a sequence could be classified as a satellite DNA sequence. In contrast, our sequence clusters homologous to pAs120a showed high similarity to Ogre/Tat and chromovirus retrotransposons (Fig. 2k) indicating that this repeat originated from a retroelement, and might have been generated as a tandem repeat from a retroelement subregion by rolling circle amplification. However, twenty years later, we still share the uncertainty of Linares et al. [36] about the evolution of this sequence to become Agenome-specific. Other sequence families also show differential amplification or reduction in individual Avena A-genomes (Fig. 6a), and in the hexaploid compared to the diploid ancestral genomes; e.g. the higher abundance of C-chromosome specific motifs identified in A. sativa genome (e.g. As-T153) or high abundance of D-chromosome specific motifs identified in A. hirtula (Ast-T116; Fig. 6a, Additional file 20: Table S8a).
From molecular dating analyses, the crown age of the C-genome diploid lineage was~20 Mya, older than the crown age of A-genome polyploid lineages [31,32]. This is supported by greater proportion of C-genome specific motifs, diverging from the common ancestors before the radiation of A-and D-genome specific motifs, as the Aand D-genome specific motifs amplified independently in common oat (Fig. 6b). This evolutionary scenario is also supported by repeats common to the A-and D-genomes or all three genomes, but no repeats were found to be specific for the C-and A-or C-and D-genomes. Retrotransposons may have a role in genome behaviour by acting as nuclei for RNA-dependent DNA methylation (as [65]), leading to position effect variegation via heterochromatinization around repetitive elements affecting adjacent gene expression [66,67].

Distal chromosome regions and translocations
The repetitive sequences used as probes here show major genomic changes through chromosome translocations in common oat [43]. In situ hybridization is unique as a method to show the nature and extend of these translocations, and the use of the repetitive probes Translocations can alter chromosome recombination frequencies, and can lead to genetic and evolutionary isolation of new hybrids. The result also suggests the opportunity for introduction of genetic variation as small chromosome segments from wild diploid Avena species into the hexaploid, with recombinant segments involving any genome, or potentially more distant diploid relatives. The oat intergenomic translocations contrast with wheat, where there are no reports of stable intra-genomic translocations with the possible exception of the 4A and 4B linkage groups, with the 4A chromosome differing substantially in repeat content from other chromosomes in the A and B genomes. In oats, the rearranged chromosomes may have adaptive value by enabling different expression levels and modulation of expression of genes from the homoeologous genomes.

Large scale genome organization and implications
Genomic repeats provide the physical basis for integrating different genomic regions for coordinating interdependent aspects of genome functions [68]. Common oat was inferred to have originated following ancient allotetraploidy (D-and C-genomes) and recent allohexaploidy (ACD-genome) events in the subfamily Pooideae [31]. Given the repeat abundance spanning A-and D-chromosomes, we speculate that the diverged repeats of A. strigosa (D-genome) and A. atlantica-A. brevis-A. longiglumis-A. wiestii (A-genome) might represent the basis of evolutionary separation of A-and D-genome progenitors, as in other species [69]). Genome-specific repeat amplification, followed by subgenome-function divergence, has been suggested to provide a mechanism driving cold acclimatization in Avena [70], and polyploidization with subgenome dominance may support rust resistance phenotypes that ultimately correspond to agronomic traits [26]. Many evolutionary models suggest that polyploid formation should be associated with a selective advantage, favouring parental genome divergence. Given that six of 13 retrotransposons plus three tandem repeats were more abundant in common oat than in diploid relatives (Additional file 20: Table S8a), it is reasonable to speculate that a burst of ancient repeatassociated genomic duplication may explain expansion of the oat genome size.

Conclusion
Here, the complete repetitive DNA content of Avena has been surveyed in whole genome data, and the repeats make up 70% of all the DNA. The most abundant elements were previously described, and we show that a small number of repeat families, some not described before, contribute a high proportion of all the repetitive DNA. It is clear that repeat amplification and turnover of repeat families have been involved during evolutionary separation of the ancestors of common oat, and, in combination with frequent intergenomic translocations (not seen in other cereals) and further turnover events, have led to the rapid evolution seen in the hexaploid. Transposable elements are a major contributor to the genome, although the families present, and their relative abundances, differ in Avena from other species groups. Used as a source of DNA markers or chromosomal probes, retroelements have utility in crop breeding [71] and tracking chromosomes in hybrids and translocation lines. With increasing data coming from long-read technologies (including Nanopore, PacBio Sequel and highthroughput chromosome conformation capture), knowledge of the repeat landscape is useful in optimizing the approach to genome sequence assembly by accounting for the abundance and genome distribution of only a small number of repeat families.

Plant material
Eight Avena species (origin of samples is given in Additional file 13: Table S1, chromosome and genome designation see Liu et al. [31], seeds were obtained from CN-Saskatchewan and USDA-Beltsville Germplasm System) were used in this study. The 171.9 Gb of raw sequences representing 4.58× to 7.06× coverage of Avena genomes [A. sativa (66.1 Gb), A. brevis (34.6 Gb), A. hirtula (35.3 Gb), and A. strigosa (35.9 Gb)] were generated by whole-genome shotgun sequencing with 2 × 250 bp from 500 bp paired-end libraries (Nanjing Genepioneer Biotechnologies Co. Ltd., Illumina HiSeq2500 platform; Additional file 1: Figure S1, Additional file 14: Table S2a). Project data have been deposited at the National Center for Biotechnology Information (NCBI) under BioProject PRJNA407595 (SRR6056489-6056492).

Repeat discovery Graph-based clustering of sequences
Similarity-based clustering, repeat identification, and classification of a subset paired-end raw reads (1.72-2.87 GB occupied 2.60-8.29% of Avena genomes; Additional file 14: Table S2b) were performed by Repea-tExplorer analysis (Additional file 15: Table S3). It was set with read overlaps containing ≥50% of length with 90% of similarity as edges to save the potential error of "bridge" reads with partial similarity among two unrelated communities [15]. The longest contigs in each of 821 clusters were analysed by BLAST search against NCBI database to check for repeat identification (Fig. 1, Additional file 16: Table S4) and repetitive DNA composition was summarized manually (Additional file 17: Table S5). Primer pairs were designed from one contig of each retroelement or tandem repeat belonging to clusters of no (or less) 1st-order neighbours [72] (Additional file 12: Figure S12a-S12 t, Additional file 18: Table S6, Additional file 19: Table S7), probe designations use genus and species abbreviation plus T for tandem or R for retrotransposon type followed by the cluster number. Primers designed from clones pTa71 [34] and pTa794 [35] were used for 45S and 5S rDNA amplification respectively (Additional file 18: Table S6). Cluster graphs, dotplots and FISH probe copy numbers were investigated by SeqGrapheR v.3.3.1 [15] and Geneious [38] (Fig. 2a-p, Additional file 20: Table S8).

Multicolour fluorescence in situ hybridization (FISH)
Root tips were fixed in 96% ethanol: glacial acetic acid (3:1) for at least 1.5 h and stored in the fixative at − 20°C overnight. An enzyme solution with 0.2% Cellulase Onozuka R10 (Yakult Pharmaceutical, Tokyo), 2% Cellulase (C1184 Sigma-Aldrich) and 3% Pectinase (P4716; Sigma-Aldrich, St Louis, USA) was used to digest root tips for 90 min at 37°C. Root tips were macerated in a drop of 60% acetic acid, and roots were squashed gently under a coverslip.

Additional files
Additional file 1: Figure S1. Spikelet morphology of eight sampled Avena species. a A. atlantica: the dispersion units-the upper florets are attached to the lower floret and only the lower floret show disarticulation. b A. brevis: spikelets show persistent florets with bidenticulate lemma tips at maturity. c A. hirtula: a Mediterranean wild type with lemma bristles 6-10 mm. d A. longiglumis: 2-3 florets/ spikelet, each floret is disarticulated; lemma back is covered with dense hairs. e A. strigosa: 2-3 florets/spikelet and persistent florets. f A. wiestii: desert and steppe wild type with lemma bristles 5-8 mm. g A. eriantha: glumes markedly unequal in size. h A. sativa: spikelets 1.5-4 cm with typically spread glumes at maturity. Scale bars = 1 cm. (TIF 4210 kb) Additional file 2: Figure S2. Distribution of graph-based clusters. Hierarchical agglomeration of RepeatExplorer analyses of four Avena species genomes are shown. a A. sativa S312. b A. brevis B289. c A. hirtula. H299. d A. strigosa S135. Coloured bars denote clusters ≥0.01% of genome: x-axis denotes the cumulative read number percentage while y-axis denotes the read numbers in the clusters. Bars coloured according to the repeat types of cluster annotation (Additional files 15: Table S3). (TIF 1895 kb) Additional file 3: Figure S3. Localization of selected C-genome specific repetitive sequences on Avena sativa metaphase chromosomes by multicolour FISH. Probe signals were captured individually with a black and white CCD camera and then pseudo-coloured to create overlaid images. For detailed description of signal distribution see Additional file 21: Table S9. a AF226603_45bp (hybridization sites displayed in red), Ab-R18 amplified from A. atlantica (green), and pAs120a from A. atlantica (blue). Note that overlapping signals of the red and green probe give yellow signals. b Ab-R19 (red) and Ab-R126 (green), counterstain DAPI (blue) shown on all chromosomes. Note that overlapping signals of the red and green probe appear white, but show several chromosome ends not labelled by either probe appearing blue or show green double dots. c AF226603_45bp (red), Ah-T118 from A. hirtula (green), and pAs120a from A. hirtula (blue). Overlapping signals of the red and green probe appear yellow and show non-uniform labelling of chromosomes. An interphase nucleus is visible at the bottom of the image. d As-T119 (green), double-dots (starred) appearing in yellow on top of the red signal of As_16mer43bp. DAPI fluorescence shown in blue.e As-T175 (green, doubledots) and As_16mer43bp (red) showing large blocks of hybridization signal on C genome chromosomes (starred). DAPI fluorescence shown in blue. f TET labeled AF226603_45bp (red), As_16mer43bp (pink), and DAPI (blue). Scale bars = 5 μm. (TIF 6648 kb)