To contribute to long term sustainability of flax production and diversification, the germplasm stored in PGRC has comprehensively been characterized for morphologic, phenologic and agronomic characteristics . This valuable phenotypic information enabled the construction of a flax core collection of 407 accessions to further flax genetic studies and improvement. Here, we report on the genetic characterization of the core collection based on 448 microsatellite loci which represents one of the largest flax genetic studies published to date [14–16, 18–21, 23–26, 28, 29, 41].
Genetic relationships and population structure
Understanding the genetic relationships and structure of core collections is critical to control false positives in AM . The NJ tree grouped the 407 flax accessions mainly but not exclusively according to geographical origin. The presence of accessions from countries out of the geographical clusters could be explained by the fact that the passport data may be occasionally weak where the donor country is considered the country of origin. As a consequence, the names of the sub-groups were assigned according to the geographic origin of the majority of the accessions within them.
The South Asian sub-group of G1 was the most genetically distinct. Fu  reported similar results in 2,727 flax accessions assessed with 149 RAPD markers. However, in his study, the Indian subcontinent and Central Asia were considered related groups rather than a unified cluster. Differences in the marker systems and extent of the genome coverage (414 mapped microsatellite vs. 149 RAPD markers) could explain the resolution differences between studies. The active exchanges of flax germplasm between France, Germany, the United Kingdom and Hungary provide support for the Western European grouping . The genetic relationships among G1 accessions were also supported by a weak population differentiation among sub-groups (FST = 0.05 - 0.11, Additional file 2: Figures S1a, b). Within G3, the North American sub-group reflects historical germplasm exchange between the U.S.A. and Canada . The Eastern European sub-group contained most of the fiber flax accessions from the Netherlands and the former Soviet Union but it also included linseed accessions that were not intermixed. They were separated by a small group of U.S.A. accessions clustered within this sub-group. The U.S.A. accessions were mostly fiber type. Similar results observed in the population structure analyses and the lowest FST (0.02) between sub-groups (Additional file 2: Figures S1a, b) could explain the interstitial presence of the U.S.A. accessions. The two major groups supported by our combined approach showed weak population subdivision in support of the breadth of the genetic diversity captured in this collection, making it ideal for AM .
Strong population structure, familial relatedness, or both, may be significant in a core collection and would negatively impact AM. Yu et al.  developed a mixed linear model (MLM) which incorporates the pairwise kinship (K matrix) to correct for relatedness. Spurious associations cannot be controlled completely by population structure (Q matrix) [37, 43]. Models incorporating a K matrix are generally superior in controlling the rate of false positives while maintaining statistical power as compared to those using only a Q matrix .
In self-pollinated crops or inbred lines, coancestry estimates tend to be higher than in outcrossing species because the high hererozygosity reduces the probability that two alleles observed at a locus are identical by state . In our core collection, approximately 80% of the pairwise coancestry estimates ranged from 0.1 to 0.3, indicating that most of the lines had weak relatedness (Figure 2a). We anticipate that with the weak population structure and relatedness of the core collection, a MLM correcting for K should provide sufficient statistical power to control most of the false positive associations in future AM studies .
A suitable core collection for AM should encompass as much phenotypic and molecular diversity as can be reliably measured in a given environment [36, 37]. An average of 5.32 alleles per locus over 414 microsatellites was observed in our core collection. This value is higher than the range previously reported (2.72 – 3.46) [28, 41, 45, 46]. This allelic diversity even exceeded that of a diverse sample of L. usitatissimum L. subsp. angustifolium (Huds.) Thell., (wild progenitor) and L. usitatissimum L. subsp. usitatissimum (4.62) . This high value may be the result of the number of genotypes analyzed (407), the choice of the germplasm, the number of microsatellite loci (414 neutral out of 448) and the microsatellite repeat type and length [29, 47].
A higher number of private alleles were observed in G1 as compared to G3 (Table 1). The Western European sub-group was particularly rich in private alleles with 246. Novel genetic variations, not previously sampled or utilized in modern flax breeding programs, may be present in this sub-group, offering unique alleles for broadening the diversity of flax gene pools. This is contrary to previous studies that have reported generally low genetic diversity of flax germplasm [18, 21, 23, 26–28]. Although 85% of the accessions of our core collection are cultivars and breeding materials, the collection possesses abundant genetic diversity, an advantageous attribute for dissecting the genetic basis of QTL for immediate application in flax breeding [36, 48].
Low LD demands the use of dense marker sets resulting in tight linkage between markers and QTL, an advantageous criterion for breeding applications because the predictive ability of a marker will be robust through generations . The average r
2 of the entire core collection was 0.036 and the average genome-wide LD decayed within 1.5 cM (Figure 2b). In self-pollinated species where recombination is less effective than in outcrossing species LD declines more slowly . Nonetheless, the germplasm that makes up the collection plays a key role in LD variation because the extent of LD is influenced by the level of genetic variation captured by the target population. For example, in wild barley (Hordeum vulgare ssp. spontaneum), despite its high rate of self-fertilization (~98%), LD decayed within 2 kb, a value similar to that observed in maize, an outcrossing species . The low LD of this core collection dictates the need for higher marker saturation to provide superior mapping resolution and QTL detection power by AM  as compared to using biparental linkage maps. Alternatively, selection of sub-groups with low FST and higher but similar levels of LD would require a reduced number of individuals and markers for exploratory AM.
The percentage of loci pairs in significant LD was fairly similar in each sub-group except for the North American and Eastern European sub-groups which registered the highest values, possibly reflecting their more intensive artificial selection and narrow germplasm . Although our core collection did not behave as an unstructured large population, our combined analyses of population structure showed that G1 and G3 were weakly differentiated, representing two ancestral populations that minimize differences in LD and potentially the amount of spurious associations (Figures 1a, b). Thus, the results of our LD characterization within diverse genetic groups offer the versatility to perform cost-effective AM studies in flax by providing the fundamental characterization of the collection demonstrating its usefulness for AM.
Identification of non-neutral loci
Flax is one of the few domesticated plants that have been subjected to disruptive selection . North America almost exclusively grows linseed and, up until recently, the stems were considered more problematic than beneficial because of their slow field biodegradation. However, the use of short fibers has received increased attention in North America in the last few years because of the interest in extracting value from the stem of linseed varieties . Stem fiber content does not seem associated with qualitative or quantitative plant characteristics in flax germplasm  indicating that there are no major biological restrictions for pyramiding agronomic and seed quality traits with high fiber content.
Crops have been subjected to strong selective pressure directed at genes controlling traits of agronomic importance during their domestication and subsequent episodes of selective breeding . Under positive selection, favourable alleles will increase in frequency until fixation. As an effect of genetic hitchhiking, loci closely linked to beneficial alleles might present distortions from neutral expectations. Genome scans have allowed the identification of candidate loci involved in domestication and breeding traits in several crops [47, 51] and domesticated animals [52, 53]. However, population structure and bottlenecks can mimic the effect of selection and create false positives. The combination of several methods based on different assumptions can reduce false positives .
We applied four different tests of neutrality to identify the genomic regions that deviate from neutral expectations potentially associated with fiber and linseed divergent selection. Collectively, 86 candidate genes were identified at nine loci (Additional file 4: Table S2). Among our candidate genes, we found a β-tubulin involved in cell morphogenesis and elongation of fiber in cotton , a glucan endo-1,3-β-glucosidase associated with cell wall biogenesis/degradation in flax , a chitinase involved in polysaccharide degradation , a MYB transcription factor that influences cellulose microfibril angle in Eucalyptus  and a class III HD-Zip protein 4 (HB4) involved in xylem identity in flax  (Additional file 4: Table S2). Candidate genes such as pyruvate dehydrogenase E1 and fatty acid alpha-hydroxylase involved in fatty acid biosynthetic processes were also identified (Additional file 4: Table S2). However, β-galactosidase and cellulose synthase, two key enzymes for cell-wall modification and cellulose synthesis in flax [56, 58] were not present at any of the nine loci. Previously identified genes in flax microarray analyses of hypocotyl and phloem fiber development  and differentially expressed genes between flax inner and outer stem tissues  were found among our candidate genes (Additional file 4: Table S2).
Although preliminary, our scans provided the first insights of non-neutral loci potentially affected by divergent selection in flax. Candidate genes, especially those previously reported [56, 58], will require further investigation and validation. To enhance the probability of identifying additional candidate loci, a high density of markers would be desirable. Currently, next generation sequencing technology enables the re-sequencing of a large number of accessions at a reasonable price. Thus, high quality and dense single nucleotide polymorphism (SNP) markers promise to provide comprehensive genome coverage for the identification of non-neutral genomic regions in flax . Such genomic tools for flax genetic studies are being developed and more comprehensive genomic scans will be possible in the near future.