SSR genotyping
To fully represent the racial diversity of maize in NEA and to complement the data presented by Bracco et al. [14], 10 SSR loci (bnlg1866, phi037, bnlg1182, bnlg252, bnlg1287, bnlg1732, bnlg1209, bnlg1018, bnlg1070 and bnlg1360) were genotyped on 232 individuals from the Argentinean provinces of Chaco, Corrientes, Entre Ríos, Formosa and Misiones (Additional file 1: Figure A1). Landraces were collected in 1977, 1978 and 2005 directly from farmer fields and preserved at the Banco de Germoplasma EEA INTA Pergamino, or at the Laboratorio de Recursos Genéticos Vegetales “N.I. Vavilov”, Universidad de Buenos Aires. Racial identification, voucher specimens, collection sites and sample sizes of the accessions genotyped here are given in Additional file 1: Table A1. Seed germination, DNA extraction, and SSR genotyping were performed as detailed in Bracco et al. [19]. Primer sequences are available at the MaizeGDB website (http://www.maizegdb.org/data_center/locus).
SSR Data analysis
To put the SSR data in a continental context, our dataset was analysed in combination with those of Bracco et al. [14], Lia et al. [15] and Vigouroux et al. [8], yielding a final data matrix of 10 SSR and 1288 individuals. This compiled SSR data matrix consisted of 709 individuals belonging to 29 landraces from NEA and NWA, plus 579 individuals from the main genetic groups recognised by Vigouroux et al. [8], namely, Tropical Lowland (74 landraces, 187 individuals), Andean (89 landraces, 235 individuals), Highland Mexico (HM) (37 landraces, 87 individuals) and Northern and Southwestern US (US) (49 landraces, 70 individuals). Individuals identified as admixed in the original studies were excluded from the analysis.
Allele size equivalences between maize landraces from NWA and those of Matsuoka et al. [1] were determined by Lia et al. [15]. Individuals from NWA were then used as allele size markers to genotype NEA landraces. The SSR data of Matsuoka et al. [1] are included within the data set of Vigoroux et al. [8], thus allowing the integration of all sources of data. The compiled SSR matrix is given in Additional file 2: Table A2.
Model-based clustering
Genetic clusters were inferred using the Bayesian approach implemented in STRUCTURE 2.3.4 [20]. However, given the differences between the number of loci genotyped in NEA and NWA landraces and those used by Vigouroux et al. [8], we first checked the consistency of the groups defined by these authors against the reduction of the number of loci (from 84 to 10) using STRUCTURE. As a result, we retrieved the same four groups reported by them, thus validating the use of this subset of SSR in our integrative analysis (data not shown). The overall probability of identity and the probability of identity given the similarity between siblings were estimated across the complete SSR data matrix according to Waits et al. [21], using GeneAlEx 6 [22]. The discriminant power of the selected loci on the combined dataset was supported by a probability of identity of 4.7 × 10−14 and probability of identity among siblings of 4.7 × 10−5. Analyses were performed using K values from 1 to 10, 10 replicate runs per K value, a burn-in period length of 105 and a run length of 106. No prior information on the origin of individuals was used to define the clusters. All the analyses were run under the correlated allele frequency model [23]. The run showing the highest posterior probability was considered for each K value. A measure of the second order rate of change in the likelihood of K (ΔK) [24] was calculated using Structure Harvester [25]. An individual was assigned to one of the clusters on the basis of having a membership coefficient higher than an arbitrary cut-off value of 0.80 (Q > 0.80) (Additional file 2: Table A2). Results were plotted with DISTRUCT 1.1 software [26]. The model of correlated allele frequencies of Falush et al. [23] was applied to estimate the drift parameter (F) for the genetic groups inferred by STRUCTURE. This parameter measures the extent of a cluster’s differentiation relative to a hypothetical population of origin and has a direct interpretation as the amount of genetic drift to which the cluster has been subjected [27]. A graphical representation of the densities of F was obtained using the density function of R 3.0.2 (R Core Team 2014).
Discriminant Analysis of Principal Components (DAPC)
Population structure was also examined by the Discriminant Analysis of Principal Components (DAPC) [28], using the adegenet 1.3-1 package [29] implemented in R 2.13.2 (R Core Team 2014). The function dapc was executed by retaining 150 principal components, which accounted for 96 % of total genetic variation and five discriminant functions, and using the clusters identified by the K-means algorithm [30]. The number of clusters was assessed using the function find.clusters, with n.iter = 1000000 and n.start = 25, evaluating a range from 1 to 40.
Genetic diversity and differentiation
Diversity indices were calculated for the genetic clusters inferred by Bayesian analysis, using only individuals unequivocally assigned to their respective clusters (Q > 0.80). The mean number of alleles per locus (A), allelic richness (Rs) [31] and gene diversity (He) [32] were estimated using the software Fstat 2.9.3.2 [33]. The presence of group-specific alleles (hereafter referred to as private alleles) was examined for each group. Private allelic richness, i.e., the mean number of private alleles per locus as a function of standardised sample size, was computed with ADZE 1.0 [34]. File conversion was conducted with Convert 1.31 [35].
Differentiation between STRUCTURE groups was assessed by the Allele frequency divergence estimate provided by the software. Cluster analysis was carried out applying the Neighbour-joining algorithm [36] implemented in PHYLIP 3.6 [37]. Branch support was estimated by bootstrapping (1000 pseudoreplicates) with Powermarker 3.25 [38]. Resulting trees were visualised with FigTree 1.3.1 [39].
Adh2 microsatellite analysis
Individuals of 16 landraces from Northern Argentina (NWA + NEA) were sequenced for the Adh2 microsatellite region (Additional file 1: Table A3). The Adh2 gene segment employed for primer design (GenBank X02915) included part of the 5’ untranslated promoter region, the exons 1 and 2, and the intervening intron. PCR primers were designed with the Primer3 software 0.4.0 [40]: upstream, 5’-AAAATCCGAGCCTTTCTTCC-3’; downstream, 5’-CTACCTCCACCTCCTCGATG-3’. Cycling was carried out as in Freitas et al. [7], and the PCR products were checked and visualised as in Bracco et al. [14]. To determine whether individuals were homozygous or heterozygous, PCR products were subjected to native PAGE as described in Bracco et al. [19]. Bands from homozygous individuals (ca. 350 bp in length) were excised from agarose gels (2 % w/v) purified using an AccuPrep® Gel purification kit (BIONEER) and sequenced in an ABI 3130XL apparatus (Applied Biosystems). Chromatograms were edited with BioEdit 7.1.3.0 [41]. Both strands were assembled with the same program, and the microsatellite motifs were extracted for frequency calculation. All sequences were deposited in GenBank under accession numbers KU304471 to KU304494.
The data gathered were analysed together with Adh2 microsatellite alleles derived from the archaeological landraces examined by Goloubinoff et al. [42], Freitas et al. [7] and Grimaldo Giraldo [18], under the assumption that genealogical continuity exists between the archaeological genetic structure and that of extant landraces. In addition, we extracted the SSR alleles from the modern, primitive and historic landraces studied by Freitas et al. [7] and Grimaldo Giraldo [18], and from Adh2 sequences retrieved from GenBank corresponding to Brazilian and Bolivian landraces (EU119984-9 and EF070151-6, respectively). The compiled data set (N = 497) was divided into three allele types following Freitas et al. [7], i.e. (GA)n, (GA)nTA and GAAA(GA)n.
Statistical associations between allele types and geographic regions were evaluated with the likelihood ratio Chi-square test implemented in Infostat software 2013 [43].
Modelling of geographic distributions
Geographic distributions were modelled using the Maximum Entropy method [44], implemented in MaxEnt 3.3.3.a (https://www.cs.princeton.edu/~schapire/maxent/). This program uses presence-only data in the form of geo-referenced occurrence records, and a set of environmental variables to produce a model of the distribution range of the species under study. The raw and logistic outputs of MaxEnt are monotonically related and can be interpreted as an estimate of habitat suitability [45, 46].
We modelled the geographic distribution of the genetic groups retrieved from Bayesian clustering, considering only individuals with Q > 80 %. Following the guidelines provided in Scheldeman & van Zonneveld [47], a minimum of 20 occurrence points was set as threshold for modelling. Thus, a total of 317 spatially unique records were used, corresponding to the following clusters: 22 for NEA Flours, 92 for Tropical Lowland, 120 for Andean, and 87 for Highland Mexico and US (HM-US). The NEA Popcorns could not be subjected to analysis because of the low number of unique sampling localities (<5). Given the stability of the MaxEnt method in the face of correlated variables [45], and to facilitate comparisons with previous models of the geographic distribution of maize [16], we used the variable Altitude and the 19 bioclimatic variables available at WORLDCLIM 1.4 (http://www.worldclim.org/) [48] at 2.5 arc-min resolution. These 19 variables are derived from monthly temperature and precipitation records, reflecting seasonal and annual variations. Models were generated using 20,000 background points from all over the world. This implies that landrace could have dispersed anywhere in the globe, which is a reasonable assumption for a domesticated crop such as maize. It has been shown that the predictive ability of MaxEnt models is influenced by the choice of feature types and regularisation parameters, particularly for small sample sizes [49]. By default, the program uses class-specific regularisation parameters tuned on the basis of a large international dataset [45]. In addition, when few samples are available, MaxEnt restricts the model to simple feature classes (i.e., linear, quadratic, hinge) [49]. Because the number of occurrences for the different groups ranged from 22 to 118, we followed the recommendations of Elith et al. [45]; thus, we constructed the models under two different settings: 1) hinge features only and default regularisation parameters; and 2) default feature and regularisation parameters. Models were compared using sample-size corrected Akaike’s Information Criteria (AICc) with ENMtools 1.3 [50, 51].
Model performance was assessed using the area under the receiver operating characteristic curve (AUC) [52] for both training and testing data sets. Ten-fold cross-validation was used to estimate errors around fitted functions and predictive performance on held-out data, except for the NEA Flours in which 4-fold cross-validation was used due to the low number of spatially unique records. Variable importance was determined by measuring the contribution of each variable to model improvement during the training process (percent of contribution), and by jackknife tests implemented in MaxEnt. Model predictions were visualised from MaxEnt logistic outputs using DIVA-GIS 7.2.1.1. [53]. Pairwise comparisons of model predictions were carried out by calculating the I statistic of Warren et al. [54] and Schoener’s D [55] as implemented in ENMtools 1.3. To evaluate whether the models generated for each genetic cluster are more different than expected if they were drawn from the same underlying distribution, we performed the niche identity test [54] included in ENMtools 1.3 with 100 replicates.
All the maps used in this study were freely available at http://www.diva-gis.org/Data.