Biodiversity is restricted by genome combinatorics?

Biodiversity is restricted by genome combinatorics?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Me and some friends are interested in opinions for the following:


The maximum number of species must be limited by the maximum combinatorial/permutational space that can be occupied by DNA. Thus if there is a maximum physical genome size this is what will determine the maximum number of species that can possibly exist.


E.G. say maximum number of DNA base pairs able to fit in a genome was $3$, each base pair can be one of either ${A,G,T,C}$. Then there are $4^3 = 64$ possible combinations of genomes. Extrapolate to genome sizes of $x$ base pairs, then there are $4^x$ combinations.


Would it be possible to claim that the underlying "blueprint" that codes for living diversity sets the absolute maximum for the total "diversity space"?

**Does it make sense to define the total number of species life can achieve with the simple function:

$S < 4^x$, where X is the maximum genome size measured in DNA base pairs?**

Notable Comments

@Shigeta: for $S<4^x$ the combinations involved quickly dwarf the number of atoms in the universe at ~10^80.

@rg255: Even at this simplification of: $S<20^{x/3}$ there are $1.024e+13$ possible combinations with just 10 codons, many many more than there is likely species in the world.

Yes, we can say the number of species is limited as you conjecture. However, quick estimation shows that the limitation has no apparent usefulness:

A reasonable estimate of the largest known genome is 150 GB (150,000,000,000 or 1.5e11 nucleobases). The limit would be 4 raised to that power. That limit is so high that it is too large for most calculators to calculate. For instance. fails to calculate 4 raised to 1e11, its maximum being about 4 raised to 1e9. The result of 4 to the 1.0e9 power is about 1.0e602059991 or 10 raised to the 602059991'th power.

That rough result, 1.0e602059991, is so enormous that it is exponentially greater than the number of atoms in the universe (which is less than 1.0e100). Hence, assuming the definition of species requires an organism to use at least one atom for its body, there is no consequence to saying the number of species must be less than this number.

The number of possible combinations of nucleotides is so outrageously large that it does not constrain the number of unique individual organisms or species.

That's an interesting conjecture about the total amount of genetic variation that is possible. I would modify a few things. First, since the size of genomes varies greatly among organisms (from 0.5 Mb to 15 Mb just for prokaryotes), there should be a fifth character in your set, representing the absence of a nucleotide.

There are also issues of whether various combinations are equivalent. For instance, bacterial genomes are often circular, so if we could convert one genome into another simply by rotating it, we would consider those genomes identical. For example, if the genomes each had 99 As and 1 T, it would not be meaningful to say that they are different just because of the location of the T. I think this would require use of the multinomial coefficient to count the number of identical variants.

Regarding your main thesis, your use of the term "species" has no relation to how that term is used among biologists. Biological species include genetic variation, so each of them would include many of your species. Also, one criteria for identifying species is the clustering of these sequences (and the absence of other, intermediate sequences). This implies both that many possible DNA sequences simply cannot produce a viable organism and competition among similar genotypes is an important aspect of species identity.

I'm going to define a species according to the biological species concept, probably the most widely "accepted" species concept where a species is a group of individuals that reproduce, or have the potential to do so. Using a simplified example I will show you that gene*environment interactions affecting phenotype can allow separate species to occur despite being genetically identical.

Imagine a very short section of DNA which affects seminal fluid proteins in a fruit fly. It is just 10 bases long. The first three triplets of bases code for a protein constructed of three amino acids which affects the females behavior upon receipt of the male ejaculate. The final base regulates the expression of the first codon and is sensitive to developmental environmental factors (let us say nutrient richness). When the development of the male fly is in a nutrient rich background the section of DNA produces all three codons in equal amount to construct a protein and that protein stimulates the female to release eggs for fertilization. If we take a genetically identical male reared in a nutrient poor environment then the regulatory base increases the production of the first codon's amino acid. This makes a different protein structure which no longer stimulates the female in to releasing eggs, therefore the males from nutrient poor backgrounds can not reproduce with the females from a nutrient rich background, and these two genetically identical populations are different species.

Thus I would say, no, the theoretical maximum number of species is not capped by the length of the DNA. However, given that just the potential number of bases (current highest estimates recorded are 150,000,000,000) far far far exceeds the number of species in the world, we can say that the number of base combinations is not the limiting factor on the biodiversity we see. That is down to evolution (selective and neutral processes). Phenotypes determine the ability of two individuals to reproduce, phenotypes are the result of more than just genetics:

$phenotype = genotype + environment + genotype*environment$

Further, as @mgkrebbs has already stated, the number of possible species (given by $4^x$ when $x$ is 150,000,000,000) is, not only far more than the number of species that do exist, but also far more than the number of atoms in the universe. Assuming each species requires at least one atom then the number of atoms available will halt increasing numbers of species before the number of possible base combinations does.

Well… this is true in the same way that books are constricted by letters. We could have more letters and create more combinations with the same length of DNA. But, like books, there are just so many combinations.

Insights into the Evolution of the New World Diploid Cottons (Gossypium, Subgenus Houzingenia) Based on Genome Sequencing

We employed phylogenomic methods to study molecular evolutionary processes and phylogeny in the geographically widely dispersed New World diploid cottons (Gossypium, subg. Houzingenia). Whole genome resequencing data (average of 33× genomic coverage) were generated to reassess the phylogenetic history of the subgenus and provide a temporal framework for its diversification. Phylogenetic analyses indicate that the subgenus likely originated following transoceanic dispersal from Africa about 6.6 Ma, but that nearly all of the biodiversity evolved following rapid diversification in the mid-Pleistocene (0.5-2.0 Ma), with multiple long-distance dispersals required to account for range expansion to Arizona, the Galapagos Islands, and Peru. Comparative analyses of cpDNAversus nuclear data indicate that this history was accompanied by several clear cases of interspecific introgression. Repetitive DNAs contribute roughly half of the total 880 Mb genome, but most transposable element families are relatively old and stable among species. In the genic fraction, pairwise synonymous mutation rates average 1% per Myr, with nonsynonymous changes being about seven times less frequent. Over 1.1 million indels were detected and phylogenetically polarized, revealing a 2-fold bias toward deletions over small insertions. We suggest that this genome down-sizing bias counteracts genome size growth by TE amplification and insertions, and helps explain the relatively small genomes that are restricted to this subgenus. Compared with the rate of nucleotide substitution, the rate of indel occurrence is much lower averaging about 17 nucleotide substitutions per indel event.


—Approximate geographic ranges of Houzingenia…

—Approximate geographic ranges of Houzingenia species. D1 = G. thurberi , D2-1 =…

—Nuclear phylogeny of Houzingenia without…

—Nuclear phylogeny of Houzingenia without (left) and including (right) the introgressed accession of…

—Comparison of phylogeny from reference-guided…

—Comparison of phylogeny from reference-guided assembly of chloroplast-derived reads in Houzingenia (left ML-derived…

—Mean transposable element content for…

—Mean transposable element content for each category in each species of Houzingenia ,…

—Rate of gene gain or loss, per million years. Boxplot distributions show distribution…


Haplotypes are combinations of alleles from multiple genetic loci on the same chromosome that are inherited together the term haplotype can encompass as few as two loci or refer to a whole chromosome (that is, chromosome-scale haplotype). For diploid genomes, a given length of chromosomal DNA will have two haplotypes, one inherited from each parent, whereas several haplotypes exist for any given chromosomal region at the population level or for polyploid genomes. DNA microarrays and short-read sequencing can determine the collection of alleles at genetic loci (that is, genotypes) but provide no information at the level of haplotypes, whether alleles are co-located on the same copy of a chromosome, or which of the parental chromosomes harbors a particular allele. Hence, computational reconstruction of haplotypes using upcoming sequencing technologies, by either read mapping to a reference genome or de novo assembly, is required.

Haplotype information is fundamental for medical and population genetics [1, 2], where it is used to study genetic variation associated with human diseases [3, 4]. Traditionally, specific SNP locus-specific association to diseases was studied with respect to a linear reference sequence, for example, two SNPs, rs9494885 and rs2230926 in the TNFAIP3 gene w.r.t Grch37 reference, have known correlation with scleritis disease [5]. However, individual haplotypes (or their collection in the form of a pan-genome graph [6], which represents the genetic variations from populations and medical samples) can help to discover highly complex variations such as nested structural variation, inversions, and other complex rearrangements (reviewed in [7]) and to access the full spectrum of rare inherited variants and de novo mutations [8]. For example, the haplotype information is helpful to detect a rare case of keratitis-ichthyosis-deafness syndrome that exhibits a spontaneous correction of a pathogenic mutation by another mutation on the whole-chromosome scale [9]. Additionally, the phenomenon of compound heterozygosity on homologous chromosomes is responsible for recessive Mendelian disorders [4]. The chromosome-scale haplotypes also have functional relevance—the distribution of cis- and trans-acting variants between homologous chromosomes, that is, the phase of variants, can affect gene expression chromosome-scale haplotypes help study interactions between variants in regulatory elements (long-range promoter-enhancer interactions) [4]. Another highly relevant chromosome-scale haplotyping example is to understand the context of aneuploidy (chromosome loss or gain) in cancer genomes, for example, large copy number gain in centromere 17 for chromosomal instability in breast cancer [10] also requires recent haplotyping approaches. The inference of whole-chromosome haplotypes has clinical relevance: having both variants on the same allele (cis) lead to a specific (for example, super-responder) phenotype, while those variants were on separate alleles (trans) do not. Haplotypes also play an important role in understanding the interplay of evolutionary processes that shape genetic variation, such as recombination, gene conversion, mutation, and selection. For example, modification of plant breeding strategies based on evolutionary processes identified through haplotype reconstruction can result in agricultural improvements [11]. Another highly relevant application occurs in the analysis of viral infections [12], where haplotype reconstruction can help to identify drug resistance and virulence factors and aid treatment decisions [13, 14].

Despite recent advances, sequencing technologies are limited in their ability to cover repetitive genomic regions to produce chromosome-scale haplotypes. Therefore, local (short-range) and genome-wide (long-range) information must be computationally integrated to assemble chromosome-scale haplotypes [15]. The integrative algorithms used for reconstruction must be tuned for the specific genome characteristics of a species, such as genome size, number of haplotypes, and repeat or haplotype variation. Many large-scale sequencing initiatives, such as the Vertebrate Genomes Project [16], the DNA Zoo project (, Darwin Tree of Life (, the Human Microbiome Project (, and the Human Pangenome Project (, have begun to leverage diverse recent sequencing data types (Table 1) to reconstruct haplotypes for various species. These projects have designed and integrated bioinformatic pipelines in a common platform for large-scale genome analyses [24].

In this Review, we discuss the bioinformatic methods—reference-based de novo and strain-resolved metagenome assembly—to reconstruct haplotypes in diploids, polyploids, and microbial communities. We present the strengths and weaknesses of these methods, alongside examples of their biological applications. Finally, we conclude with challenges and future directions, with an emphasis on both the algorithmic and technological advances required to achieve routine high-quality haplotypes for further biological discoveries.

Biodiversity and biogeography of phages in modern stromatolites and thrombolites

Viruses, and more particularly phages (viruses that infect bacteria), represent one of the most abundant living entities in aquatic and terrestrial environments. The biogeography of phages has only recently been investigated and so far reveals a cosmopolitan distribution of phage genetic material (or genotypes). Here we address this cosmopolitan distribution through the analysis of phage communities in modern microbialites, the living representatives of one of the most ancient life forms on Earth. On the basis of a comparative metagenomic analysis of viral communities associated with marine (Highborne Cay, Bahamas) and freshwater (Pozas Azules II and Rio Mesquites, Mexico) microbialites, we show that some phage genotypes are geographically restricted. The high percentage of unknown sequences recovered from the three metagenomes (>97%), the low percentage similarities with sequences from other environmental viral (n = 42) and microbial (n = 36) metagenomes, and the absence of viral genotypes shared among microbialites indicate that viruses are genetically unique in these environments. Identifiable sequences in the Highborne Cay metagenome were dominated by single-stranded DNA microphages that were not detected in any other samples examined, including sea water, fresh water, sediment, terrestrial, extreme, metazoan-associated and marine microbial mats. Finally, a marine signature was present in the phage community of the Pozas Azules II microbialites, even though this environment has not been in contact with the ocean for tens of millions of years. Taken together, these results prove that viruses in modern microbialites display biogeographical variability and suggest that they may be derived from an ancient community.

Fish genome projects

The genome of the pufferfish Takifugu rubripes was, after the human genome, the second vertebrate genome to be sequenced (Aparicio et al, 2002). Its sequencing through whole-genome shotgun (WGS) strategy allowed the first genome-wide comparison between two vertebrate species. Pufferfish and mammals have approximately the same number of genes, but the Torafugu genome is 8–9 times smaller than the human genome. This is principally due to the fact that nonexonic regions (intronic and intergenic sequences) are generally – but not always – much shorter in the pufferfish than in humans, because of a relative paucity of repetitive sequences. The third assembly of the Torafugu genome is available and consists of approximately 8000 genomic scaffolds covering approximately 95% of the nonrepetitive fraction of the genome. Genome compaction is also observed in the green spotted pufferfish T. nigroviridis, the genome of which has been almost completely sequenced too (Jaillon et al, 2004). Sequence data have been assembled in approximately 50 000 contigs covering 312 Mbp of the 385 Mbp genome. These contigs have been further linked particularly through fluorescent in situ hybridization of genomic clones on Tetraodon chromosomes (Jaillon et al, 2004). The genome sequence of both smooth pufferfishes is useful for the identification of coding and regulatory sequences in humans and other vertebrates through sequence comparison, since sequences with functions should be more conserved than ‘useless’ sequences. The sequence of T. nigroviridis has been used to predict the number of genes in the human genome (Roest Crollius et al, 2000). Importantly, hundreds of putative novel human genes have been discovered by comparing the pufferfish and human genome sequences (Aparicio et al, 2002 Jaillon et al, 2004). In addition, analysis of the T. nigroviridis genome revealed the basic structure of the ancestral bony vertebrate genome, which was composed of 12 chromosomes, and allowed to reconstruct many of the chromosome rearrangements which led to the modern human karyotype (Jaillon et al, 2004). Finally, Takifugu/Tetraodon comparisons might provide new information about differences between relatively related species, in a manner similar to the comparison between rat and mouse.

The sequencing of the genome of both zebrafish and medaka is close to completion. Sequence reads and preliminary contigs are publicly available, but the final analyses have not been published to date. The sequencing of the zebrafish genome has been initiated in 2001 following two strategies: clone mapping and sequencing from genomic clone libraries and WGS sequencing with subsequent assembly (Table 1). The fourth WGS assembly Zv4 has been released in July 2004. This assembly consists of approximately 21 000 contigs covering 1500 Mbp of the zebrafish genome. The sequencing of the genome of the medaka Oryzias latipes, mainly based on a whole shotgun sequence strategy, has been started in 2002. A nine-fold coverage of the genome has been already obtained in May 2004, and a first WGS assembly has been released in July 2004 (about 116 000 sequences covering 840 Mbp see Naruse et al (2004a) for additional information and useful www resources). The availability and comparison of the zebrafish and medaka genome drafts will allow linking mutant phenotypes to gene functions and shedding a new ‘evo-devo’ light on the fish and vertebrate lineages.

Finally, for other fishes (Table 1), important expressed sequence tags (ESTs) resources are already available. There is no doubt that some of these species will be subjected to genome sequencing in the near future due to their economical and/or academical relevance, and several proposals way have already been submitted to funding agencies. On the subgenomic level, a project dealing with the physical mapping and sequencing of the sex chromosomes of the platyfish Xiphophorus maculatus has been started (Froschauer et al, 2002).


In conclusion, our study shows that generating high-quality plant genome assemblies is feasible with relatively modest amounts of resources and data. Using nanopore sequencing, we were able to produce contiguous, chromosome-level genome assemblies for cultivars in a rice variety group that contains economically and culturally important varieties. Our reference genome sequences have the potential to be important genomic resources for identifying single-nucleotide polymorphisms and larger structural variations that are unique to circum-basmati rice. Analyzing de novo genome assemblies for a larger sample of Asian rice will be important for uncovering and studying hidden population genomic variation too complex to study with only short-read sequencing technology.


Population creation

The 16 NDM founders were chosen to capture the greatest genetic diversity using PowerMarker genetic analysis software [58]. They were chosen from 94 NW European wheats released in the UK that were genotyped with 546 DArT and 61 SSR markers the full panel also included 96 US and 50 Australian varieties, which were excluded based on STRUCTURE analysis [59]. The founder selection process was run iteratively with the varieties “Robigus” and “Soissons” first fixed to be included to coincide with the founders of the 8-founder NIAB Elite MAGIC population [14]. Then the most frequently selected additional 4, then 9, and 12 varieties were fixed in multiple iterative selection runs and finally the most frequently selected 16 were chosen. Seed for the founding varieties was sourced from the John Innes Centre Germplasm Resource Unit (

These founders were intercrossed in a balanced funnel crossing scheme, based on a Latin square field trial design, over four generations to create 16-way crosses with all the founders equally represented in their pedigree. First, all 120 possible 2-way crosses between founders were made in a half diallel scheme. Two-way plants were then crossed in 60 4-way combinations. Multiple plants from each family were used in crossing from 2-way onwards, in order to maintain maximum founder allelic diversity within the population. 30 crossing combinations were made between 4-way plants to create 8-way crosses, making between five and eight replicate crosses per combination using different plants. These were intercrossed in 15 combinations to create balanced 16-way crosses, with each combination replicated between six and fifteen times using different 8-way plants. This resulted in 174 16-way plants from which one to sixteen inbred lines per 16-way family were made through single seed descent (SSD). In total, 596 RILs were advanced to the F7 stage when seed for phenotyping was multiplied in 1 × 1 m nursery plots. Additional file 1: Table S9 gives details the number of plants involved in each cross and Fig. 2a shows the pedigree for the 504 RILs used in our main analysis only.


RILs from the population were phenotyped in field trials over multiple environments near Cambridge, UK. Yield trials were conducted in the growing seasons 2016–2017 and 2017–2018, hereafter year 1 and year 2 (phenotype suffix codes _1 and _2). Information on location, soil type, key dates, and inputs for both years are given in Additional file 1: Table S4. Yield plot dimensions were 2 m wide and 4 m long, and plots were sown at a density aiming to achieve 300 plants m −2 . In year 1, 596 lines were included in two replicates, the sixteen founders in four replicates and the commercial control variety “KWS Santiago” in 24 replicates in a randomized nested block design with 16 main blocks of 80 adjacent plots which comprised each row in the trial and eight sub-blocks of ten plots nested within each main block. In year 2 trials, 596 lines and the 16 founders were included in two and four replicates respectively but three control varieties (“KWS Santiago,” “Skyfall,” and “Shabras”) were all included in four replicates. Plots were again randomized in a nested block design but including additional plots making a larger trial, consisting of 20 main blocks of 115 adjacent plots, which comprised each row, and 23 sub-blocks of five plots nested within each main block.

Disease observation trials (DOTs) were conducted near Cambridge, UK, in the same years as the yield trials to assess resistance to crop diseases. These plots consisted of two 1.2-m length rows, treated with no fungicide but otherwise standard inputs. Due to local conditions, DOTs had a high natural likelihood of yellow rust infection (Puccinia striiformis f.sp. tritici) and were not experimentally challenged with pathogens. In both years, DOTs included two replicates of 596 RILs, four replicates of the 16 founders, and 68 additional replicates of the susceptible founder “Robigus.” Trial designs included two main blocks of 660 plots, with 11 sub-blocks of 60 plots nested within main blocks. All trial designs for both yield and disease observation trials were made using the package “blocksdesign” in R. Phenotyping of some traits was also carried out in 1 × 1 m seed nursery plots where lines were not replicated but the founders were in three replicates and randomized across the nurseries (phenotype code _N).

A wide range of traits were phenotyped across the field trials, including traits for crop developmental morphology, phenology, plant stature and canopy architecture, yield and yield components such as spike and grain morphology, disease resistance, pigmentation, plant glaucosity, indications of stress response, lodging, grain protein content, and vernalization requirement. A summary of these traits and abbreviations are presented in

Table 1 and details of phenotyping methods are listed in Additional file 1: Table S5.

Trials analysis

Adjusted phenotype values were calculated as best linear unbiased estimates (BLUEs) for each trait separately for each trial year using mixed effects models with ASRemL [60]. Genotype was considered a fixed effect while experimental blocking structure as well as other covariates such as harvesting day, where relevant, was included as random effects. Spatial models including first- and second-order auto-regressive spatial models were also used. Model simplification was carried out where models with all possible combinations of random effect terms and spatial terms for row and column were run and the best fitting model was chosen based on Akaike Information Criteria (AIC). Model residuals were visually checked for normality and equal variance. The best linear unbiased estimates (BLUEs) for all phenotypes for the 16 founders and for the 504 RILs used in our main analysis (see below) are provided in Additional file 1: Table S6. We used symmetrical Thiel-Sen regression (implemented in the “deming” R package) after phenotype normalization to characterize the relationship between protein content (GPC) and yield (GY). The protein-yield deviation (PYD) phenotype is calculated as the Euclidian distance from this regression line.

Genotyping array data

All DNA extraction was performed using the Qiagen DNeasy Plant Kit on leaf tissue samples taken from emerging leaves of seedlings. First, genotyping was performed at the Bristol Genomics Facility using the Axiom 35 k wheat breeders’ array [7]. Initially, two 384-sample plates were genotyped. Seed from the plants used as founders were genotyped on each plate (32 samples) along with extra seed from the original varietal seed stock used (28 samples) and seed from founders propagated to 2017 (16 samples). In addition, 596 RILs were genotyped after 5 generations of selfing (F6). To account for genotyping failures and to ensure the accuracy of sample labels, 150 RILs were re-genotyped in the F7 generation along with a further replicate of each founder.

Genotype calling was performed using the Affymetrix Power Tools (v1.19) and SNPolisher R packages, following the recommended Axiom analysis pipeline. All samples except two-way crosses were given the standard inbreeding penalty, 4, which penalizes calling heterozygous genotypes. Four samples failed the “dish quality control” threshold (0.82) and a further 28 samples with call rates below 97% were excluded. Marker classifications were performed using “ps-classification” and “ps-classification-Supplementary” functions with options --species-type polyploid --hom-ro false. All calls were adjusted using the standard 0.025 confidence threshold using the Ps_CallAdjust function.

Samples were compared to one another using the 14,935 markers classified as “PolyHighResolution” only. Overall, 46 RIL pairs were found to be > 92% similar (mean 98.5% genotype similarity), where all other comparisons between MAGIC lines were, at most, 84% similar (mean 67.8%). These apparently duplicated genotypes could indicate genotyping, labelling, or propagation errors so only one RIL from each pair was used for sequencing (550 RILs). To ensure pedigree accuracy, we chose the RIL in each pair that was genotypically most similar to other RILs derived from the same 16-way cross (i.e., in the same family).

Sequencing data

For whole-genome sequencing, DNA was extracted from 550 RILs at the F7 generation. DNA for RILs that failed quality control were extracted again at the F8 generation (n = 50). Sequencing and library preparation was performed at Novogene, where libraries were generated from 1.0 μg DNA per sample using the NEBNext DNA Library Prep Kit. Sequencing was performed on a NovaSeq 6000 instrument (Illumina) to get at least 6 Gb of raw sequence data (2 × 150 bp paired end reads) per sample. One founder (Holdfast) was sequenced to 15.8× coverage using the same method.

The other founders were sequenced after capture using two recently designed probe sets targeting promoter and genic regions, respectively [19]. Capture was performed at the Earlham Institute following the SeqCap EZ Library SR v5.1 protocol (Roche NimbleGen Inc., Madison, WI, USA) with 1 μg of genomic DNA sheared to 300 bp [19]. Four captures were performed using 8 samples per set (2× promoter captures and 2× genic captures). Samples for the founder Stetson were included on all four capture experiments, so roughly double the sequence data was obtained for this founder (Additional file 1: Table S1). Sequencing with 2 × 150 bp reads was performed at the Earlham Institute on a NovaSeq 6000 instrument (Illumina) with 16 promoter capture libraries on one lane and 16 genic capture libraries on another lane.

Variant calls and imputation

All reads were aligned to the bread wheat reference genome from cv. Chinese Spring (RefSeq v1.0) [3] using bwa-mem (version 0.7.12) [61] and sorted using samtools (version 1.3.1) [62], which was also used to calculate coverage. For compatibility with the bam file format, we split each chromosome in the reference genome at the halfway point before alignment. We called variants from the founder sequences within the high-confidence gene, promoter and 5′ UTR regions targeted by the capture probes [19] using GATK (version [63] HaplotypeCaller and GenotypeGVCFs (options --interval-padding 100 --minimum-mapping-quality 30). We used vcftools (version 0.1.15) to include only biallelic single-nucleotide polymorphisms (SNPs) with average coverage depth between 5 and 60 (all per-sample coverages between 2 and 120) and no missing calls. We also filtered with bcftools (version 1.2) [64] using standard quality control options --exclude “QD <2 || FS >60.0 || MQRankSum<-12.5 || ReadPosRankSum<-8.0 || SOR >3.0 || MQ <40”. This left 1.78 M SNPs, of which we only use the 1.13 M sites with no heterozygous calls (--genotype ^het option) for our main analyses.

We first called genotypes in the RILs at these 1.13 M SNP sites directly using GATK HaplotypeCaller in GENOTYPE-GIVEN-ALLELES mode, using the same options as above. We assessed the concordance between array genotypes and these direct calls (AD) at overlapping sites (see below). For 10 RILs, the directly called sequencing variants most closely matched genotyping array data for a different line than expected. These were excluded because the source of the discrepancy (sequence data or array data) cannot be established. The concordance between our genotyping array data and direct calls (AD) was below 95% for a further 36 RILs, which were excluded (mean AD 84.7% for removed lines), leaving 504 RILs. We estimated heterozygosity in these 504 RILs using only genotypes called from at least four reads. Of 2.6 M such genotype calls, only 0.67% were called as heterozygotes.

We imputed genotypes at the 1.13 M SNP sites using the alignments and STITCH software (version 1.5.7) [26]. Because alignments were to a reference genome with chromosomes split in half, we first ran STITCH with the generateInputOnly option, and then joined the input files for each chromosome half before imputation. For all runs, we used the parameters nGen = 3, minRate = 0.001, bqFilter = 30, method = “diploid-inbred” and then filtered all sites with an info score below 0.4, minor allele frequency below 2.5%, or missingness above 10%. For our main analysis, we used the genotype calls in the founders as a reference panel and outputted the estimated ancestry dosages of each founder at each position in each RIL using the outputHaplotypeProbabilities and output_haplotype_dosages options. When using the founders as a reference panel, we removed options that estimate and update the haplotypes in the population (shuffleHaplotypeIterations, reference_shuffleHaplotypeIterations, refillIterations). To test accuracy when reference panels are not available, we re-ran imputation without providing the founder genotypes, using 40 iterations to estimate the haplotype space and recombination mosaics. We also used the downsampleFraction option to randomly sample a fraction of alignments with/without using the founder reference panel. Finally, we tested imputation accuracy (without a reference panel), when fewer than sixteen haplotypes were assumed to segregate in the population by varying the K parameter (Additional file 2: Figure S3).

Genotype comparisons

For comparison against the sequencing dataset, we used all genotyping array markers. Replicates of founders and MAGIC RILs (where available) were used to make a consensus call where the most common genotype across replicates was taken as the consensus and only retained when more than 50% of the non-missing calls were in agreement. In addition, markers where one homozygous genotype was missing from all RILs were converted such that all heterozygous calls were assumed to be in the missing homozygous class. The failure to detect a homozygous class is likely to be a result of polyploidy, which can reduce differentiation between the three genotype classes and make them hard to distinguish. Finally, to get genome coordinates for the genotyped markers, BLASTn v2.2.30 [25] was used to compare the 75-bp probe sequences ( [7] against the reference genome [3]. When matching the SNP array data with the sequenced SNPs, array sites were excluded if there had missing or heterozygous founder calls or if the genotypes and targeted SNP alleles did not match the founder sequence data. We found 5877 sites that overlapped between the genotyping array data and the sequencing data (Additional file 1: Table S2).

To compare against global wheat diversity, we called founder genotypes at 113,457 genotyping array sites that were polymorphic among 4506 diverse global wheat accessions [8]. We called genotypes from alignments with mapping quality scores of at least 30 using GATK HaplotypeCaller in EMIT_ALL_SITES mode with the –emit-ref-confidence BP_RESOLUTION option, providing a bed file of the 113,139 genotyping array sites [8]. We only considered sites where genotypes could be called in all 16 founders (n = 56,063). We used genotyping array calls for cv. Chinese Spring to determine reference/non-reference alleles on the genotyping array, ignoring sites called as heterozygous (n = 109) or missing (n = 306) in Chinese Spring. Seven of the MAGIC founders were also present in the global genotype set (Brigadier, Copain, Maris Fundin, Soissons, Spark, Steadfast, Stetson) 7 . The average concordance of the global genotype calls and our sequencing calls for these founders was 94.3% (sd 0.63%). We excluded 5491 (9.8%) sites that had mismatches across these founders, many of which are likely to reflect differences in the underlying genetic variation picked up by the different genotyping technologies. Two other founder variety names were in the genotyping array dataset 7 (Banco and Holdfast) but the genotyping calls did not match (concordances 74.2% and 71.4%, respectively), which may reflect differences in the seed stock used.

Haplotype diversity among founders

First, we used the SNPs called within each promoter-gene pair to estimate haplotypic diversity among the founders. We calculated absolute (Manhattan) pairwise genetic distances between founders at each site and then used complete linkage clustering to define haplotypic groups using dist and hclust functions implemented in R statistical software (version 3.6.0) [65]. This was repeated using different similarity thresholds to define haplotypes.

Second, we determined haplotype breakpoints using a dynamic programming algorithm. For each pairwise founder combination, our algorithm calculates a mosaic of genotypic similarity/dissimilarity akin to the Viterbi path from a hidden Markov model. Genotype matches and mismatches are allocated a score (1 by default). To prevent excessive switching between states, there is also a “transition penalty” for inferring a change between matching and mismatching states. Based on their pairwise matching/mismatching states, we then infer the total number of haplotypes inferred at each site. We repeat this procedure with different transition penalty parameter choices (Additional file 2: Figure S3). Figure 2c shows founder similarity using a transition penalty of 200.

Genetic mapping and heritability

For mapping, we used the full set of 1,065,185 high-quality SNP sites called in 504 RILs after imputation and quality control filters. From these, we selected a subset of p = 55,067 SNPs such that every other SNP was tagged at R 2 > 0.99 by a member of the subset using PLINK (version 1.90) with option --indep-pairwise 500 10 0.99. The genotype dosages at each tagging SNPs were standardized to produce a 504 × 55,067 genotype dosage matrix G which was used to calculate the genetic relationship matrix (GRM) K = GG ’ /p. The phenotypic variance-covariance matrix for a given vector y of standardized phenotype values was modelled as ( oldsymbol=oldsymbol_g^2+oldsymbol_e^2 ) where ( _g^2,_e^2 ) are the additive genetic and environmental variance components, estimated by maximum-likelihood [66]. The heritability of a trait was defined as ( ^2=_g^2/left(_g^2+_e^2 ight) ) . The matrix square root of the variance matrix was calculated by eigendecomposition of V as A 2 = V, and the mixed model transformation of the data performed, i.e., y → A 1 y, G → A 1 G, V → I , to remove the inflationary effects of unequal relatedness on genetic associations before association mapping.

We performed association tests at the level of both SNPs and founder haplotypes using R statistical software (version 3.6.0 )[65], using purpose-written R scripts available on GitHub (see “Availability of data and materials”). Initially, we tested the null hypothesis of no association at each SNP site in the 55 k tagging SNPs. We then determined genome-wide thresholds for statistical significance using 1000 permutations on the transformed phenotypes, as described in reference [67]. If any association exceeded the 0.05 threshold (smaller p value than found across at least 950 phenotypic permutations), then we repeated the association test at all of the

1.1 M SNPs on the chromosome with the strongest association signal (lowest p value). Mapping intervals were defined to include SNPs surrounding the peak SNP, with log10(p) values within d units of x using d = max <2, 0.1x> where x is the peak log10(p) value. The interval for founder-haplotype-based tests includes the range of sites that have log10(p) values within d units of x. SNP-based intervals were calculated using the same measure but then extended by the minimum of 5 Mb or the distance to the next SNP in either direction that the same “strain distribution pattern” [47] as any highly associated SNPs (SNPs with log10(p) values within d units of x). The “strain distribution pattern” is the pattern of major/minor alleles across founders. This procedure is designed to capture the uncertainty in the positioning of relevant recombination events on either side of the QTL peak. We fitted QTLs in a stepwise manner by fitting the phenotype against the most strongly associated SNP (or founder haplotype dosage) whenever genome-wide significant QTLs were detected. The above association test procedure was then repeated using the phenotype residuals after fitting all previously identified QTLs. This allows closely linked QTLs to be detected when they have different patterns of causal variants among RILs. Where QTL associations were found for different genotypes, they were judged to be at the same locus if they had overlapping mapping intervals and at least one matching strain distribution pattern at highly associated SNP sites.

For 40 QTLs identified using SNP-based associations, we looked in the set of

55 k SNPs that were called in 4500 global wheat accessions [8] for markers within the mapping interval that had founder genotype calls consistent with the QTL. Where more than one phenotype measurement was mapped to the same locus, we used the smallest QTL interval for matching. Twenty two QTLs could be “matched” in this way and we can therefore estimate the frequency of these functional variants in the global germplasm (Additional file 2: Figure S2). Where more than one candidate SNP from the set of 55 k could be plausibly matched with the QTL, we used the average global MAF. We evaluated all QTLs to identify potentially causal variants that we estimated to be rare in the global germplasm. The QTLs that are rarest in the global population are for yellow rust resistance (Yr17 on chromosome 2A [30], estimated global MAF 6.7%) and grain yield in year 2 (3D:12–24, estimated global MAF 10.4%). A caveat to this analysis is that the linkage disequilibrium between SNPs and the underlying causal variation could break down in the wider population. Furthermore, the design of genotyping arrays biases them towards the detection of common variation [68]. We are therefore likely to underestimate the degree to which rare functional alleles have been detected in our population.

Genomic prediction

We evaluated the accuracy of trait prediction within NDM and estimate the extent of polygenic variation beyond genome-wide significant QTLs. We conducted genomic prediction across all phenotypes using three shrinkage-based methods: ridge regression (RR), elastic nets (EN), and least absolute shrinkage and selection operator (LASSO), using the R package glmnet [69], which estimates optimal shrinkage parameters for each genomic prediction method based on the training set. For each method, we conducted 50 rounds of cross validation by randomly sampling 90% of the RILs (n = 454) as a training set in each round to train the model, which was then used to predict the remaining 10% of RILs (n = 50)—the test set. For the three methods, the model equation can be written generally as y = μ + Gβ + ε, where y is the estimated trait value, μ is the model intercept, β is the vector of SNP effects, G is the genotype dosage matrix, and ε is the residual error. With appropriate choice of ridge parameter ( lambda =_e^2/_g^2 ) , RR is equivalent to a mixed model in the sense that the RR estimated SNP effects are identical to the mixed model best linear unbiased predictors (BLUPs) [50, 70]. This explains the near perfect correspondence between estimates of heritability and RR prediction accuracy (Fig. 5c).

We then predicted phenotypes in the test set by multiplying all SNP coefficient estimates by their corresponding genotypes in the test set (and adding the intercept term). We reported the training and test set prediction accuracy as the mean Pearson correlation coefficient of the predicted trait values and the actual phenotype values over 50 rounds of cross validation.

We used these genomic prediction models to evaluate the potential for phenotypic change in a simulated NDM population of 20,160 RILs, assuming the same patterns of recombination as actually observed. We did this by simulating new breeding funnels. Thus, we permuted the population founder haplotype identities 40 times across the 504 RILs and then projected the permuted founder genotypes onto the new lines. This creates new genetic combinations while retaining the mosaic breakpoints, genetic map, and linkage disequilibrium found in the real population. We applied the LASSO models (trained as above on the 504 RILs) to predict phenotypes for the simulated MAGIC RILs. We further calculated the theoretical maximum and minimum phenotype values that are possible given the genomic prediction models and the variants segregating in the population, by summing the estimated effects for all positive or negative SNP coefficients, respectively.

Gene deletion analysis

We asked if gene-level coverage variation among founders might explain phenotypic variation. In each founder f and at each gene feature g, we computed a gene deletion index Dgf based on the number of reads aligning to the associated capture sequences, normalized by the overall coverage for that founder. The gene deletion score (GDS) for each MAGIC RIL i and feature j was computed as ( _=sum limits_f__ ) , where Hijf is the founder haplotype dosage for founder f in RIL i at gene j, as computed by STITCH. For each phenotype, a mixed model GWAS was performed, using the GDS in place of SNP dosages and with a genetic relationship matrix computed from the GDS (Additional file 1: Table S8). We also repeated the genomic prediction analysis described above by replacing the SNP genotype dosage matrix with the GDS matrix (Additional file 2: Figure S5).


Evidence for introgressions was based on summary statistics (coverage, non-reference allele frequency in founders and RILs) calculated in 10-Mb windows moved in 5-Mb steps. Within introgressions, carriers should have a high proportion of non-reference alleles due to the alignment of inter-specific genetic material to the bread wheat reference genome. Introgression boundaries were defined by the extent of 10-Mb windows where all introgression carriers had a higher proportion of non-reference alleles than all non-carriers. Within these regions, we then checked the relative coverage of carriers and the extent to which the alleles of carriers are over- or under-represented among the RILs. This evidence is summarized in Additional file 1: Table S3.



. 1987 Escalation and evolution . Cambridge, MA : Harvard University Press . Crossref, Google Scholar

Roy K, Jablonski D, Valentine JW, Rosenberg G

. 1998 Marine latitudinal diversity gradients: tests of causal hypotheses . Proc. Natl Acad. Sci. USA 95, 3699-3702. (doi:10.1073/pnas.95.7.3699) Crossref, PubMed, ISI, Google Scholar

. 1986 Phanerozoic overview of mass extinction . In Patterns and processes in the history of life (eds DM Raup, D Jablonski), pp. 277-295. Berlin, Germany : Springer . Crossref, Google Scholar

2004 The global decline of nonmarine mollusks . BioScience 54, 321-330. (doi:10.1641/0006-3568(2004)054[0321:TGDONM]2.0.CO2) Crossref, ISI, Google Scholar

. 2015 Consensus and confusion in molluscan trees: evaluating morphological and molecular phylogenies . Syst. Biol. 64, 384-395. (doi:10.1093/sysbio/syu105) Crossref, PubMed, ISI, Google Scholar

Stöger I, Sigwart JD, Kano Y, Knebelsberger T, Marshall BA, Schwabe E, Schrödl M

. 2013 The continuing debate on deep molluscan phylogeny: evidence for Serialia (Mollusca, Monoplacophora + Polyplacophora) . BioMed Res. Int. 2013, 407072. (doi:10.1155/2013/407072) Crossref, PubMed, ISI, Google Scholar

Smith SA, Wilson NG, Goetz FE, Feehery C, Andrade SC, Rouse GW, Giribet G, Dunn CW

. 2011 Resolving the evolutionary relationships of molluscs with phylogenomic tools . Nature 480, 364-367. (doi:10.1038/nature10526) Crossref, PubMed, ISI, Google Scholar

. 2007 Deep molluscan phylogeny: synthesis of palaeontological and neontological data . Proc. R. Soc. B 274, 2413-2419. (doi:10.1098/rspb.2007.0701) Link, ISI, Google Scholar

Pollock DD, Zwickl DJ, McGuire JA, Hillis DM

. 2002 Increased taxon sampling is advantageous for phylogenetic inference . Syst. Biol. 51, 664. (doi:10.1080/10635150290102357) Crossref, PubMed, ISI, Google Scholar

2020 The Scaly-foot Snail genome and implications for the origins of biomineralised armour . Nat. Commun. 11, 1657. (doi:10.1038/s41467-020-15522-3) Crossref, PubMed, ISI, Google Scholar

Sigwart JD, Bennett KD, Edie SM, Mander L, Okamura B, Padian K, Wheeler Q, Winston JE, Yeung NW.

2018 Measuring biodiversity and extinction–present and past . Integr. Comp. Biol . 58, 1111-1117. (doi:10.1093/icb/icy113). PubMed, ISI, Google Scholar

Varney RM, Speiser DI, McDougall C, Degnan BM, Kocot KM

. 2021 The iron-responsive genome of the chiton Acanthopleura granulata. Genome Biol. Evol. 13, evaa263. (doi:10.1093/gbe/evaa263) Google Scholar

Sigwart JD, Stoeger I, Knebelsberger T, Schwabe E

. 2013 Chiton phylogeny (Mollusca: Polyplacophora) and the placement of the enigmatic species Choriplax grayi (H. Adams & Angas) . Invertebr. Syst. 27, 603-621. (doi:10.1071/IS13013) Crossref, ISI, Google Scholar

Mikkelsen NT, Kocot KM, Halanych KM

. 2018 Mitogenomics reveals phylogenetic relationships of caudofoveate aplacophoran molluscs . Mol. Phylogenet. Evol. 127, 429-436. (doi:10.1016/j.ympev.2018.04.031) Crossref, PubMed, ISI, Google Scholar

Lemer S, Bieler R, Giribet G

. 2019 Resolving the relationships of clams and cockles: dense transcriptome sampling drastically improves the bivalve tree of life . Proc. R. Soc. B 286, 20182684. (doi:10.1098/rspb.2018.2684) Link, ISI, Google Scholar

Uribe JE, Irisarri I, Templado J, Zardoya R

. 2019 New patellogastropod mitogenomes help counteracting long-branch attraction in the deep phylogeny of gastropod mollusks . Mol. Phylogenet. Evol. 133, 12-23. (doi:10.1016/j.ympev.2018.12.019) Crossref, PubMed, ISI, Google Scholar

Jaksch K, Eschner A, Rintelen TV, Haring E

. 2016 DNA analysis of molluscs from a museum wet collection: a comparison of different extraction methods . BMC Res. Notes 9, 348. (doi:10.1186/s13104-016-2147-7) Crossref, PubMed, Google Scholar

Li R, Zhang W, Lu J, Zhang Z, Mu C, Song W, Migaud H, Wang C, Bekaert M

. 2020 The whole-genome sequencing and hybrid assembly of Mytilus coruscus . Front. Genetics 11, 440. (doi:10.3389/fgene.2020.00440) Crossref, PubMed, ISI, Google Scholar

Payne A, Holmes N, Rakyan V, Loose M

. 2018 BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files . Bioinformatics 35, 2193-2198. (doi:10.1093/bioinformatics/bty841) Crossref, ISI, Google Scholar

2012 The oyster genome reveals stress adaptation and complexity of shell formation . Nature 490, 49-54. (doi:10.1038/nature11413) Crossref, PubMed, ISI, Google Scholar

2019 A draft genome assembly of the solar-powered sea slug Elysia chlorotica . Sci. Data 6, 190022. (doi:10.1038/sdata.2019.22) Crossref, PubMed, ISI, Google Scholar

2019 Symbiotic organs shaped by distinct modes of genome evolution in cephalopods . Proc. Natl Acad. Sci. 116, 3030-3035. (doi:10.1073/pnas.1817322116) Crossref, PubMed, ISI, Google Scholar

Sun J, Li R, Chen C, Sigwart JD, Kocot KM

. 2021 Benchmarking Oxford nanopore read assemblers for high-quality molluscan genomes . Phil. Trans. R. Soc. B 376, 20200160. (doi:10.1098/rstb.2020.0160) Link, ISI, Google Scholar

Murgarella M, Puiu D, Novoa B, Figueras A, Posada D, Canchaya C

. 2016 A first insight into the genome of the filter-feeder mussel Mytilus galloprovincialis . PLoS ONE 11, e0151561. (doi:10.1371/journal.pone.0151561) PubMed, ISI, Google Scholar

Kocot KM, Jeffery NW, Mulligan K, Halanych KM, Gregory TR

. 2016 Genome size estimates for Aplacophora, Polyplacophora and Scaphopoda: small solenogasters and sizeable scaphopods . J. Molluscan Stud. 82, 216-219. (doi:10.1093/mollus/eyv054) ISI, Google Scholar

. 1991 Chromosomal studies and the quantitative evaluation of nuclear images stained with Feulgen dye in the Diplommatidinidae . Venus 50, 68-78. Google Scholar

. 1998 Chromosomes and nuclear DNA contents of two subspecies in the Diplommatinidae . Venus 57, 133-136. Google Scholar

Sigwart JD, Wicksten MK, Jackson MG, Herrera S

. 2019 Deep-sea video technology tracks a monoplacophoran to the end of its trail (Mollusca, Tryblidia) . Mar. Biodivers. 49, 825-832. (doi:10.1007/s12526-018-0860-2) Crossref, ISI, Google Scholar

Albertin CB, Simakov O, Mitros T, Wang ZY, Pungor JR, Edsinger-Gonzales E, Brenner S, Ragsdale CW, Rokhsar DS

. 2015 The octopus genome and the evolution of cephalopod neural and morphological novelties . Nature 524, 220-224. (doi:10.1038/nature14668) Crossref, PubMed, ISI, Google Scholar

2020 The gene-rich genome of the scallop Pecten maximus . GigaScience 9, giaa037. (doi:10.1093/gigascience/giaa037) Crossref, PubMed, ISI, Google Scholar

Kingston AC, Chappell DR, Miller HV, Lee SJ, Speiser DI

. 2017 Expression of G proteins in the eyes and parietovisceral ganglion of the bay scallop Argopecten irradians . Biol. Bull. 233, 83-95. (doi:10.1086/694448) Crossref, PubMed, ISI, Google Scholar

Renaut S, Guerra D, Hoeh WR, Stewart DT, Bogan AE, Ghiselli F, Milani L, Passamonti M, Breton S

. 2018 Genome survey of the freshwater mussel Venustaconcha ellipsiformis (Bivalvia: Unionida) using a hybrid de novo assembly approach . Genome Biol. Evol. 10, 1637-1646. (doi:10.1093/gbe/evy117) Crossref, PubMed, ISI, Google Scholar

Choo LQ, Bal TM, Choquet M, Smolina I, Ramos-Silva P, Marlétaz F, Kopp M, Hoarau G, Peijnenburg KT

. 2020 Novel genomic resources for shelled pteropods: a draft genome and target capture probes for Limacina bulimoides, tested for cross-species relevance . BMC Genomics 21, 1-14. (doi:10.1186/s12864-019-6419-1) Crossref, ISI, Google Scholar

. 2016 Determinants of genetic diversity . Nat. Rev. Genet. 17, 422-433. (doi:10.1038/nrg.2016.58) Crossref, PubMed, ISI, Google Scholar

Liu F, Li Y, Yu H, Zhang L, Hu J, Bao Z, Wang S

. 2020 MolluscDB: an integrated functional and evolutionary genomics database for the hyper-diverse animal phylum Mollusca . Nucleic Acids Res. 49, D988-D997. (doi:10.1093/nar/gkaa918) Crossref, ISI, Google Scholar

Electronic supplementary material is available online at

Published by the Royal Society under the terms of the Creative Commons Attribution License, which permits unrestricted use, provided the original author and source are credited.


Morales AE, Jackson ND, Dewey TA, O'Meara BC, Carstens BC

. 2017 Speciation with gene flow in North American Myotis bats . Syst. Biol. Zool. 66, 440-452. (doi:10.1093/sysbio/syw100) PubMed, ISI, Google Scholar

Titus BM, Blischak PD, Daly M

. 2019 Genomic signatures of sympatric speciation with historical and contemporary gene flow in a tropical anthozoan (Hexacorallia: Actiniaria) . Mol. Ecol. 28, 3572-3586. (doi:10.1111/mec.15157) Crossref, PubMed, ISI, Google Scholar

Capblancq T, Mavárez J, Rioux D, Després L

. 2019 Speciation with gene flow: evidence from a complex of alpine butterflies (Coenonympha, Satyridae) . Ecol. Evol. 9, 6444-6457. (doi:10.1002/ece3.5220) Crossref, PubMed, ISI, Google Scholar

van Rijssel JC, Moser FN, Frei D, Seehausen O.

2018 Prevalence of disruptive selection predicts extent of species differentiation in Lake Victoria cichlids . Proc. R. Soc. B 285, 20172630. (doi:10.1098/rspb.2017.2630) Link, ISI, Google Scholar

. 2021 Genomic divergence landscape in recurrently hybridizing Chironomus sister taxa suggests stable steady state between mutual gene flow and isolation . Evol. Lett. 5, 86-100. (doi:10.1002/evl3.204) Crossref, PubMed, ISI, Google Scholar

Edwards KF, Kremer CT, Miller ET, Osmond MM, Litchman E, Klausmeier CA

. 2018 Evolutionarily stable communities: a framework for understanding the role of trait evolution in the maintenance of diversity . Ecol. Lett. 21, 1853-1868. (doi:10.1111/ele.13142) Crossref, PubMed, ISI, Google Scholar

Campbell CR, Poelstra JW, Yoder AD

. 2018 What is speciation genomics? The roles of ecology, gene flow, and genomic architecture in the formation of species . Bio. J. Linn. Soc. 124, 561-583. (doi:10.1093/biolinnean/bly063/5035934) Crossref, ISI, Google Scholar

Egan SP, Ragland GJ, Assour L, Powell THQ, Hood GR, Emrich S, Nosil P, Feder JL

. 2015 Experimental evidence of genome-wide impact of ecological selection during early stages of speciation-with-gene-flow . Ecol. Lett. 18, 817-825. (doi:10.1111/ele.12460) Crossref, PubMed, ISI, Google Scholar

Fuentes-Pardo AP, Ruzzante DE

. 2017 Whole-genome sequencing approaches for conservation biology: advantages, limitations and practical recommendations . Mol. Ecol. 26, 5369-5406. (doi:10.1111/mec.14264) Crossref, PubMed, ISI, Google Scholar

Beichman AC, Huerta-Sanchez E, Lohmueller KE

. 2018 Using genomic data to infer historic population dynamics of nonmodel organisms . Annu. Rev. Ecol. Evol. Syst. 49, 433-456. (doi:10.1146/annurev-ecolsys-110617-062431) Crossref, ISI, Google Scholar

. 2018 Sympatric speciation in the genomic era . Trends Ecol. Evol. 33, 85-95. (doi:10.1016/j.tree.2017.11.003) Crossref, PubMed, ISI, Google Scholar

Scordato ESC, Symes LB, Mendelson TC, Safran RJ

. 2014 The role of ecology in speciation by sexual selection: a systematic empirical review . J. Heredity 105, 782-794. (doi:10.1093/jhered/esu037) Crossref, PubMed, ISI, Google Scholar

Chueca LJ, Gómez-Moliner BJ, Forés M, Madeira MJ

. 2017 Biogeography and radiation of the land snail genus Xerocrassa (Geomitridae) in the Balearic Islands . J. Biogeogr. 44, 760-772. (doi:10.1111/jbi.12923) Crossref, ISI, Google Scholar

. 2002 Phylogeographic history of the land snail Candidula unifasciata (Helicellinae, Stylommatophora): fragmentation, corridor migration, and secondary contact . Evolution 56, 1776-1788. (doi:10.1111/j.0014-3820.2002.tb00191.x) Crossref, PubMed, ISI, Google Scholar

. 2016 Evolution and extinction of land snails on Oceanic Islands . Annu. Rev. Ecol. Evol. Syst. 47, 123-141. (doi:10.1146/annurev-ecolsys-112414-054331) Crossref, ISI, Google Scholar

Richards PM, Liu MM, Lowe N, Davey JW, Blaxter ML, Davison A

. 2013 RAD-Seq derived markers flank the shell colour and banding loci of the Cepaea nemoralis supergene . Mol. Ecol. 22, 3077-3089. (doi:10.1111/mec.12262) Crossref, PubMed, ISI, Google Scholar

. 2012 What do we need to know about speciation? Trends Ecol. Evol. 27, 27-39. (doi:10.1016/j.tree.2011.09.002) Crossref, PubMed, ISI, Google Scholar

. 2001 Phenotypic evolution and hidden speciation in Candidula unifasciata ssp. (Helicellinae, Gastropoda) inferred by 16S variation and quantitative shell traits . Mol. Ecol. 10, 2541-2554. (doi:10.1046/j.0962-1083.2001.01389.x) Crossref, PubMed, ISI, Google Scholar

Chueca LJ, Gómez-Moliner BJ, Madeira MJ, Pfenninger M

. 2018 Molecular phylogeny of Candidula (Geomitridae) land snails inferred from mitochondrial and nuclear markers reveals the polyphyly of the genus . Mol. Phylogenet. Evol. 118, 357-368. (doi:10.1016/j.ympev.2017.10.022) Crossref, PubMed, ISI, Google Scholar

Pfenninger M, Nowak C, Magnin F

. 2007 Intraspecific range dynamics and niche evolution in Candidula land snail species . Biol. J. Linn. Soc. 90, 303-317. (doi:10.1111/j.1095-8312.2007.00724.x) Crossref, ISI, Google Scholar

. 1991 Mollusques continentaux et histoire quaternaire des milieux méditerranéens (Sud-Est de la France, Catalogne) . Doctoral dissertation, Aix-Marseille 2 . Google Scholar

Pfenninger M, Eppenstein A, Magnin F

. 2003 Evidence for ecological speciation in the sister species Candidula unifasciata (Poiret, 1801) and C. rugosiuscula (Michaud, 1831) (Helicellinae, Gastropoda) . Biol. J. Linn. Soc. 79, 611-628. (doi:10.1046/j.1095-8312.2003.00212.x) Crossref, ISI, Google Scholar

Bolger AM, Lohse M, Usadel B

. 2014 . Trimmomatic: a flexible trimmer for Illumina sequence data . Bioinformatics 30, 2114-2120. (doi:10.1093/bioinformatics/btu170/-/DC1) Crossref, PubMed, ISI, Google Scholar

Waldvogel AM, Wieser A, Schell T, Patel S, Schmidt H, Hankeln T, Feldmeyer B, Pfenninger M

. 2018 The genomic footprint of climate adaptation in Chironomus riparius . Mol. Ecol. 27, 1439-1456. (doi:10.1111/mec.14543) Crossref, PubMed, ISI, Google Scholar

. 2010 FastQC: a quality control tool for high throughput sequence data. See Google Scholar

Ewels P, Magnusson M, Lundin S, Käller M

. 2016 MultiQC: summarize analysis results for multiple tools and samples in a single report . Bioinformatics 32, 3047-3048. (doi:10.1093/bioinformatics/btw354) Crossref, PubMed, ISI, Google Scholar

Chueca LJ, Schell T, Pfenninger M

. 2021 De novo genome assembly of the land snail Candidula unifasciata (Mollusca: Gastropoda) . bioRxiv (doi:10.1101/2021.01.23.427926) Google Scholar

. 2009 Fast and accurate short read alignment with Burrows–Wheeler transform . Bioinformatics 25, 1754-1760. (doi:10.1093/bioinformatics/btp324) Crossref, PubMed, ISI, Google Scholar

. 2009 The sequence alignment/map format and SAMtools . Bioinformatics 25, 2078-2079. (doi:10.1093/bioinformatics/btp352) Crossref, PubMed, ISI, Google Scholar

Okonechnikov K, Conesa A, García-Alcalde F

. 2016 Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data . Bioinformatics 32, 292-294. (doi:10.1093/bioinformatics/btv566) PubMed, ISI, Google Scholar

. 2010 The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data . Genome Res. 20, 1297-1303. (doi:10.1101/gr.107524.110) Crossref, PubMed, ISI, Google Scholar

. 2013 From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline . Curr. Protoc. Bioinform. 43, 1-43. (doi:10.1002/0471250953.bi1110s43) Crossref, PubMed, Google Scholar

. 2011 The variant call format and VCFtools . Bioinformatics 27, 2156-2158. (doi:10.1093/bioinformatics/btr330) Crossref, PubMed, ISI, Google Scholar

Alexander DH, Novembre J, Lange K

. 2009 Fast model-based estimation of ancestry in unrelated individuals . Genome Res. 19, 1655-1664. (doi:10.1101/gr.094052.109) Crossref, PubMed, ISI, Google Scholar

. 2007 PLINK: a tool set for whole-genome association and population-based linkage analyses . Am. J. Hum. Genet. 81, 559-575. (doi:10.1086/519795) Crossref, PubMed, ISI, Google Scholar

Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ

. 2015 Second-generation PLINK: rising to the challenge of larger and richer datasets . GigaScience 4, 559-616. (doi:10.1186/s13742-015-0047-8) Crossref, ISI, Google Scholar

. 2017 factoextra: extract and visualize the results of multivariate data analysis. R package version 1.0.7. Google Scholar

Ma Y, Wang J, Hu Q, Li J, Sun Y, Zhang L, Abbott RJ, Liu J, Mao K

. 2019 Ancient introgression drives adaptation to cooler and drier mountain habitats in a cypress species complex . Commun. Biol. 2, 1-12. (doi:10.1038/s42003-019-0445-z) Crossref, PubMed, ISI, Google Scholar

Stryjewski KF, Sorenson MD

. 2017 Mosaic genome evolution in a recent and rapid avian radiation . Nat. Ecol. Evol. 1, 1912-1922. (doi:10.1038/s41559-017-0364-7) Crossref, PubMed, ISI, Google Scholar

. 2011 Inference of human population history from individual whole-genome sequences . Nature 475, 493-496. (doi:10.1038/nature10231) Crossref, PubMed, ISI, Google Scholar

Allio R, Donega S, Galtier N, Nabholz B

. 2017 Large variation in the ratio of mitochondrial to nuclear mutation rate across animals: implications for genetic diversity and the use of mitochondrial DNA as a molecular marker . Mol. Biol. Evol. , 34, 2762-2772. (doi:10.1093/molbev/msx197) Crossref, PubMed, ISI, Google Scholar

Excoffier L, Dupanloup I, Huerta-Sanchez E, Sousa VC, Foll M

. 2013 Robust demographic inference from genomic and SNP Data . PLoS Genet. 9, e1003905. (doi:10.1371/journal.pgen.1003905) Crossref, PubMed, ISI, Google Scholar

Csilléry K, François O, Blum MGB

. 2012 abc: an R package for approximate Bayesian computation (ABC) . Methods Ecol. Evol. 3, 475-479. (doi:10.1111/j.2041-210X.2011.00179.x) Crossref, ISI, Google Scholar

. 2020 Whole-chromosome hitchhiking driven by a male-killing endosymbiont . PLoS Biol. 18, e3000610. (doi:10.1371/journal.pbio.3000610) Crossref, PubMed, ISI, Google Scholar

. 1984 Estimating F-statistics for the analysis of population structure . Evolution 38, 1358. (doi:10.2307/2408641) PubMed, ISI, Google Scholar

. 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism . Genetics 123, 585-595. Crossref, PubMed, ISI, Google Scholar

. 1979 Mathematical model for studying genetic variation in terms of restriction endonucleases . Proc. Natl Acad. Sci. USA 76, 5269-5273. (doi:10.1073/pnas.76.10.5269) Crossref, PubMed, ISI, Google Scholar

. 2017 Interpreting differentiation landscapes in the light of long-term linked selection . Evol. Lett. 1, 118-131. (doi:10.1002/evl3.14) Crossref, ISI, Google Scholar

Pfeifer B, Wittelsbürger U, Ramos-Onsins SE, Lercher MJ

. 2014 PopGenome: an efficient Swiss army knife for population genomic analyses in R . Mol. Biol. Evol. 31, 1929-1936. (doi:10.1093/molbev/msu136) Crossref, PubMed, ISI, Google Scholar

. 2002 Adaptive protein evolution in Drosophila . Nature 415, 1022-1024. (doi:10.1038/4151022a) Crossref, PubMed, ISI, Google Scholar

. 2020 topGO: Enrichment Analysis for Gene Ontology. R package version 2.42.0. Google Scholar

Aubry S, Labaune C, Magnin F, Roche P, Kiss L

. 2006 Active and passive dispersal of an invading land snail in Mediterranean France . J. Anim. Ecol. 75, 802-813. (doi:10.1111/j.1365-2656.2006.01100.x) Crossref, PubMed, ISI, Google Scholar

Wada S, Kawakami K, Chiba S

. 2011 Snails can survive passage through a bird's digestive system . J. Biogeogr. 39, 69-73. (doi:10.1111/j.1365-2699.2011.02559.x) Crossref, ISI, Google Scholar

van Leeuwen CHA, Huig N, Van Der Velde G, Van Alen TA, Wagemaker CAM, Sherman CDH, Klaassen M, Figuerola J.

2012 How did this snail get here? Several dispersal vectors inferred for an aquatic invasive species . Freshw. Biol. 58, 88-99. (doi:10.1111/fwb.12041) Crossref, ISI, Google Scholar

Fischer SF, Poschlod P, Beinlich B

. 1996 Experimental studies on the dispersal of plants and animals on sheep in calcareous grasslands . J. Appl. Ecol. 33, 1206-1222. (doi:10.2307/2404699) Crossref, ISI, Google Scholar

Brauer A, Allen J, Mingram J, Dulski P, Wulf S, Huntley B

. 2007 Evidence for last interglacial chronology and environmental change from Southern Europe . Proc. Natl Acad. Sci. USA 104, 450-455. (doi:10.1073/pnas.0603321104) Crossref, PubMed, ISI, Google Scholar

Meyer MC, Spötl C, Mangini A

. 2008 The demise of the Last Interglacial recorded in isotopically dated speleothems from the Alps . Quat. Sci. Rev. 27, 476-496. (doi:10.1016/j.quascirev.2007.11.005) Crossref, ISI, Google Scholar

. 2018 Abrupt high-latitude climate events and decoupled seasonal trends during the Eemian . Nat. Comm. 9, 1-10. (doi:10.1038/s41467-018-05314-1) Crossref, PubMed, ISI, Google Scholar

. 2014 The Last Interglacial-Glacial cycle (MIS 5–2) re-examined based on long proxy records from central and northern Europe . Quat. Sci. Rev. 86, 115-143. (doi:10.1016/j.quascirev.2013.12.012) Crossref, ISI, Google Scholar

. 2017 A multi-proxy palaeoenvironmental and geochronological reconstruction of the Saalian-Eemian-Weichselian succession at Klein Klütz Höved, NE Germany . Boreas 47, 114-136. (doi:10.1111/bor.12255) Crossref, ISI, Google Scholar

Sellinger T, Awad DA, Tellier A

. 2020 Limits and convergence properties of the sequentially Markovian coalescent . bioRxiv (doi:10.1101/2020.07.23.217091) Google Scholar

. 2010 Approximate Bayesian computation in evolution and ecology . Annu. Rev. Ecol. Evol. Syst. 41, 379-406. (doi:10.1146/annurev-ecolsys-102209-144621) Crossref, ISI, Google Scholar

Smith CCR, Flaxman SM, Scordato ESC, Kane NC, Hund AK, Sheta BM, Safran RJ

. 2018 Demographic inference in barn swallows using whole-genome data shows signal for bottleneck and subspecies differentiation during the Holocene . Mol. Ecol. 27, 4200-4212. (doi:10.1111/mec.14854) Crossref, PubMed, ISI, Google Scholar

. 2004 Model selection in ecology and evolution . Trends Ecol. Evol. 19, 101-108. (doi:10.1016/j.tree.2003.10.013) Crossref, PubMed, ISI, Google Scholar

Capblancq T, Després L, Rioux D, Mavárez J

2015 Hybridization promotes speciation in Coenonympha butterflies . Mol. Ecol. 24, 6209-6222. (doi:10.1111/mec.13479) Crossref, PubMed, ISI, Google Scholar

Wang L, Wan ZY, Lim HS, Yue GH

. 2016 Genetic variability, local selection and demographic history: genomic evidence of evolving towards allopatric speciation in Asian seabass . Mol. Ecol. 25, 3605-3621. (doi:10.1111/mec.13714) Crossref, PubMed, ISI, Google Scholar

Fraïsse C, Roux C, Gagnaire PA, Romiguier J, Faivre N, Welch JJ, Bierne N

. 2018 The divergence history of European blue mussel species reconstructed from approximate Bayesian computation: the effects of sequencing techniques and sampling strategies . PeerJ 6, e5198. (doi:10.7717/peerj.5198) Crossref, PubMed, ISI, Google Scholar

. 2002 Land snails as a model to understand the role of history and selection in the origins of biodiversity . Popul. Ecol. 44, 129-136. (doi:10.1007/s101440200016) Crossref, ISI, Google Scholar

Städler T, Haubold B, Merino C, Stephan W, Pfaffelhuber P

. 2009 The impact of sampling schemes on the site frequency spectrum in nonequilibrium subdivided populations . Genetics 182, 205-216. ( Crossref, PubMed, ISI, Google Scholar

Marques DA, Lucek K, Sousa VC, Excoffier L, Seehausen O

. 2019 Admixture between old lineages facilitated contemporary ecological speciation in Lake Constance stickleback . Nat. Comm. 10, 1-14. (doi:10.1038/s41467-019-12182-w) Crossref, PubMed, ISI, Google Scholar

Teske PR, Sandoval-Castillo J, Golla TR, Emami-Khoyi A, Tine M, von der Heyden S, Beheregaray LB

. 2019 Thermal selection as a driver of marine ecological speciation . Proc. R. Soc. B 286, 20182023. (doi:10.1098/rspb.2018.2023) Link, ISI, Google Scholar

. 2013 Genome-wide evidence for speciation with gene flow in Heliconius butterflies . Genome Res. 23, 1817-1828. (doi:10.1101/gr.159426.113) Crossref, PubMed, ISI, Google Scholar

. 2012 Divergence hitchhiking and the spread of genomic isolation during ecological speciation-with-gene-flow . Phil. Trans. R. Soc. B 367, 451-460. (doi:10.1098/rstb.2011.0260) Link, ISI, Google Scholar

Parmakelis A, Kotsakiozi P, Rand D

. 2013 Animal mitochondria, positive selection and cyto-nuclear coevolution: insights from Pulmonates . PLoS ONE 8, e61970. (doi:10.1371/journal.pone.0061970) Crossref, PubMed, ISI, Google Scholar

Martínez-Fernández M, Bernatchez L, Rolán-Alvarez E, Quesada H

. 2010 Insights into the role of differential gene expression on the ecological adaptation of the snail Littorina saxatilis . BMC Evol. Biol. 10, 356-414. (doi:10.1186/1471-2148-10-356) Crossref, PubMed, ISI, Google Scholar

Parmakelis A, Pfenninger M, Spanos L, Papagiannakis G, Louis C, Mylonas M

. 2005 Inference of a radiation in Mastus (Gastropoda, Pulmonata, Enidae) on the island of Crete . Evolution 59, 991-1005. (doi:10.1111/j.0014-3820.2005.tb01038.x) Crossref, PubMed, ISI, Google Scholar

. 2008 Speciation and the evolution of gamete recognition genes: pattern and process . Heredity 102, 66-76. (doi:10.1038/hdy.2008.104) Crossref, PubMed, ISI, Google Scholar

. 2008 Speciation through evolution of sex-linked genes . Heredity 102, 4-15. (doi:10.1038/hdy.2008.93) Crossref, PubMed, ISI, Google Scholar

. 1986 Variation in land-snail shell form and size and its causes: a review . Syst. Biol. 35, 204-223. (doi:10.1093/sysbio/35.2.204) Crossref, Google Scholar

Vulture genomes reveal molecular adaptations underlying obligate scavenging and low levels of genetic diversity

Obligate scavenging on dead and decaying animal matter is a rare dietary specialization that in extant vertebrates is restricted to vultures. These birds perform essential ecological services, yet many vulture species have undergone recent steep population declines and are now endangered. To test for molecular adaptations underlying obligate scavenging in vultures, and to assess whether genomic features might have contributed to their population declines, we generated high-quality genomes of the Himalayan and bearded vultures, representing both independent origins of scavenging within the Accipitridae, alongside a sister taxon, the upland buzzard. By comparing our data to published sequences from other birds, we show that the evolution of obligate scavenging in vultures has been accompanied by widespread positive selection acting on genes underlying gastric acid production, and immunity. Moreover, we find evidence of parallel molecular evolution, with amino acid replacements shared among divergent lineages of these scavengers. Our genome-wide screens also reveal that both the Himalayan and bearded vultures exhibit low levels of genetic diversity, equating to around a half of the mean genetic diversity of other bird genomes examined. However, demographic reconstructions indicate that population declines began at around the Last Glacial Maximum, predating the well-documented dramatic declines of the past three decades. Taken together, our genomic analyses imply that vultures harbour unique adaptations for processing carrion, but that modern populations are genetically depauperate and thus especially vulnerable to further genetic erosion through anthropogenic activities.

Keywords: conservation gastric acid immune system scavenging vulture.

© The Author(s) 2020. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.