We are searching data for your request:
Upon completion, a link will appear to access the found materials.
When looking a population genomic data, regions of low diversity (lower than expected; such as in a region of high recombination) can indicate either purifying selection of deleterious mutations or a selective sweep of an adaptive mutation. What are some ways one can tell which one occurred?
One way I've heard of is to use an outgroup that did not live through the same events as the main group. If the outgroup also has low diversity, it means nonsynonymous mutations in that region are deleterious regardless of events, and so those homogeneous regions likely underwent purifying selection. If the outgroup has high diversity in those regions, the main group likely went through a selective sweep.
What are other ways to distinguish between the two?
Welcome to Biology.SE!
I actually recently went through the literature on this and similar subject, so I'll be happy to answer.
The answer will not be easy to formulate as a number of authors are working and arguing on the question. I will try to give a quick overview of methods.
Definitions: Background selection and selective sweep
First off, let's use the correct terms. As you decribed, both positive and purifying selection is reducing genetic diversity at nearby loci. When the reduction in genetic diversity is caused by positive selection, we call the process
selective sweep. When the reduction in genetic diversity is caused by purifying selection, we call the process
What affect background selection?
- Recombination rate
- Strength of selection.
- For a given neutral locus at distance $r$ centimorgans from a locus under purifying selection, the selective coefficient $s$ that is causing the highest decrease in genetic diversity is $s=r$ (Nordborg 1997).
- Mutation rate
- population structure
What affect a selective sweep?
- Strength of selection
- population structure
- Number of loci involved under adaptation
- Whether adaptation comes from de novo mutation or standing genetic variation
By the way, you might want to have a look at the terms 'soft sweep' vs 'hard sweep' in relationship to the two last elements of the above list and in relation to local adaptation.
How to disentangle Background selection from selective sweep?
There are a number of technics but again all of that is a work-in-progress. I would like to classify these methods into three themes.
- Environmental covariate
This first element talks about disentangling local selection from background selection. It is maybe not exactly what you asked for but this discussion often comes to center stage in the literature talking about both.
If we assume to know the environmental variable causing the adaptation, then you can compare divergence between population present in different environments and those present in the same environment. If loss of genetic diversity is caused by local selection, then divergence should be higher between population occurring in difference environments. If loss of genetic diversity is caused by background selection, then all populations will show similar divergence. Bayesian Technics such as BayEnv2 (Gunther and Coop 2013) take advantage of this technic.
- Among lineages comparisons
By comparing related species, it is possible to find out regions of low genetic diversity. If all related lineage show similar loss of genetic diversity irrespective to the existence of an adaptive event, then the loss of genetic diversity is likely caused by background selection.
From such methods (but I might not totally understand them, I should read the paper again), some authors (such as McVicker 2009) have build B-map and recently implemented in BayeScan (Huber et al. 2016; BayeScan is an Fst outlier method to detect local adaptation), that is map of the genome of the intensity of background selection measured by the B-value introduced by Charlesworth (history of the term reviewed in Charlesworth 2012)
Both background selection and a selective sweep affect the Allele-Frequency-Spectrum as to cause an excess of rare alleles (Tajima's D < 0). However, the strength of the effect and the detailed effect on the Allele-Frequency-Spectrum is not quite the same and some authors have suggested to use such differences to disentangle the two.
There is a lot to read on the subject. I would recommend the recent special issue in Molecular Ecology DETECTING SELECTION IN NATURAL POPULATIONS including the review Haasl and Payseur 2016.
Distinguishing Between Selective Sweeps and Demography Using DNA Polymorphism Data
In 2002 Kim and Stephan proposed a promising composite-likelihood method for localizing and estimating the fitness advantage of a recently fixed beneficial mutation. Here, we demonstrate that their composite-likelihood-ratio (CLR) test comparing selective and neutral hypotheses is not robust to undetected population structure or a recent bottleneck, with some parameter combinations resulting in a false positive rate of nearly 90%. We also propose a goodness-of-fit test for discriminating rejections due to directional selection (true positive) from those due to population and demographic forces (false positives) and demonstrate that the new method has high sensitivity to differentiate the two classes of rejections.
THE substitution of a strongly selected advantageous mutation is expected to alter the frequencies of linked neutral variation (M aynard -S mith and H aigh 1974 K aplan et al. 1989 S tephan et al. 1992). Several statistical tests have been proposed for inferring such a “selective sweep” event based on predicted effects relative to the standard neutral model. These include (1) a depression of expected heterozygosity relative to divergence at the target of selection (H udson et al. 1987), (2) an excess of rare alleles compared to the standard neutral model (T ajima 1989 B raverman et al. 1995 F u 1997), (3) an excess of high-frequency-derived alleles (F ay and W u 2000), and (4) increased linkage disequilibrium (P rzeworski 2002 K im and N ielsen 2004). Since these signatures are localized to regions adjacent to the targets of selection, it seems reasonable to attempt to identify loci subject to recent directional selection by analyzing genomic patterns of presumably neutral polymorphism (e.g., H arr et al. 2002 K im and S tephan 2002 V igouroux et al. 2002).
A potential problem in this endeavor, however, is the low power to discriminate patterns expected under hitchhiking from similar patterns produced by chance under nonequilibrium conditions in the absence of selection. For example, recovery from a recent population bottleneck may result in an excess of rare alleles (T ajima 1989a,b) as can population expansion (F u and L i 1993). More troubling is the fact that selection against linked deleterious mutation can also lead to an excess of rare alleles when effective population sizes are small (e.g., C harlesworth et al. 1993). More recently, F ay and W u (2000) suggested that an excess of high-frequency-derived alleles in a sample is more likely due to hitchhiking than to other scenarios. However, they also pointed out that if there are many fixed differences between populations that exchange rare migrants, polymorphisms in the population would tend to be at very low or high frequencies. Furthermore, P rzeworski (2002) demonstrated that a variety of demographic models have the same effect on Fay and Wu's H-statistic as a selective sweep. Recent bottlenecks and metapopulation structures (W akeley and A licar 2001) were also shown to result in high-frequency-derived alleles more often than would be expected under the standard neutral model. Despite these clear effects of nonselective forces, many have argued that one may still distinguish selective sweeps from demography, since the former generates a localized signature around the target of selection while the latter affects the entire genome equally. However, in the absence of selective sweeps, we may still observe local fluctuations of variation along a sequence, which are likely to be amplified by demographic forces and recombination that resemble the expected pattern of a selective sweep. Thus, while the pattern of variation along a chromosome produced by hitchhiking is quite predictable, it is often difficult to be certain that a given departure from neutrality is due to hitchhiking and not some stochastic effects manifested in the single realization of the evolutionary process.
K im and S tephan (2002) present a composite-likelihood method for distinguishing selective sweeps from stochastic, neutral variation, assuming the sample of DNA sequences is drawn from a randomly mating population of constant size. They demonstrate that their method has considerable power to detect a recent selective sweep and yields unbiased estimates of the location and strength of the beneficial mutation. Here, we examine the extent to which bottlenecks and undetected population structure affect the type I error of their composite-likelihood-ratio (CLR) test. The CLR test was studied for two main reasons. First, it has been shown to have high power, indicating that it may be useful for whole-genome scans for adaptively evolving genes. Second, the test statistic (as is discussed below) is the ratio of the likelihood of the data given a recently completed selective sweep vs. an equilibrium neutral model. Therefore, one might predict that population processes that create large deviations from the latter model may lead to the spurious rejection of the null hypothesis of neutrality and thus to the erroneous inference of a recent selective sweep. Using coalescent simulations, we demonstrate that the CLR test as proposed by K im and S tephan (2002) is not robust to the assumption of constant population size and random mating. However, through the use of the proposed goodness-of-fit test, it may be possible to distinguish data sets rejecting neutrality due to directional selection from those due to nonselective effects.
MATERIALS AND METHODS
Bacteriophage T7: Starting from a single plaque (designated WT), a population of T7 was propagated for 500 lytic cycles, ∼2500 generations, by C. W. Cunningham and J. J. Bull at the University of Texas, Austin (Figure 1). At each lytic transfer ∼2 μl of the 2-ml lytic culture of viruses (∼10 5 individuals) was passaged to the next tube, and at no point was the lineage bottlenecked to a single individual. The lineage was sampled at three time points named for the age of the lineage in number of lytic cycles (one lytic cycle is ∼5 generations J. J. B ull , personal communication): populations CW100, CW400, and CW500. The genes sequenced (identified in D unn and S tudier 1983) from the single ancestral plaque and the descendant populations were the first 285 bp of 0.3, which inactivates host restriction (the rest of the gene was truncated by a deletion event, as in C unningham et al. 1997) 17.0 (1662 bp), a tail fiber protein 17.5 (204 bp), which is associated with lysis and 18.0 (270 bp), a DNA maturation protein. Sequences are GenBank nos. AF419412-AF419511. T7 was grown in 2-ml cultures of Escherichia coli strain W3110 in the presence of the mutagen N-methyl-N′-nitro-N-nitrosoguanidine (20 μg/μl). See H illis et al. (1992) and B ull et al. (1993) for further details on the growth and maintenance of the T7 phage.
Modification of current tests: As an exemplar, we illustrate our method using F u and L i ’s D test (1993), but it is important to point out that the same idea applies to F u and L i ’s F, D * , F * (1993), and T ajima ’s D tests (1989). The main idea behind our method is that under selective neutrality of polymorphism the distribution of nonsynonymous and synonymous mutations should be proportional across a genealogy. In terms of Fu and Li’s D test, this means that the ratio of nonsynonymous mutations on internal branches of a genealogy to those on external branches should equal the ratio of synonymous mutations on internal to external branches. As a result of homogeneous processes, such as population expansion or a selective sweep, there is an excess of mutations on external branches (i.e., rare alleles) but this affects both nonsynonymous and synonymous mutations equally. Purifying selection and any resulting segregating deleterious mutations have heterogeneous effects across a locus nonsynonymous and synonymous mutations are not affected equally, so the distribution of mutations is disproportional. Under purifying selection there will be an excess of mutations on external branches, but nonsynonymous mutations will be disproportionately represented because they are being actively selected against and thus kept at low frequencies. To test for heterogeneous effects, therefore, we calculated D for two sets of data: nonsynonymous and synonymous mutations.
Our procedure (heterogeneity test) for testing for differences in Fu and Li’s D between synonymous and nonsynonymous mutations was relatively simple. First we calculated, for each gene, D and θW (the population mutation parameter, 2Neμ, based on the number of segregating sites W atterson 1975) separately for the nonsynonymous and synonymous data sets, and then we calculated ΔD (synonymous D - nonsynonymous D). Using a PERL version of the make tree program of H udson (1990 available upon request or on the Web at http://www.duke.edu/
mwh3), we conducted Monte Carlo coalescent simulations of 10,000 gene genealogies with no recombination the assumption of no recombination makes our test conservative. Each of the 10,000 genealogies was simulated with the values of both synonymous and nonsynonymous θW. For each tree the value of Fu and Li’s D was then calculated for both synonymous and nonsynonymous mutations and the difference, ΔD, was recorded. This distribution of the values of ΔD was then used to calculate the probability, P, of observing a difference in D values between synonymous and nonsynonymous mutations as great or greater than the observed difference. A one-tailed test is used because we have an a priori expectation that D for nonsynonymous mutations will be more negative due to segregating deleterious mutations. This program can also be used on Fu and Li’s F, D * , F * , and Tajima’s D statistics.
Data analysis: Sequences used in this study were visually aligned there were no gaps in any of the aligned sequences we used. Calculations of Fu and Li’s D, π (the average number of pairwise nucleotide differences per site T ajima 1983), πa/πs (the ratio of pairwise nonsynonymous and synonymous differences per site), and θW were done using DNAsp 3.5 (R ozas and R ozas 1999). The outgroup used for the calculation of D was the known ancestral sequence (WT).
The population recombination parameter, γ (2Nec), was analyzed using SITES (H ey and W akeley 1997). This is used because H udson’ s C (1987) is unreliable for small sample sizes (H ey and W akeley 1997 H udson 1987). For some populations SITES (H ey and W akeley 1997) cannot calculate γ, and in these cases C is used these are marked by a superscript a in Table 2. SITES cannot generate an estimate of γ for some data sets either because they have too few informative sites that are shared in subsets of four lines or because of the spacing of those sites with regard to whether or not they show evidence of recombination (H ey and W akeley 1997 J. H ey , personal communication). Estimates of C are almost always >γ because of error involved in calculating C from small sample sizes.
M c D onald and K reitman (1991) suggested a comparison of the ratio of polymorphism to fixed differences of both synonymous and nonsynonymous mutations as a statistical test for evaluating the role of natural selection in causing substitutions in protein-coding genes. This test suggests the action of positive selection when there is a relative excess of nonsynonymous fixed differences (M c D onald and K reitman 1991). We performed the McDonald and Kreitman (M-K) test using Fisher’s exact test to evaluate significance fixed differences were calculated between the WT ancestral sequence and the evolved populations.
A selective sweep occurs when, due to strong positive natural selection, beneficial alleles quickly go to fixation in a population and results in the reduction or elimination of variation among the nucleotides near that allele.  A selective sweep can occur when a rare or a formerly absent allele that improves the fitness of the carrier relative to other members of the population increases in frequency quickly due to natural selection. As the frequency of such a beneficial allele increases, genetic variants that happen to be present in the DNA neighborhood of the beneficial allele will also become more prevalent this phenomenon called genetic hitchhiking.   A Selective sweep arise if rapid changes within the frequency of a beneficial allele, driven by positive selection, distort the genealogical history of samples from the region around the selected locus. It is now recognized that not all sweeps reduce genetic variation in the same way, but rather selective sweeps can be categorized into three main categories:  First, the classic selective sweep or hard sweep is expected to occur when beneficial mutations are rare but when a beneficial mutation that has occurred increases in frequency rapidly, drastically reducing genetic variation in the population. Second, soft sweep from standing genetic variation (SGV) occurs when previously neutral mutations that were present in a population become beneficial because of an environmental change. Such a mutation may be present on several genomic backgrounds so that when it rapidly increases in frequency it does not erase all genetic variation in the population. Finally, a multiple origin soft sweep happens when mutations are common, for example in a large population, so that the same or similar beneficial mutations occur on a different genomic background such that no single genomic background can hitchhike the high frequency.  Whether the selective sweep has occurred can be explored in various ways. One method is to measure linkage disequilibrium, that is whether a given haplotype is overrepresented in the population. Under neutral evolution, genetic recombination will result in the reshuffling of the different alleles within the haplotypes, and no single haplotype will dominate the population. However, during a selective sweep, selection for a positively selected gene variant will also result in hitchhiking of neighboring alleles and less opportunity for recombination. Therefore, the presence of strong linkage disequilibrium might indicate that there has been a selective sweep and can be used to identify sites recently under selection. There have been many scans for selective sweeps in humans and other species using a variety of statistical approaches and assumptions. 
The main difference between soft and hard selective sweeps lies in the expected number of different haplotypes carrying the beneficial mutation or mutations, and therefore in the expected number of haplotypes that hitchhike to considerable frequency during the selective sweep, and which remain in the population at the time of fixation. This key difference results in different expectations in both the site frequency spectrum and in linkage disequilibrium, and consequently in the frequent test statistics based on these forms.  If hard sweeps facilitate evolutionary rescue, then just a single ancestor is responsible for the spread of the advantageous variants and so genetic diversity will be removed from the population as a consequence of adaptation as well as demographic decline. On the other hand, a soft sweep, in which the beneficial allele is independently derived in multiple ancestors, will keep certain ancestral diversity that existed prior to the environmental shift that initiated the fitness changes.  
Is there any way to separate soft and hard sweeps? Obviously, only recent adaptive events leave a measurable signal at all (hard or soft). Signals from the site frequency spectrum (like the excess of rare alleles that is picked up by Tajima 1989  ) usually fade on time scales of
0.1Ne generations, while signals based on linkage disequilibrium or haplotype statistics only last
Ways to distinguish between purifying selection and selective sweep - Biology
kI Cj9=([email protected]*Q*F-hgc!q`Uakp'OEFfu r>)VlhiDWd7+V6V""P&[&. NI?TrEM(sC(,#kB_+BP:5 ?(G7!0(]_kd9']`QFS:`U%]G,[email protected]\%W`KR0i#.Dq CguR6PFCX!Nt4h")D,[email protected]"X0NUSYua-AFo#*b&8ZGp'Iq_p0##P.^lCWmQ&CXD f_cO[j6l)1h9$DurruN0]60XL&/d^>]L.HgPKpBd$CZkCghmbHH:)I?C,*uiQT Y.#Q/frYXq$e9`ZC#"'#HWN4QNL=SJoC+Hl*SJMhV]XB4JnC:!_%AuFMirZSOMlL 7uksSa"JGBhLUS]
]mZeCr3H,"4S DENJPBpRHEb-t$KNa$YnH :_A?gYUP?h.72KWU3),(MC !SehcAe2`a7NJ3VHfaCY20="?lf=$q=]NKo IQ)j!=J`CS%2!O"9oErcYO_dU4rE*fu>Uq*cUEEY[LCXT`OV>iZn*#ECVTi:am [email protected]')[email protected](%$>44O,W8Dk[`nDdK.PG?6N8]T+ l1[%[37`8%O#@.NmFtQ*f*4 Us([email protected]:P $)C5EMsI`&/I,/'r-C]O>1" 1if+$8IJL* 2' #_aC:$Xr*,H8(AT>/1s_XW)El9%(se"Z^4=J>+$ih.DR28QA]:e2NkcRL#RUH7+] (E=JpNYY.rd78i"W"n]1g?k&Apd)5+-T2kQK#c/e2fD!K]CAkO+e'pC:R>S-[7o[ jLe`[email protected])&)oJN5DeRf=A4%Jo_I8#$*d(MC4j=pI %"OUHd,0 !#5PIK63dR19tH!IF^l9 hUteCM"%@GcH*SGD4g6QLmDF4fP35,R3^.MFfcog865QNAe`Ya5E^B6V5I.^aL[6b %*A7RObbe)i:1.`ha/[>j29k,DtJ417VJsQ,kSTuYj
Fq&jAEL4B[ [email protected]+(g[98M)dP&i`lpgq-E]T&M(W'*R+:>\%6:4^UTKdkP3H'hB!k#DE->7Ci=_= ^kQ7UOuZ5efiK=+rLb(lY5W2o'&?fbeu/ #@%]*_.Er&h:m,lWp5P%%-'5=VJu&eI13E"c$^^CWoX1=l*%4E`C`U]XeZfe K0]B/[&s]=$)[email protected]#ZU=D4h] jkfJqC+c.X[mX>[email protected]>F:f0+k(:Xajq:-58BB+[3+ #/Wg3Y%oLO":YH*R(2, *0O3WFa]n+#EN_Z qA3B]c>]ElW4ULgk!9Qq7E`Z/DDA>Iq+P58W8 )_Ln=$Tug9OrTWt:^A7L"!I$nWF6 ,@K$=AT6-`an*(m4nIr%j%uh)N>.j1I!r a%/buec=*Jf=1aL":/Ph'LQ& OfQ^>Vp+]Ot.oKP_-b,]Gl ARO?7B*'UHr"eDn*B-X)iH"X#I]kr]g-Vg "$7X,@2R.E08tl3'?> jfFQHO h$6)[email protected]#R+47g P_2OG'bnKZM#.r'5jLdWL?J9#1l4eZK?duIX_)UbET_-J:'8+feBZXt s/'=/Q(1/^L(jO^Hr0VR qhIko/MId$)[email protected]"[@1O8Ai%OJlAjqkhG=l Hol k'fRhtY=Zl)PT53mTc!#B:,R3to [email protected])l?q-V0aN-7IENq4JCrDk/T$edc1%BVSi:jp!OA%R0K^"AZ.47Y7Q*8Bb`_M 4Zpl$C*`uK/&]dh ZCOr-loa[(IJrQ-' (Xk=g4V+P,'m9F3[nAL1mcSIHJNpqR1%bqs#@g0PBL4P%I's41 PAjgDgkEC" i#oEZDFb[/nI>PYM*%s?5f`%+(`btMpjupmMk!".nAADmhfSL^[email protected]$m!+>$[1a f4:Ul//9ON6F'C4%^!Ue=$nEl%GARrRN0E(91A9!u^'#[email protected] )N)?t,7pKibHS&'a!GJE"tuR((]/&/o/GhO5,-qV!_cZ/r7BN_ng"tX8pQf)Yqe5 [email protected]_Pr+nVX.H/P11$^@>)Mcc#Qfcakn>Q]C&I:DNf'"pKgO[V$,uc!H/^WC,r)T #Csf9lV%9J(P=]h-Y:5mVn/BB
40OL_Or]5L (r]cUicL"2$"W)[email protected]&)A3Uemer)bPO/ScGp+_pQj8_mT5ko-tqORi48)Q >/i H%8g%`A6=[gMt[!hMUJZ&2p[VJbBh$ 8421W*e)d8O^44]9IZdM'[email protected]&>qJH%:JFtglRnt6^#8Y^E`e,,6P a#2`a.HnTo=Qr3uh_2?PlVaj>M4]nX')[Aa2X [email protected]!%[email protected]%uB1"T#IH)Q::Z27RUT_)[email protected]!Rhl:ps*7rUf5e NIqDh,V(+/IQjV8.#eqFlLVe%X1j*Akt)gO'6?CSQC4^7nDr1qRr-IZ4p+3-b LHt _9)AkiE*6KQ^qOg1jO7f&pA/ZHh2&EWZpOk3RrbLm [email protected]'dITo3:1Z-YD-E/KlW'/C&=Jo9]Er_$-ZY6QT)]p6'NGBh`O'3(Z07jhu
> endstream endobj 3 0 obj > endobj 12 0 obj > endobj 17 0 obj > endobj 22 0 obj > endobj 25 0 obj > endobj 31 0 obj > endobj 34 0 obj > endobj 39 0 obj > endobj 42 0 obj > endobj 45 0 obj > endobj 49 0 obj > endobj 53 0 obj > endobj 56 0 obj > endobj 60 0 obj > endobj 66 0 obj > endobj 10 0 obj > endobj 38 0 obj > endobj 59 0 obj > endobj 37 0 obj > endobj 105 0 obj > endobj xref 0 106 0000000000 65535 f 0000000016 00000 n 0000000172 00000 n 0000720286 00000 n 0000000363 00000 n 0000005155 00000 n 0000106232 00000 n 0000116884 00000 n 0000117938 00000 n 0000118992 00000 n 0000721543 00000 n 0000005287 00000 n 0000720367 00000 n 0000007880 00000 n 0000012520 00000 n 0000106310 00000 n 0000107273 00000 n 0000720451 00000 n 0000012701 00000 n 0000019837 00000 n 0000108239 00000 n 0000109200 00000 n 0000720535 00000 n 0000020007 00000 n 0000025138 00000 n 0000720619 00000 n 0000025294 00000 n 0000032192 00000 n 0000110106 00000 n 0000111069 00000 n 0000111977 00000 n 0000720703 00000 n 0000032387 00000 n 0000040053 00000 n 0000720787 00000 n 0000040236 00000 n 0000049842 00000 n 0000721851 00000 n 0000721652 00000 n 0000720871 00000 n 0000050048 00000 n 0000056935 00000 n 0000720955 00000 n 0000057141 00000 n 0000063777 00000 n 0000721039 00000 n 0000063983 00000 n 0000069761 00000 n 0000120044 00000 n 0000721123 00000 n 0000069920 00000 n 0000080494 00000 n 0000112968 00000 n 0000721207 00000 n 0000080712 00000 n 0000089592 00000 n 0000721291 00000 n 0000089810 00000 n 0000095931 00000 n 0000721762 00000 n 0000721375 00000 n 0000096101 00000 n 0000102819 00000 n 0000113962 00000 n 0000114951 00000 n 0000115916 00000 n 0000721459 00000 n 0000103038 00000 n 0000106085 00000 n 0000121103 00000 n 0000285372 00000 n 0000323402 00000 n 0000121207 00000 n 0000362015 00000 n 0000121318 00000 n 0000395401 00000 n 0000428762 00000 n 0000460351 00000 n 0000121454 00000 n 0000494084 00000 n 0000577924 00000 n 0000121612 00000 n 0000617634 00000 n 0000648534 00000 n 0000681348 00000 n 0000121732 00000 n 0000122932 00000 n 0000176301 00000 n 0000231406 00000 n 0000522664 00000 n 0000123144 00000 n 0000176511 00000 n 0000231614 00000 n 0000285576 00000 n 0000323603 00000 n 0000362213 00000 n 0000395587 00000 n 0000428958 00000 n 0000460536 00000 n 0000494263 00000 n 0000522883 00000 n 0000578127 00000 n 0000617832 00000 n 0000648740 00000 n 0000681559 00000 n 0000721951 00000 n trailer ] >> startxref 722003 %%EOF
We evaluated the performance of a composite likelihood ratio test for detecting selective sweeps (Nielsen et al. 2005 ) when including fixed differences in the likelihood ratio in addition to SFS information, using extensive simulations. We show that there can be a marked increase in power as well as a reduction in FPR for a number of different scenarios in several different models of mutation rate variation, population bottlenecks and background selection. We also show that estimates of the strength of background selection can be included into the framework, to prevent false positives in regions with strong, long-term background selection. By applying the method to human genetic data, we detect novel regions that are not identified as candidate regions with the standard sweepfinder approach.
Using invariant sites increases power and robustness
Given that both diversity and divergence change proportionally with mutation rate, we integrate variation in mutation rates by including a measure of divergence to an out-group species. More specifically, we include sites that are not polymorphic within the species under investigation, but differs from an out-group sequence, that is inferred fixed differences. If the sweepfinder CLR is calculated including all sites (CLR3), variation in mutation rates can create false positives (Fig. 4). However, if only fixed differences are added to the SFS (CLR2), the power, but not the FPR, increases. This strongly suggests using CLR2 instead of CLR3 when out-group information is available.
Furthermore, including invariant sites can increase robustness to certain bottleneck scenarios if the bottleneck is of intermediate to high strength, but not too recent (Boitard et al. 2009 Pavlidis et al. 2010 ). However, like many other methods for detecting selective sweeps (Barton 1998 Jensen et al. 2005 Voight et al. 2006 Boitard et al. 2009 Pavlidis et al. 2010 Crisci et al. 2013 ), the CLR test can suffer from a disturbingly high FPR in the presence of recent bottlenecks in population size. The use of an empirically derived demographic background SFS does not eliminate the sensitivity to demographic assumptions, because the CLR does not model the correlation in coalescence times along the sequence correctly irrespective of the demographic model. A bottleneck will force many lineages to coalesce in a short amount of time. If the duration of the bottleneck is such that at least some lineages escape the bottleneck in most regions, the few regions in which all lineages coalesce during the bottleneck may very much resemble regions that have been affected by a selective sweep. Realistic demographic models should be used if assigning P-values to individual sweeps.
Background selection as a null model for sweep detection
What is often neglected in previous discussions of diversity-based sweep detection methods is variation in diversity across the genome that is not caused by variation in mutation rate (or conservation level), but by variation in background selection, that is by the effect of deleterious mutations on linked neutral variation (Charlesworth et al. 1993 Hudson & Kaplan 1995 Charlesworth 2012 Cutter & Payseur 2013 ). A locally increased level of background selection will lead to a reduction in diversity similar to that expected after a selective sweep.
As data sets and methods for estimating the effect of background selection for each position in the genome are becoming available (McVicker et al. 2009 ), the objective of developing methods for detecting positive selection that can take background selection into account is becoming tenable. We present the first such method by including a map of predicted B-values in the calculation of the CLR. McVicker et al. ( 2009 ) provide such a B-value map for humans by defining functional elements based on mammalian sequence conservation, and fitting parameters to phylogenetic data. Therefore, reductions in neutral diversity in regions of the human data do not influence the local estimation of B. Our approach considers a local reduction in diversity as evidence for a selective sweep only if it is not also predicted by a local drop in B-values, that is background selection is our evolutionary null model (Cutter & Payseur 2013 ). We simulated background selection levels typical for humans (McVicker et al. 2009 ), and by accounting for background selection, we could effectively prevent false positives without loosing power. If one does not account for background selection, the proportion of false positives is large and similar to that of a HKA test (Fig. 7a).
Application to human data
Finally, by applying our method to human genetic variation data, we show that the new method detects novel regions that were not identified as candidates using the standard sweepfinder approach. Based on our simulations, we would expect those regions to be enriched for old selective sweeps that started between 0.2 and 0.8 Ne generations ago, a time range where the power of other SFS-based, FST- and LD-based methods is low (Sabeti et al. 2006 ). Interestingly, the strongest signal we find, which has been missed by most previous scans, is near KIAA1217, a gene affecting lumbar disc herniation susceptibility. We speculate that the selection in this region may possibly be related to changes in human muscular–skeletal function subsequent to the evolution of erect bipedal walk. Increased risk of lumbar disc herniation is a likely consequence of bipedal walk. We may still be evolving to optimize muscular–skeleton functions after this recent, radical change in skeletal structure and function.
Population geneticists and evolutionary biologists have a long-standing interest in understanding the ecological and genetic mechanisms that allow species to adapt to local environmental conditions. The recent advent of next-generation sequencing (NGS) (Shendure & Ji 2008 ) and the high density SNP arrays it generates has allowed rapid advances in this field and has fostered the emergence of the population genomics approach (Luikart et al. 2003 ). This new paradigm is focused on the use of genomewide data to distinguish between locus-specific effects (mainly selection but also mutation, and recombination) and genomewide effects such as genetic drift. It has proven particularly useful to detect signatures of selection and has been used to uncover genes involved in local adaptation, disease susceptibility, resistance to pathogens and other phenotypic traits of interest to plant and animal breeders.
At the genetic level, local adaptation involves a process whereby directional selection induced by local environmental conditions will favour the spread of genetic variants associated with beneficial phenotypic traits. If selection is strong at the level of an individual locus, the selected variant will increase in frequency. Additionally, selection will modify the pattern of diversity around the selected locus through genetic hitchhiking (Smith & Haigh 1974 Barton 2000 ). This process, known as a selective sweep, has been extensively studied using models of isolated populations (Smith & Haigh 1974 Sabeti et al. 2002 Kim & Nielsen 2004 Hermisson & Pennings 2005 Pennings & Hermisson 2006a , b Voight et al. 2006 ) but much less studied under structured population scenarios. In this latter case, analyses focused on either an universally favoured mutation that spreads from its deme of origin to other demes (Slatkin & Wiehe 1998 Barton 2000 Bierne 2010 ) or on a scenario where the new selected variant is favoured in one part of the species range but counter-selected in the other half (Bierne 2010 ). However, there is a third scenario still poorly understood but frequently assumed by studies of local adaptation, particularly in humans. Under this scenario, a selected variant is favoured in one part of the species range and is neutral elsewhere (e.g. lactase persistence, skin pigmentation, high altitude adaptation Jeong & Di Rienzo 2014 ).
A third type of genome-scan methods considers explicitly the physical linkage among SNPs surrounding a selected variant, either by focusing on patterns of long-range haplotype homozygosity (Sabeti et al. 2002 Voight et al. 2006 ) or by modelling the effect of linkage on multilocus genetic differentiation (Chen et al. 2010 ). These methods are more recent, and their properties have not been extensively investigated. Moreover, although they are focused on either a single population (Sabeti et al. 2002 Voight et al. 2006 Ferrer-Admetlla et al. 2014 ) or on pairs of populations (Sabeti et al. 2007 Chen et al. 2010 Fariello et al. 2013 ), they are being used to study structured populations consisting of many subpopulations without a clear understanding of how migration and complex population structure may affect their power and error rates. Thus, the objective of this study is to carry out a thorough evaluation of the performance of these methods under various scenarios of population structure. We focus mainly on the case where the selected variant is beneficial in part of the species range and neutral elsewhere, as it is the underlying scenario envisaged by many recent studies of adaptation (Lao et al. 2007 Hancock et al. 2008 Foll et al. 2014 ). Additionally, we consider both hard and soft selective sweeps. These two scenarios differ in the origin of the selected variant. In a hard selective sweep, the favoured allele appears through de novo mutation, while in a soft sweep, it is already segregating at low frequency in the population (standing genetic variation) or it arises from recurrent mutations (Hermisson & Pennings 2005 Pennings & Hermisson 2006a , b Pritchard et al. 2010 ).
In the present analysis, we compare the performance of seven recent methods to detect selective sweeps. We incorporate in the analysis, methods that were developed to study a single population, a pair of populations or multiple populations. We explain in detail the ability of each method to capture the signal of selection left by both hard and soft sweeps under different scenarios of structured populations and a range of parameter values (migration and selection). The principle is to examine these methods on the same simulated data sets and draw conclusions about how the different model parameters affect their performance as described by power and false discovery rate (FDR). The goal of this analysis is to guide scientists in the choice of the methods that is better suited for their biological model.
Ways to distinguish between purifying selection and selective sweep - Biology
Variation in human skin and hair color is one of the most striking aspects of human variability, and explaining this diversity is one of the central questions of human biology. Only in the last decade or so has it been realistically possible to address this question experimentally using population genetic approaches. On the basis of earlier studies in mice, and on studies in humans with various Mendelian disorders, many of the genes underpinning population variation in skin color have been identified. More recently, genome-wide approaches have identified other loci that appear to contribute to pigmentary variation. The ability to study sequence diversity from world populations has allowed examination of whether the observed variability is due to random genetic drift or is a result of natural selection. The genetic evidence taken as a whole provides strong evidence for natural selection, functioning so as to increase pigment diversity across the world's populations. Future larger studies are likely to provide more details of this process and may provide evidence for exactly which mechanistic pathways have mediated selection.
Several recent studies have confirmed that mitochondrial DNA variation and evolution are not consistent with the neutral theory of molecular evolution and might be inappropriate for estimating effective population sizes. Evidence for the action of both positive and negative selection on mitochondrial genes has been put forward, and the complex genetics of mitochondrial DNA adds to the challenge of resolving this debate. The solution could lie in distinguishing genetic drift from ‘genetic draft’ and in dissecting the physiology of mitochondrial fitness.
Additional file 1: Figure S1.
Number of SNPs in 500-kb windows for the LRH test.
Additional file 2. Figure S2.
Number of SNPs in 500-kb windows for the XP-EHH test.
Additional file 3: Figure S3.
PCA analysis on Chinese Holstein population (HOL_CHI) with the world reference dataset and the Simmental reference population included in WIDDE.
Additional file 4: Figure S4.
PCA analysis on Chinese Simmental population (SIM_CHI) with the world reference dataset and the Simmental reference population included in WIDDE.
Additional file 5: Table S1.
The top 5 genetically closest populations to Chinese Holstein and Simmental individuals in breed assignment analyses.
Additional file 6: Figure S5.
Genome-wide distribution of SNP-based LRH values for Holstein and Simmental. The dash line indicates the threshold for the LRH test (LRH > 2.6).
Additional file 7: Figure S6.
Genome-wide distribution of 500-kb window-based maximum |XP-EHH| and average FST. The dash line indicates the threshold for the FST test.
Additional file 8: Table S2.
Overlapping windows between LRH, XP-EHH and FST tests.
Additional file 9: Table S3.
Candidate genes used for functional annotation.
Additional file 10: Table S4.
DAVID analyses on candidate genes.
Additional file 11: Figure S7.
Rootgrams of the posterior class probability for FST values.
Additional file 12: Table S5.
Distribution of six components inferred by FlexMix analysis and number of SNPs in each component.
Additional file 13: Figure S8.
A graphical representation of pairwise D’ for the DGAT1 region (A) and GHR region (B).
Additional file 14: Figure S9.
Distribution of unweighted means of minor allele frequencies for non-genic and exonic SNPs in the low-MAF bin (0-0.05).