How to compare SNP from genotyping results for multiple people with a known phenotype?

How to compare SNP from genotyping results for multiple people with a known phenotype?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm looking at different genotyping profiles available at and am trying to compare profiles for people with different phenotypes.

For example, given 10 profiles of people who can roll their tongue and those who cannot, I am trying to identify what is different between these people and what's similar.

A typical record is in this format:

Rsid, chromosome, Position on chromosome, actual 2 letter mutation

How can such records be compared to identify similarities for a given phenotype? Does RSID uniquely identify a mutation, or is it a combination of rsid and position?

In other words, if one user has rsid 123 at position 555 and another one has rsid 123 at position 666, and the mutation is the same, is this a similarity, or does position matter?

Each rsid identifies a unique SNP in the genome. Thus there should not be any entry in the files that have the same rsid but different chromosomes and/or positions. If you do find this, it is likely that you have data from different versions of the assembled genomes.

To find genetic variants associated, i.e. correlated, with your trait, you need to focus on the genotype at each of the SNPs. For example, SNP rs3094315 is an A/G polymorphism, i.e. an individual can either have the genotype AA, AG, or GG. To find if it is associated with tongue rolling, you would count the number of AA, AG, and GG individuals (alternatively you could sum the number of A and G alleles) in the group that can roll their tongue and compare those numbers to the group that cannot. You would then repeat this for each SNP that you want to test. For most SNPs, there will be no difference in proportion.

That being said, this is not a modest undertaking as this a cornerstone of human and statistical genetics. There are many issues that could strongly bias your results (e.g. population stratification, genotyping error, linkage disequilibrium, multiple testing). To learn more, you can read about the Cochrane-Armitage Trend Test and the PLINK software. If you are really serious I recommend Applied Statistical Genetics with R by Andrea S. Foulkes.

Replicating genotype–phenotype associations

What constitutes replication of a genotype–phenotype association, and how best can it be achieved?

The study of human genetics has recently undergone a dramatic transition with the completion of both the sequencing of the human genome and the mapping of human haplotypes of the most common form of genetic variation, the single nucleotide polymorphism (SNP) 1,2,3 . In concert with this rapid expansion of detailed genomic information, cost-effective genotyping technologies have been developed that can assay hundreds of thousands of SNPs simultaneously. Together, these advances have allowed a systematic, even 'agnostic', approach to genome-wide interrogation, thereby relaxing the requirement for strong prior hypotheses.

So far, comprehensive reviews of the published literature, most of which reports work based on the candidate-gene approach, have demonstrated a plethora of questionable genotype–phenotype associations, replication of which has often failed in independent studies 4,5,6,7 . As the transition to genome-wide association studies occurs, the challenge will be to separate true associations from the blizzard of false positives attained through attempts to replicate positive findings in subsequent studies. The purpose of a replication study is to evaluate a positive finding from a previous study, to provide credibility that the initial finding is valid. Replication is essential for establishing the credibility of a genotype–phenotype association, whether derived from candidate-gene or genome-wide association studies. However, there is a lack of agreement about what constitutes a finding deserving of replication, what constitutes an adequate replication study and what constitutes a replication or refutation.

Investigators and journal editors have offered guidelines for how to address this problem 8,9,10,11,12 , but these initial efforts have been hampered by limited experience and conflicting empirical data. However, as evidence has accumulated, several instructive examples have emerged of genotype–phenotype associations being reproduced reliably in follow-up studies. These include peroxisome proliferator-activated receptor-γ (PPARG) 13 and the transcription factor TCF7L2 (refs 14–19), related to diabetes nucleotide-binding oligomerization domain containing 2 (NOD2) and Crohn's disease 20,21,22 complement factor H (CFH) and age-related macular degeneration 23,24,25,26 and chromosome region 8q24 and prostate cancer risk 27,28,29,30,31 .

Many instances have arisen in which initial findings have not been reproduced in follow-up studies because of issues in either the initial study or the attempted replication 4,5,6,32,33 . Small sample size is a frequent problem and can result in insufficient power to detect minor contributions of one or more alleles. Similarly, small sample sizes can provide imprecise or incorrect estimates of the magnitude of the observed effects. Poor study design — particularly a lack of comparability between cases and controls — can increase the risk of biases because there can be heterogeneity in exposure to environmental challenges and population stratification. The latter arises when investigators fail to account for case–control differences in the genetic structure of the underlying population. Heterogeneity in classification of outcomes across studies can undermine the opportunity to compare among them. Similarly, data 'dredging' can be a major problem, especially when criteria for defining phenotypes are altered to achieve statistical significance worthy of publication.

Another challenge arises when follow-up studies analyse different variants. An example is the reported association between DTNBP1 and schizophrenia, initially identified in Irish pedigrees 34 and 'confirmed' in independent European studies 35 . Unfortunately, different risk alleles and haplotypes were reported in each study, making comparison difficult 36,37,38,39 . Although it is plausible that more than one variant could contribute to schizophrenia risk at the DTNBP1 locus, it is difficult to draw this conclusion from the literature because follow-up studies have not consistently analysed the same markers or those in perfect linkage disequilibrium (r 2 = 1.0). Other recent examples for which initial reports of association have been inconsistently replicated include insulin-induced gene 2 (INSIG2) and obesity 40,41,42,43,44 , and cyclic-AMP-specific phosphodiesterase (PDE4D) and stroke 45,46 . These have been accompanied by controversies about what actually constitutes replication.

This paper presents the conclusions of a working group on the replication of genotype–phenotype associations — whether identified in genome-wide or candidate-gene studies — convened by the National Cancer Institute and the National Human Genome Research Institute. The group was composed of experts from diverse disciplines, including biostatistics, clinical medicine, epidemiology, genetics and scientific publishing. The purpose was to review the current state of the field and propose best practices for the design, conduct and publication of replication studies that aim to follow up notable findings, particularly in genome-wide association studies. The group addressed three topics. First, assessment of the validity and limitations of any single genetic association study. Second, criteria for establishing replication in genetic association studies. Third, points to consider for publication of high-quality genotype–phenotype association reports (Box 1).

Initial association studies

The initial study of any association represents an important discovery tool. In the near future, it is unlikely that a single study will unequivocally establish a valid genotype–phenotype association and not require replication. A number of points relating to the study design and reporting should be considered in determining whether a finding in an initial genome-wide or candidate-gene study merits follow-up replication studies (Box 2). Attempts to replicate a reported association are often complicated by lack of methodological detail in the initial report or lack of methodological rigour in the original study.

Because of the enormous number of genotype–phenotype associations tested in each genome-wide study, spurious associations will substantially outnumber true ones unless rigorous statistical thresholds are applied. Although no universal threshold can be specified for statistical significance in all circumstances, smaller P-values generally provide greater support for a true association. Extremely small P-values should be interpreted carefully, however, until completion of replication studies, because many can be due to inappropriate reliance on asymptotic distributions of test statistics, or to technical artefact or genotype errors that are distributed differently between cases and controls. Cluster plots for highly significant markers should be examined carefully. It may be desirable to include confirmatory data from a second genotyping technology in the initial report to verify genotype accuracy. Cases and controls should be drawn from populations that are generally comparable both in terms of genetic background and environmental exposures 47 , and should be analysed for confounding population stratification. This may require genotyping of ancestry informative markers (AIMs), which should be strongly encouraged as genotype costs fall and AIMs become increasingly well-characterized within marker sets. Family-based studies are affected by population stratification, so researchers should opt for methods robust to this, such as transmission disequilibrium methods 48 . They may be particularly valuable in the initial study if there is evidence for ethnic differences in the genetic effect of a trait, although at the cost of increased genotyping. Cautious interpretation is required either if significance is observed only for unusual or highly specific phenotypes (especially if they represent a small proportion of the study sample) or if significance depends on a particular analytical method that is not publicly available for confirmation.

Approaches for dealing with multiple comparisons are beyond the scope of this report, but more robust methods are clearly needed 49 . Permutation testing is an effective strategy to address the problem of multiple comparisons, especially if a large number of phenotypes are being analysed. Many methods for addressing the problem of multiple comparisons invoke a conservative approach, namely a standard Bonferroni correction, which assumes the independence of all tests performed. In many association studies, markers are not independent because they are in linkage disequilibrium, and so a standard Bonferroni correction is overly conservative. Lowering the threshold for calling a finding of particular variants — such as non-synonymous coding SNPs — positive in the analysis scheme (weighting) has merit but must be declared before initiation of the analysis and not once the analysis has begun 49,50 . The number of variants for which there is either credible laboratory evidence or a validated in silico prediction a priori is quite small. However, the temptation to create a credible biological hypothesis post hoc can be quite strong.

At present, many studies are barely powered to identify, much less to establish, associations of common alleles of weak effect in complex diseases 51,52 . Recently, appreciation of this crucial issue has led to larger, more definitive studies, such as the Cancer Genetic Markers of Susceptibility (CGEMS) project and the Wellcome Trust Case Control Consortium, (WTCCC). An estimated large effect (that is, with an odds ratio greater than 2) in a well-powered study can lend credence to an association, because unknown confounding factors are less likely to produce large effects 53 . Unfortunately, many risk variants contribute less than this. Small studies are prone to large variation in risk estimates, of which only selected strong positives are initially detected and reported. Furthermore, the estimate of the effect declines as replication studies are pursued, a phenomenon known as 'winner's curse' 54,55 .

Consortial studies comprised of multiple independent studies combined into a pooled analysis can be viewed as a practical approach that overcomes many of the disadvantages of a disconnected set of underpowered studies. In addition, consortia may meet the need for rapid replication by achieving sufficiently large sample size 40,56 . Collaborations among multiple independent studies can offer important advantages over a single large study, particularly regarding the generalizability of findings observed in multiple studies that typically have greater diversity of populations and/or exposures.

As far as possible, similarly rigorous criteria should be considered for evaluation of genotype–phenotype association studies with limited or no availability of subjects for replication, such as studies of rare diseases or severe toxicity due to therapy or environmental exposures. In these circumstances, additional information gathered from laboratory techniques, bioinformatic tools and a priori biological insight should be used to provide plausibility for interpreting genetic association findings. The expectation for demonstrated replication might be relaxed if it is unethical to attempt replication — such as in studies that link genetic variation with adverse effects of therapy or environmental exposure (for example, benzene or cigarette smoke). Similarly, the public health impact of a finding may lessen the stringency of expectation for replication before initial publication — for example, in an urgent situation in which effective intervention is available and can be readily implemented.

Genotype–phenotype associations that have been replicated widely have often used clearly defined phenotypes classified by standard and widely-accepted criteria, such as diabetes and age-related macular degeneration 57,58 . Use of accepted criteria should reduce misclassification rates 59 . Some association studies have reported intermediate phenotypes (known as endophenotypes) but have provided little detail on the actual measure or its reliability 60 . In the absence of standard criteria, sufficient detail should be provided for both the definition of the phenotypes investigated and assessment of their validity and comparability across studies.

Replication of initial studies

To establish a positive replication of a genotype–phenotype association, many of the same considerations important for genome-wide association or candidate-gene studies should be fulfilled (Box 3). In replication studies, every effort should be made to analyse phenotypes comparable to those reported in the initial study. In the first attempt to replicate a finding, comparable populations should be analysed not only for the main effect but also to guard against confounding population stratification, either in the initial or replication studies 61,62 . Because many initial studies and replication studies have been reported in populations of European descent, the challenge remains to extend the studies to other populations. It has already been shown that many variants that have a significant association with disease in several studies in one population may not necessarily have the same association in another (such as TCF7L2 in West Africa and East Asia 18,63,64 in this case, it has provided an opportunity to refine the signal to a restricted region). In some circumstances, it might be impossible to conduct follow-up studies because of the uniqueness of a study population or the lack of availability of additional subjects for replication. If replication is not an option, interpretation of association findings could be supplemented by biological insights derived from the laboratory.

Evaluation of an association in populations of different ancestry from that of the initial report would generally be expected, because genomic variation is greater when compared across populations, and should increase confidence in the finding. By contrast, failure to replicate in a population different from that of the initial report does not necessarily invalidate the original finding. In some cases, the differences in linkage disequilibrium relationships across populations can be used to narrow the region of interest for later genetic and possible functional analysis. Owing to their robustness to population stratification, as noted above, family-based studies can also serve as valuable replication studies for notable findings 48 .

Reports of attempts at replication should distinguish between tests of the same SNP as in the original study, SNPs in strong linkage disequilibrium with the reported SNP, and other SNPs that were genotyped to search for additional variants associated with disease in the region (Fig. 1). In some circumstances, the initial study might have identified a marker that is not in strong linkage disequilibrium with the causal variant, which could lead to a false refutation in a different population, whereas testing additional SNPs in the region might reveal another association worthy of follow-up. For clarity, if new, previously untested SNPs are included, they should be clearly identified and the rationale for their inclusion explicitly stated. If differences in linkage disequilibrium patterns across populations are used to invoke an association at a new marker but not at the originally tested marker, the different linkage disequilibrium patterns should be empirically demonstrated in the appropriate populations and shown to be a plausible and consistent explanation for both the new and original results. Otherwise, the new association cannot be considered a replication.

Black diamonds represent four single nucleotide polymorphisms (SNPs rs11200014, rs2981579, rs1219648 and rs2420946) for which associations with breast cancer were replicated in multiple studies 73,74 . Estimates of the square of the correlation coefficient (r 2 ) were calculated for each pairwise comparison of SNPs in the initial genome-wide association study across the FGFR2 region 73 . The log(10) r 2 values are colour-coded.

Publication of associations

The evaluation of a publication addressing one or more genotype–phenotype associations is a daunting task in the age of large, dense datasets. To this end, published genome-wide association reports should include detailed descriptions of design, genotyping and statistical methods, and results, even if available only through online supplements, or perhaps in a separate journal. A checklist of key possible issues is provided in Box 1 — this could be used as a guide for authors, editors, reviewers and the general readership.

It is a challenge to make the case for the importance of the replication finding(s) without exaggerating the significance of the observation. Remarks about possible follow-up of genetic markers and corroborative studies to investigate plausibility should be brief and well referenced. Authors should practise sound judgement and temper enthusiasm based on prior publications (especially from the same investigative group), particularly if the replication study results differ from those of the initial study. Disclosure of known previous attempts to replicate the reported findings, whether positive or negative, by the authors or others is important for interpreting the replication study.

Although it is desirable for the initial report of a genotype–phenotype association to include adequately powered replication studies, requiring replication with every initial study may not be necessary, as long as the preliminary nature of a study without replication is emphasized. Such studies can still provide valuable information if the entire set of results is made available, and releasing such results before replication would be of value to the field. However, there is substantial added value in presenting robust findings based on an initial scan together with follow-up replication, and an appropriate balance is needed that facilitates rapid publication of valid findings and encourages collaboration 19,65 . If replication studies are included, each should be described or referenced in the same detail as the initial study and should include the results for all SNPs tested at each stage. As noted above, replication studies should preferably investigate the same or a very similar phenotype.

In many cases, the follow-up study will fail to replicate the initial results. Such findings are valuable for distinguishing false-positives from the true-positive signals that should be pursued for putative causal variants. The preference for publishing positive findings, even if derived from suboptimal studies, presents a formidable barrier to the dissemination of well-conducted negative studies. Failure to disseminate results from well-conducted negative studies withholds essential pieces of evidence for investigators who may be deciding whether to launch a follow-up study to replicate or to extend the original study. Thus, high-quality instances of 'meaningful negativity' are useful and should be reported succinctly in the literature. Criteria for a meaningful negative replication study are the same as those for a positive study (Box 3), with the added requirements that the same trait should be studied in a population of comparable underlying structure with sufficient power to measure the appropriate effect size and yield a negative result.

Negative studies are difficult to publish but they are crucial for separating true-positive from false-positive findings. Journals are strongly encouraged to publish high-quality negative studies refuting earlier positive reports of genotype–phenotype associations. The journal in which the initial scan is published is encouraged to solicit and publish well-conducted follow-up studies within a specified time frame, perhaps between 3 and 9 months of the initial report. A case in point is the recent collection of reports published by The American Journal of Human Genetics 66,67,68,69,70,71 that failed to replicate the initial findings of a genome-wide association study on Parkinson's disease. A handful of journals — such as Cancer Epidemiology, Biomarkers and Prevention and the new PLoS series 72 — currently feature well-conducted negative reports, and such efforts are to be lauded. The value of a well-executed negative study cannot be overemphasized more venues are needed to capture these valuable results.

Although there are challenges to making data on individual research participants available to other investigators, every effort should be made to provide researchers with an opportunity to reproduce the reported results and to investigate new hypotheses and methods. To facilitate this research in genome-wide association studies, a public data archive known as the Database of Genotypes and Phenotypes, or dbGaP ( has been established at the National Library of Medicine's National Center for Biotechnology Information and will be used by many National Institutes of Health (NIH)-supported studies. dbGaP will provide study documentation and aggregated genotype and phenotype data through its website with no account or authorization required. Access to individual, de-identified genotype and phenotype data will require an authorization and approval process that is currently under development. Whether through dbGaP or other venues, genotype summaries of computed analyses should be published online unless there are strong reasons not to do so, such as data derived from special populations (that is, isolated populations or minority communities) or other groups that will not permit such sharing. There are substantial informatic challenges for data presentation and data archiving, especially on public and journal websites. Best practices for retrieval and analysis of such data continue to evolve.

The history of genotype–phenotype association studies has focused on initial discoveries as opposed to careful replication. Earlier attention to the appropriate design of subsequent replication studies might have helped limit the plethora of false-positive results. Determination of valid genotype–phenotype associations presents a series of challenges that will require a logical strategy for conducting well-designed studies, based on excellent quality control practices interwoven with sound analytical methods and judicious interpretation. Other than the obvious differences in the drawbacks involved in multiple comparisons, standards for assessing the validity of the initial findings of a genotype–phenotype association should not differ substantially between the candidate-gene approach and genome-wide association studies. As experience accumulates, we can look forward to methodological advances that will facilitate our interpretation of studies, such as continued improvement of proposed methods for lowering the threshold for positive findings, adjustments for population structure, and exploitation of linkage disequilibrium structure in a candidate region.

The best practices suggested here for reporting initial and replication studies are based on sufficient disclosure of study methods to permit independent confirmation of study findings. Often a sequence of studies will be required to establish a valid genotype–phenotype association, perhaps involving several rounds of replication studies. And, of course, the conclusive demonstration of a replicated association represents only the beginning of the process towards finding the causal genetic variant(s). Labour-intensive and costly investigation will subsequently be required to sequence the candidate interval in depth, genotype all the common and perhaps uncommon variants that are markers for the outcomes of interest in multiple population samples, understand their functional consequences, examine their potential interactions with other genes or environmental factors, and devise strategies for preventative or therapeutic interventions. None of these steps should proceed far, however, without conclusive replication of findings from an initial genotype–phenotype association study.

Box 1: Points to consider in genotype–phenotype association reports

This checklist is intended to serve as a guide for authors, journal editors and referees to allow clear and unambiguous interpretation of the data and results of genome-wide and other genotype–phenotype association studies.

A detailed description of the study design and its implementation

The source of cases and controls (or cohort members, if based on cohort design), including time period and location(s) of subject recruitment

Methods for ascertaining and validating affected or unaffected status and reproducibility of classification

Participation rates for cases, controls or cohort members

Presentation of case and control selection in a flow chart, including exclusion points for missing and erroneous data (possibly as supplementary tables)

Initial table comparing relevant characteristics (such as demographics, risk factors and exposures) of cases and controls

Success rate for DNA acquisition, including comparisons of those with and without collection, extraction failures and exclusions due to inconsistent data

Statement on availability of results and data so that, as far as possible, others can analyse them independently

Links to supplemental online resources and database accession numbers

Genotyping and quality control procedures

Sample tracking methods, such as bar-coding, to ensure accuracy of analysis

Description of genotyping assays and protocols, particularly when new or applied in a non-standard method

Description of genotyping calling algorithm

Genotype quality control design for samples, including numbers, plating locations, selection criteria for:

External control samples from standard accepted sets (such as HapMap)

Internal control samples (duplicate samples it should be specified whether these are from the same or different DNA collection, extraction or aliquot)

Assay and DNA quality metrics by locus, sample, plate or 'batch'

Average error rates estimated by internal duplicates or external samples

Assay reproducibility: concordance for performance of extraction, aliquoting (internal control samples) and assay reproducibility

Concordance with published or previously generated genotypes

Mendelian consistency checks if related individuals are present

Detection of inconsistent or cryptic relatedness in study subjects

Evaluation of deviations from Hardy–Weinberg proportions to detect failed assays or large-scale stratification (for example, testing Hardy–Weinberg equilibrium 'violations') separately in cases and controls

Assessment of population heterogeneity, including

Average or median value of chi-square and full distribution

Q–Q plots of chi-square analysis and P-values (with specific description of type of test used to generate the values)

Validation of most critical results on an independent genotyping platform

Analysis methods in sufficient detail to reconstruct the analytical approach and reproduce all reported results

Description of any pre-analysis weighting scheme for selecting variants for replication

Simple single-locus and multi-marker (haplotype) association analyses

Genetic models tested (unconstrained genotype effects — dominant, additive, multiplicative or trend)

Graphical display of genotype clustering for assays of high interest

Verification of results at highly correlated loci

Discussion of choice of threshold for significance and the statistical basis for any adjustment for multiple testing and the relationship to overall study power

Significance of any known 'positive controls' (that is, loci established in previous genetic associations)

Consistency of results before and after application of quality control filters

Description of replication samples, including source, ascertainment and comparability to initial sample

Discussion of choice of threshold for significance and the statistical basis for any adjustment for multiple testing and the relationship to overall study power

Summary of replication and analysis attempts by authors

Summary of all known replication attempts by others, including non-replications

Genotyping data and specifications for deposition in standard databases

Availability of 'raw' genotype data in the technology and vendor format, consistent with the requirements or restrictions imposed by funding agencies or informed consent

Data extraction and processing protocols

Normalization, transformation and data selection procedures and parameters

Points for reviewers and authors to consider regarding priority for publication

Suitably large sample size

Sufficiently stringent criteria for significance (small P-values)

High quality of study design, including selection of study population, reliability of phenotypes, measurement and adjustment for potential confounders

Discussion and conclusions commensurate with sample size, power, P-value and epidemiological quality of study design

Quality control standards used, including assessment of genotype quality and completeness

Usefulness of observations to others for subsequent research

Value of initial hypothesis described

Brief presentation of implications, especially as they relate to further follow-up both of genetic markers and for corroborative studies to investigate plausibility

Explanations of notable findings

Appropriate alternative explanations proposed and briefly discussed

Biological or functional explanations based firmly on available data

Box 2: Suggested criteria for establishing the soundness of an initial association report

These criteria are intended for studies of genotype–phenotype associations assessed by genome-wide or candidate-gene approaches.

Statistical analyses demonstrating the level of statistical significance of a finding should be published or at least available so that others can attempt to reproduce the reported results

Explicit information should be provided about the study's power to detect a range of effects

The study should be epidemiologically sound, with careful accounting for potential biases in selection of subjects, characterization of phenotypes, comparability of environmental exposures (when possible) and underlying population structure in cases and controls

Phenotypes should be assessed according to standard definitions provided in the report

Associations should be consistent (within the range of expected statistical fluctuation) and reported for the same phenotypes across study subgroups or across similar phenotypes in the entire study group

Significance should not depend on altering the quality control methods beyond standard approaches that could change inclusion or exclusion of large numbers of samples or loci

Measures to assess the quality of genotype data should include results of known study sample duplicates or publicly available samples

The results for concordance between duplicate samples (if applicable) as well as completion and call rates per SNP and per subject should be disclosed, along with rates of missing data

A subset of notable SNPs should be evaluated with a second technology that verifies the same result with excellent concordance, because no technology is error-free

Associations with nearby SNPs in strong linkage disequilibrium with the putatively associated SNP should be reported (and should be similar)

The results of replication studies of previous findings should be reported even if the results are not significant

Testing for differences in underlying population structure in case and control groups should be performed and reported

Appropriate correction for multiple comparisons across all statistical tests examined should be reported. Comparison to genome-wide thresholds should be described. Similarly, for bayesian approaches, the choice of prior probabilities should be described

Box 3: Suggested criteria for establishing positive replication

These criteria are intended for follow-up studies of initial reports of genotype–phenotype associations assessed by genome-wide or candidate-gene approaches.

Replication studies should be of sufficient sample size to convincingly distinguish the proposed effect from no effect

Replication studies should preferably be conducted in independent data sets, to avoid the tendency to split one well-powered study into two less conclusive ones

The same or a very similar phenotype should be analysed

A similar population should be studied, and notable differences between the populations studied in the initial and attempted replication studies should be described

Similar magnitude of effect and significance should be demonstrated, in the same direction, with the same SNP or a SNP in perfect or very high linkage disequilibrium with the prior SNP (r 2 close to 1.0)

Statistical significance should first be obtained using the genetic model reported in the initial study

When possible, a joint or combined analysis should lead to a smaller P-value than that seen in the initial report 75

A strong rationale should be provided for selecting SNPs to be replicated from the initial study, including linkage-disequilibrium structure, putative functional data or published literature

Replication reports should include the same level of detail for study design and analysis plan as reported for the initial study (Box 1)


Geese possess strong/variable broodiness and poor egging performances, which are impacted by many factors, such as genetics, nutrition, environment and disease. Asthe heritability of reproduction is low,it is hard to improve reproductive traits using traditional selection methods. Marker-assisted selection (MAS) is an effective way to improve such traits with low heritabilities. However, mining trait-linked sequence variationor functional genesis needed for developing MAS strategies. Single nucleotide polymorphism(SNP) is the most abundant type of genetic marker, and itshigh genetic stability makes itideal for studying the inheritance of genomic regions [1,2]. However, there is yet no genome sequence data available for geese, which largelyhinders the research of any economical traits at the molecular levelin this species.

The candidate gene approach is a common method for identifying genetic markers linked to important economical traits. Chen et al (2012) found more than 30 SNPs in Prolactin (PRL) intron 2, and 5 SNPs in Prolactin Receptor (PRLR) exon 10 in Wanjiang white geese. These polymorphisms were significantly related to the egg productivity [3]. Zhao et al (2011) found two SNPs respectively on Gonadotropin-releasing Hormone (GnRH) and PRL were associated with reproduction traits in Wulong geese [4].Zhang et al (2013) demonstrated the gene expression of Luteinizing Hormone (LH), PRL and their receptors at different stages in Zi geese [5], and Ding et al (2006) identified many differentially expressed genes in livers of laying geese compared with prelaying geese using suppression subtractive hybridization (SSH). These genes included Vitellogenin I, apoVLDL-II, ethanolamine kinase, G-protein gamma-5 subunit, and leucyl-tRNA synthase[6]. Recently, Guo et al (2011) used a similar approach to find several differentially expressed genes between the laying and broodiness stages, including PRLR, estrogen receptor 1 and anti-mullerian hormone receptor II[7].

Next-generation high-throughput DNA sequencing techniques have accelerated theresearch speed of animal genomic research. This techniques has been widely used in whole-genome sequencing, target resequencing, and transcriptome sequencing[8]. Most recently, Xu et al (2013) identified 572 differentially expressed genes with 294 up-regulated and 278 down-regulated genes in the ovarian tissue library of laying geese and broodiness geese by de novo transcriptome assembly using short-read sequencing technology (Illumina) [9]. Unfortunately, the resultant transcriptome provided only limited restriction site information from coding regions, where nucleotide diversity is much lower compared to non-coding regions.

Restriction-site associated DNA (RAD) sequencing, a newly developed method for rapid and large-scale SNP discovery, can effectively reduce the complexity of the genome[10]. It has becomean economical and efficient method for SNP discovery and genotyping [11,12]. It allows smaller research groups, or groups studying organisms that do not yet possess a reference genome, to conduct “genome wide studies”[13]. The RAD sequencing approach has been successfully applied in a number of organisms, including guppy [14], salmon [15], eurasian beaver [16], cutthroat and rainbow trout [17], Sturgeon [18], and rapeseed [10].

In this study, we applied pool-based RAD sequencing to discover novel SNP across the goose genome. Candidate SNP for laying performance were selected by comparing allelic frequencies between the two DNA pools with lowest estimated breeding value (LEBV) and highest estimated breeding value (HEBV). Using an allele-specific PCR (AS-PCR) assay for individual-based genotyping, the candidate SNP-traitassociation pattern was first confirmed in LEBV and HEBV cohorts, and further verified in the population of 492 geese. Novel genes harboring laying-related SNP were cloned for geese.

Genetic differentiation and diversity upon genotype and phenotype in cowpea (Vigna unguiculata L. Walp.)

The evolution of species is complex and subtle which always associates with the genetic variation and environment adaption during active/passive spread or migration. In crops, this process is usually driven and influenced by human activities such as domestication, cultivation and immigration. One method to discover this process is to analyze the genetic diversity of those crops in different regions. This research first assessed the similarity and differentiation between genetic diversity of genotype and phenotype in 768 world-wild cowpea germplasm which were collected by USDA and US breeding programs. Totally 1048 genotyping by sequencing (GBS) derived single nucleotide polymorphisms (SNPs) and 17 agronomic traits were used to analyze the genetic diversity, distance, cluster and phylogeny. The group differentiation was analyzed based on both the genotype distances from 1048 SNP markers and the phenotypic (Mahalanobis) distance D2 from 11 traits. A consistent result of diversity in genotype (polymorphism information content, PIC) and phenotype (Shannon and Simpson index) indicated that the East Africa and South Asia sub-continents were the original and secondary regions of cowpea domestication. Both dendrograms built by genetic distance present relationship among different regions, and the Mantel coefficient showed medium correlation level (r = 0.58) between genotype and phenotype. The information of both genotypic and phenotypic differentiations may help us to understand evolution and migration of cowpea more comprehensively and also will inform breeders how to use cowpea germplasm in breeding programs.

This is a preview of subscription content, access via your institution.


The challenges currently facing agriculture, that is, rapid human population growth, climate change, and the need to balance increasing production with reduced environmental effects, make it necessary to optimize the use of available resources. Genomic data can be used to address these challenges by helping breeders to compare individual plant genomes and optimize the characterization, discovery, and use of functional genetic variation [38]. Germplasm banks around the world curate thousands of maize accessions that, in combination with genomic data, can be explored through GWAS or GS, and could potentially be used for improving agriculturally significant quantitative traits. Inexpensive methods to obtain dense genetic marker information on large samples of germplasm are needed to take full advantage of this tremendous resource [39].

The enormous progress in sequencing technologies that has occurred over the past few years has allowed better understanding of the maize genome. High-density genome sequencing has been used to study maize diversity [4, 23–25]. In addition, several studies [39–42] have taken advantage of recently developed SNP genotyping arrays for maize, which have evolved quickly from only a few thousand SNPs to more than 50,000. Although high-density genome sequencing can provide a larger number of markers and a more accurate vision of the genome, its expense has restricted it to only a few hundred samples per study. SNP arrays are cheaper and can analyze larger samples of germplasm however diversity studies can be confounded by the fact that SNPs are developed using reference sources of diversity, which may cause an important ascertainment bias (Ganal et al [19] describes an example with B73 and Mo17 in the maizeSNP50 chip). GBS has been shown to be a less expensive method for genotyping large numbers of samples, and provides many more SNPs than do SNP arrays. Although the use of a reference genome for calling SNPs from GBS data might cause bias and underestimate the amount of diversity from the groups more distant from the reference, the diversity picture obtained when analyzing the distance matrix seems to be closer to the expectations from simple sequence repeats studies [8], whole-genome sequencing, and maize domestication data [23] than that obtained with SNP arrays.

The percentage of missing data from GBS with enzymes such as ApeKI and the levels of coverage obtained here may be a problem for some applications, especially GWAS and GS. Although better coverage can be achieved with more repetitions of the samples, this will increase cost, and quickly reaches a point where there is little reduction in missing data with increased investment in repeated sequencing runs. Given the importance of PAV in maize [2, 3, 24, 43] some of the missing data are very probably due to the absence of some regions of the B73 genome in other inbred lines. As shown here, simple imputation procedures based on identifying the most similar haplotype can be used to supply some of those missing data, and this imputation may be sufficiently accurate provided that similar haplotypes are present in the sample of genotypes. This kind of procedure may work better as the total number of maize samples in the GBS database increases, but it may also cause over imputation of data that are actually biologically missing as a result of a PAV. Alternative methods for handling missing SNP data in GBS datasets include an approach that avoids using a reference genome, such as the one recently used for switchgrass [44], or one that genetically maps individual GBS sequence tags as dominant markers [13].

Another important difference between the results obtained with GBS and the results from SNP array methods seems to be the MAF distribution. Whereas array assays seem to oversample SNPs with intermediate frequencies [45] even when analyzing diverse maize collections [9, 41], more than half of GBS SNPs within our collection are rare (this is especially true within some of the more diverse germplasm groups). As sequencing technologies improve, the number of rare alleles detected is increasing. In humans, recent studies have found that the majority of variable genomic sites are rare, and exhibit little sharing between diverged populations [46]. The importance of rare alleles is not yet completely clear, and further studies to understand the magnitude of their role causing observable phenotypic variation are underway [38]. There are strong arguments both in favor and against the rare allele model, which hypothesizes that quantitative traits are largely controlled by rare alleles of large effect [15, 17].

GWAS studies have shown that variation in some traits is related to rare alleles, and that those rare variants could explain an additional fraction of the missing heritability [15]. However, identifying rare variants through GWAS is challenging, and requires large sample sizes [38]. With the present work, we present an extensive genetic characterization of the maize inbred lines preserved by one of the largest crop germplasm banks in the world, using a method that detects rare alleles with high confidence levels. Moreover, our data show that when there are not enough resources to extensively evaluate the entire collection, a smaller number of samples (such as the maize association panel or even the NAM parents), can, if chosen based on appropriate criteria to maximize haplotype diversity, capture a high portion of the rare alleles, allowing detection of rare allele effects that may be desirable to incorporate into breeding programs.

A complication of using the entire USDA-ARS maize inbred collection for breeding or GWAS is the close relationships between some of the lines. When the seed yield of a few inbreds derived from the Iowa Stiff Stalk Synthetic and their derivatives facilitated the transition to single-cross hybrids, these inbreds became the female parents of choice for many breeding programs [47]. For example B73, the main founder of the stiff stalk group, is closely related to more than 50 other inbred lines from different programs in the collection. Several germplasm sources were used to generate the male pool (non-stiff stalk). However, the visualization of the genetic relationships through the MDS shows that even if the non-stiff stalk group forms a larger cluster (revealing a higher amount of diversity), an overlap between the stiff stalk and non-stiff stalk group still exists.

As shown by the MDS plot and Fst values, most of the germplasm from classic breeding programs of the Corn Belt region is closely related. The bottleneck is even narrower when ExPVPs are examined. Using a much smaller sample of SNP markers, Nelson et al. [48] reported that most of the ExPVPs released in the past three decades could be clustered into six primary groups represented by six prominent public inbred lines. More recently, Mikel [49] studied the pedigree records of several inbreds registered until 2008, and found that the genetic contribution of the inbred Mo17 decreased, whereas that of Oh43 increased. Our analysis shows that the ExPVP inbreds tend to cluster into three main groups, with B73, Mo17/Oh43, and PH207 being the principal connectors within each cluster. Although all of the major private seed companies are represented within each group (consistent with the small value of divergence between companies), Pioneer germplasm is represented more in the iodent group (including PH207) and more of its germplasm falls outside the three main clusters (B73, PH207/Oh43, and PH207). This result is in concordance with the observed smaller average haplotype length of Pioneer germplasm.

Although the recycling of elite lines as breeding parents has markedly reduced the amount of diversity used by maize breeders over the past few decades, breeders have also been aware of the importance of maintaining and introducing diversity into their programs [50]. The determination of breeders to search for new sources of promising, exotic germplasm is reflected in the Ames inbred collection. For instance, the GEM program aims to broaden the germplasm base of corn hybrids grown by farmers in the USA [51]. Combining the efforts of public and private cooperators, this project has introduced tropical alleles into elite USA germplasm. Our molecular characterization of these materials shows that the GEM program has been effective, as most of its inbreds lie somewhere between the ExPVPs and tropical materials on the MDS plot. According to our results, other public programs that have succeeded in incorporating tropical diversity into their materials are North Carolina State University and the University of Missouri. On the other side of the graph, adaptation to colder climates has been accomplished using different heterotic pools within the Northern USA and Canadian programs. Overall, although inbred lines from breeding programs from other parts of the globe might have different haplotype combinations (related to the use of different breeding pools), the USA and Canadian public inbred lines preserved at NCRPIS capture most of the total allelic diversity uncovered in this study.

GBS has yielded the greatest number of SNPs ever obtained from a large maize association panel to date. As seen with our GWAS analysis, the data can provide accurate mapping of simple and complex traits for the most important genes. Van Inghelandt et al. [52] suggested that with an association panel of 1,537 elite maize inbred lines, 65,000 SNPs should be sufficient to detect associations with the genes with biggest effects. Lu et al. [41] used a panel containing tropical and temperate materials, and suggested that 230,000 to 460,000 markers would be needed. However when comparing the results for the two locations with the best flowering time associations in our study, we observed that the most important flowering time gene, ZmCCT, was targeted with only one SNP, meaning that it could easily have been missed. By contrast, the Vgt1 peak showed more than 80 SNPs associated with the trait (Figure 11). The main difference between these two important QTL is that the ZmCCT polymorphism is very rare in temperate materials with very low levels of LD, whereas the Vgt1 variation is common in temperate inbred lines that have higher LD. When GBS data are used to perform GWAS, the probability of finding the causative SNPs in the dataset is highly dependent on the trait itself and the germplasm in which it is expressed. The length and number of the haplotypes detected vary enormously, depending on the region of the genome and the germplasm group. Some germplasm groups are currently under-represented in our maize dataset. As a result, population bottlenecks can cause a polymorphism that is not present at an appreciable frequency to pass the GBS pipeline quality filters. Therefore, it is unlikely that a causative polymorphism is present in the GBS dataset if it is unique to one of these germplasm groups. In addition, if the region has high haplotype diversity, rapid LD decay indicates that it is very likely that, even with approximately 700,000 SNPs we might not find a marker in LD with a particular causative polymorphism of interest. This situation is reflected in a large portion of chromosome 10 where the ZmCCT gene is located, and tropical inbreds have much greater haplotype diversity than the rest of the collection. This means that, although 700,000 SNP markers are likely to be sufficient for analysis of temperate alleles, they are not sufficient to perform accurate GWAS with tropical alleles.

However, numerous inbreds in the collection are IBD for specific regions, allowing a strategy of accurate imputation. Based on common local haplotypes defined with GBS SNPs, high-density markers for a representative inbred obtained through whole-genome sequencing can be imputed between GBS markers, thereby increasing marker density.

In summary, our GWAS results for days to silking showed that this association panel combined with the GBS information can help to dissect the genetic architecture of important agronomic complex traits. Our best association signals corresponded to regions in which a priori candidate genes or previously identified flowering time QTL are located. Nevertheless, identifying the causal gene is complex. Excluding the ZmCCT gene hit on chromosome 10, all other major associations contain several SNPs. These hits cover regions that can extend for more than 10 Mb, even though our average LD decays very rapidly. For Arabidopsis [53] and rice [54], the results suggest that the occurrence of these 'mountain landscapes' could be related to the presence of several linked genes across the region. In maize, the dissection of a candidate region contributing to flowering time variation on chromosome 6 suggests that a cluster of tightly linked genes are responsible for the phenotypic variation [55]. In our study, the linked associations on chromosome 8 correspond with the position of two known flowering time genes, ZmRap2.7 [30] and ZCN8 [56]. A similar situation occurs for the hits on chromosome 7 with candidates DLF1 and FRI. Lastly, on our chromosome 1 region, extended haplotype lengths for some subpopulations and a strong correlation between the region and population structure have been reported [37]. Within 3 Mb, there are genes that have been under selection since the domestication of maize including tb1 and d8 [25, 36] and two strong candidate genes for flowering time (CCT and PhyA1). All these results for our candidate regions support the hypothesis of the presence of some multigene complexes that may have evolved together during the process of maize domestication and adaptation. Further studies to unravel these regions and better understand the genetic architecture of flowering time are needed. Flowering time and adaptation to temperate climates are complex traits that seem to be controlled by several genes with small effects, organized in clusters across the genome.

SNP Judgments and Freedom of Association

Genetic association studies using single nucleotide polymorphisms (SNPs) and insertion/deletion variants are a common feature on the atherosclerosis research landscape. A recent Medline search using the terms “[gene] AND [polymorphism] AND [X],” where X was “atherosclerosis,” “vascular biology,” thrombosis, ” or “lipoprotein,” found >4000 original articles. Furthermore, the yearly number of new reports has been growing exponentially since 1983 (Figure). The allure of SNPs, the release of millions of SNP-based markers from dedicated consortia, and the availability of cost-effective high-throughput detection methods are converging to create a potential explosion of genetic association studies in atherosclerosis. Although the standards and quality seem to be improving, there is nevertheless a risk that SNP-based association analyses will squander academic trust and scientific resources owing to unsatisfactory design and/or analysis.

The figure shows the results of a recent Medline search using the terms “[gene] AND [polymorphism] AND [X],”, where X was “ atherosclerosis,” “vascular biology,” “thrombosis, ” or “lipoprotein.” The number of articles published each year for each combination of search terms is plotted.

Like all experimental designs and model systems, genetic association studies in human samples have strengths and limitations. 1–3 Their potential strengths include the simplicity of design, ease of noninvasive sampling, reliability and cost-effectiveness of genotyping, uncomplicated statistical analysis, and the potential for clear interpretation and direct relevance to human biology. But many factors collude to undermine confidence in association studies. Often the initial publication of a positive association is followed by reports of non-replication or refutation. There can be good reasons for non-replication, including complexity of mechanisms, multiplicity of causative genes, confounding by gene-environment interactions, and context-dependency of the associations. However, the pattern of non-replication of genetic associations is frequent, familiar, and disconcerting. An index of the tenuous nature of genetic associations in atherosclerosis is that few DNA markers are in routine clinical use, such as in risk stratification protocols, although the application of markers for disease prediction is admittedly distinct from their use in experimental hypothesis testing.

Limiting the potential to publish false-positive (FP) or false-negative (FN) results can be achieved in many ways. For instance, one journal recently expanded its editorial criteria for rapid rejection to include “genetic association studies related to complex disorders, including … atherosclerotic heart disease” ( This policy can be justified as an effective means to eliminate the likelihood of ever publishing a false, non-replicable result from a genetic association study. However, the bath water could hold an occasional baby. In general, the principle of editorial scrupulousness and placing restrictions on the publication of genetic association studies is valid in the current research environment, but might there be some standards for association studies that would not close the door on their publication, while simultaneously maximizing confidence in their validity? It is hoped that this editorial will stimulate a dialogue among the contributors, editorial staff, and readers of Arteriosclerosis, Thrombosis, and Vascular Biology regarding desirable attributes for genetic association studies in atherosclerosis.

The Basics of Genetic Association Analysis

Using genetic markers as putative atherosclerosis determinants, even risk factors, is intuitively attractive. The classic Hill criteria for a causative relationship between a determinant and an outcome require that an association be temporally correct, specific, strong, and consistent, with clear biological plausibility, with an evident dose-response relationship. 4 Genetic markers for atherosclerosis–long regarded as the molecular correlates of family history–can be visualized as fitting such classic criteria. Genetic association studies seem to provide a direct way to probe the relationship between genome and disease. But there is more to association analysis than meets the optimistic eye.

In contrast to genetic linkage studies, which investigate correlations between inheritance of a trait and chromosomal regions within family units such as sibling pairs or multigenerational pedigrees, association studies test for differences in genetic marker frequency between affected cases and unaffected controls. Linkage studies have had many notable successes in identifying the molecular basis of monogenic diseases, but fewer successes with common, more complex phenotypes, such as atherosclerosis. True lasting, replicable successes with association studies are also infrequent, despite the fact that they are more commonly performed because in their simplest form they are not constrained by a requirement for family units.

One common study design for association analysis is: 1) ascertain cases with a trait of interest, such as atherosclerosis or a related phenotype 2) assemble matched controls without the trait 3) obtain DNA samples 4) genotype all subjects for a marker thought to be etiologically important, or with a set of markers covering the genome and 5) statistically compare allele or genotype frequencies in cases versus controls. Perhaps an even more commonly used alternate procedure for quantitative measures, which would apply to many atherosclerosis-related phenotypes, is to enter the genotype as an independent variable in a multivariate regression analysis that assesses sources of variation of the continuous trait within a single study sample.

Weaknesses of Genetic Association Studies

Case-control genetic association studies suffer from both general problem types that afflict non-genetic case-control studies and from specific problem types that are unique to genetic studies. As with any case-control approach, there are many sources of bias. For instance, unsatisfactory designation of cases and controls is a limitation in this study design. Cases and controls must be representative and must be matched as closely as possible, except for the phenotype of interest, which must be clearly defined, using reliable diagnostic procedures with stated performance attributes.

Etiologic and genetic heterogeneity underlying the affected status of cases or the quantitative trait of interest can obscure the identification of an association, especially in small samples, thus increasing the probability of a FN result. Narrowing the selection criteria for cases, such as specifying defined sub-phenotypes using stringent criteria, can reduce the risk of heterogeneity, and can possibly increase the signal-to-noise ratio for a true association. In addition, defining cases and controls should not be influenced by knowledge of individual genotypes. As well, the genotype assay must be uniformly performed using reference standards, and must be blinded to the phenotype. In theory, it is possible to match cases and controls molecularly, by using DNA markers from random genomic locations that have low-to-no a priori possibility of being associated with the trait of interest. However, procedures and standards for such molecular matching have not yet been established.

Most criticisms of association studies stem from their potential to generate FP results. High-throughput genotyping and multiple hypothesis testing increase the threat that associations will result from chance alone. For example, if 20 markers are genotyped, but only the two best results are reported without commenting on the actual amount of genotyping, the findings are misleading to the reader. Reports of association studies should thus include a statement of the numbers of markers actually genotyped. Realistic analysis of statistical power should be included, with appropriate assumptions of homogeneity and genetic effect sizes.

Increasing the stringency of the significance level by using an accepted adjustment procedure is a crucial step to control the risk of a FP result. However, this also could reduce statistical power in some association studies that may already be under-powered. Another way to reduce FP results is to increase the previous probability of causation by selecting candidate genes based on evidence of a role in the phenotype from functional, expression, or genetic mapping experiments. Most association studies thus use candidate gene SNPs.

FP associations can also result from inadequate matching of controls and cases, often due to systematic differences. For instance, some population substrata may be more susceptible to the phenotype because of an unmeasured factor, like ethnicity or genetic background, for which the genotype is merely an indirect marker. There are several approaches to reduce population stratification artifacts, such as the use of more restrictive family-based association designs, although some of these approaches may require additional scrutiny to fully define their appropriate use and limitations. In any event, greater awareness of this problem seems to be reducing the hazard of stratification artifacts for recent better-designed studies.

Although true positive or true negative findings should be replicable, there may be valid reasons why this is not always the case. For instance, imagine that four independent investigators conduct separate case-control association studies of the same genetic marker and atherosclerosis phenotype. Assuming a true association and similarity of effect and sample sizes, and a power of 0.85 for each study, the probability that all four studies will detect the association is ≈50%. Furthermore, because atherosclerosis and related phenotypes are so complex, the actual assortment of etiologies may vary across samples, making replication less likely between samples.

Publication bias for positive results occurs early in a polymorphism’s history. Also, for some polymorphisms, just as negative associations between a marker/locus and a phenotype were being reported, new positive associations involving the same marker and more distal phenotypes were being published, with a lag period before failure to replicate is demonstrated. The ACE intron 16 insertion/deletion genotype is illustrative: while large meta-analyses were discounting the association with myocardial infarction, 5,6 smaller case-control studies were reporting new associations with seemingly more remote traits, such as serum triglycerides, exercise tolerance and “ leukoaraiosis.” 7–10 Replication and persistence of the initial association through systematic, planned meta-analysis and review would appear to be crucial. The rationale to search for new associated phenotypes would need to be justified and withstand critical peer-review for each new study. Also, there might be more convenient ways to publish association studies, with smaller article formats for both initial reports and replication studies.

There are also problems related to the use of SNPs themselves as genetic markers. These include the low power afforded in general by biallelic systems and the inability to examine haplotype phase in unrelated study samples. Most identified SNPs are biologically neutral or have no known function. When a SNP is selected as a marker for a locus, there may be no functional basis for a relationship with the phenotype. Instead, the SNP may be in linkage disequilibrium with an unmeasured functional variant, which may be weak or variable in large complex populations. Using larger numbers of SNPs might help provide a few functional polymorphisms. If many SNPs at a locus are used, linkage disequilibrium between them should be estimated. Finally, and most importantly, tests of direct association are convincing if a selected SNP has a demonstrated functional consequence.

Is There Any Hope for Genetic Association Studies?

While genetic association studies have significant limitations, they sometimes represent the only practical approach to begin to address a particular biological hypothesis. For new genes or for known genes with an interesting protein product, genetic association analysis may be the fastest initial approach to show a relationship with a biological end point. Because interest in SNPs and association analyses will continue to grow, it is important to recognize their limitations. Some weaknesses of genetic association studies and possible ways to confront them are shown in the Table.

Table 1. Partial List of Desirable Attributes for Genetic Association Studies

At a bare minimum, a desirable genetic association study should have: 1) justifiable biological rationale 2) appropriate selection and sampling procedures 3) rigorous phenotyping and genotyping procedures 4) large samples 5) appropriate probability values and 6) physiologically meaningful evidence supporting a functional role of the polymorphism. Reports that contain an initial study and an independent replication would be particularly desirable, especially if different sampling and/or analytical strategies were used. To reduce the risk that the authors, editorial staff, and readers will be confronted with false results, explicit guidelines for genetic association studies may be required. While few studies will meet all criteria, the confidence in the results will probably be proportional to the number of met criteria. The standards for “desirability” will continue to evolve as insights into complex traits and analytic strategies improve.

The author is grateful for laboratory support by grants from the Canadian Institutes for Health Research, the Heart and Stroke Foundation of Ontario, the Canadian Genetic Diseases Network, the Canadian Diabetes Association, and the Blackburn Group. The author is a Career Investigator of the Heart and Stroke Foundation of Ontario and holds a Canada Research Chair (Tier I) in Human Genetics.

Parabon ® Snapshot ®

Snapshot is a cutting-edge forensic DNA analysis service that provides a variety of tools for solving hard cases quickly:

Snapshot is ideal for generating investigative leads, narrowing suspect lists, and solving human remains cases, without wasting time and money chasing false leads.

Get Started

Snapshot is ideal for generating investigative leads, narrowing suspect lists, and solving human remains cases, without wasting time and money chasing false leads.

Get Started

Snapshot Genetic Genealogy

Genetic Genealogy (GG) is the combination of genetic analysis with traditional historical and genealogical research to study family history. For forensic investigations, it can be used to identify remains by tying the DNA to a family with a missing person or to point to the likely identity of a perpetrator.

By comparing a DNA sample to a database of DNA from volunteer participants, it is possible to determine whether there are any relatives of the DNA sample in the database and how closely related they are (see Snapshot Kinship Inference for more details). This information can then be cross-referenced with other data sources used in traditional genealogical research, such as census records, vital records, obituaries and newspaper archives.

Why Use Genetic Genealogy?

Genetic genealogy gives you a powerful new tool to generate leads on unknown subjects. When a genetic genealogy search yields useful related matches to an unknown DNA sample, it can narrow down a suspect list to a region, a family, or even an individual. Paired with Snapshot DNA Phenotyping to further reduce the list of possible matches, there is no more powerful identification method besides a direct DNA comparison. Identity can then be confirmed using traditional STR analysis.

How Does This Technique Differ From Familial Searches in the CODIS Database?

Our genetic genealogy service is somewhat like familial search, but it differs in three very important ways: (1) we only search public genetic genealogy databases, not government-owned criminal (STR profile) databases, such as CODIS (2) because the DNA SNP profiles we generate contain vastly more information than traditional STR profiles, genetic relatedness can be detected at a far greater distance (see Snapshot Kinship Inference) and (3) because genetic genealogy matches can be cross-referenced by name with traditional genealogy sources, such as, existing family trees can be used to expedite tree-building and case-solving. This technology and our innovative techniques combine to create a groundbreaking system for forensic human identification.

How Genetic Genealogy Works

Genetic genealogy uses autosomal DNA (atDNA) single nucleotide polymorphisms (SNPs) to determine how closely related two individuals are. Unlike other genetic markers, such as mitochondrial DNA or Y chromosome DNA, atDNA is inherited from all ancestral lines and passed on by both males and females and thus can be used to compare any two individuals, regardless of how they are related. However, atDNA SNPs are more difficult to obtain from forensic samples, which is why Parabon has created an optimized laboratory protocol to ensure high-quality results even from small, degraded DNA samples.

The standard atDNA metric used by genetic genealogists is the amount of DNA that two people are likely to have inherited from a recent common ancestor. This can be estimated by looking for long stretches of identical DNA. While alleles can easily be shared by chance at one or a few SNPs, it is highly unlikely for two unrelated people to share a long stretch of DNA. Therefore, only segments above a certain length are counted. The length of these shared segments is measured in centimorgans (cM), a measure of genetic distance, and the total number of cM shared across all chromosomes can be used to determine approximately how closely related two people are. The figure below shows how shared segments of DNA on a single chromosome are broken up with each generation, leading to shorter shared segments for more distant relatives. Using a public genetic genealogy database, DNA from an unknown person can be compared to roughly 1 million other people to see whether any of them are related.

DNA database matches serve as clues on which traditional genealogy methods can build, starting with building the matches' family trees using a wide variety of information sources. During the tree building process, the genetic genealogist searched for common ancestors who appear across multiple family trees of the matches. Ideally, marriages between the descendants of the identified common ancestors are discovered. Then descendancy research is employed to search for descendants at the intersection of these common ancestors who were born at a time that is consistent with the subject's estimated age range. The goal of this search is to narrow down the possible individuals to a set of names, a family, or even an individual.

Depending on the amount of information available from the matches, genetic genealogy can produce a wide range of leads. In all cases that proceed to analysis, genetic genealogy will significantly narrow the scope of possible identities for the person-of-interest. In some cases, the identity will be narrowed to descendants of a particular ancestor or from a particular region. In others, our analysts can produce the name and address of the person-of-interest. In all cases, identity must be confirmed through traditional forensic DNA matching.

Genetic Genealogy Use Cases

Genetic genealogy has traditionally been used to discover new relatives and build a full family tree. However, it can also be used to discover the identity of an unknown individual by using DNA to identify relatives and then using genealogy research to build family trees and deduce who the unknown individual could be. These techniques have primarily been used to discover the family history of adopted individuals, but they apply equally as well to forensic applications. Genetic genealogy has been used to identify victims' remains, as well as suspects, in a number of high-profile cases.

Because genetic genealogy uses the same type of data generated for Snapshot DNA Phenotyping and Snapshot Kinship, the analysis can quickly be performed on existing cases, and new cases have a wide array of options for generating new leads from a single DNA sample.

Snapshot Featured in
National Geographic
Magazine Cover Story
[UPDATE: Solved]

Watch NBC Nightly News
Put Snapshot To The Test

Watch Snapshot
Workflow Video

The Snapshot DNA Phenotyping Service

DNA Phenotyping is the prediction of physical appearance from DNA. It can be used to generate leads in cases where there are no suspects or database hits, to narrow suspect lists, and to help solve human remains cases.

DNA carries the genetic instruction set for an individual's physical characteristics, producing the wide range of appearances among people. By determining how genetic information translates into physical appearance, it is possible to "reverse-engineer" DNA into a physical profile. Snapshot reads tens of thousands of genetic variants ("genotypes") from a DNA sample and uses this information to predict what an unknown person looks like.

Over the past four years, using deep data mining and advanced machine learning algorithms in a specialized bioinformatics pipeline, Parabon &mdash with funding support from the US Department of Defense (DoD) &mdash developed the Snapshot Forensic DNA Phenotyping System, which accurately predicts genetic ancestry, eye color, hair color, skin color, freckling, and face shape in individuals from any ethnic background, even individuals with mixed ancestry.

Because some traits are partially determined by environmental factors and not DNA alone, Snapshot trait predictions are presented with a corresponding measure of confidence, which reflects the degree to which such factors influence each particular trait. Traits, such as eye color, that are highly heritable (i.e., are not greatly affected by environmental factors) are predicted with higher accuracy and confidence than those that have lower heritability these differences are shown in the confidence metrics that accompany each Snapshot trait prediction.

How DNA Phenotyping Works

Whereas traditional DNA forensics matches STRs from a sample to a known suspect or a database, DNA phenotyping can generate new leads about an individual, even if they have not previously been identified in a database. DNA phenotyping takes advantage of modern SNP technology to read the parts of the genome that actually code for the differences between people.

The Snapshot DNA Phenotyping System translates SNP information from an unknown individual's DNA sample into predictions of ancestry and physical appearance traits, such as skin color, hair color, eye color, freckling, and even face morphology. Each phenotype prediction is made with a measure of confidence, including those that can be excluded with high confidence.

SNP Technology

Recent advances in genomic technology have made it practical and affordable to read the sequence of millions of pieces of DNA from a small quantity of sample. This data captures a large proportion of the genomic variation between people and thus contains much of the genetic blueprint that differentiates people's appearance. These SNP genotypes can then be paired with phenotypes from thousands of subjects to create a genotype-and-phenotype (GaP) dataset for analysis.

Using genomic data from large populations of subjects with known phenotypes, Parabon's bioinformatics scientists have built statistical models for forensic traits, which can be used to predict the physical appearance of unknown individuals from DNA.

Data Mining

Beginning with large GaP datasets containing genetic information and measures of phenotype for thousands of subjects, Parabon's bioinformatics team performs large-scale statistical analysis on hundreds of thousands of individual SNPs and billions of SNP combinations to identify genetic markers that are associated with a trait. This mining process can take weeks of compute time running on hundreds, sometimes thousands, of computers. In the end, those SNPs with the greatest likelihood of contributing biologically to the trait's variation are selected for potential use in predictive models.

Data Modeling

In the modeling phase, Parabon's scientists use machine learning algorithms to combine the selected set of SNPs into a complex mathematical equation for the genetic architecture of the trait. A new, unknown individual's SNP data can then be plugged into this equation to produce a prediction of the trait in that individual.

Model accuracy is assessed by making predictions on new subjects with known phenotypes ("out-of-sample predictions"). By comparing predicted versus actual phenotypes, Parabon scientists are able to calculate confidence statements about new predictions and, more importantly, exclude highly unlikely traits. For example, if 99% of brown-eyed people have an eye color prediction value greater than 2, then we can have very high confidence that a prediction of 1.5 most likely did not come from a brown-eyed person.

The final models are calibrated with all available data before being installed into the Snapshot production service that is used to generate phenotype predictions for investigators.

Snapshot Success Stories

Snapshot has been used by hundreds of law enforcement agencies around the world to help generate leads, narrow their suspect pools, and solve human remains cases, in both active and decades-old investigations.

Featured Case Summaries: Read detailed case descriptions, including how Snapshot helped solve the following cases:

Case Summary
Albuquerque, NM
2008 Aggravated Assault

Just before noon on 11 September 2008, Diane Marcell returned to her home in Albuquerque, NM, to meet her daughter, Brittani Marcell, for lunch. Brittani, then 17 years old, had driven home from her nearby high school. Upon entering her home, Diane found Brittani lying on the floor, covered in blood. A male subject, unknown to Diane, was standing near Brittani holding a shovel.

Startled, he dropped the shovel, ran into. More

Case Summary
Rockingham County, NC
2012 Double Homicide

In the early hours of 4 Feb 2012, Troy and LaDonna French were gunned down in their home in Reidsville, NC. The couple awoke to screams from their 19-year old daughter, Whitley, who had detected the presence of a male intruder in her second floor room. As they rushed from their downstairs bedroom to aid their daughter, the intruder attempted to quiet the girl with threats at knifepoint. Failing this, he released Whitley and raced down the stairs.

After swapping his knife for the handgun in his pocket. More

Case Summary
Tacoma, WA
1986 Rape and Murder of 12-Year-Old Girl

On Wednesday 26 March 1986, Michella Welch, a petite 12-year old girl with long blond hair and glasses, went missing. She had taken her two younger sisters to Puget Park in Tacoma, Washington at about 10 a.m. and then rode her bicycle home about 11 a.m. to make lunch for them. When she returned, she chained her bike next to one of her sister's bikes, set the lunches on the table and went looking for her siblings, who had gone to a nearby business to use the restroom.

A 13-year-old classmate later told detectives he saw a man in the park that day under the Proctor Bridge who. More

Case Summary
Anne Arundel County, MD
2017 Unidentified Remains

On Wednesday 14 June 2017, members of the Anne Arundel County Police Department responded to a call reporting that a body had been found in the area of East Ordnance Road and East Avenue in Glen Burnie, MD. Upon arrival, officers located badly decomposed human skeletal remains that had been covered up by a tarp. The Office of the Chief Medical Examiner later determined that the decedent was a female approximately 20 years of age and that foul play was suspected in her death.

In the fall of 2017, after initial investigative efforts failed to reveal the victim's identity. More

Case Summary
Lake Brownwood, TX
2016 Sexual Assault and Murder

On Friday 13 May 2016, the Brown County Texas Sheriff's Office (BCSO) received a missing person report for 25-year-old Rhonda Chantay Blankinship. Family members reported Blankinship had last been seen late Friday evening, walking near her home in the Tamarack Mountain/Thunderbird Bay area of Lake Brownwood. Friends, family and volunteers began a search for her while deputies followed up on possible leads into her disappearance.

Blankinship's body was found. More

Testimonials: To read about how Snapshot has helped our clients with their investigations, see:

Published Investigations: To learn how Snapshot is being used by additional law enforcement agencies &mdash and to read about other solved cases &mdash please visit the published police investigation page at:

Blind Evaluations: Snapshot was built by Parabon NanoLabs for the defense, security, justice, and intelligence communities with funding from the United States Defense Threat Reduction Agency. As part of the development and validation process, Snapshot was tested on thousands of out-of-sample genotypes and was shown to be extremely accurate.

To see examples of Snapshot predictions from blind evaluation studies, visit:

Example of How To Use Snapshot: To learn how you can use Snapshot to narrow a suspect pool, watch:

Snapshot Featured in
National Geographic
Magazine Cover Story
[UPDATE: Solved]

Watch NBC Nightly News
Put Snapshot To The Test

Watch Snapshot
Workflow Video

Predicting Genetic Ancestry With Snapshot

Scientific analysis of human genomes from different parts of the world has shown that, on a global scale, modern humans divide genetically into seven continental populations: African, Middle Eastern, European, Central/South Asian, East Asian, Oceanian, and Native American 1 . These genetic divisions stem simply from the fact that these groups were isolated from one another for many generations, and thus each group has a unique genetic signature that can be used for identification. In order to determine a new subject's genetic ancestry, Parabon Snapshot analyzes tens of thousands of SNPs from a DNA sample to determine a person's percent membership in each of these global populations. Other forensic ancestry approaches assume that every individual comes from only a single population, so they can easily be confounded by admixed individuals, but Snapshot allows for contributions from multiple populations, so it can detect even low levels of admixture (<5%).

Global ancestry map showing mostly East Asian and Native/South American ancestry, with some European ancestry as well.

After global ancestry is determined, Snapshot's ancestry algorithm investigates which subpopulations (e.g., Northwest vs. Northeast Europe) an individual comes from. This analysis is robust to admixture, such that each piece of continental ancestry can be precisely localized within that continent. For example, the admixed East Asian and Latino example from the global map above was determined to have specifically Japanese, Central American, and Southwest European ancestry, as shown in the map below.

Regional ancestry map showing mostly Japanese, Southwest European, and Central American ancestry.

Using all of this information, Snapshot builds a precise profile of an individual's ethnic ancestry using only his or her DNA.

How Genetic Ancestry Determination Works

Parabon has built a powerful system for determining ethnic ancestry from DNA. Most other forensic ancestry systems use only a small number of SNPs and thus are limited to very coarse populations and cannot detect admixture between populations. Snapshot uses tens of thousands of SNPs across the genome to obtain very precise estimates of ancestry, even for admixed individuals. Parabon's scientists have collected data from many published scientific articles, totalling more than 9,000 individuals with clearly defined ancestry from more than 150 populations around the world, as shown in the map below.

Each point represents a population from which we have obtained ancestry background data. Efforts are ongoing to increase the representation of Native American populations.

Academic research using hundreds of thousands of SNPs from across the genome has shown that human groups generally divide into seven continental populations, which have been established over the past 50,000 years during the migration out of Africa. The 150 populations collected as the ancestry background can thus be divided into these seven continental groups according to their origin.

Snapshot builds on this research by mapping a new person's genome onto these established populations. Our algorithm calculates how similar the new individual's DNA is to each of the background populations, determining which population(s) the person comes from. This allows for contributions from multiple groups, so even small amounts of admixture (<5%) can be detected.

Snapshot takes a similar approach to identifying within-continental (regional) ancestry, although the local populations were identified through empirical analysis performed by our bioinformatics team. Each piece of continental ancestry is partitioned according to its regional ancestry (e.g., if an individual is 50% European and 50% East Asian, the precise origin of each of those pieces will be determined). The person's genome is also plotted against all of the known individuals in each region to show visually where he or she falls.

Below is an example plot for an individual who was determined to be 50% East Asian and 50% Latino. Latino ancestry is a mixture of European and Native American ancestry, so these groups are shown as well.

Ancestry clustering diagram this individual is half Japanese and half Latino.

Ancestry Determination Use Cases

Ethnic ancestry is one of the most informative traits that can be predicted from DNA. In an ancestry analysis, Snapshot will determine an individual's precise genetic origins, as well as whether there is any evidence of admixture (contribution from multiple populations). This information can be used to help identify remains or to significantly focus an investigation by excluding a wide range of possible suspects or even pointing to a very small group.

Snapshot Featured in
National Geographic
Magazine Cover Story
[UPDATE: Solved]

Watch NBC Nightly News
Put Snapshot To The Test

Watch Snapshot
Workflow Video

Snapshot Kinship Inference &trade

Snapshot Kinship Inference provides highly accurate inferences about the familial relationship between two people based on their DNA, even if they are distantly related. Unlike traditional forensic DNA methods, which are extremely limited in their ability to determine kinship (see tan region in the figure below), Snapshot can detect relatedness out to 9th-degree relatives (fourth cousins). This powerful forensic analysis tool gives investigators valuable, previously unobtainable information about the DNA samples found at a crime scene &mdash information that can save time and money and lead to more solved cases.

Thanks to the massive amount of information contained in genome-wide SNP data, using DNA extracted from two biological samples, it is possible to precisely calculate the degree of relatedness between the contributors, even if the relationship is very distant.

Built with advanced machine learning algorithms, the Snapshot kinship model can distinguish up to 9th-degree relatives (fourth cousins) from unrelated pairs.

Traditional STR-based kinship analysis is limited to distinguishing parent/offspring relationships, often yielding inconclusive results for siblings or other second-degree relatives. Snapshot's kinship model, on the other hand, uses hundreds of thousands of SNPs to detect relatedness out to 9th-degree relationships &mdash e.g., fourth cousins. Moreover, the precise degree of the relationship can be determined out to 6th-degree relatives (second cousins once removed) while minimizing false positives &mdash i.e., unrelated pairs mistakenly inferred to be related.

How Snapshot Kinship Inference Works

Traditional autosomal kinship analysis uses fewer than 20 short tandem repeat (STR) loci, which lack the resolution to establish relatedness beyond parent-offspring or full siblings, and is easily confounded by mutation or mistaken testing of a close relative of the true parent. 1 Other forensic analyses use pieces of DNA that are directly transmitted through the maternal (mitochondrial DNA) or paternal (Y-chromosome) lines however, these approaches are limited to a small subset of relationships and have very low resolution. For example,

7% of unrelated Europeans share the same mitochondrial haplotype, meanting that they cannot be assigned to a specific family. MtDNA and Y-STRs can only suggest that two individuals may be related but cannot say whether that relationship is close or very distant.

Dissatisfied with these limitations, Parabon's scientists set out to develop a novel algorithm that takes advantage of the massive amount of autosomal data made available by genome-wide SNP typing to compare two genomes and determine the precise degree of relatedness between the two individuals. The result is a revolutionary new test that redefines the state-of-the-art in kinship analysis.

Parabon's kinship algorithm analyzes the similarity between two genomes and uses a machine learning model to predict the degree of relatedness of the two individuals. In thousands of out-of-sample predictions, this method has proven to be highly accurate while maintaining a very low false-positive rate (i.e., unrelated pairs are almost never mistakenly inferred to be related). This is true across subjects from a range of ethnic backgrounds, including related pairs with different ethnic backgrounds. Absolute accuracy is >90% out to 3rd-degree relatives (first cousins), and Snapshot can distinguish 6th-degree relatives (e.g., second cousins once removed) from unrelated pairs with greater than 98% accuracy.

Snapshot Kinship Accuracy, measured as the frequency of correct predictions of the exact degree of relatedness (absolute accuracy) and the frequency of predictions within one degree of actual relatedness (n = 3,654 relationships).

As shown in the figure above, even when Snapshot incorrectly infers the degree of relatedness between two individuals, it is almost always correct within one degree. For example, Snapshot may occasionally incorrectly predict a 4th-degree relationship to be a 5th-degree relationship, but it rarely makes the mistake of predicting a 4th-degree relationship to be a 6th-degree relationship. With this level of accuracy, you can be confident that the inferences provided by Snapshot are reliable and actionable.

[1] Chakraborty, R., et al. (1999). The utility of short tandem repeat loci beyond human identification: implications for development of new DNA typing systems. Electrophoresis, 1682&ndash1696.

How Snapshot Kinship Inference is Used

Snapshot Kinship Inference can be used to establish familial relationships between a DNA sample and previously collected DNA samples or among a set of new samples, e.g.:

  • If there is a chance that the perpetrator of a crime is related to the victim, Snapshot can compare the victim's DNA to a crime scene DNA sample to determine whether they are related. With just one test, investigators and include or exclude the entire extended biological family of the victim.
  • If DNA from a suspect cannot be obtained, but a consenting family member is willing to contribute a sample, Snapshot can establish whether that family member is related to a crime scene DNA sample.
  • If the identity of unidentified remains is suspected, but only distant relatives are available, Snapshot can compare DNA from the remains (even bone) to that of a relative to determine whether they are related.

According to the U.S. Department of Justice (DOJ) Bureau of Justice Statistics, over 60% of all violent crimes in 2016 [the latest period for which data is available] were committed by persons known to the victim. 1

Knowledge of these relationships can be used to validate claims of distant kinship, establish relationship networks within groups of interest, or identify remains when close relatives are not available, such as cold cases, mass disasters, or casualties of past conflicts.

[1] Morgan R. and Kena G., Criminal Victimization, 2016, US Department of Justice, Office of Justice Programs, Bureau of Justice Statistics, NCJ 251150, Dec 2017. Retrieved: 19 Feb 2018.

Snapshot Featured in
National Geographic
Magazine Cover Story
[UPDATE: Solved]

Watch NBC Nightly News
Put Snapshot To The Test

Watch Snapshot
Workflow Video

Forensic Art Enhancement

While DNA can reveal much about the appearance of a subject, information about features such as age, body mass index (BMI) or the presence of facial hair are not available within an individual's genetic code. Snapshot forensic art services provide a means of incorporating such information into a Snapshot composite when it is available from non-DNA sources.

Examples of age progression and accessorization with Snapshot Forensic Art Services. By default, Snapshot produces composites from DNA at 25 years of age (A). Composite (A) shown after age progression to age 50 years (B) with the addition of a light beard (C) after further age progression to age 75 years with reading glasses (D) and with a full beard (E)

Our Forensic Art Department &mdash under the direction of Thom Shaw, who is certified by the International Association for Identification (IAI) in the discipline of forensic art &mdash offers age progression, BMI alteration, and accessorization services, which may include the addition of facial hair, eyeglasses, piercings, etc. We can also create composite sketches from eyewitness accounts and combine them with traditional Snapshot composites in this way, corroborating the witness account or adding objective phenotype information to help produce the most accurate composite possible.

Composite (A) shown after age progression to 50 years old, including a beard (B) as compared to the actual subject (C)

In cases involving unidentified remains where a skull or partial skull is available, our forensic artists are also trained to perform digital facial reconstruction, using bone structure to enhance or give nuance to a Snapshot composite.

Snapshot predictions for Yolanda McClary, investigator for TV's "Cold Justice",
shown at age 25 and age progressed to 49 years old

Collectively, these forensic art services perfectly complement what Snapshot can provide from DNA alone and together they represent a revolution in how DNA can be used in an investigation.

How Forensic Art Enhancement Works

Forensic artists are artists with special training to address forensic challenges. They have an expert understanding of the human face and how the effects of aging and body mass index (BMI) change appearance. Those trained in facial reconstruction learn how to infer the most likely distribution of muscle and soft tissue from a skull. Forensic artists who create composite sketches from eyewitness accounts are trained to conduct cognitive interviews, so as to get the most accurate portrayal from a witness' memory.

Like many domains, forensic artists are beginning to rely heavily on modern software applications to facilitate their work. Sketches formerly performed with pencil and pad can now be drawn digitally. As well, facial reconstructions once performed with clay sculpture can also be digitally sculpted. In the right hands, graphics software programs can ease the task of adding or subtracting hair, scars, and other accessories. In all cases, great skill and specialized training is still required, but the work can be more efficient and realistic thanks to these tools.

Forensic Art Enhancement Use Cases

Age Progression or Regression

Because age is not genetically encoded, Snapshot predicts subjects at 25 years of age by default. When investigators have reason to believe a person of interest is younger or older, our artists can adjust a composite accordingly, based on standard aging principles.

Examples of age progression with Snapshot Forensic Art Services: the predicted composite at 25 years old (A) shown after age progression to age 50 years (B) and after further age progression to 75 years of age

Composites Based on Eyewitness Account

Our forensic artists are trained to conduct cognitive interviews and produce composites solely from an eyewitness account. The interview and composite production is conducted online with screen sharing technology, so eyewitnesses do not have to travel. When DNA is available for the same person of interest as seen by the eyewitness, Snapshot can provide a corresponding composite from "the genetic witness" perspective. Our artists can combine a composite from an eyewitness account with one produced by Snapshot to produce a single, highly accurate rendering that contains the best that both sources of information can offer.


In some instances, descriptive information about a subject's accessories or distinguishing features is available that can be used to enhance a Snapshot composite. For example, a surveillance camera image may be too grainy for identification, but nevertheless suggestive that a suspect has facial hair. Similarly, an eyewitness may recall a tattoo or scar, even though they were too traumatized to remember much else. In such cases, our forensic artists can accessorize a Snapshot composite to include all available descriptive information about a subject.

Examples of accessorization with Snapshot Forensic Art Services: the predicted composite at 25 years old (A) shown after age progression to age 50 years, with the addition of a light beard (B) and after further age progression to age 75 years with reading glasses and a full beard (C)

Body Mass Index (BMI) Alteration

Besides the effects of aging, changes in BMI have among the largest effects on appearance. By default, Snapshot produces composites assuming the subject has a BMI of 22, which is considered average. When information is available that suggests a subject has a lower or higher than average BMI, forensic artists can appropriately alter the BMI of a Snapshot composite.

Extreme examples of body mass index (BMI) alteration: the original prediction (A) shown with significantly less body mass (B) and again with a significantly larger amount of body mass (C)

Unidentified Remains

When unidentified human remains include a skull, our forensic artists can perform facial reconstruction, literally building up the corresponding face using knowledge of facial musculature and soft tissues. Although facial features cannot be perfectly inferred from a skull, bone structure can be immensely informative about the shape of an individual's face. Snapshot predicts exterior face morphology, but when a skull is available, a forensic artist can use it to confirm or enhance a Snapshot composite based on facial reconstruction.

Snapshot Featured in
National Geographic
Magazine Cover Story
[UPDATE: Solved]


Craniosynostosis (CRS), the premature fusion of the cranial sutures, is a heterogeneous disorder with a prevalence of ∼ 1 in 2000. Environmental factors, polygenic inheritance and single-gene or chromosomal abnormalities all contribute to its complex manifestations. Variants in >60 genes have been identified as recurrently associated with CRS, with an underlying genetic cause being found in ∼ 24% of patients overall. 1,2,3 The proportion in whom a cause can be determined varies widely depending on clinical diagnosis: from 88% for bicoronal synostosis down to 8% for sagittal synostosis (SS). 2 Until recently, success in identifying a genetic diagnosis has been particularly low in nonsyndromic midline CRS, under 1% for both sagittal and metopic suture fusions. 2

In 2016, Timberlake et al. 4 performed exome sequencing of 132 parent–offspring trios and 59 additional probands presenting with clinically nonsyndromic SS, metopic (MS), or combined metopic/sagittal synostosis, seeking evidence for major monogenic contributions to these disorders. Based on enrichment of de novo variants and inherited damaging variants, this study identified a single significant gene, SMAD6, located at 15q22.3. 4

SMAD6, originally identified in mammals by homology-based cloning, 5,6 encodes one of two (with SMAD7) inhibitory members of the SMAD family required for regulated intracellular signal transduction by members of the transforming growth factor β/bone morphogenetic protein (TGFβ/BMP) superfamily. 7,8,9 Intriguingly, enrichment of rare SMAD6 variants has also been reported in association with several other distinct phenotypes, namely congenital heart disease, 10,11,12 bicuspid aortic valve (BAV) and ascending thoracic aortic aneurysm (TAA), 13,14,15 intellectual disability, 16 and radioulnar synostosis. 17

In a follow-up study, Timberlake et al. increased the sample size of probands with midline CRS and no other genetic diagnosis to 379 (45 pedigrees included ≥1 additional affected family member). 18 They found damaging SMAD6 variants in 4/234 (1.7%) SS, 11/135 (8.1%) MS, and 2/10 (20%) combined metopic/sagittal synostosis probands. Although de novo variants (DNMs) were identified in four families, in the remainder, the SMAD6 variant was transmitted by an apparently unaffected (i.e., nonpenetrant) parent. Similar observations of nonpenetrance of SMAD6 variants were made for several of the other described disease associations. 13,14,17 To seek an explanation for the unpredictable penetrance, Timberlake et al. 4,18 genotyped a single-nucleotide polymorphism (SNP), rs1884302, previously reported in a genome-wide association study (GWAS) of nonsyndromic SS to be the most significant associated SNP, which may differentially regulate the most proximal gene BMP2. 19,20 The risk-conferring C allele (prevalence in non-Finnish Europeans of 32.7%, gnomAD), 21 was found to be present in 15/21 individuals with CRS but only 1/20 unaffected relatives heterozygous for the SMAD6 variant but without CRS, suggesting a two-locus mechanism to account for variable manifestation of CRS. 18

Although the studies described above 4,18 represent an important advance in delineating the contribution of single-gene variants to nonsyndromic midline CRS, they raise several questions. First, what is the contribution of SMAD6 variants in all presentations of CRS (including syndromic diagnoses and fusion of coronal or lambdoid sutures)? Second, can it be assumed that all rare SMAD6 missense variants affect protein function? Third, can the two-locus (SMAD6/rs1884302) model be confirmed in an independent cohort? Here, we address these questions. We confirm the primary finding that SMAD6 variants are enriched in CRS, especially metopic synostosis, but find a more diverse pattern of clinical presentation in addition, we illustrate the importance of combining functional studies with frequency-based evaluation of variants to refine likelihood of pathogenicity. Finally, we report that the two-locus model does not account for inconsistencies of penetrance of damaging SMAD6 variants in our data set.


Development of CottonSNP80K

High-density SNP arrays have been developed for a number of economically important crops, such as rice [9,10,11], maize [31, 32], soybean [33], wheat [34], and cotton [16]. These chips have been successfully used for functional genomics studies and molecular breeding. To date, only one array (CottonSNP63K) has been reported in cotton [16]. In this study, we developed a new upland cotton genotyping SNP array (CottonSNP80K), an improved version of CottonSNP63K for upland cotton intraspecific genotyping detection. Of the 82,259 SNPs selected, 77,774 SNPs (94.55%) were successfully synthesized on this CottonSNP80K array. Compared to the CottonSNP63K array [16], the CottonSNP80K array shows several significant highlights and improvements. Firstly, compared to SNP loci in CottonSNP63K array, which were collected from 13 different discovery sets of G. hirsutum germplasm and five other species, the SNP loci in CottonSNP80K array benefited from the whole genome sequencing of G. hirsutum acc. TM-1 [22], and 1,372,195 intraspecific non-unique SNPs identified by re-sequencing of G. hirsutum accessions [27], therefore the selected SNPs in CottonSNP80K could be distributed along the entire genome. Secondly, the CottonSNP63K array contains 63,058 markers, including 45,104 intraspecific SNPs and 17,954 interspecific SNPs, whereas the CottonSNP80K array increased the total number of markers to 77,774. With requirement of MAFs > 0.1 by analyzing the re-sequencing data of different cotton accessions, the SNPs in CottonSNP80K showed five to six times upland cotton intraspecific polymorphism compared with that in CottonSNP63K. In the recent reports, using the CottonSNP63K array, Huang et al. (2017) [19] detected 11,975 quantified polymorphic SNPs in a diverse and nationwide population containing 503 G. hirsutum accessions, and Sun et al. (2017) [18] detected 10,511 polymorphic SNPs using 719 diverse accessions of upland cotton. In the present study, the number of polymorphic markers for upland cotton intraspecific genotyping detection was increased to 59,502 using the CottonSNP80K array. Thirdly, compared with the CottonSNP63K array, each SNP marker in the CottonSNP80K array is addressable, which avoids the disturbance of homeologous/paralogous genes. During the development of the CottonSNP80K array, we also considered factors affecting the array quality, including flanking sequence information, Illumina design scores, heterozygosity rates, cluster results, which ensures that it is of high quality in upland cotton genotyping detection.

Upland cotton accounts for more than 90% of the world’s cotton production, however, modern upland cotton cultivars have narrow genetic diversity. The CottonSNP80K array is more suitable for upland cotton intraspecific genotyping detection, which can greatly overcome the narrow genetic background and low genetic diversity. Using the CottonSNP80K array, genotyping analysis was performed on 352 cotton accessions. Of the 77,774 SNPs on the array, 59,502 (76.51%) were polymorphic loci, with 95.91% (57,071) and 82.25% (48,940) showing MAFs greater than 0.05 and 0.1, respectively. In the CottonSNP63K project, these parameters were much lower, at 66.8% (MAF > 0.05) and 55.8% (MAF > 0.1) [16]. We also investigated the genetic diversity of upland cotton accessions from different ecological areas. The Yangtze River Valley and the Yellow River Valley showed a higher degree of polymorphism than Northwestern inland and Northern groups. In addition, the number of SNPs with MAF > 0.1 was higher in the Yangtze River accessions than in the Yellow River Valley accessions, indicating that more rare alleles exist in the Yellow River Valley group. We also found that the rate of polymorphisms between G. hirsutum and G. barbadense was greater than 30%, implying the array has potential for interspecific genotyping analysis.

To evaluate the reproducibility and the distinguishability of the CottonSNP80K array, three parent/F1 combinations, several mutants and their corresponding donors with similar genetic backgrounds, and the replicated cotton varieties were used for genotyping analysis. In the replication analysis, technical DNA replicates showed perfect consistency (100%) and different levels of variability were detected with biological replication types. Considerable genetic variation existed in cotton lines of different origins, which might be related to the cross-pollination nature of upland cotton with a 10–15% natural hybridization rate. Similarly, this variation was also found in other array projects [16, 35]. Thus, the CottonSNP80K array provides an efficient tool to characterize the inconsistencies between upland cotton accessions despite their similar genetic backgrounds. In addition, the CottonSNP80K array presented an excellent ability to detect heterozygous loci with an accuracy of more than 98%. Taken together, these findings suggest that the CottonSNP80K array is of high quality with high levels of convenience and cost-effectiveness, and could be widely used in diverse types of research.

Applications of CottonSNP80K

Cotton (Gossypium spp.) is the world’s most important fiber crop plant. While most of the >50 Gossypium species are diploid (n = 13), five are allopolyploids (n = 26), originating from an interspecific hybridization event between A- and D-genome diploid species [36]. There are three wild tetraploid cotton species (G. tomentosum, G. mustelinum and G. darwinii) and two cultivated species, upland cotton (G. hirsutum) and sea-island cotton (G. barbadense). In addition, upland cotton consists of seven semi-wild races, G. hirsutum race punctatum, morrilli, yucatanense, richmondii, marie-galante, latifolium and palmeri, and the domesticated upland cotton cultivars, which constitute about 90% of the world cotton production. In this study, using the CottonSNP80K array, 5, 13 and 312 cotton accessions from wild species, G. hirsutum races and cultivated upland cotton accessions, respectively, as well as 2 sea-island cotton accessions, were selected for phylogenetic analysis. We found that 20 accessions from wild, semi-wild and sea-island cotton were clustered together, with closer genetic relationships between 2 sea-island cotton accessions and wild species, and more similarity between semi-wild species and cultivated upland cotton. This might be because these SNP loci originated from intraspecific upland cotton variation. We also found that all 312 G. hirsutum accessions were sub-clustered ten groups, but these groupings were not related to their geographical distribution. A similar analysis was reported in a previous study that used 81,675 SNPs from the SLAF-seq of 355 cotton accessions, and population structure analysis showed that the tested accessions could be separated into nine subpopulations with no obvious geographic relationship [8]. In the present study, we grouped 312 G. hirsutum accessions, including 299 from modern improved Chinese cultivars/accessions and 13 from the introduced landraces, into ten clusters, clusterI to clusterX. We found that the different introduced landraces were clustered in different subgroups, such as King (in the 1920s), Foster6 (in 1933), Stoneville2B (in 1947), DPL15 (in 1950) and DPL16 (in 1970), which were successively introduced into China from the USA and were grouped into clusters IV, X, V, VI and III, respectively. In addition, Stoneville 4, which was introduced from the USA in 1934 was grouped into cluster II. Uganda3, which was introduced from Uganda in 1959, was grouped into clusterI. Junmian1 was developed from the filial generation of multi-parents involved in C1470, C3521, and 147φ, which was introduced from the former Soviet Union in 1960s, and was grouped to cluster IX. These results suggest that the introduced upland cotton landraces have a high genetic diversity, which provides rich genetic resources for breeding modern improved cotton varieties in China.

Soil salinization, one of the main factors leading to soil desertification and land degradation, has become a serious threat to agricultural production and ecological environments throughout the world. More than 800 million hectares (

6%) of world’s total land area are salt affected [37]. Compared with several other crops such as rice, maize and soybean, cotton is a pioneer crop in saline-alkali land. Based on SSR markers, some of the QTLs related to salt tolerance traits have been reported by family-based QTL mapping [38, 39] and association mapping approaches [30, 40,41,42]. To date, no studies have reported the identification of genes/QTLs associated with salinity tolerance in cotton using high-density SNP arrays. In previous studies, we carried out the large-scale identification of ten salt tolerance related traits in 304 upland cotton cultivars/accessions [30]. Here, we integrated 288 cotton accessions with genome-wide SNP genotyping data, and a total of 54,588 SNPs (MAF > 0.05) were used for GWAS analysis of salt stress related traits. We detected eight significant SNPs for three salt stress traits at the P < 1 × 10 −5 level (Table 4, Fig. 6). Further, in these SNPs peak region, 36 and 21 genes were annotated as response to stimulus and response to stress, respectively. Of these candidate stress response genes, mitogen-activated protein kinase kinase 5 (MKK5), which acts upstream of MPK3/MPK6, plays key roles in mediating many different stress signals and in plant development [43]. Overexpression of MKK5 in wild-type plants enhanced the tolerance to salt treatments, while mkk5 mutants exhibited hypersensitivity to salt stress during germination on salt-containing media in Arabidopsis [44]. Another candidate gene, lesion simulating disease 1 (LSD1), together with phytoalexin deficient 4 (PAD4) and enhanced disease susceptibility 1 (EDS1), comprise a molecular hub that integrates plant responses to several stresses such as hypoxia [45], drought [46], cold [47] and oxidative stress [48]. A reactive oxygen species (ROS) scavenging related gene, peroxidase (POD), was also found amongst the candidate genes. ROS, typically in the forms of H2O2 and O 2− , can be rapidly generated in plants when exposed to adverse environments, such as high salinity, drought, heat or cold. An excess of ROS leads to oxidative damage of cellular components, such as proteins, lipids, carbohydrates and nucleic acids. High ROS level also disturb protein synthesis and the cellular membrane, resulting in cellular and tissue damage [49]. Therefore, ROS scavenging capability is closely related to plant stress tolerance. Transcription factors (TFs) are considered as upstream regulatory proteins that play a major role in cellular metabolism and abiotic stress responses. One TF, named homeobox 7 (HB7), which encodes a putative transcription factor that contains a homeodomain closely linked to a leucine zipper motif, was found in an associated region with relative germination rate (RGR) in this study. ATHB7 has essential functions in the primary response to drought, as mediators of a negative feedback effect on ABA signaling in the plant response to water deficit [50]. Independent expression of AtHB7 resulted in improved stress tolerance in Arabidopsis [51]. In summary, with the identification of increasing numbers of genes/QTL related to important traits, high-throughput genotyping platforms will provide an effective pathway for high-resolution dissection of complex traits and molecular breeding by design in cotton.

With reduced computational requirements for downstream data processing, high call frequency, low error rates and ease of use, high-density SNP arrays are an attractive genotyping tool, which are widely applied in diversity studies and high-resolution dissection of complex traits [34], variety verification, trait introgression [10], and genome-wide association studies [18]. Compared with genome-wide sequencing technology, the SNP loci in the array is known and addressable, the data generation and analysis are more convenient and cost-effective. In the study, CottonSNP80K showed to work efficiently for variety verification and genome-wide association studies for salt stress traits, indicating great application potential in future cotton molecular breeding.

Electronic supplementary material


Additional file 1: Comparison of effect of different preprocessing steps. A detailed comparison of calling results with different preprocessing steps in terms of dbSNP rate, Ti/Tv ratio, novel Ti/Tv ratio and NRD for all regions, inside target regions, outside ≤ 200 bp regions, and outside > 200 bp regions from Illumina whole-exome sequencing data. Raw (blue), filterY (green), trim (black) and filterY&trim (red). (PDF 196 KB)


Additional file 2: Comparison of effect of marking duplication, realignment and recalibration. A detailed comparison of results using different steps, marking duplication, realignment and recalibration, in terms of dbSNP rate, Ti/Tv ratio, novel Ti/Tv ratio and NRD for all regions, inside target regions, outside ≤ 200 bp regions, and outside > 200 bp regions from Illumina whole-exome sequencing data. Initial alignment (black), marking duplication (yellow), realignment (violet), recalibration (blue), marking duplication followed by realignment (red), marking duplication followed by realignment and recalibration (brown). (PDF 285 KB)


Additional file 3: Comparison of effect of different arrangements of marking duplication, realignment and recalibration. A detailed comparison of results by arranging three steps, marking duplication, realignment and recalibration, in different orders in terms of dbSNP rate, Ti/Tv ratio, novel Ti/Tv ratio and NRD for all regions, inside target regions, outside ≤ 200 bp regions, and outside > 200 bp regions from Illumina whole-exome sequencing data. Marking duplication followed by realignment and recalibration (red), marking duplication followed by recalibration and realignment (red), realignment followed by recalibration and marking duplication (gray). (PDF 146 KB)

Watch the video: Πως λαμβάνω σωστές αποφάσεις. Τρία βασικά προαπαιτούμενα στη λήψη αποφάσεων. (May 2022).