WHICHLOCI

Executable application for determining relative discriminatory power among candidate genetic loci.

Download WHICHLOCI [zipped]

Requirements: This program will run on legacy Windows operating systems 95, 98, 00 or NT (including Macintosh emulations of these operating systems) and has no specific hardware requirements.

The astounding increase in amount of information yielded by highly polymorphic molecular marker types such as microsatellites has significantly increased resolving power for discrimination among closely related populations. This together with increased automation of techniques for resolving genetic variation results in an overall boon of new information. Individual based methods for assessing most likely population origin are among the innovative statistical techniques emerging to take advantage of this increased amount of information (Petkau et al. 1995; Waser and Strobeck 1998; Banks and Eichert 2000). The program WHICHLOCI concerns these individual based population assignment methods but presents the method looking back on itself. Trial assignments with loci one at a time allows ranking of loci in terms of their efficiency for correct population assignment and conversely their propensity to cause false assignments. Subsequent trials with increasing numbers of loci determines what minimum number of which specific loci is required in order to attain defined power for population assignment.

Input File: The program requires data from populations under consideration listed either as genotypes per sample (in the same format used for GENEPOP, Raymond and Rousset 1995) or as allele frequencies per population (in the same format as allele frequency files created in WHICHRUN (Banks and Eichert 2000). The program is written to analyze co-dominant as well as haploid data.

Theory and Program Outline A resample option allows creation of test data for all populations under consideration. Computer generated random numbers specify sampling from an allele table created from frequency data for each population. This table consists of an array of alleles observed in each population, repeating each allele in accord with the frequency of each allele observed in any population. The user defines how many samples to generate in this manner and has the option to vary sample size among populations.

Optimum loci combinations that will match user-defined accuracy for population discrimination are determined through two basic procedures. First, repeated iterations for assignment of test data using the method applied in WHICHRUN (Banks and Eichert 2000) are performed using data from each locus separately, scoring the number of correct assignments to appropriate source populations for each locus. A rank order for successful assignment is thus determined among all loci. A second round of iterations invokes loci from the rank increasing the number of loci one at a time until the correct assignment score matches the accuracy criteria set by the user. The above description covers procedure scoring accuracy across all populations. An alternate, critical population routine, allows focus on accuracy for assignment to a specific population. Iterations using data from each locus separately occurs as above but loci are scored according to how many of the trial samples from the critical population are assigned correctly. Also the number of samples which might originate from other populations but are falsely assigned to the critical population are tallied. Rank order under the critical population routine is determined by applying the following formula:

LocusScore = % correctly assigned - (% incorrectly assigned * scoreMultiplier), where: % correctly assigned = % of members of the critical population that were correctly assigned % incorrectly assigned = # from other populations assigned to critical population / # from other populations scoreMultiplier = (100 – User specified accuracy) / User specified inaccuracy

 

This allows the user to weight correct assignment or misses according to how important accuracy or inaccuracy might be to the application at hand. An allele frequency differential following methods described in Shriver et al. (1997) can also be implemented as an alternate means of ranking loci. As above, a second round of iterations determines empirically how many of which loci are required to match accuracy criteria.

There has been increasing interest in the estimation of confidence intervals for assignment results from individual-based methods. Accuracy for this estimation is obviously closely linked to the accuracy of allele frequency information for populations under consideration and is addressed through ensuring that sample sizes among baseline populations match estimates required in order to provide accurate allele frequency for polymorphic marker types (see Banks et al. 2000). The issue of confidence interval estimation in the context of population assignment, however, becomes multidimensional given a comparison between alternate likelihoods that a sample may come from each of the populations under study. The critical population presented above provides a convenient means of summarizing these multidimensional likelihoods from the perspective of the critical population. WHICHLOCI provides a means for creating multiple trial data sets. Summary statistical parameters such as variance, standard deviation and standard error across results from each dataset are determined following typical formulae (Sokal and Ralph 1987). A sub-routine written in WHICHLOCI allows users to bypass the loci ranking routine to determine assignment accuracy, variance, standard deviation and standard error for a user-selected bank of loci.

We thus present an empirical method for determining which specific combination of loci would most likely provide defined population assignment power for individuals as well as statistical bounds on the performance of any particular group of loci. Our hope is that this method will allow researchers to maximize power limits in focused population assignment contexts.

Authors: Michael A. Banks1, Will Eichert1 and J.B. Olsen2

1 Bodega Marine Laboratory, University of California Davis, Bodega Bay, CA 94923 USA

2 Gene Conservation Laboratory,Alaska Department of Fish and Game,333 Raspberry Road,Anchorage, Alaska 99518-1599

Email: Michael Banks, Jeff Olsen

Note: This program is under review for Bioinformatics under the title: Which Loci Have The Diagnostic Power You Need?

Acknowledgments: From The Bodega Marine Laboratory, University of California at Davis, P.O.Box 247, Bodega Bay and The Gene Conservation Laboratory, Alaska Department of Fish and Game USA. Research and development of WHICHLOCI was supported by funds attained from CALFED and the California Department of Water Resources.

References

Banks, M.A., Rashbrook, V.K., Calavetta, M.J., Dean, C.A. and Hedgecock, D. (2000) Analysis of microsatellite DNA resolves genetic structure and diversity of chinook salmon in California’s Central Valley. CJ FAS 57:915-927.

Banks, M.A. and Eichert, W. (2000) WHICHRUN (version 3.2): A computer program for population assignment of individuals based on multilocus genotype data. J. of Hered. 91:87-89.

Raymond, M. and Rousset, F. (1995) GENEPOP (Version 1.2): Population genetics software for exact tests and ecumenicism. J. of Hered. 86:248-250.

Paetkau, D., Calvert, W., Stirling, I. and Strobeck, C. (1995) Microsatellite analysis of population structure in polar bears. Mol Ecol 4:347-354.

Shriver, M.D., Smith, M.W., Jin, L., Marcini, A., Akey, J.M., Deka, R. and Ferrell, R.E. (1997) Ethnic-affiliation estimation by use of population-specific DNA markers. Amer. J. Hum. Genet. 60:957-964.

Sokal, R.R. and Ralph, F.J. (1995) Biometry. San Francisco: W.H. Freeman

Waser PM, and Strobeck, C. (1998) Genetic signatures of interpopulation dispersal. T. Ecol. Evol. 13:43-44.

WHICHPARENTS

Executable application for determining the most likely parents of offspring, using mutlilocus genotype data. If parental mating history is known, this program also makes use of that information.

Download WHICHPARENTS [zipped]

Requirements: This program will run on legacy Windows operating systems 95, 98, 00 or NT (including Macintosh emulations of these operating systems) and has no specific hardware requirements.

Authors: Written by Will Eichert for Dennis Hedgecock, Bodega Marine Laboratory, University of California Davis

WHICHRUN 4.1

Executable application for population assignment of individuals based on multilocus genotype data.

If you have Whichrun 4.0, be sure to download version 4.1, which corrects an error that was affecting the jackknife results.

Download WHICHRUN 4.1 [zipped]

Requirements

This program will run on legacy Windows operating systems 95, 98, 00 or NT (including Macintosh emulations of these operating systems) and has no specific hardware requirements.

Microsatellite DNA provides essentially limitless, highly varied information within species. That this provides a means for distinguishing not only among populations but also individuals has not escaped current theoretic interest (Smouse and Chevillon 1998, Waser and Strobeck 1998). Here, we present a C++ computer program named WHICHRUN that uses multilocus genotypic data to allocate individuals to their most likely source population.

Input File: WHICHRUN requires baseline genotype data for all potential source populations as well as genotype data for candidate individuals for which population origin is to be determined. Data should be provided in simple ASCII format as required for GENEPOP (Raymond and Rousset 1995, http://www.cefe.cnrs-mop.fr/). WHICHRUN's help file describes preparing input data in detail, and the download includes a sample baseline, unknown and output file.

Theory and Program Outline It is assumed that each baseline population (B1..Bk) has Hardy-Weinberg-Castle (H-W-C) genotype frequencies and that genetic loci employed are independent. The likelihood that an individual sample (s1..n) may come from each of the source populations (B1.. k) is presumed to be equal to the H-W-C frequency of its specific genotype at each locus in each respective source population. Thus, for homozygotes the likelihood that a sample (s1) is an element (e) of baseline population B1 is p12 (the square of its allele frequency (p1) in population B1) or for heterozygotes, s2 e B1 = 2 p1q1 (q1 being the frequency of an alternate allele in population B1 ) and the likelihood that sn e Bk = pk2 or 2 pkqk. Likelihood values for each locus are multiplied to give a series of multi-locus likelihood functions for assignment to each of the source populations. Alternate hypotheses that individual samples in question may come from each source population are considered in three ways:

  1. Multi-locus likelihood functions may be grouped to form ratios considering all possible pairs of baseline populations under consideration. If the ratio of the most likely allocation grouped with the second most likely allocation approaches one, there is ambiguity in the assignment of the particular sample under study. Conversely, samples for which this ratio yields a large result in comparison to all other ratios can be assigned to a single population with more confidence. For the two populations considered in the ratio, the chance of error is equal to the inverse of this ratio. Stringency for population allocation can be applied by defining a selection criterion for the log10 of this ratio. For example, by selecting only assignments that have a log of the odds (LOD) ratio of at least 2, all results will have a 1/100 chance of error or less.
  2. Multi-locus likelihood functions may be grouped according to maximum likelihood format according to the equation L(n)/L(max). This yields a series of ratios between 1 (most likely) and close to 0 (least likely). Analysis of variance of log transformed data followed by a Tukey's multiple comparison enables evaluation of statistical significance in the classical sense.
  3. Jackknife iterations provide an empirical means for evaluating baseline data and the chances of correct allocation. Iterations sample individuals from the baseline one at a time, recalculating allele frequencies in the absence of each individual genotype sampled before determining most likely population origin for that individual. Experimenting with alternate loci and populations enables one to determine which population comparisons and loci combinations enable reliable population re-allocation.

Reporting options and special cases

Sample ID, genotypic data, and multilocus likelihoods for population allocation can be displayed for verification. A critical population routine allows one to select a target population for calculation of LOD scores. All scores are then calculated with the critical population as the numerator in the ratio. A special case where test samples may have an allele or pair of alleles not observed in one or all of the baseline populations is treated as follows. For source populations in which the allele is not observed an estimated allele frequency of 1/(2N + 1) is applied. This hypothesizes that the non-observance of the allele in question is due to sampling error and that the allele in question would have been observed in the baseline population if one more allele had been sampled. Note that this estimation may introduce substantial bias if baseline population size (N) is small as would be likely for any allele frequency estimation given small N, particularly when dealing with highly polymorphic marker types. The program implements a warning describing this consideration when small baseline population sizes (N < 30) are encountered. Alternatively, if sampling error is low, an unknown sample allele not observed in a baseline population may constitute strong evidence that the sample in question may indeed not originate from the particular baseline population under consideration. Any alleles for which the 1/(2N+1) estimation is necessary are noted on the genotype output.

It is obvious that a technique such as WHICHRUN will only be effective if there is reasonable reproductive isolation among populations under consideration. Three other considerations are also important. First, the rate of accumulation of variance for molecular loci employed should be closely matched with estimated divergence times among populations under study. For example, highly polymorphic microsatellites prone to homoplasy would not be suited for diagnosis among populations that have been diverged for substantial evolutionary time. However, highly polymorphic microsatellites are likely one of a few molecular marker types that have sufficient information to resolve diagnosis among recently diverged populations such as the global radiation of Drosophila melanogaster which is estimated to have occurred within the last 10,000 &ndash; 15,000 years (David and Capy 1988; B?nassi and Veuille 1995). Second, the accuracy of determination is crucially dependent upon the lack of differential sampling error among baseline allele frequencies. While this problem is partially addressed through ensuring that sample size is equal for all populations, highly polymorphic marker types such as microsatellites require substantial sampling. Third, for population origin diagnoses where source populations are recently diverged, there will be a number of loci that have not accumulated differences in the time since divergence. As a result, simply increasing the number of loci employed may not necessarily increase the power of diagnosis. For closely related populations, additional loci that have marked differences in allele frequency profiles among populations will be necessary to achieve increased power.

Authors: Michael A. Banks and Will F. Eichert Bodega Marine Laboratory, University of California at Davis, Bodega Bay, CA 94923 USA

Published reference Banks, M.A. and W. Eichert. 2000. WHICHRUN (Version 3.2) a computer program for population assignment of individuals based on multilocus genotype data. Journal of Heredity. 91:87-89. Note: Copyright has been awarded to the American Genetics Association.

Email: Michael Banks

Acknowledgments: From The Bodega Marine Laboratory, University of California at Davis, P.O.Box 247, Bodega Bay. USA. We thank V.K. Rashbrook, H. A. Fitzgerald and J. Olsen, for beta testing various versions of this program and a number of useful suggestions and improvements that resulted from our collaboration. We are also grateful to F.J. Saminiego for discussion on statistical aspects during the development of WHICHRUN. Research and development of WHICHRUN was supported by funds attained from the California Department of Water resources and the US Fish and Wildlife Service.

References

Benassi, V., and Veuille, M. 1995. Comparative population structuring of molecular and allozyme variation of Drosophila melanagaster Adh between Europe, West Africa and East Africa. Genetics Research. 65:95-103.

David, J.R and Capy, P. 1988. Genetic variation of Drosophila melanagaster natural populations. Trends in genetics. 4:106-111.

Raymond,M, and Rousset, F. 1995. GENEPOP (Version 1.2): Population genetics software for exact tests and ecumenicism. Journal of Heredity. 86:248-250.

Smouse, PE, and Chevillon, C. 1998. Analytical aspects of population-specific DNA fingerprinting for individuals. Journal of Heredity. 89:143-150.

Waser, PM, and Strobeck, C. 1998. Genetic signatures of interpopulation dispersal. Trends in Ecology and Evolution. 13:43-44.