\name{Bruvo.distance} \alias{Bruvo.distance} \title{Genetic Distance Metric of Bruvo et al} \description{ This function calculates the distance between two individuals at one microsatellite locus using the method of Bruvo et al. (2004). } \usage{Bruvo.distance(genotype1, genotype2, maxl=9, usatnt=2, missing=-9)} \arguments{ \item{genotype1}{A vector of alleles for one individual at one locus. Allele length is in nucleotides. Each unique allele corresponds to one element in the vector, and the vector is no longer than it needs to be to contain all unique alleles for this individual at this locus.} \item{genotype2}{A vector of alleles for another individual at the same locus.} \item{maxl}{If both individuals have more than this number of alleles at this locus, NA is returned instead of a numerical distance.} \item{usatnt}{Length of the repeat at this locus. For example usatnt=2 for dinucleotide repeats, and usatnt=3 for trinucleotide repeats. If the alleles in genotype1 and genotype2 are expressed in repeat count instead of nucleotides, set usatnt=1.} \item(missing}{A numerical value that, when in the first allele position, indicates missing data. NA is returned if this value is found in either genotype, or if either genotype has a length of zero.} } \details{ Since allele copy number is frequently unknown in polyploid microsatellite data, Bruvo et al. developed a measure of genetic distance similar to band-sharing indices used with dominant data, but taking into account mutational distances between alleles. A matrix is created containing all differences in repeat count between the alleles of two individuals at one locus. These differences are then geometrically transformed to reflect the probabilities of mutation from one allele to another. The matrix is then searched to find the minimum sum if each allele from one individual is paired to one allele from the other individual. This sum is divided by the number of alleles per individual. If one genotype has more alleles than the other, 'virtual alleles' must be created so that both genotypes are the same length. There are three options for the value of these virtual alleles, but Bruvo.distance only implements the simplest one, assuming that it is not known whether differences in ploidy arose from genome addition or genome loss. Virtual alleles are set to infinity, such that the geometric distance between any allele and a virtual allele is 1. } \value{ A number ranging from 0 to 1, with 0 indicating identical genotypes, and 1 being a theoretical maximum distance if all alleles from genotype1 differed by an infinite number of repeats from all alleles in genotype2. NA is returned if both genotypes have more than maxl alleles or if either genotype has the symbol for missing data as its first allele. } \references{ Bruvo, R., Michiels, N. K., D'Sousa, T. G., and Schulenberg, H. (2004) A simple method for calculation of microsatellite genotypes irrespective of ploidy level. _Molecular Ecology_ *13*, 2101-2106. } \note{ The processing time is a function of the factorial of the number of alleles, since each possible combination of allele pairs must be evaluated. For genotypes with a sufficiently large number of alleles, it may be more efficient to estimate distances manually by creating the matrix in Excel and visually picking out the shortest distances between alleles. This is the purpose of the maxl argument. On my personal computer, if both genotypes had more than nine alleles, the calculation could take an hour or more, and so this is the default limit. In this case, Bruvo.distance returns NA. } \seealso{ } \examples{ Bruvo.distance(c(202,206,210,220),c(204,206,216,222)) Bruvo.distance(c(202,206,210,220),c(204,206,216,222),usatnt=4) Bruvo.distance(c(202,206,210,220),c(204,206,222)) Bruvo.distance(c(202,206,210,220),c(204,206,216,222),maxl=3) Bruvo.distance(c(202,206,210,220),c(-9)) } \author{Lindsay V. Clark} \keyword{arith} \name{read.GeneMapper} \alias{read.GeneMapper} \title{Read GeneMapper Genotypes Tables} \description{ Given a list of tab-delimited text files containing genotype data and a list of loci that correspond to the files, read.GeneMapper produces a list of genotypes that can be read by other functions in the polysat package. } \usage{ read.GeneMapper(infiles, loci) } \arguments{ \item{infiles}{A character vector of paths to the files to be read.} \item{loci}{A character vector of names of the loci. infile[x] contains the data for loci[x].} } \value{ The object produced is a list of lists of vectors. The top level is a list of loci, and each locus is a list of genotypes. The genotypes are indexed by the names of the loci and samples. For example, mygenotypedata$locus1$individual1 would be a numeric vector of all alleles belonging to individual1 at locus1. } \details{ read.GeneMapper can read the genotypes tables that are exported by the Applied Biosystems GeneMapper software. The only alterations to the files that the user may have to make are 1) make sure that each file contains all the data for one locus, and no other data, 2) delete any rows with missing data or (preferably) fill in a numerical missing data symbol of your choice (such as -9) in the first allele slot for that row, 3) make sure that all allele names are numeric representations of fragment length (no question marks or dashes), and 4) put sample names into the Sample Name column, if the names that you wish to use in analysis are not already there. Each file should have the standard header row produced by the software. The file format is simple enough that the user can easily create files manually if GeneMapper is not the software used in allele calling. The files are tab-delimited text files. There should be a header row with column names. The column labeled _Sample Name_ should contain the names of the samples. You can have as many or as few columns as needed to contain the alleles, and each of these columns should be labeled _Allele X_ where X is a number unique to each column. Row labels and any other columns are ignored. For any given sample, each allele is listed only once and is given as an integer that is the length of the fragment in nucleotides. Alleles are separated by tabs. If you have more allele columns than alleles for any given sample, leave the extra cells blank so that read.table will read them as NA. Example data files in this format are included in the package. read.GeneMapper will read all of your data at once. It takes as its first argument a character vector containing paths to all of the files to be read. Each file should contain the data for one locus. The second argument is another character vector containing the names of the loci, in the same order as the file paths. Sample names should be consistent between data files for ease of indexing later. However, it is not necessary for all of the data files to contain the same set of samples (for example, if you have related data for two different studies, using different sets of markers). Because the object produced by read.GeneMapper is a list of lists that is indexed by locus name and sample name, it should be easy to use index vectors to select subsets of loci and individuals for downstream analysis. } \references{ \url{http://www.appliedbiosystems.com/genemapper} } \seealso{ } \examples{ \dontrun{ myinfiles<-c("data\\sample CBA15.txt","data\\sample CBA23.txt","data\\sample CBA28.txt") myloci<-c("CBA15","CBA23","CBA28") read.GeneMapper(myinfiles, myloci) } } \author{Lindsay V. Clark} \keyword{file} \name{distance.matrix.1locus} \alias{distance.matrix.1locus} \title{Pairwise Genetic Distances at One Locus} \description{ Given all genotypes for one locus, create a pairwise genetic distance matrix. } \usage{ distance.matrix.1locus(gendata, distmetric=Bruvo.distance, progress=TRUE, ...) } \arguments{ \item{gendata}{A list of vectors, where each vector contains all the alleles in the genotype of one sample at this locus. names(gendata) should be the sample names corresponding to the genotypes.} \item{distmetric}{This is the function that will be used to calculate each pairwise distance. This should be a function that, given two vectors of alleles, returns a numerical distance.} \item{progress}{If TRUE, distance.matrix.1locus will print the names of sample pairs as it finishes each calculation with distmetric. For large datasets, this is intended so that the user can monitor the progress of the calculations.} \item{...}{These arguments will be passed to distmetric. For example, with Bruvo.distance, maxl, usatnt, or missing may be used.} } \value{ A symmetrical matrix of distances, with the names of samples used as row and column names. } \details{ Given a list of genotypes at one locus, distance.matrix.1locus produces a symmetrical matrix of pairwise distances between genotypes. If using a polysat genotype object such as that produced by read.GeneMapper, the gendata argument should be one of the sublists, for example mygenotypedata$locus1. The measure of distance can be any that is provided with polysat, or any function written by the user, so long as it takes genotypes as vectors of alleles (or any other type of object that is given as elements of the list gendata) and returns a numerical distance. Any arguments that need to be passed to the distmetric function can be given to distance.matrix.1locus. To save processing time, each pairwise distance is only calculated once and then written to both locations in the matrix simultaneously. The user also has the option to have each pair of sample names printed after the distance is calculated, so that progress can be monitored if evaluation is expected to take a long time. } \references{ } \seealso{ \item{Bruvo.distance} \item{read.GeneMapper} } \examples{ mygenotypes<-list(IND1=c(124,127,133),IND2=c(130,139,145,151),IND3=c(118,127,133,154)) distance.matrix.1locus(mygenotypes,usatnt=3) } \author{Lindsay V. Clark} \keyword{array} \name{mean.distance.matrix} \alias{mean.distance.matrix} \title{Mean Pairwise Distance Matrix} \description{ Given a list of lists of genotypes, mean.distance.matrix produces a symmetrical matrix of pairwise distances between samples, averaged across all lists (where each list usually represents one locus). } \usage{ mean.distance.matrix(gendata, samples, loci, all.distances=FALSE, usatnts=NULL, ...) } \arguments{ \item{gendata}{A list of lists of genotypes, such as that produced by read.GeneMapper. All samples to be analyzed should be present in each list (if that locus is going to be analyzed), with the missing data symbol used if necessary.} \item{samples}{A character vector of samples to be analyzed. These should be all or a subset of the sample names used in gendata.} \item{loci}{A character vector of loci to be analyzed. These should be all or a subset of the loci names used in gendata.} \item{all.distances}{If FALSE, only the mean distance matrix will be returned. If TRUE, a list will be returned containing an array of all distances by locus and sample as well as the mean distance matrix.} \item{usatnts}{A numerical vector that contains the length of nucleotide repeats for each locus. For example, 3 would be used to indicate a locus with trinucleotide repeats. 1 should be used if alleles are written in terms of repeat number, not fragment length in nucleotides. names(usatnts) should be the same as those used in names(gendata) (the names of the loci.) This argument can be omitted if repeat length is irrelevant to the distance metric.} \item{...}{If distmetric or progress are given here they will be passed to distance.matrix.1locus. Any other arguments will be passed to distmetric.} } \value{ A symmetrical matrix containing pairwise distances between all samples, averaged across all loci. Row and column names of the matrix will be the sample names provided in the samples argument. If all.distances=TRUE, a list will be produced containing the above matrix as well as a three-dimensional array containing all distances by locus and sample. The array is the first item in the list, and the mean matrix is the second. } \details{ mean.distance.matrix uses distance.matrix.1locus once for each locus to be analyzed, then averages values across these matrices. Any arguments that need to be passed to distance.matrix.1locus may be given to mean.distance.matrix. If the loci are of different repeat types and the type of repeat is important for the distance metric being used (e.g. Bruvo.distance), the usatnts argument can be used to pass a different usatnt argument to distmetric depending on the locus. Because the user may want to omit samples or loci, the samples and loci arguments are given for convenient indexing of the data to be analyzed. If gendata contains only the data that the user wants to analyze, the user can simply use names to create these indices, for example: mean.distance.matrix(mygendata, names(mygendata[[1]]), names(mygendata)) The samples argument is also important for making sure that all distances are calculated in the same order for each call of distance.matrix.1locus, so that the distances can be averaged correctly across the array. Missing data must be signified by using a missing data symbol (such as -9 for the first allele), rather than having that sample be absent from one of the lists in gendata. } \references{ } \seealso{ \item{distance.matrix.1locus} \item{read.GeneMapper} } \examples{ mygendata <- list(locus1=list(ind1=c(124,128,138),ind2=c(122,130,140,142), ind3=c(122,132,136),ind4=c(122,134,140)), locus2=list(ind1=c(203,212,218),ind2=c(197,206,221),ind4=c(200,218)), locus3=list(ind1=c(140,144,148,150),ind2=c(-9),ind4=c(152,154,158)), locus4=list(ind1=c(233,236,280))) myloci <- c("locus1","locus2","locus3") mysamples <- c("ind1","ind2","ind4") myusatnts <- c(2,3,2) names(myusatnts) <- myloci mean.distance.matrix(mygendata, mysamples, myloci, all.distances=TRUE, usatnts=myusatnts) } \author{Lindsay V. Clark} \keyword{array}