# Simple correction for sampling bias

12 Apr

**Simple correction for sampling bias
**Allele frequency databases are relatively small when compared with the populations from which they are drawn and therefore there remain sampling uncertainties. A sim- ple method for addressing such uncertainties, which are inherent in allele frequency databases, is suggested by Balding [20]. The allelic information in the evidential ma- terial is incorporated into the database to adjust for the potential under-representation of alleles. When there are matching DNA profiles there must be two DNA profiles: one from the crime scene and one from the reference sample. The alleles from these profiles are added to the allelic frequency database. By adding both profiles we are making the assumption that the material found at the crime scene did not come from the suspect. If we look at the profile in Table 8.1, at the vWA locus is a heterozygous locus with alleles 14 and 17; these have frequencies of 0.0850 and 0.2500 respectively. By multiplying the allele frequency with the total number of alleles in the database, we can calculate that the numbers of observed alleles in the database are 34/400 for allele 14 and 100/400 for allele 17. We now have two profiles to add to the database; we have seen a total of four new alleles: 14, 17 in the crime scene sample and also 14,17 in the suspect’s sample. These can be added to the database and the frequency recalculated. The database now has 36 observations of allele 14 out of a total of 404 observed alleles, which leads to an allele frequency of 0.090. Similarly, for allele 17 we now have 102/404, which gives us an allele frequency of 0.2525. This procedure is repeated for each heterozygous locus. In Table 8.1 the FGA locus is homozygous and in the original database we have 71/400 observations but now need to add four more observations (21, 21 and 21, 21) to both the frequency of allele 21 and the total number of alleles, so the new frequency is 75/404 = 0.1856. The profile is recalculated using this correction method in Table 8.2. The Balding correction for size bias has the greatest impact when the database is made from a small number of alleles or when the allele is rare. If the allele is common and the database is large, the effect is negligible. The above methods both compensate for the limitations of allele frequency databases that are caused by sampling effects. Other more complex methods, such as calculating the confidence 95% interval, can be employed but are not widely used [23, 24].

**Subpopulations**

In addition to correcting for sampling effect, it may also be necessary to allow for the presence of subpopulations when calculating profile frequencies. Even within

CORRECTIONS TO ALLELE FREQUENCY DATABASES

Table 8.2 The profile frequency has been recalculated from Table 8.1 using the Balding correction for sampling bias. The impact of this correction factor is greatest on the rare alleles

populations of the same broad ethnic group, the population is not homogeneous but comprisesrelatedsubpopulations.Thesubpopulationsformbecausepeopledonotmate randomly, but tend, for example, to have children with people from the same geograph- ical area or same social group. Allelic databases are normally composed of samples that have been drawn from the general population, and not from one subpopulation, and therefore provide us with an average estimate of the allele frequencies in the whole population. The effect of subpopulations has been demonstrated as leading to errors in the estimation of profile frequencies [25]. In a subpopulation there is a higher degree of relatedness between individuals than there is to the whole population, i.e. a higher probability that two individuals would have some genetic markers in common through descent from a common ancestor (identical by descent) than by a random match (iden- tical by state) [26]. To incorporate this substructure factor into the profile frequency calculations, a theta value (θ) is used to describe the degree of differentiation between subpopulations (the amount of inbreeding) [27]. The level of population substructure, and therefore the theta values at the STR loci, have been demonstrated to be low [23, 24,28,29]. In general a theta value of 0.01 is used for seemingly homogeneous popula- tions, while for more isolated/differentiated populations a theta value of 0.03 has been recommended [23]. To calculate the profile frequencies that allow for subpopulations the following equations are used commonly used [21]:

**STATISTICAL INTERPRETATION OF STR PROFILES**

The impact of a theta value of 0.01 on this particular profile is a modest three fold increase in the profile frequency, whereas a theta value of 0.03 leads to a frequency that is over 20 times more common – but still exceedingly rare (Table 8.3). It should be noted that the impact of applying theta to a profile frequency calculation differs between profiles. The current practice in most legal systems is to use a theta value of between 0.01 and 0.03, apart from in exceptional circumstances where very high levels of inbreeding may have occurred.