This repo collects together information about two errors in msprime simulations. The manuscript describes the problems and background in detail.
In short, there are two distinct problems here:
-
The three population Out-of-Africa model given as an example in the msprime documentation was not an accurate description of the true model. In the most ancient time period, migration was allowed to continue between ancestral African and European populations. Fortunately, the difference is a subtle one, and the differences in expected diversity measures between the models is small. However, this code has been extensively copied --- see below.
-
The simulation pipeline for the analysis for the influential Martin et al paper contained an error, leading to the simulations being of a substantially different model from what was expected.
See the manuscript for more information and analyis.
The error present in the msprime documentation was found as part of the quality control process for stdpopsim, as described in the preprint.
See these issues for more details:
- popsim-consortium/stdpopsim#496 (comment)
- popsim-consortium/stdpopsim#496
- popsim-consortium/stdpopsim#516
If you have copied incorrect code, you have two basic options to fix it:
Defining demographic models is hard and error-prone. In an attempt to reduce the duplicated effort involved in reimplementing published models multiple times, the PopSim Consortium developed stdpopsim, a standard library of population genetic models. The correct version of the three population Out-of-Africa model is defined and can be run as simply as
$ stdpopsim HomSap -d OutOfAfrica_3G09 -c chr22 10 -o ooa.trees
There is also a Python API which can plug directly into your existing pipeline and significantly simplify your code.
The problem with the OOA model is that migration is allowed between two
ancestral populations until the indefinite past, when we should only
have a single ancestral population. The solution is to add
MigrationRateChange
events to ensure that this erroneous migration
isn't happening.
Here is the correct model with a single randomly mating ancestral population:
import math
import msprime
def out_of_africa():
# First we set out the maximum likelihood values of the various parameters
# given in Table 1.
N_A = 7300
N_AF = 12300
N_B = 2100
N_EU0 = 1000
N_AS0 = 510
r_EU = 0.004 # 0.4% EU growth
r_AS = 0.0055 # 0.55% AS growth
# Migration rates during the various epochs.
m_AF_B = 25e-5
m_AF_EU = 3e-5
m_AF_AS = 1.9e-5
m_EU_AS = 9.6e-5
# Times in Table 1 are provided in years, calculated on the assumption
# of 25 years per generation: we need to convert back into generations.
generation_time = 25
T_AF = 220e3 / generation_time
T_B = 140e3 / generation_time
T_EU_AS = 21.2e3 / generation_time
# We need to work out the starting (diploid) population sizes based on
# the growth rates provided for these two populations
N_EU = N_EU0 / math.exp(-r_EU * T_EU_AS)
N_AS = N_AS0 / math.exp(-r_AS * T_EU_AS)
# Population IDs correspond to their indexes in the population
# configuration array. Therefore, we have 0=YRI, 1=CEU and 2=CHB
# initially.
population_configurations = [
msprime.PopulationConfiguration(
sample_size=0, initial_size=N_AF),
msprime.PopulationConfiguration(
sample_size=1, initial_size=N_EU, growth_rate=r_EU),
msprime.PopulationConfiguration(
sample_size=1, initial_size=N_AS, growth_rate=r_AS)
]
migration_matrix = [
[ 0, m_AF_EU, m_AF_AS],
[m_AF_EU, 0, m_EU_AS],
[m_AF_AS, m_EU_AS, 0],
]
demographic_events = [
# CEU and CHB merge into B with rate changes at T_EU_AS
msprime.MassMigration(
time=T_EU_AS, source=2, destination=1, proportion=1.0),
msprime.MigrationRateChange(time=T_EU_AS, rate=0),
msprime.MigrationRateChange(
time=T_EU_AS, rate=m_AF_B, matrix_index=(0, 1)),
msprime.MigrationRateChange(
time=T_EU_AS, rate=m_AF_B, matrix_index=(1, 0)),
msprime.PopulationParametersChange(
time=T_EU_AS, initial_size=N_B, growth_rate=0, population_id=1),
# Population B merges into YRI at T_B
msprime.MassMigration(
time=T_B, source=1, destination=0, proportion=1.0),
msprime.MigrationRateChange(time=T_B, rate=0), # NB THIS EVENT WAS MISSING!!!!
# Size changes to N_A at T_AF
msprime.PopulationParametersChange(
time=T_AF, initial_size=N_A, population_id=0)
]
return {
'population_configurations':population_configurations,
'migration_matrix':migration_matrix,
'demographic_events':demographic_events}
These GitHub repos have a copy of the faulty Out-of-Africa model that was in the msprime documentation. Each link is to a file containing either a direct copy of the code, or code that is obviously derived from it.
The list is probably not exhaustive.
- DomNelson/wf_coalescent
- Ephraim-usc/egrm
- OasisYE/MsprimeSimul
- TishkoffLab/data_simulation
- YingZhou001/POPdemog
- abwolf/msprime_scripts
- armartin/ancestry_pipeline
- arundurvasula/migration
- astheeggeggs/msprime_sim
- awohns/relative-allele-age
- awohns/tsdate_paper
- carjed/primeval
- carjed/topmed_singleton_clusters
- cran/POPdemog
- dmctong/rv_imp
- fbaumdicker/AIMsetfinder
- isaacovercast/gimmeSAD
- jiahuanglin/GSoC2019
- jshleap/Cotagging_playground
- mathii/spectrum
- mccoy-lab/sim_introgression
- mcveanlab/treeseq-inference
- mcveanlab/tskit-workshop
- messDiv/MESS
- nikbaya/risk_gradients
- pkalbers/ScriptCollection
- pmckenz1/quartet_proj
- popgengent/pipeline
- slowkoni/out-of-africa-model
- tszfungc/scripts
These repos previously had the faulty model and have now been fixed.
The analysis for the Martin et al paper is define in armartin/ancestry_pipeline. We found the following repos that have code derived from it:
By searching for the GitHub repository URLs above, we were able to identify a number of papers that may be affected by the erroneous model.
- Dating genomic variants and shared ancestry in population-scale sequencing data. The OOA model was used as an example of a complicated demography when evaluating the accuracy of allele age estimates. The precise details of the model are not important and it is highly unlikely the incorrect model specification has any impact.
- Inferring whole-genome histories in large population datasets. The OOA model was used here as an example of a more complex demographic history, used to test ancestry inference methods. The specifics of the demography is not important, and the model does not affect the conclusions of the paper in any way.
- Population genetic simulation study of power in association testing across genetic architectures and study designs The authors use an implementation of the Tennessen model that is based on the incorrect msprime OOA example. It would appear that this implemented model also does not switch off migration in the most ancient time period. However, the method is not concerned with detecting detailed population structure, and so the details of the model used are unlikely to be significant.
- An integrated model of population genetics and community ecology The OOA model appears in the GitHub repository associated with this paper (isaacovercast/gimmeSAD), but it appears to only have been used as a temporary debugging example.
- POPdemog: Visualizing Population Demographic History From Simulation Scripts POPDemog is a method for visualising demographic histories as described by a number of population genetic tools. The OOA example is included as an example of how they can convert msprime input into ms compatible demography descriptions, which they then process.
- How to choose sets of ancestry informative markers: A supervised feature selection approach In this paper the OOA model was used to evaluate a new method for choosing ancestry informative markers. Given the very subtle effect of the incorrect model on demography (and the fact the method was evaluated using other simulations and real data), it seems unlikely that the model specified will have any effect on the conclusions of the paper.