The Caballo Moro Astyanax mexicanus cave population consists of both eyed and eyeless individuals. The purpose of this pipeline is to compare/contrast genomic differences within this population. We sequenced 42 indiviudals- eyed cavefish (E series sample names), eyeless cavefish (C series sample names), and surface fish (S and Sr sample names).
config.yaml
is easily editable to define new genomes, paths, samples, etc.
align_scripts.py
: fastq files > align with bwa-mem2 > mark duplicates > rename to biological sample name
variant_calling.py
: Splits aligned genomes to mapped/unmapped > genotypes (Ref vs. Alt with haplotype caller) > Combines gvcfs across all individuals > joint calling
hard_filtering.py
: Applies hard filters to the all.vcf, subset by monomorphic sites, snps, indels for downstream analysis.
genotype_filter.py
: Performs snp comparisons by phenotype, depending on the genotype schematic, i.e. sites that are alt in eyeless but ref in eyed and surface (aka eyeless recessive, aka fixed eyeless alleles). Also does filtering by number of nocalls across the population. To get accurate counts of overlap (eyeless only, eyed only, common to eyed, eyeless, surface, etc.) use vcf-compare
to get summary stats for this.
snpeff_annotation.py
: Generates a basic script to take a genotype-filtered vcf and run snpEff.
popgen_windows.py
: Runs Simon Martins' popgenWindows.py to calculate Dxy, Fst, and pi across windows of the genome.
preplink.py
: Generates plink files that are common to both the GWAS analysis and the population structure analysis. It converts RefSeq chromosomes to common chromosomes and filters for linkage disequilibrium. It also runs the population stratification analysis: pca and admixture. This script also makes the plot for cross-validation error of the admixture because I don't like R (literally that's the only reason, if I could do all the plotting in python without having to use jupyter notebooks, I would).
gwas_fixandtest.py
: Does the gwas trend analysis. This includes some prep work to get the family, map, etc files ready. Then it does a variety of gwas tests depending on what you need. The most important one is the ".model" test, which tests a variety of models for trend. The important one is called "TREND" which is a Cochran-Armitage trend test. The script also filters down to just the trend test and pulls in gene information from the .gtf
file found in ~/genome
that has been narrowed to just accession and gene name. There's some weirdness around assigning (or reassigning) chromosome names and positions, which this handles.
popgen/loter
: has it's own README.md for running and analyzing local ancestry inference with Loter.
- Has it's own README.md
reports/CM_graphics.Rmd
does the following:
- Generates PCA plots (default is 50kb window)
- Makes admixture plots for all K/window combinations
- Plots selected trend tests generated by gwas_fixandtest.py