Caballo Moro pipeline

The Caballo Moro Astyanax mexicanus cave population consists of both eyed and eyeless individuals. The purpose of this pipeline is to compare/contrast genomic differences within this population. We sequenced 42 indiviudals- eyed cavefish (E series sample names), eyeless cavefish (C series sample names), and surface fish (S and Sr sample names).

config.yaml is easily editable to define new genomes, paths, samples, etc.

Alignment

align_scripts.py: fastq files > align with bwa-mem2 > mark duplicates > rename to biological sample name

Variant Calling

variant_calling.py: Splits aligned genomes to mapped/unmapped > genotypes (Ref vs. Alt with haplotype caller) > Combines gvcfs across all individuals > joint calling

hard_filtering.py: Applies hard filters to the all.vcf, subset by monomorphic sites, snps, indels for downstream analysis.

genotype_filter.py: Performs snp comparisons by phenotype, depending on the genotype schematic, i.e. sites that are alt in eyeless but ref in eyed and surface (aka eyeless recessive, aka fixed eyeless alleles). Also does filtering by number of nocalls across the population. To get accurate counts of overlap (eyeless only, eyed only, common to eyed, eyeless, surface, etc.) use vcf-compare to get summary stats for this.

snpeff_annotation.py: Generates a basic script to take a genotype-filtered vcf and run snpEff.

Pop gen analysis

popgen_windows.py: Runs Simon Martins' popgenWindows.py to calculate Dxy, Fst, and pi across windows of the genome.

PCA, Admixture, Loter, and GWAS

preplink.py: Generates plink files that are common to both the GWAS analysis and the population structure analysis. It converts RefSeq chromosomes to common chromosomes and filters for linkage disequilibrium. It also runs the population stratification analysis: pca and admixture. This script also makes the plot for cross-validation error of the admixture because I don't like R (literally that's the only reason, if I could do all the plotting in python without having to use jupyter notebooks, I would).

gwas_fixandtest.py: Does the gwas trend analysis. This includes some prep work to get the family, map, etc files ready. Then it does a variety of gwas tests depending on what you need. The most important one is the ".model" test, which tests a variety of models for trend. The important one is called "TREND" which is a Cochran-Armitage trend test. The script also filters down to just the trend test and pulls in gene information from the .gtf file found in ~/genome that has been narrowed to just accession and gene name. There's some weirdness around assigning (or reassigning) chromosome names and positions, which this handles.

popgen/loter: has it's own README.md for running and analyzing local ancestry inference with Loter.

Candidate gene selection

Has it's own README.md

Visualization

reports/CM_graphics.Rmd does the following: - Generates PCA plots (default is 50kb window) - Makes admixture plots for all K/window combinations - Plots selected trend tests generated by gwas_fixandtest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Caballo Moro pipeline

Alignment

Variant Calling

Pop gen analysis

PCA, Admixture, Loter, and GWAS

Candidate gene selection

Visualization

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
alignment		alignment
candidate_gene_selection		candidate_gene_selection
popgen		popgen
reports		reports
variant_calling		variant_calling
README.md		README.md
config.yaml		config.yaml

rikellermeyer/Caballo_Moro

Folders and files

Latest commit

History

Repository files navigation

Caballo Moro pipeline

Alignment

Variant Calling

Pop gen analysis

PCA, Admixture, Loter, and GWAS

Candidate gene selection

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages