CAW currently uses GRCh38 by default. The settings are in genomes.config
, they can be tailored to your needs. The buildReferences.nf
script can be use to build the indexes based on the reference files.
Use --genome GRCh37
to map against GRCh37. Before doing so and if you are not on UPPMAX, you need to adjust the settings in genomes.config
to your needs.
To get the needed files, download the GATK bundle for GRCh37.
The following files need to be downloaded:
- 242c0df2a698a76fc43bdd938ba57c62 - '1000G_phase1.indels.b37.vcf.gz'
- 00b0e74e4a13536dd6c0728c66db43f3 - 'dbsnp_138.b37.vcf.gz'
- dd05833f18c22cc501e3e31406d140b0 - 'human_g1k_v37_decoy.fasta.gz'
- a0764a80311aee369375c5c7dda7e266 - 'Mills_and_1000G_gold_standard.indels.b37.vcf.gz'
From our repo, get the intervals
list file. More information about this file in the intervals documentation
The rest of the references files are stored in in export.uppmax.uu.se and also on the repository CAW-References using GIT-LFS:
- '1000G_phase3_20130502_SNP_maf0.3.loci'
- 'b37_cosmic_v74.noCHR.sort.4.1.vcf'
You can create your own cosmic reference for any human reference as specified below.
To annotate with COSMIC variants during MuTect1/2 Variant Calling you need to create a compatible VCF file. Download the coding and non-coding VCF files from COSMIC and process them with the Create_Cosmic.sh script. The script requires a fasta index .fai
, of the reference file you are using.
Example:
samtools faidx human_g1k_v37_decoy.fasta
sh Create_Cosmic.sh human_g1k_v37_decoy.fasta.fai
Note: CosmicCodingMuts.vcf.gz & CosmicNonCodingVariants.vcf.gz must be in same folder as Create_Cosmic.sh when executed.
To index the resulting VCF file use igvtools.
igvtools index <cosmicvxx.vcf>
Use --genome GRCh38
to map against GRCh38. Before doing so and if you are not on UPPMAX, you need to adjust the settings in genomes.config
to your needs.
To get the needed files, download the GATK bundle for GRCh38 from ftp://[email protected]/bundle/hg38/.
The MD5SUM of Homo_sapiens_assembly38.fasta
included in that file is 7ff134953dcca8c8997453bbb80b6b5e.
From the beta/
directory, which seems to be an older version of the bundle, only Homo_sapiens_assembly38.known_indels.vcf
is needed. Also, you can omit dbsnp_138_
and dbsnp_144
files as we use dbsnp_146
. The old ones also use the wrong chromosome naming convention.
Afterwards, the following needs to be done:
gunzip Homo_sapiens_assembly38.fasta.gz
bwa index -6 Homo_sapiens_assembly38.fasta
Use --genome smallGRCh37
to map against a small reference genome based on GRCh37. smallGRCh37
is the default genome for the testing profile (-profile testing
).
The buildReferences.nf
script can download and build the files needed for smallGRCh37, or build the references for GRCh37/smallGRCh37.
Only with --genome smallGRCh37
. If this option is specify, the smallRef
repository will be automatically downloaded from GitHub. Not to be used on UPPMAX cluster Bianca or on similarly secured clusters where such things are not working/allowed.
nextflow run buildReferences.nf --download --genome smallGRCh37
Use --refDir <path to smallRef>
to specify where are the files to process.
nextflow run buildReferences.nf --refDir <path to smallRef> --genome <genome>
Same parameter used for main.nf
- GRCh37
- GRCh38 (not yet supported)
- smallGRCh37