Skip to content

Latest commit



107 lines (65 loc) · 4.56 KB

File metadata and controls

107 lines (65 loc) · 4.56 KB

Genomes and reference files

CAW currently uses GRCh38 by default. The settings are in genomes.config, they can be tailored to your needs. The script can be use to build the indexes based on the reference files.


Use --genome GRCh37 to map against GRCh37. Before doing so and if you are not on UPPMAX, you need to adjust the settings in genomes.config to your needs.

GATK bundle

To get the needed files, download the GATK bundle for GRCh37.

The following files need to be downloaded:

  • 242c0df2a698a76fc43bdd938ba57c62 - '1000G_phase1.indels.b37.vcf.gz'
  • 00b0e74e4a13536dd6c0728c66db43f3 - 'dbsnp_138.b37.vcf.gz'
  • dd05833f18c22cc501e3e31406d140b0 - 'human_g1k_v37_decoy.fasta.gz'
  • a0764a80311aee369375c5c7dda7e266 - 'Mills_and_1000G_gold_standard.indels.b37.vcf.gz'

Other files

From our repo, get the intervals list file. More information about this file in the intervals documentation

The rest of the references files are stored in in and also on the repository CAW-References using GIT-LFS:

  • '1000G_phase3_20130502_SNP_maf0.3.loci'
  • 'b37_cosmic_v74.noCHR.sort.4.1.vcf'

You can create your own cosmic reference for any human reference as specified below.

COSMIC files

To annotate with COSMIC variants during MuTect1/2 Variant Calling you need to create a compatible VCF file. Download the coding and non-coding VCF files from COSMIC and process them with the script. The script requires a fasta index .fai, of the reference file you are using.


samtools faidx human_g1k_v37_decoy.fasta
sh human_g1k_v37_decoy.fasta.fai

Note: CosmicCodingMuts.vcf.gz & CosmicNonCodingVariants.vcf.gz must be in same folder as when executed.

To index the resulting VCF file use igvtools.

igvtools index <cosmicvxx.vcf>


Use --genome GRCh38 to map against GRCh38. Before doing so and if you are not on UPPMAX, you need to adjust the settings in genomes.config to your needs.

To get the needed files, download the GATK bundle for GRCh38 from ftp://[email protected]/bundle/hg38/.

The MD5SUM of Homo_sapiens_assembly38.fasta included in that file is 7ff134953dcca8c8997453bbb80b6b5e.

From the beta/ directory, which seems to be an older version of the bundle, only Homo_sapiens_assembly38.known_indels.vcf is needed. Also, you can omit dbsnp_138_ and dbsnp_144 files as we use dbsnp_146. The old ones also use the wrong chromosome naming convention.

Afterwards, the following needs to be done:

gunzip Homo_sapiens_assembly38.fasta.gz
bwa index -6 Homo_sapiens_assembly38.fasta


Use --genome smallGRCh37 to map against a small reference genome based on GRCh37. smallGRCh37 is the default genome for the testing profile (-profile testing).

The script can download and build the files needed for smallGRCh37, or build the references for GRCh37/smallGRCh37.


Only with --genome smallGRCh37. If this option is specify, the smallRef repository will be automatically downloaded from GitHub. Not to be used on UPPMAX cluster Bianca or on similarly secured clusters where such things are not working/allowed.

nextflow run --download --genome smallGRCh37


Use --refDir <path to smallRef> to specify where are the files to process.

nextflow run --refDir <path to smallRef> --genome <genome>


Same parameter used for

  • GRCh37
  • GRCh38 (not yet supported)
  • smallGRCh37