Somatic mutations from the TCGA portion of the PCAWG dataset are only available to approved researchers. Since we cannot publicly release the full PCAWG somatic mutation dataset in Dig format, we instead provide instructions and scripts so that you can create the files yourself. NOTE: There are a lot of mutations to parse. Each command will take some time to run. Required files: * A version of the hg19 referene genome (see ../../../dig_dat_files for a copy if you do not already have one) * Simple somatic mutation annotation file for the ICGC portion of PCAWG - Download link: https://dcc.icgc.org/api/v1/download?fn=/PCAWG/consensus_snv_indel/final_consensus_passonly.snv_mnv_indel.icgc.public.maf.gz - The name will be something like: final_consensus_passonly.snv_mnv_indel.icgc.public.maf.gz * Simple somatic mutation annotation file for the TCGA portion of PCAWG - Download link for approved researchers: https://icgc.bionimbus.org/files/0e8a845d-a4f4-40bc-890b-5472702d087c - The name will be something like: final_consensus_passonly.snv_mnv_indel.tcga.controlled.maf.gz Required software: * Dig * bash Step 0: * Download all files in this directory to a directory on your machine. * Place the simple somatic mutation annotation files for ICGC and TCGA in this directory Step 1: * Run ./01_merge_and_parse.sh Step 2: * Run ./02_to_Dig_format.sh Step 3 (optional): * Run ./03_split_msi.sh - Splits MSI-high and MSI-low samples into separate mutation files for each cohort Step 4 (optional): * Run ./04_filter_hypermut.py - Removes samples with >3000 coding mutations - Filtered mutation files are placed in a new directory: filter_hypmerut/