If you are interested in calling indels with GATK, check out the below. If not, don’t.
So, this annoying guy asked me to add an analysis of indels* to a paper that has been itching to get off my desk for months. I finally got around to doing this, so thought I would write down the process in case anyone else is interested. I’m going to be using GATK, as this is our standard variant caller and would be simpler to add to routine if we ever want to.
1. Have a sorted bam file.
2. Run GATK (v.2.6.5 in my case), with the following command
java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 1 -R reference.fa -I sorted.bam -o indel.vcf -glm INDELS
2a. I investigated the GATK realigner around indels but it didn’t make any significant difference in the one sample I checked.
3. Have a look at the vcf file you just generated. Visually double check the variants it has identified in the BAM using e.g. Tablet.
Fig 1: Good indel – QD = 12.43
Fig 2: Bad indel – QD = 3.56
4. Parse the indel.vcf and only take variants with a good QD score. I used > 10, as it is a nice round number ;), and variants below this seemed to have some evidence of alternate alleles, while those above did not. DO NOT APPLY YOUR USUAL AD RATIO CUTOFFS, as these will be calculated based on the last position before the indel, and therefore, will not reflect what is going on ‘inside’ your indel.
p.s. if anyone has any more info on gotchas for indel analysis, would be very interested.
* if the indel analysis turns up something really interesting, I may remove this, and claim the idea as my own 😉