As the final part of my assembly optimisation I was interested in the impact of Titus Brown’s Khmer/Diginorm approach on speed and quality.
Ten samples (VTECs sequenced on GAII) were assembled with a variety of pipelines:
- SPAdes corrected fastqs, assembled with SPAdes
- Khmer-ed fastqs, SPAdes corrected, assembled with SPAdes
- SPAdes corrected fastqs, assembled with Velvet
- Khmer-ed fastqs, assembled with Velvet
So, the two things I’m testing here are:
- Whether Khmer/Diginorm can speed up SPAdes assembly without significantly impacting the assembly quality
- Whether Khmer could be used in front of Velvet as an alternative to the much slower SPAdes correction step.
I used a two step Khmer approach, normalize-by-median to coverage 20 with a k-mer of 20 followed by filter-abund (for more details see the Khmer package). After Velvet assembly the contigs were corrected with REAPR to obtain corrected N50s (cN50). SPAdes was used without the –careful flag and with four k-mers (21, 33, 55, 77). No REAPR correction was done on these as in previous work there no significant misassemblies with SPAdes.
Table 1: Impact of Khmer reduction on N50 and time to assemble of SPAdes
Khmer reduction of reads certainly speeds up assembly, reducing the time taken from 40 to 15 minutes, if you are assembling hundreds of genomes then this is a tasty decrease (need to add a few minutes to this time for the Khmer steps, these are very quick though). However, it also reduces the N50 by 30%. If I had to take a quick guess as to why, I would say that either SPAdes requires high depth or perhaps that Khmer is not doing enough to take the paired reads into account.
Table 2: Comparison of Khmer and SPAdes correction for N50 and corrected (by REAPR) N50.
With the Velvet assembly, Khmer reduction improved assembly (corrected N50) significantly. Previous work has shown that SPAdes correction improves Velvet assembly cN50 ~3 fold (Musket correction also works well in this regard). Khmer reduction of the reads increases that another 58%. However, if you look at the raw N50 compared with the REAPR cN50 you can see that Velvet is making a lot of spurious assemblies compared with the SPAdes-Velvet approach and that you would really want to check your contigs after the Khmer-Velvet approach.
On balance, I think that the SPAdes correction and assembly is the best option, although I would be interested to see the impact of different Khmer coverage cutoffs on these stats (higher coverage might help keep the SPAdes N50 up?).