We, like many people, are interested in optimal de novo genome assembly. When you assemble a genome you want the best possible representation of the ‘true’ genome. How can we obtain the best possible representation?
In the past we have used VelvetOptimiser to do our assemblies. This assembles with a range of k-mers and returns the best assembly (edit: in terms of N50). This approach produces assemblies with high N50s, which is a handy rule of thumb for assembly quality.
However, like the old adage ‘as soon as a measure becomes a target, it ceases to be a valid measure’, we are wondering whether optimising for N50 in this way provides the ‘best’ assemblies.
So, we decided to assess a couple of different k-mer estimation tools, Kmer Genie and VelvetK, and see how assemblies with their K-mers stack up against VelvetOptimiser. The test sample was paired-end fastq data from a shiga toxin producing E. coli O157 I’m working on, representative of our samples in terms of coverage and quality.
The three different assemblies were then assessed with Quast, a reference (I used E. coli O157 Sakai) based assembly quality assessment tool.
Table 1: Characteristics of the different k-mer estimation tools, calculated by Quast. Velvet Optimiser k-mer range was 21-121, to match the Kmer Genie range. Edit: non-velvet optimiser assemblies were repeated with -exp_cov and -cov_cuttoff set as ‘auto’ with much improved results. Also did Velvet with the velvet optimiser k-mer to compare the improvement of optimiser only.
All k-mer estimation tools gave k-mers between 63-69. As expected velvet optimiser gave a Edit: slightly better NG50.
It is the misassemblies stat that really leaps out. Quast is saying, that relative to Sakai, the VelvetOptimiser assembly had 92 misassemblies, while VelvetK had 44 and Kmer Genie had 12.
So, are there genuine biological differences between the Sakai genome and the sample genome (caused by e.g. recombination or rearrangement), or, is VelvetOpt forcing misassemblies in the quest for a better N50? How to tell? Well, I think it is very difficult (impossible?) to 100% determine this using a bioinformatic approach but we can get a better feel for the answer if we use the paired end information in our dataset.
Our hypothesis is that if a contig is correctly assembled, it will contain a high proportion of properly paired reads. Velvet should take this info into account when assembling – but does it? We mapped the original reads back to the three assemblies using bwa and I wrote a script (using pysam) to calculate the percentage of reads in each 100 bp window (iterating by a single base each time) that are properly paired and produce a plot of percentage mapped along the contig length.
Figure 1: Percentage of properly paired reads in a 10 bp window along the length of the contig. This contig was not highlighted as misassembled by Quast (i.e. was a positive control). X axis is base-pairs along the contig , y axis is percentage of reads properly paired. There are bound to be dips at either end of the contig, length of this dip will be a function of length of DNA fragment sequenced (~700bp in this case).
Figure 2: Percentage of properly paired reads – two unrelated contigs, artificially stuck together by myself (negative control). The junction is at about 4500 bp
So, how does this work on contigs labelled as misassembled by Quast?
Figure 3: Percentage of properly paired reads – contig labelled as misassembled by Quast. There are two dips in the percentage of properly paired reads, one at around 20000 bp, the other a couple of thousand bps downstream.
So, figure 3 certainly looks a lot different from the the figure 2 (negative control), but is also different from figure 1 (positive control). Does this mean that this contig is misassembled? There are a couple of little validations we can do. Firstly, look at this contig in Tablet and there is a series of unpaired reads at the genomic location indicated by the script. Also, where the reads are ‘properly paired’, the two halves of the pair overlap. Secondly, we can look at how this contig assembled in the Kmer Genie assembly – see edit below.
Edit: when exp_cov and cov_cutoff were set to auto in the velvet assembly, the kmer_genie assembly resulted in a single contig which Quast then annotated as misassembled.
This all seems like along way of telling you what you already know – if you want long, questionable contigs then go with VelvetOptimiser, if you want Edit: slightly more conservative contigs, go for Kmer Genie or VelvetK. However, I feel more comfortable with this kind of information rather than just having a black box of assembly.
While chatting about this with Anthony Underwood, he said that he thinks the killer experiment in this area will be an assembly of Illumina paired end reads from a strain for which a Sanger reference is available – then you can call true misassemblies to help calibrate your tools. Either that or a shed-load of PCRs to characterise what it means when properly paired reads drop below a certain percentage.
p.s. the script is very slow (takes about a minute for a 10 kb contig) due to the way I have done the windowing (with a step of 1), there is an option in the comments of the script for changing this but it messes with the display of the graph for a reason I’m not quite sure about yet. The script is very much a work in progress and if I have missed any bugs or made a giant error then please let me know.
p.p.s the script is only in the public folder of my dropbox, as I’m not cool enough for github. yet.