Biologists often want to see annotated genomes so they can start digging into the genome using tools like Mauve and Artemis.
First off though, I think it is really important to provide the ‘whole-y genome’ disclaimer (h/t Neil Perry). Before letting the biologist loose on the genome it is vital that they really meditate on the fact that the genome assembly is a representation of the genome rather than the ‘true’ genome. Often biologists are surprised by this but, better that they are surprised by it up front than being brought down to earth with a bump after they think they have discovered something interesting.
After assembly with e.g. Velvet, we are left with some number of contigs (typically in the hundreds for STEC). If the aim is to compare a pair of strains in e.g. Mauve, I think it is sensible to align the contigs vs a common reference using e.g. Mauve contig mover (if anyone has a command line alternative to this, I would like to hear about it). This initial alignment will result in a more coherent picture in Mauve although it may be necessary to double down on the whole-y genome disclaimer, and emphasise that they have been aligned to a common reference and that the true genome arrangement could be very different.
For annotation we use Prokka from Torsten Seemann’s group. It is a great tool, annotating your genome and providing you a .gbk file from Velvet’s contigs.fa output in 10-15 minutes. However, depending on your version of Prokka, when you load the Prokka .gbk output in Mauve, it might not load the annotation, someone from our group got in touch with Torsten about this and he kindly provided the solution. I explained how to implement this solution to a biologist colleague below…
Finally, if you want to look at your genome in Artemis it will have to be in a single contig, as artemis only loads a single contig at a time. In this case, I would take the contigs.fa, align to the reference, then strip out all the fasta headers except the first one and then annotate. This would obviously come with a triple whole-y genome assembly disclaimer.
——– How to change prokka .gbk files to work with Mauve ————-
If the sequence loads but isn’t annotated it is likely the following problem…
In the gbk files produced by prokka, only the first contig has an entry for ACCESSION and VERSION. If you open the gbk file with a text editor (e.g. sublime text) these will look something like
If you look at the headers of the other contigs then this won’t be present i.e.
Do a find and replace for ACCESSION\n (this means ACCESSION followed immediately by a newline), you need to click the little .* option to the left of the find and replace boxes for this to work. Replace it with ACCESSION PROKKA_06052013. Repeat for VERSION.