Biologists often want to see annotated genomes so they can start digging into the genome using tools like Mauve and Artemis.
First off though, I think it is really important to provide the ‘whole-y genome’ disclaimer (h/t Neil Perry). Before letting the biologist loose on the genome it is vital that they really meditate on the fact that the genome assembly is a representation of the genome rather than the ‘true’ genome. Often biologists are surprised by this but, better that they are surprised by it up front than being brought down to earth with a bump after they think they have discovered something interesting.
After assembly with e.g. Velvet, we are left with some number of contigs (typically in the hundreds for STEC). If the aim is to compare a pair of strains in e.g. Mauve, I think it is sensible to align the contigs vs a common reference using e.g. Mauve contig mover (if anyone has a command line alternative to this, I would like to hear about it). This initial alignment will result in a more coherent picture in Mauve although it may be necessary to double down on the whole-y genome disclaimer, and emphasise that they have been aligned to a common reference and that the true genome arrangement could be very different.
For annotation we use Prokka from Torsten Seemann’s group. It is a great tool, annotating your genome and providing you a .gbk file from Velvet’s contigs.fa output in 10-15 minutes. However, depending on your version of Prokka, when you load the Prokka .gbk output in Mauve, it might not load the annotation, someone from our group got in touch with Torsten about this and he kindly provided the solution. I explained how to implement this solution to a biologist colleague below…
Finally, if you want to look at your genome in Artemis it will have to be in a single contig, as artemis only loads a single contig at a time. In this case, I would take the contigs.fa, align to the reference, then strip out all the fasta headers except the first one and then annotate. This would obviously come with a triple whole-y genome assembly disclaimer.
——– How to change prokka .gbk files to work with Mauve ————-
If the sequence loads but isn’t annotated it is likely the following problem…
In the gbk files produced by prokka, only the first contig has an entry for ACCESSION and VERSION. If you open the gbk file with a text editor (e.g. sublime text) these will look something like
If you look at the headers of the other contigs then this won’t be present i.e.
Do a find and replace for ACCESSION\n (this means ACCESSION followed immediately by a newline), you need to click the little .* option to the left of the find and replace boxes for this to work. Replace it with ACCESSION PROKKA_06052013. Repeat for VERSION.
3 thoughts on “Whole-y genome annotation and problem with Prokka genbank files in Mauve”
I’m sorry I didn’t see this blog post earlier – I think I was still in jetlag haze after getting back from ABPHM!
1. I don’t actually recall being emailed about this problem, but that doesn’t mean it didn’t happen, and I’m glad I was helpful
2. It is very odd that the later contigs have empty ACCESSION and VERSION. The Genbank is actually created by “tbl2asn” so either i’m feeding it something wrong, or NCBI has a bug (or feature). I’ll look into that!
3. I wonder why Mauve fails here. I think I’ll contact Aaron Darling and see what he says about it.
4. It was good to meet some of your team at ABPHM. Next time i’m in the UK I’ll try and squeeze in a visit to the lab!
5. Could you add an extra N to the end of my surname please? Thank you 🙂
That’s it for now. Thanks for blogging about it.
Thanks for the comment Torsten, you should definitely come for a visit next time!
I do this the other way.
1. First order assembled scaffolds with mauve
2. Annotate scaffolds with prokka (I don’t concat at this point)
3. Concat annotations together into a single genbank record using a script I wrote: https://github.com/happykhan/seqhandler