Intro to bacterial genomics

Here, in the interests of ‘if you have to email it twice, write a blog’ is my high-level overview of what a bacterial genomics pipeline looks like.

1. quality assess fastqs with e.g. fastqc, visualise these across your dataset with MultiQC. If data is particularly bad, do quality trimming, if not, then don’t.
2. do species level identification and identification of mixed cultures with e.g. mash, kmerid or kraken
3. do variant calling using the appropriate reference genome (chose this using top mash hit) using e.g. PHEnix
4. Read variants into a database (e.g. SnapperDB) or make a consensus genome by modifying the reference with all the variants you have identified. Most variant calling pipelines have an option to make this consensus genome, or you can do it from the VCF and the reference genome. Two important points for making a consensus genome 1) if a position is mixed in the mapping to reference, it should be called as an N, not as reference 2) all the consensus genomes should be the same length, so it’s fine to have deletions as ‘—‘ but insertions should not be included.
5. Gather all your consensus genomes into a single file and run e.g. snp-sites.
6. Make a phylogenetic tree with IQ-TREE or RAxML or FastTree
7. Annotate and view the tree with Figtree, Microreact or iTOL tree
In terms of AMR gene detection, mykrobe and ariba are useful. SRST2 is a nice option for mapping based results (if you ask yourself which is better, mapping or assembly? the answer is both!)
For GWAS, bugwas and SEER are primary options.
Here is a paper on how to organise bioinformatics projects. I loosely follow this structure.
Plenty for you to get stuck into there!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s