Validation of SKA analysis for genomic epidemiology

TL;DR SKA is a high quality bioinformatics tool which enables you to rapidly do genomic epidemiology of very closely related isolates. SKA shows good correlation with traditional SNP calling techniques (BWA + GATK) SKA shows a consistent trend of identifying around 10 more SNPs between a sub-set of pairs of isolates than traditional method, I’m…

Basic options for bioinformatics data analysis

This post is written in the spirit of “if you write it in an email to more than one person, turn it into a blog”. It’s for beginners who are just starting out in bioinformatics and who need information about which path to go down for which programming language. ————————— Firstly, you will need to…

Parsing treebreaker output

Treebreaker is a nice piece of software which detects changes in the distribution of a phenotype across a phylogenetic tree. Code here. Paper here. In my case, I’m using it to find clades of lineage 4 M. tuberculosis which are associated with East/Southeast Asia. It takes a phylogenetic tree and a phenotype file (the leaf labels and…

How to get to Ha Giang from Hanoi Airport (Noi Ba)

I just spent an amazing three days motorbiking round the Ha Giang loop in northern Vietnam. Fantastic thing to do, just mind blowingly amazingly spectacular. However, I had a bit of a stressful time making the connection between my plane arriving into Noi Ba airport and the bus to Ha Giang, but it all worked…

Alternative ways of giving credit

It’s a truth universally acknowledged, that for a research career to flourish you need to publish first author papers. However, like many things concerning scientific publishing, this norm fails the ‘explain it to a non-biologist’ test. Non-biologist: So, how do you get credit for the papers you publish? Biologist: Well, if you are listed first…

TB incidence cartogram

I saw this amazing ‘natures-heartbeat‘ cartogram, and thought it would be cool to have something similar for TB incdience. There doesn’t seem to be any TB cartograms already available, so I thought I would make my own. I used the cartogram package in R. After spending some time in the deeper circles of R dependecy…

Causal inference and the spectrum of association studies

I’m reading an interesting article which Marc Lipsitch tweeted about. The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data by Miguel Hernan. The main take-away messages for me are that Almost all scientific studies are aiming for causal inference, but people working on non-intervention, non-randomised studies (aka association/descriptive/exploratory studies) are generally discouraged from…

Intro to bacterial genomics

Here, in the interests of ‘if you have to email it twice, write a blog’ is my high-level overview of what a bacterial genomics pipeline looks like. 1. quality assess fastqs with e.g. fastqc, visualise these across your dataset with MultiQC. If data is particularly bad, do quality trimming, if not, then don’t. 2. do species…