This post is by Lauren.
Two weeks in Michigan by a lake in the spring time – bliss! PLUS I got to test out my bioinformatics skills and learn all the areas where I am severely lacking, this was really driven home to me when on the first day we were asked to do a pre-learning quiz that had some pretty tricky bioinformatics problem solving questions in. Having started a PhD six months ago that includes a lot of bioinformatics I was sadly deluded on my expertise of many areas when really tested in this format so realised that it was time to roll up my sleeves and get to work.
The biggest area of blurriness for me was RNA-seq which I was happy to learn was gonna be one of the focal points of the course. For this section we were split into groups and asked to tackle an RNA-seq analysis in the way we thought best by writing a pipeline to accomplish this. Our group decided to map our reads to the reference transcriptome and blast reads that did not map, but many other groups chose different approaches. We luckily had Diego (a bioinformatics rock star) in our group and managed to produce a kick-ass pipeline complete with tutorial and ‘pushed’ it to github.
Being an assembly enthusiast I was particularly psyched to hear Titus Brown and Rayan Chaikhi talking about digital normalisation and k-mer genie respectively. Digital normalisation is a very effective way of minimising the amount of reads you have to use to achieve de novo assembly, this is achieved by discarding reads that are repeated after a specified coverage and discarding erroneous reads that have very low coverage. It is highly useful as it only requires a single pass of the reads and therefore is a very desirable step before running through an assembler. However, as in my case when you have a very repetitive genome to be working with you better hope that your paired end to the repetitive sequence is retained before the coverage level is reached. K-mer genie computes a k-mer histogram for all k and then assumes that all correct k-mers should be distributed as a Gaussian and fits a model to the histogram, it achieves this in a very efficient manner by using sampling to estimate the histogram so can give you a brilliantly quick consensus on the best k-mer value to use on your assembly. Both excellent tools!
I also had a rude awakening on a severe lack of quality awareness in my sequencing analysis so far, I seem to have completely skipped the bit you are supposed to do before diving straight into assembly on my data so will definitely be re-evaluating what genomes I had considered assembled to the best of my ability. Luckily for me we learnt all about the use of fastQC, khmer, trimmomatic, fastX and more so quality control can actually happen now.
As well as all this super relevant info I was picking up there was simultaneously a lot of fun to be had. I tried to avoid playing too many American sports and making a complete tit out of myself but got roped in a handful of times. There was also a beautiful lake to swim in and plenty of hilarious yanks to laugh at *ahem* I mean with. Most shocking and terrifying of all I can now say I have been in a tornado!!