This post is about reproducing analysis from a paper, the first step in this is obviously get the data. In this case, the data is from the Sanger and the primary
I’m interested in Salmonella Typhimurium ST313, a potentially invasive strain that is possibly in the process of genome reduction leading to human host restriction. See here, here and here for more info.
The most recent big genomics paper was the Okoro et al., Nature Genetics one linked to above. So, let’s go through the process of getting this data from the ENA so we can take the first steps to replicating their analysis.
Step 1. Go to your paper of interest, and grab the SRA accessions. In the case of Okoro et al., there is no BioProject listed, so we can take a single Accession number from their supplementary info.
Step 2. Take your ENA accession (e.g. ERS033133) and search http://www.ebi.ac.uk/ena for it.
Step 3. As you don’t want just one isolate, but all of them, click on the study accession that should be listed in the result from your search of ENA.
Step 4. There should now be a large number of samples, which you view as text, and then download.
Step 5. Take the fastq ftp path, e.g. ftp.sra.ebi.ac.uk/vol1/fastq/ERR023/ERR023768/ERR023768_2.fastq.gz
Step 6. Download all these, either with a shell script or a python script calling os.system(), wget did a nice job of this for me, if you are on a mac, see here for how to install this.
Step 7. Now you have a bunch of someone elses data, the fun really begins! 😉
Extra Bonus Step 8. Oh, but why are there only 48 samples here, the paper has nearly 200?! Well, in this case, the data in this project seems to be split across three ‘studies’. You can kind of tell this from the distribution of the ERS numbers but I wouldn’t want to stake my reputation on that kind of heuristic.
p.s. you can also do much of the above programmatically via ENA’s RESTful api http://www.ebi.ac.uk/ena/browse/search-rest#data_warehouse
e.g. the below will get all the fields after ‘fields’, from Accession ERP000113