I’m trying to download some data from the Sequence Read Archive but it seems to be behaving quite unusually. Just writing a post to summarise this and see if anyone has seen it before. If I get an answer from the SRA helpdesk, I will post it in the comments/an edit.
`sratoolkit.2.5.2-centos_linux64/bin/fastq-dump –split-files ERR024630`
This, somewhat confusingly results in 3 fastq files.
-rw-r–r– 1 philip philip 444M Jan 4 17:30 ERR024630_1.fastq
-rw-r–r– 1 philip philip 180M Jan 4 17:30 ERR024630_3.fastq
-rw-r–r– 1 philip philip 425M Jan 4 17:30 ERR024630_4.fastq
_3 only has one base per read, but the same number of lines as _1 and _4
So, I downloaded the same data from the ENA, just using
Ok, now the real weirdness starts. These files have the same number of lines as the ones downloaded from SRA. They have the same read headers, and the corresponding reads share a large overlap, but are not 100% identical.
The boxed and highlighted sections are the same sequence.
What on earth is going on? Any ideas?
The data is quite old (2012), so it may be something to do with that? I’m tempted to trust the ENA data, because they got the initial deposition. Also, they return the expected number of fastqs (i.e. 2). But I really like the accession based download syntax of the sra-toolkit so would like to use that if possible. Also, it’s not great if the two databases are giving different results! Hopefully there is an explanation/fix.
Apparently the data was uploaded in SRF format (which is a new one on me) in some sort of pooled experiment type which is no longer encouraged. After some clues from the very helpful ENA helpdesk, on further investigation the SRA _1 file has illumina barcodes on the 3′ end (and is 82bp), SRA _4 has no barcodes (and is 76 bp). ENA _1 (76 bp) has no barcodes, and ENA _2 (76 bp) has barcodes on the 5′ end! What a mess! The main thing that is bugging me now is why ENA_1 and ENA _2 are the same length, when one of them has the barcode on and the other one doesn’t. So, I think the most trustworthy data will be the SRA data with the barcodes removed? But, what a palava!
Also, here is a script to download data from the ENA ftp site https://gist.github.com/flashton2003/336e67bbc513b0cf3f07