GenomeTrakr meeting notes

Notes for GenomeTrakr 2015 meeting

Context

These notes were taken during the FDA GenomeTrakr Meeting held at the Omni Shoreham Hotel in Washington DC, USA from 23-24 September 2015. The notes are intended to be as objective as possible. Personal opinions or speculation are prefixed by the author’s initials below:

Contributors to this document

  • PA = Phil Ashton (Public Health England, UK)
  • TS = Torsten Seemann (University of Melbourne, Australia)
  • FB = Fiona Brinkman (Simon Fraser University, Canada)
  • EG = Emma Griffiths (Simon Fraser University, Canada)

Background reading

http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

Eric Brown

Welcome, intro.

Marc Allard

  • 2015 an important year for GenomeTrakr → getting data, interpretation, ways to use data, DB grown 3x
  • 1000 isolates per month, 2 dozen clusters, 2-3 active clusters that are actionable (Errol Screening & Stats Office) (PA – how do they define actionable?)
  • Meet weekly with CDC and USDA, no longer research activity but actual testing.
  • On track to double, triple volume in year ahead.

Ruth Timme – ORS (office of regulatory science), CFSAN (centre for food safety and applied nutrition), FDA

  • Co-ordinating data flow, “state of the art network”–> paradigm shift in 2 parts (technology, open data)
  • Data is released before being analysed
  • raw data available 1-2 days after collection
  • Industry can look at the data as well – interesting to help them get on board?
  • Basic Data Flow
  • distributed network, 30 labs plus other partners
  • INSDC (International Seq Database Consortium)
  • contributors mostly FDA and state labs, expanded on PulseNet contributors (adding enviro isolates to make DB richer)
  • 65 countries, 45 states
  • >35 000 sequences in GenomeTrakr database
  • tracking Salmonella predominantly, also Listeria. E. coli, Campylobacter
  • 400 serovars sequenced so far.
  • sample collection spans 25 years
  • daily kmer trees posted to NCBI
  • kmer? evolutionary?
  • Pilot phase: build reference db @NCBI
  • data generation: MiSeq → QC→ SRA
  • upload to SRA with CLC plugin?
  • Need robust LIMS software to track sample, deployed Sept 1
  • Justin Payne developed python based LIMS system for CFSAN, track sample-cluture-wetlab-seq->submission->analysis. Interacts with commercial package called “Slimshttp://www.genohm.com/slims/
  • Illumina Basespace SRA submission app
  • Talk of “new submission portal at NCBI” – the UIless submission (http://anonsvn.ncbi.nlm.nih.gov/repos/v1/trunk/submit/public-docs/common/docs/UI-lessSubmissionProtocol.docx), PHE use this.
  • Successes over the last year:
    • 18 -> 30 labs contributing (eg CDC, PHE). empowering labs to upload their own data. do their own quality control.
    • 500→ 35 000 genomes in database!
    • Independent uploading → NCBI improved submission (plug-in in dev/testing) w/ CLC
    • New LIMS for CFSAN
    • Switch to new submission portal (overcome hurdle of pre-reg samples)
  • Traditional response: took 52 days, WGS made it 12 days – previously unactionable, now feasible
  • Centre for Food Safety, UCD, Ireland is a submitter?
  • What are FERA contributing?
  • From UK + Eire perspective (PHE), would be interested in what FERA and Centre for Food Safety, UCD, Ireland are submitting.
  • Aren’t there quality issues with making data available before analysis? Yes, but everyone is aware of them, and they (Erroll) seem quite open about them.
  • SLIMS report looks like excel file (behind FDA firewall), single page shows metadata, status, accession #’s, submitter etc, QA/QC status, can sort by run date

Palmer Orlandi – CSO, FDA office food veterinary medicine

  • Goal: faster results w/ more specificity and sensitivity, provide “boots on the ground” for PH
  • Questions:
    • Why WGS?
    • What is the need for a network?
    • Where do regulatory labs fit in?
    • Where is the best place to start?
  • How does the animal feed aspect fit in?
  • begin with ORA labs→ Training (SOP dev, pilot projects, data transfer)→ Seq (field isolate inventories)→ Research
  • Deliverables eg identifying players, instrument acquisition, Salmonella strain inventory assessment
  • want to expand use of WGS beyond research, now an accepted clinical tool
  • Moving toward strain characterisation (amr, serotype, strain, virulence), as well as typing.
  • used for outbreak, compliance, surveillance → functional tool for regulatory practices
  • Sequencing all regulatory isolates: seq’d all Salmonella from 2011-14, working on 2002-10
  • gives estimation of relatedness of resistance in environment → tool for measuring impact of regulatory policies pertaining to food supply
  • National Antibiotic Resistance monitoring (NARMs) – sequenced everything in retail meat (do they have any data on antibiotic useage in these herds?).
  • 99% accuracy of predicting resistance using WGS
  • WGS Metagenomic sampling, animal husbandry
  • CFSAN → developed WGA applications to modernize typing (priority: Salmonella, L. mon etc), training program, compliance actions
  • GenomeTrakr contains 25K Salmonella sequences, 4K Listeria, 100’s of STEC, Vibrio, Cronobacter
  • Interagency partners: CDC, FDA, FSIS, Gen-FS (Genomics and Food Safety), NLM (National Library of Medicine), NIH
  • want to strengthen collaboration and co-ordination to increase efficiency
  • need proficiency testing and standards
  • LEVERAGING→ data provided by partners is acceptable for regulatory purposes? need to use ideas/technology, resources, data to develop next-gen “Tool Box”

Errol Strain – Analytics and Outreach, CFSAN

    • “GenomeTrakr 2015 FAQ”
    • minimal metadata for 3rd party submissions
    • Screening – do we have anything new that matches recent clinical case?
    • FDA routinely screen clinical isolates against enviro isolates in database, historical strains (do food/enviro isolates collected at different locations at diff times etc match?)
  • Quote: “K-mers are soooo 2015, SNPs are 2016.”
  • NCBI improved clustering based on SNPs
  • NCBI Genome Workbench – start day with coffee and latest k-mer tree http://www.ncbi.nlm.nih.gov/tools/gbench/
  • SNP Cluster Analysis: Ref genome (NCBI, de novo) → Run CFSAN SNP pipeline on Kmer cluster → Build tree, make refs (CFSAN SNP pipeline source code on GitHub under CFSAN)
  • separating signal from noise → very distant ref will include lots of SNPs and drag noise into analysis
  • FDA are using essentially the same ad hoc pipeline to do analysis as PHE are (and MDU).
    • Use k-mer tree to identify cluster, then use regular reference (closed or de novo) then map reads
    • Reference needs to be < 0.1% (3000-5000 SNPs) – prefer same PFGE, same subtype (MLST?)
    • Important to screen for phage introduced SNPs – trying to automate this (How to include these in a phylogenetic model?)  (MDU considering just masking all known phage from references for core SNP part)
    • Don’t want to mask repetitive regions, if the phage is stable (enteritidis, ???) want to include that region in the analysis. in a way, that is further down the line, get the core genome stuff up and running first.
  • Hopes to automate cluster screening as current manual approach is not going to scale (<5 SNPs is actionable)
  • After notification of match: lookup IDs in PulseNet and contact, contact CDC and FDA district office if appropriate (cross state lines, FDA has jurisdiction) → recent isolate match may trigger enviro sampling and product testing → historical match may look at food transmission (check machinery for endemic contamination)
  • PNUSA = PulseNet USA identifier
  • metadata used for notification and actionable responses!
  • MInimal metadata: important is collected_by so you know who to contact when you identify a match!!!.
  • Harmonizing Proficiency Testing (PulseNet and GenomeTrakr)
  • interpretation: no single threshold for all species/types but there are rough guides
  • The “Daubert Standard” for WGS analysis: https://en.wikipedia.org/wiki/Daubert_standard
  • Rough SNP guide: < 20 (identical, match)  20-100 (inconclusive)  > 100 (exclude) +  does it form a unique cluster with > 95% bootstrap support?  Really need multiple lines of I evidence, and rich background DB (determine if cluster is distinct in tree)
  • compute likelihood ration → matrix of # of SNPs vs Bootstrap % gives strength of match (table of support)
  • need enough isolates to provide statistical support, 35K genomes in DB makes these caluculations more reliable eg Salmonella Montevideo
  • Salmonella Newport – 30-40 SNPs within strains from the same produce.
  • GenomeTrakr will move towards new technologies/platforms → always pilot projects and validation to contribute to decision making (adopt/not)–> metadata standardization will come as database gets bigger
  • Open to any short read data (FB: but what about MinION?) FDA are expecting to go through another cycle of research, pilot, adopt with long read tech.
  • Everyone must get their SNP pipelines published. ie. CFSAN pipeline.

Darcy Hanes – Office of Applied Research and Safety Assessment

  • Historical collections include:
    • outbreak isolates
    • surveillance isolates
    • food manufacturing/processing plants isolates
  • need to modify SOPs to include new pathogens
  • Sequencing also historical isolates of Salmonella, including 6 dog farm isolates, now have 1000s L.mono ice cream 2014. Sequenced 2010 (? surely not?!) isolates from same factory. This was aimed at answering the question – is this a new or ongoing problem?
  • Foodborne metagenomic sequencing w/ matching to genometrakr isolates (done at NRL-MOD-1)
  • Cilantro (coriander) model of Salmonella source w/ FDA Vet Medicine
  • “Highlights from the CFSAN-ORA Next Generation Sequencing Network”
  • Started with Salmonella and Listeria. E.coli, Shigella, and Campylobacter being added and currently doing proficiency testing (2015 CFSAN Proficiency Testing).
  • The acronyms below are abbreviations of FDA regional labs:
    • DDL: Seq over 400 Salmonella isolates since May 2014
    • NRL: Listeria, Salmonella.  Also involved in virus genome project. Interesting sunflower seeds case (will add more here when mentioned later)
    • PRL- NW: lots done incl seq isolates from manufacturers surfaces. First to have their 100 historical isolates done. Seq. 220 salmonella, for example, and growing. Historical useful for IDing if there is a “new problem” or not.
    • MOD-1: interesting collaborations including CARTS project – metagenomics for direct speciation from environmental samples.
  • FERN (Food Emergency Response Network)
  • Checking if can match metagenomics Seq with WGS and Seq in GenomeTrakr
  • Database of Brucella melitensis and B. abortus for cheese studies.
  • High Cubes → high throughput genomic extraction → BSL3 facility (all cells are dead) can’t ship not dead cells!

John Besser – CDC

  • Sharing his time slot with the Pope’s tour of Washington DC
  • Has been with PulseNet from the start, 20 years ago, but this is first GenomeTrakr meeting!
  • 87 labs in PulseNet USA, > 500,000 PFGEs done so far – 30-60 clusters per week
  • “PFGE is still a very effective method” – but molecular clock can be too fast, false positive differences
  • WGS introduction led to more clusters found, more solved, and smaller clusters (close to theoretical 2 isolates)
  • difficult to contribute to case definitions → connecting clinical/enviro isolate sources (Salmonella particularly bad, so outbreaks harder to solve)
  • Listeria Surveillance System →  most seq by CDC (developed metrics for Listeria after advent of WGS)
  • eg caramel apples outbreak → eliminated cases that were determined to be sporadic by WGS (previously might have thrown off epi investigation if included), <10SNPs/cluster
  • diff case definitions req’d for diff parts of the investigation:
  1. determine contaminated food
  2. link food to source
  3. most inclusive definition when determining who is involved (culprit)
  • wgMLST scheme lends itself to a “nomenclature” for L.mono isolates (aka allelle database, automated)
  • “Any day now” is a common theme in terms of implementation of all this.
  • BioNumerics “point and click” tree interface, node click to all meta-data and case information
  • centralized computing so all state partners don’t need own infrastructure
  • BaseSpace, FDA, NCBI (external storage)
  • Nomenclature server = Scicomp
  • Automated case-case comparison of exposures (how will this work without hierarchical ontology? PA). Control for age, sex – power calculation? Do they have a food name ontology for exposure information?
    • 16S → genus
    • MLST (7 loci) →  clonal complex
    • RiboMLST → strain
    • wgMLST → clone
    • SNP → subclone
  • bionumerics next version will include hqSNP (the CDC SNP pipeline)
  • hqSNP vs wgMSLT – CDC does mask their mobile elements first for hqSNP (implicity done with wgMLST)
  • Supervised machine learning / “Disjunctive anaomaly detection” for finding clusters humans would miss by eye
  • Culture independent, direct sequencing, metagenomics will lead to a LOSS of available isolate sequences (but with long reads we will be recover the isolates from the mixture)
  • Direct from patient metagenomics → actionable

eg diagnosis of neuroleptospirosis by WGS

  • Next frontier is single cell sorting and sequencing ($$$$)
  • Amplicon sequencing of known Virulence and Antibiotic Resistance markers, but phasing problem (aligning metagenomics genomes → pathogen may have chunks of genome found in commensals etc→ solved by Q linkage or 16S for identity or assembly)
  • noise → <0.15% genome coverage of gut microbiome from stool run on HiSeq → “Clutter Mitigation” strategies for reducing noise (improving lab procedures, technology)
  • Keen on potential of metagenomics for in situ pathogen ID. Phasing problems may be overcome by long read sequencing
  • Human faeces has DNA from everywhere – virus,bacteria,fungus,parasite,human,food,environment
  • can we use Bionumerics for standardizing metadata?
  • Pathogen detection website (NCBI, contact with a plan)
  • complication of “serving different masters”, need better way to trend track through all BioSample/BioProject umbrellas
  • also need a way of identifying quality of sequences that everyone can quickly compare
  • working group (FDA, CDC and some states) harmonizing SOPs → feedback will help make something that works for everyone
  • single largest source of error → switching labeling (names of samples), also contamination

Group Discussion – chaired by Eric Stevens – Ruth TImme, Marc Allard, Palmer Orlandi

  • Ruth Timme bringing up issue raised by Emma Griffiths (SFU) re standardizing food terms and other ontology dev for metadata. Would like to get States OK to go back and clean up the data.
  • EFSA has a standard 2500 term vocabulary for food items (there are a few food ontology initiatives – need to integrate)
  • NARMS has nice standardize food terms – ie. Chicken breast is always called that. (Another example for ref: some say “deli meat” while others say “cold cuts”
  • NCBI: has standardized field names but not terms within them. Need to work on this.
  • so many standard food term databases to choose from  – and none are being used in GenomeTrakr (yet)
  • Emma raising issue re people needing to talk together to get right resolution of food term dictionary – practical
  • Metadata is available on all isoaltes from ftp://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/

Bill Klimke – NCBI

  • NCBI Pathogen Pipeline:
    • QC
    • Kmer
    • Assembly
    • Annotation
    • GEnome Placement
    • Clustering
    • SNP Analysis
    • Tree Construction
  • BioProject→ BioSample→ SRA→ GenBank
  • NCBI list submitter/contributors, Processing Status currently internal, later will send report if your seq doesn’t meet minimum data quality standard
  • clustering based on <100 SNPS (based on pairwise distances → Cluster100), also Kmer distance
  • 100 SNPs doesn’t define outbreak, but airs on side of caution for inclusion → user needs to decide
  • doing retrospective analyses comparing NCBI pipeline to epi confirmed outbreaks (recruiting epi confirmed outbreak info from community)  
  • Genome Workbench can label trees with private metadata → done locally
  • Listeria pipeline, pairwise SNP distances. 4628 targets, 351 Genbank genomes
  • Most “clusters” have <= members – a cluster is <= 30 SNPs
  • Check out http://www.ncbi.nlm.nih.gov/pathogens/ for the most recent analysis
  • The vast majority of SE outbreaks have mean/max PWD of < 10 SNPs
  • New NCBI SNP detection pipeline will soon (?) be on the Pathogen FTP site: ftp://ftp.ncbi.nlm.nih.gov/pathogen/Results/
  • As part of combating AMR, NCBI wants to capture the phenotypic resistance profile of isolates being received.
  • Standardized template table for uploading antibiogram results, ontology by ASM/CLSI – these can be sumbitted independently of sequence data: http://www.ncbi.nlm.nih.gov/biosample/docs/antibiogram/
  • antibiogram fields capture experimentally derived break points, can compare to point mutations in DB
  • integrating CARD (NCBI has sent corrections to CARD), ResFinder, FDA Centre for Vet Medicine
  • Building a new AMR Gene Database, curated, ResFams HMM, suitable naming, bad sequence
  • Hierarchial naming based on similarity: KPC-2 → KPC family → class A betalactamase → NOT betalactamase
  • assess accuracy of calls using finished high quality genomes, literature and “known truths”
  • NCBI will provide a AMR report back to submitters to help them correct AMR designations (and errors)
  • AMR point mutations not curated yet
  • contamination from sequencing instrument carry-over (Illumina) → take care of with statistics?
  • sharers need to label samples as potentially contaminated when there is a question
  • Sanger is biggest submitter → diff read platforms, data quality, contamination → NCBI needs to control input (curate in some way so does not allow upload of all samples)
  • is NCBI creating application for state labs to pull data to make own tree? no, mandate to create something for data integration
  • users can pull distance matrices to plug into own pipeline, but NCBI won’t build custom trees for you (they make Kmer trees and make available to public)
  • tutorials and webinars available for PH workers, SNP pipeline tutorials coming soon

Daniel Janies – U. North Carolina at Charlotte

  • diverse data→ supraMAP→phylomapping→users
  • can ask where outbreaks originate, where they are going –>map spread
  • Visualisation of virus transmission
  • not just phylo info (evolution), but also SNPs → can include this info through use of colour → can animate
  • transmission events (denote change of city/”other shifts” through use of colours)→ compare frequency, “betweenness”
  • Betweenness Centrality: nodes with higher “betweenness” have higher influence on spread of pathogen → determines number of times a node appears during transmission paths
  • PH interventions could concentrate on places with high Betweenness → longer lasting PH effect rather than local effect
  • can handle large, diverse datasets and put in visual contact → results lead to actionable conclusions → will help when contact tracing (narrative) not available
  • Camels had a role in MERS evolution but not important in ongoing spread
  • International data sharing + informatics = checks and balances for infection contol
  • Strength and weaknesses of contact tracing and genetic data

Jeremy Peirce – Illumina

  • co-ordinates FDA sequencing efforts w/in GenomeTrakr
  • MiSeq QC = the SAV tool = Sequence Analysis Viewer: https://support.illumina.com/sequencing/sequencing_software/sequencing_analysis_viewer_sav.html
  • SAV(QA/QC metadata about run) off website, from pre-alignment side
  • Flow cell:
    • data by cycle → over cycles, data gets worse at ends of reads, look at Q score distribution (“%>Q30”)
  • Anatomy and Dissection of Seq Run: Troubleshooting Journey
    • Cluster number -outcome of hydridization event:
      • upstream: prob is either hybridization or library prep
      • downstream: prob is downstream sequencing or analysis
  • Image analysis
    • goal of template generation – identify spots on map of all clusters
    • signal to noise measure
    • dim clusters → check signals in diff channels over course of run
    • clusters passing filter → measure of “cleanliness of data” → if “pure”, have high primary intensity
    • too many clusters → they will overlap and purity goes down
  • fluidics (reagents), thermls (heating and cooling), Optics (imaging)
  • 16S has less diversity at beginning, more after run proceeds
  • M3 motor issues, can contribute 10% to quality eg Q30
  • NextSeq has different # tiles and diff camera
    • high phasing (behind)/pre-phasing(ahead)→ signal to noise ratio degrades over run
    • phase/pre-phase should be <0.5
    • lots of tiles (“rain pattern” = random) probs optics problems
    • if there is a pattern over course of tiles, probs fluidics
  • Library Diversity (indication of library degradation)
  • Sequence length (more useful if trimmed)
  • FastQ toolkit
    • remove adaptors, trim reads based on quality

Fiona Brinkman – Simon Fraser Uni

  • IRIDA – http://www.irida.ca/
  • Goals: user friendly, web, access control, automatic pipelines, standards compliant, ontologies, open source
  • Federated database model + Galaxy for workflow engine w/ version controlled pipeline templates (workflows)
  • GenomeD3Plot –  https://github.com/brinkmanlab/GenomeD3Plot/
  • SNVPhyl (“Sniffle”) for variant calling – https://github.com/apetkau
  • Data sharing needs to arranged early
  • Outbreak transmission more closely linked to flight paths than geographic distance

David Boxrud – Minnesota Dept Health

  • WGS of Salmonella Enteriditis (SE) – clonal, 4 PFGE = 76% of Pulsenet, poultry and eggs
  • SE Outbreak 1994 has 224,000 case (Schwann’s Icecream in USA/CA)
  • No longer do PFGE due to clonality – now WGS, originally FDA kmer tree but now SNP tree/heatmap
  • Retrospective study to compare/assess WGS
    • stabilty
    • typability
    • discriminatory power
    • epi concordance
  • worked with epi’s to produce carefully curated examples
  • eg SE retrospective study n=55 from 7 outbreaks, n=22 sporadic isolates, some in vivo, some outliers, mystery : with PFGE, MLVA
    • very few SNP diffs, 3 max
    • several isolates from same person
    • Salmonella with common PFGE pattern was resolved into several clusters, 2 most prevalent (type 4 cluster contained 138 SNPs, type 2 cluster contained 58 SNPs)
    • stable, few SNPs/person over time
  • eg Prospective Study (April 1-Dec31 2014)
    • PFGE and WGS in real time
    • each cluster, regardless of method, will be investigated as outbreak
    • cluster definition: indistinguishable PFGE (XbaI and BlnI) type within month, <10 SNPs diff between isolates over 1 month (originally >20 but this gave non-specific results)
    • Lots of travel to Mexico clusters, largest clusters they found.
    • Salmonella Enteritidis associated with snakes – or associated with food fed to snakes?
  • Collecting exposure data key to evaluating subtyping technique
  • interview all cases in surveillance ASAP, collect details on specific exposures:
    • dates
    • restaurants
    • brands
    • open-ended
  • dynamic investigation approach (re-interview if some new potential source discovered during investigation)
  • eg 159 isolates, 21 unique PFGE patterns
    • #clusters by PFGE = 12 (cases/cluster = 9.2), clusters by WGS = 25 (cases/cluster=3)
    • 13 instances of multiple isolates collected from same patient, used to study stability (#SNPdiffs/time) –>0 SNP diffs over 150 days
  • eg Lab-associated infection
    • traced hospital technician to patient
    • PFGE suggested they were not related but only 1-2 SNP diffs
  • eg Pet associated infections
  • eg frozen chicken product outbreak → distinct WGS patterns
    • summer of 2014, 19 isolates with same PFGE, WGS → 8 clusters, 0 SNP diffs b/w them
    • epi can differentiate and substantiate WGS clusters→ result was recall
    • PFGE would have taken longer to solve
  • Changes based on WGS:
    • difficult to relay cluster info to epi’s, excess info they don’t need → need to streamline info sharing practices
    • cluster definition changes
  • Conclusion:
    • WGS was superior
    • communication of results challenging
    • WGS with quality exposure data key to outbreak identification
  • nomenclature for WGS patterns inprove uptake among epi’s → but difficult to create nomenclature since it depends on interpretation criteria (specific for serotypes), wgMLST to provide technical capacity for nomenclature
  • can alternatively do thresholds with SNP diffs
  • need common nomenclature system→ also depends on methods and data quality

Dag Harmsen – Uni Muenster

  • Antibiotic resistance @ Uni Hospital Muenster
  • MRSA, CVRE, CRE all isolated at point of care -> Miseq Nextera XT v2 250bp PE @ 100x
  • Use cgMLST eg. N.meningiditis 1241 locii ~ 55% of genome
  • Claims “cgMLST produces more robust trees than SNPs by ignoring recombination effects with some minor loss of discriminatory power, offset by ease of use”
  • Problems with multiplexing different genome sizes, coverage of big genomes sometimes too low
  • Can save $millions on unnecessary quarantine / preemptive patient isolation
  • Transmission used to be 5%, and after WGS it went down a little bit, but $$$ saved in bed costs
  • 1 day DNA extraction from single colony – Koeser et al 2014 JAC(69):1275 < 4 euro
  • Using diluted Nextera reagents with same results -> more cost saving
  • Power of WGS is to EXCLUDE outbreaks where previously inconclusive. Saves money, but not publishable!

Group Discussion – GenomeTrakr future needs

No notes – primarily FDA / NCBI internal issues.

James Pettengill – OAO – CFSAN – FDA

  • Bioinformatics approaches for rapidly detecting outbreaks
  • Identify epidemiologically relevant clusters from 25,000 samples
  • Salmonella 50 TB, Listeria 5 TB of data so far
  1. Obtain an assembly for each sample, done using ‘cloud bursting’
    1. download from SRA
    2. FASTX toolkit quality filters
    3. Kraken custom DB contamination filter
    4. Spades assembly
  2. Evaluate different methods for estimating relationsip between samples
    1. Assembly | all vs all VS. one vs all | site based VS. kmer based
      kmer based are fast but loses information as heterogeneity increases
    2. Reads | one vs all
    3. Simulation Hudon’s ms program 100 tip tree + SeqGen under topology + ART read sim
      http://home.uchicago.edu/rhudson1/source/mksamples.html
      http://tree.bio.ed.ac.uk/software/seqgen/
    4. Site based: Numcer, parSNP, CFSAN SNP Pipeline
    5. Kmer based: Jaccard Index, Manhattan distance, Euclidean distance
    6. k-mer distances and SNP distances break down < 1000 SNPs. if using kmer distances to recruit nearby isolates you need to take this into consideration
    7. kmer issues suffer a lot – mobile elements, remove a contig, etc (Jaccard dist)
  3. Fine grain analyses to elucidate genomic differences
    1. CFSAN github site
    2. SNP pipelines – recovered 98.9% of SNPs, FP rate = 1.04 x 10^-6 @ 100x / 98.8% 8.34 x 10-7 @ 20x. False negatives are due to consensus frequency (i.e. mixed bases) and not coverage, and are NOT random along the genome (although not much evidence except eyeballing!).
  4. Implement daily surveillance and report generation
    1. orphan and non orphan samples based on SNP threshold (what is this, how determined?)
  • 25000 genomes assembled in cloud – $8000, useful to clear backlog or use as surge capacity if need to redo all assemblies.

Cecilie Boysen – CLC, Qiagen

  • keep instruments and reagents agnostic
  • CLC Genomics Workbench, single user, few samples
  • Genomics Server, several users, many samples
  • Build workflows much like Galaxy – has toolbox (like toolshed) – is this Server backend product? I can see “CLC Custom Solutions” at the top of the left hand tool menu.
  • Can share workflows with colleagues – export and import. Need to “unlock” them so your colleague can point the reference links to their own reference genomes?
  • many modules eg
    • SRA submission
    • Typing tool
    • OTU clustering
    • MLST
    • BLAST
    • Forensic toolkit
  • import data from MiSeq
  • Plugin data to upload data to SRA – has a filter to ensure you don’t send crap data!
  • Allows to use local FASTQ for data on BaseSpace.

Lynne Bry – Brigham & Women’s Hospital

  • Sequencing foodborne and MDR clinical isolates at BWH.
  • 1000 micro samples per day! >100 +ve cultures across kingdoms/phyla
  • 50% diagnosis, 10% therapy, 40% screening/surveillance MRSA, VRE, GrpB Strep, Gram-ve
  • Lots of meta-data: MIC, disk diff, E-TEST, Drug resistnace, zone diam, R/I.S, ESBL, D-zone CLI/ERM
  • HIPAA de-identified data – Year only, no location
  • WHONET http://whonet.org/ open source to generate antiobiograms in surveillance (Old, Windows s.w)
  • Crimson LIMS – prospective analysis of clinical samples & real time query
  • “Honest Broker” assigns new external IDs
  • Spades, QUAST, ResFinder, CARD, RAST, Mauve for extra chromosomal, BLAST for plasmid/transposon,  where is the resistance gene ?  transposon – plasmid or chrom. (Bandage and ISmapper might be useful for these types of analysis?)
  • SNPS: bowtie2, mpileup, bcftools, custom filtering.
  • Kp CRE ST258 – found many different plasmids and transposons + point mutations – WGS revealed this detail
  • E. cloacae CRE – ampC on chrom + porin mutations , multiple mobile elements Tn4401b / Tn6901
  • S. marcessens CRE – SRT-2 Ampc_SME-4, AmpC and KPC-3 acquire, 3 year surv 2011-2014, 2 close events
  • TImelimes; MiSeq (14 days), Bioinformatics (1 – 14 days), Epi (14 days)
  • Despite 3 week turnaround they are told it IS actionable
  • Rule out just as important as rule in
  • Mobile element analysis can refine relationship analysis
  • Curating new genomes and mobile elements take most the time
  • Desire to use more principled methods for outbreak calling – SaTScan, Bayesian, likelihoods

Ole Lund – DTU, Denmark + CGE + COMPARE-EU

  • Built a web based system with tools: http://www.genomicepidemiology.org/
  • 164,000 WGS to the server since 2012
  • Species detection needed for MLST typing – found kmer based scheme worked best
  • k-mer trees go back to Woese 1977 – oligo based trees
  • ResFinder for acquired AMR
  • PIpeline: assembly + put through all the *Finder suite apps in one pass. (checkout https://cge.cbs.dtu.dk/services/cge/ for this)
  • Batch uploader tool – fill out a metadata XLS form and submit , produces XLS summary output format
  • Compute: 56 cores / 100,000 submissions — moving to 280 cores ~ 500,000 submissions
  • SNP tree – Made error in that they used to use ref base when couldn’t call base in isolate (whoops!)
  • Did use UPGMA but assumption is that they all come from same time point, now using neighbour joining
  • Mapping – 10 mins per isolate (this seems long? does it include SNP calling as well? PA)
  • Tree building – N*2.5 + N*N*0.02 seconds =  24 days for 10,000 isolates
  • Solution is to only analyse those isolates close (<10 SNPs?) to what you have, use precalculation mappings, efficient binary storage etc. Move to clade specific sub-analyses over global analysis.
  • Reference genomes curated down to a smaller set < 98% id between any two,
  • Can “place” new strains onto existing trees (aka pplacer) instead of building whole tree.
  • Problem is core genome sites CHANGE as you add/ new isolates. Need smarter methods?

Wondwossen Gebreyes – The Ohio State University

Comparative Molecular Epidemiology of MDR Salmonella of Global Origin Using Whole Genome Sequencing

NOTE: Phil has started tweeting instead of entering good notes here.

  • vet
  • International Consortium on Interface between animal-human-pathogen (KOPHAI)
  • emerging zoonotic infectious diseases – outbreak and response
  • Molecular Epidemiology of Salmonella
  • eg antimicrobial free and conventional (farms)
    • goal: understand AMR across geographic and temporal space
    • correlation of AMR and co-tolerance genes (heavy metals)
    • build database for surveillance, capacity of networked labs
    • 924 isolates to CFSAN
    • studying 3 countries in Eastern Africa
    • high Zn tolerance linked w/ certain AMR phenotypes→ efflux pumps
  • looked at antimicrobial alternatives and effects on gut microbiome

Phil bringing up good point that the African seqs being added to GenomeTrakr in advance of publication help with more global/other country analyses. FB: Need to definitely get more countries on board

Patrick Chain – LANL

  • tools for sample analysis and metagenomics
  • Code: https://github.com/LANL-Bioinformatics
  • PhaME – core genome matrix, phylogentic and molecular evolution analysis
    • grab parts of tree and do analyses (pos/neg/neutral selection)
    • tre analysis with subtrees→ whether contigs, reads or closed genomes used, tightly clustered
  • metagenomics:
    • use cgtree, map reads, record SNPs, place strains in tre with other pathogens
    • can apply to euk and viral genomes
    • identify genome signatures?
  • GOTTCHA – taxanomic profiling
  • EDGE-BOUND Bioinfomratics platform
    • 6 independent modules (eg pre-processing, assembly)
    • simply requires FastQ file
    • many outputs

Reviewed

“Accurate read-based metagenome characterization using a hierarchical suite of unique signatures”

http://www.nar.oxfordjournals.org/content/43/10/e69.long

Web server: https://bioedge.lanl.gov/edge_ui/

Round Table

Why not sequence one of every PFGE type? Try and sample the population widely with WGS. We keep finding more and more diversity. Not clear how much more we need to sample! FDA is happy to consider sequencing any novel material to help with this.

Lack of contribution from Europe to SRA/GenomeTrakr – primarily USA and UK.  Public health funding should require submission to databases. Ole Lund claims Euro bureaucracy is too difficult.  GMI talks about this a lot – it’s mainly political, and it will happen eventually.

END

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s