GenomeTrakr meeting notes

Notes for GenomeTrakr 2015 meeting

Context

These notes were taken during the FDA GenomeTrakr Meeting held at the Omni Shoreham Hotel in Washington DC, USA from 23-24 September 2015. The notes are intended to be as objective as possible. Personal opinions or speculation are prefixed by the author’s initials below:

Contributors to this document

PA = Phil Ashton (Public Health England, UK)
TS = Torsten Seemann (University of Melbourne, Australia)
FB = Fiona Brinkman (Simon Fraser University, Canada)
EG = Emma Griffiths (Simon Fraser University, Canada)

Background reading

http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

Eric Brown

Welcome, intro.

Marc Allard

2015 an important year for GenomeTrakr → getting data, interpretation, ways to use data, DB grown 3x
1000 isolates per month, 2 dozen clusters, 2-3 active clusters that are actionable (Errol Screening & Stats Office) (PA – how do they define actionable?)
Meet weekly with CDC and USDA, no longer research activity but actual testing.
On track to double, triple volume in year ahead.

Ruth Timme – ORS (office of regulatory science), CFSAN (centre for food safety and applied nutrition), FDA

Co-ordinating data flow, “state of the art network”–> paradigm shift in 2 parts (technology, open data)
Data is released before being analysed
raw data available 1-2 days after collection
Industry can look at the data as well – interesting to help them get on board?
Basic Data Flow
distributed network, 30 labs plus other partners
INSDC (International Seq Database Consortium)
contributors mostly FDA and state labs, expanded on PulseNet contributors (adding enviro isolates to make DB richer)
65 countries, 45 states
>35 000 sequences in GenomeTrakr database
tracking Salmonella predominantly, also Listeria. E. coli, Campylobacter
400 serovars sequenced so far.
sample collection spans 25 years
daily kmer trees posted to NCBI
kmer? evolutionary?
Pilot phase: build reference db @NCBI
data generation: MiSeq → QC→ SRA
upload to SRA with CLC plugin?
Need robust LIMS software to track sample, deployed Sept 1
Justin Payne developed python based LIMS system for CFSAN, track sample-cluture-wetlab-seq->submission->analysis. Interacts with commercial package called “Slims” http://www.genohm.com/slims/
Illumina Basespace SRA submission app
Talk of “new submission portal at NCBI” – the UIless submission (http://anonsvn.ncbi.nlm.nih.gov/repos/v1/trunk/submit/public-docs/common/docs/UI-lessSubmissionProtocol.docx), PHE use this.
Successes over the last year:
- 18 -> 30 labs contributing (eg CDC, PHE). empowering labs to upload their own data. do their own quality control.
- 500→ 35 000 genomes in database!
- Independent uploading → NCBI improved submission (plug-in in dev/testing) w/ CLC
- New LIMS for CFSAN
- Switch to new submission portal (overcome hurdle of pre-reg samples)
Traditional response: took 52 days, WGS made it 12 days – previously unactionable, now feasible
Centre for Food Safety, UCD, Ireland is a submitter?
What are FERA contributing?
From UK + Eire perspective (PHE), would be interested in what FERA and Centre for Food Safety, UCD, Ireland are submitting.
Aren’t there quality issues with making data available before analysis? Yes, but everyone is aware of them, and they (Erroll) seem quite open about them.
SLIMS report looks like excel file (behind FDA firewall), single page shows metadata, status, accession #’s, submitter etc, QA/QC status, can sort by run date

Palmer Orlandi – CSO, FDA office food veterinary medicine

Goal: faster results w/ more specificity and sensitivity, provide “boots on the ground” for PH
Questions:
- Why WGS?
- What is the need for a network?
- Where do regulatory labs fit in?
- Where is the best place to start?

How does the animal feed aspect fit in?

begin with ORA labs→ Training (SOP dev, pilot projects, data transfer)→ Seq (field isolate inventories)→ Research
Deliverables eg identifying players, instrument acquisition, Salmonella strain inventory assessment
want to expand use of WGS beyond research, now an accepted clinical tool

Moving toward strain characterisation (amr, serotype, strain, virulence), as well as typing.
used for outbreak, compliance, surveillance → functional tool for regulatory practices
Sequencing all regulatory isolates: seq’d all Salmonella from 2011-14, working on 2002-10
gives estimation of relatedness of resistance in environment → tool for measuring impact of regulatory policies pertaining to food supply
National Antibiotic Resistance monitoring (NARMs) – sequenced everything in retail meat (do they have any data on antibiotic useage in these herds?).
99% accuracy of predicting resistance using WGS
WGS Metagenomic sampling, animal husbandry
CFSAN → developed WGA applications to modernize typing (priority: Salmonella, L. mon etc), training program, compliance actions
GenomeTrakr contains 25K Salmonella sequences, 4K Listeria, 100’s of STEC, Vibrio, Cronobacter
Interagency partners: CDC, FDA, FSIS, Gen-FS (Genomics and Food Safety), NLM (National Library of Medicine), NIH
want to strengthen collaboration and co-ordination to increase efficiency
need proficiency testing and standards
LEVERAGING→ data provided by partners is acceptable for regulatory purposes? need to use ideas/technology, resources, data to develop next-gen “Tool Box”

Errol Strain – Analytics and Outreach, CFSAN

“GenomeTrakr 2015 FAQ”
minimal metadata for 3rd party submissions
Screening – do we have anything new that matches recent clinical case?
FDA routinely screen clinical isolates against enviro isolates in database, historical strains (do food/enviro isolates collected at different locations at diff times etc match?)

Quote: “K-mers are soooo 2015, SNPs are 2016.”

NCBI improved clustering based on SNPs
NCBI Genome Workbench – start day with coffee and latest k-mer tree http://www.ncbi.nlm.nih.gov/tools/gbench/
SNP Cluster Analysis: Ref genome (NCBI, de novo) → Run CFSAN SNP pipeline on Kmer cluster → Build tree, make refs (CFSAN SNP pipeline source code on GitHub under CFSAN)
separating signal from noise → very distant ref will include lots of SNPs and drag noise into analysis

FDA are using essentially the same ad hoc pipeline to do analysis as PHE are (and MDU).
- Use k-mer tree to identify cluster, then use regular reference (closed or de novo) then map reads
- Reference needs to be < 0.1% (3000-5000 SNPs) – prefer same PFGE, same subtype (MLST?)
- Important to screen for phage introduced SNPs – trying to automate this (How to include these in a phylogenetic model?) (MDU considering just masking all known phage from references for core SNP part)
- Don’t want to mask repetitive regions, if the phage is stable (enteritidis, ???) want to include that region in the analysis. in a way, that is further down the line, get the core genome stuff up and running first.
Hopes to automate cluster screening as current manual approach is not going to scale (<5 SNPs is actionable)
After notification of match: lookup IDs in PulseNet and contact, contact CDC and FDA district office if appropriate (cross state lines, FDA has jurisdiction) → recent isolate match may trigger enviro sampling and product testing → historical match may look at food transmission (check machinery for endemic contamination)
PNUSA = PulseNet USA identifier
metadata used for notification and actionable responses!
MInimal metadata: important is collected_by so you know who to contact when you identify a match!!!.
Harmonizing Proficiency Testing (PulseNet and GenomeTrakr)
interpretation: no single threshold for all species/types but there are rough guides
The “Daubert Standard” for WGS analysis: https://en.wikipedia.org/wiki/Daubert_standard
Rough SNP guide: < 20 (identical, match) 20-100 (inconclusive) > 100 (exclude) + does it form a unique cluster with > 95% bootstrap support? Really need multiple lines of I evidence, and rich background DB (determine if cluster is distinct in tree)
compute likelihood ration → matrix of # of SNPs vs Bootstrap % gives strength of match (table of support)
need enough isolates to provide statistical support, 35K genomes in DB makes these caluculations more reliable eg Salmonella Montevideo
Salmonella Newport – 30-40 SNPs within strains from the same produce.
GenomeTrakr will move towards new technologies/platforms → always pilot projects and validation to contribute to decision making (adopt/not)–> metadata standardization will come as database gets bigger
Open to any short read data (FB: but what about MinION?) FDA are expecting to go through another cycle of research, pilot, adopt with long read tech.
Everyone must get their SNP pipelines published. ie. CFSAN pipeline.

Darcy Hanes – Office of Applied Research and Safety Assessment

Historical collections include:
- outbreak isolates
- surveillance isolates
- food manufacturing/processing plants isolates

need to modify SOPs to include new pathogens

Sequencing also historical isolates of Salmonella, including 6 dog farm isolates, now have 1000s L.mono ice cream 2014. Sequenced 2010 (? surely not?!) isolates from same factory. This was aimed at answering the question – is this a new or ongoing problem?
Foodborne metagenomic sequencing w/ matching to genometrakr isolates (done at NRL-MOD-1)
Cilantro (coriander) model of Salmonella source w/ FDA Vet Medicine
“Highlights from the CFSAN-ORA Next Generation Sequencing Network”
Started with Salmonella and Listeria. E.coli, Shigella, and Campylobacter being added and currently doing proficiency testing (2015 CFSAN Proficiency Testing).
The acronyms below are abbreviations of FDA regional labs:
- DDL: Seq over 400 Salmonella isolates since May 2014
- NRL: Listeria, Salmonella. Also involved in virus genome project. Interesting sunflower seeds case (will add more here when mentioned later)
- PRL- NW: lots done incl seq isolates from manufacturers surfaces. First to have their 100 historical isolates done. Seq. 220 salmonella, for example, and growing. Historical useful for IDing if there is a “new problem” or not.
- MOD-1: interesting collaborations including CARTS project – metagenomics for direct speciation from environmental samples.

FERN (Food Emergency Response Network)

Checking if can match metagenomics Seq with WGS and Seq in GenomeTrakr
Database of Brucella melitensis and B. abortus for cheese studies.
High Cubes → high throughput genomic extraction → BSL3 facility (all cells are dead) can’t ship not dead cells!

John Besser – CDC

Sharing his time slot with the Pope’s tour of Washington DC
Has been with PulseNet from the start, 20 years ago, but this is first GenomeTrakr meeting!
87 labs in PulseNet USA, > 500,000 PFGEs done so far – 30-60 clusters per week
“PFGE is still a very effective method” – but molecular clock can be too fast, false positive differences
WGS introduction led to more clusters found, more solved, and smaller clusters (close to theoretical 2 isolates)
difficult to contribute to case definitions → connecting clinical/enviro isolate sources (Salmonella particularly bad, so outbreaks harder to solve)
Listeria Surveillance System → most seq by CDC (developed metrics for Listeria after advent of WGS)
eg caramel apples outbreak → eliminated cases that were determined to be sporadic by WGS (previously might have thrown off epi investigation if included), <10SNPs/cluster
diff case definitions req’d for diff parts of the investigation:

determine contaminated food
link food to source
most inclusive definition when determining who is involved (culprit)

wgMLST scheme lends itself to a “nomenclature” for L.mono isolates (aka allelle database, automated)
“Any day now” is a common theme in terms of implementation of all this.
BioNumerics “point and click” tree interface, node click to all meta-data and case information
centralized computing so all state partners don’t need own infrastructure
BaseSpace, FDA, NCBI (external storage)
Nomenclature server = Scicomp
Automated case-case comparison of exposures (how will this work without hierarchical ontology? PA). Control for age, sex – power calculation? Do they have a food name ontology for exposure information?
- 16S → genus
- MLST (7 loci) → clonal complex
- RiboMLST → strain
- wgMLST → clone
- SNP → subclone
bionumerics next version will include hqSNP (the CDC SNP pipeline)
hqSNP vs wgMSLT – CDC does mask their mobile elements first for hqSNP (implicity done with wgMLST)
Supervised machine learning / “Disjunctive anaomaly detection” for finding clusters humans would miss by eye
Culture independent, direct sequencing, metagenomics will lead to a LOSS of available isolate sequences (but with long reads we will be recover the isolates from the mixture)
Direct from patient metagenomics → actionable

eg diagnosis of neuroleptospirosis by WGS

Next frontier is single cell sorting and sequencing ($$$$)
Amplicon sequencing of known Virulence and Antibiotic Resistance markers, but phasing problem (aligning metagenomics genomes → pathogen may have chunks of genome found in commensals etc→ solved by Q linkage or 16S for identity or assembly)
noise → <0.15% genome coverage of gut microbiome from stool run on HiSeq → “Clutter Mitigation” strategies for reducing noise (improving lab procedures, technology)
Keen on potential of metagenomics for in situ pathogen ID. Phasing problems may be overcome by long read sequencing
Human faeces has DNA from everywhere – virus,bacteria,fungus,parasite,human,food,environment
can we use Bionumerics for standardizing metadata?
Pathogen detection website (NCBI, contact with a plan)
complication of “serving different masters”, need better way to trend track through all BioSample/BioProject umbrellas
also need a way of identifying quality of sequences that everyone can quickly compare
working group (FDA, CDC and some states) harmonizing SOPs → feedback will help make something that works for everyone
single largest source of error → switching labeling (names of samples), also contamination

Group Discussion – chaired by Eric Stevens – Ruth TImme, Marc Allard, Palmer Orlandi

Ruth Timme bringing up issue raised by Emma Griffiths (SFU) re standardizing food terms and other ontology dev for metadata. Would like to get States OK to go back and clean up the data.
EFSA has a standard 2500 term vocabulary for food items (there are a few food ontology initiatives – need to integrate)
NARMS has nice standardize food terms – ie. Chicken breast is always called that. (Another example for ref: some say “deli meat” while others say “cold cuts”
NCBI: has standardized field names but not terms within them. Need to work on this.
so many standard food term databases to choose from – and none are being used in GenomeTrakr (yet)
Emma raising issue re people needing to talk together to get right resolution of food term dictionary – practical
Metadata is available on all isoaltes from ftp://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/

Bill Klimke – NCBI

NCBI Pathogen Pipeline:
- QC
- Kmer
- Assembly
- Annotation
- GEnome Placement
- Clustering
- SNP Analysis
- Tree Construction
BioProject→ BioSample→ SRA→ GenBank
NCBI list submitter/contributors, Processing Status currently internal, later will send report if your seq doesn’t meet minimum data quality standard
clustering based on <100 SNPS (based on pairwise distances → Cluster100), also Kmer distance
100 SNPs doesn’t define outbreak, but airs on side of caution for inclusion → user needs to decide
doing retrospective analyses comparing NCBI pipeline to epi confirmed outbreaks (recruiting epi confirmed outbreak info from community)
Genome Workbench can label trees with private metadata → done locally
Listeria pipeline, pairwise SNP distances. 4628 targets, 351 Genbank genomes
Most “clusters” have <= members – a cluster is <= 30 SNPs
Check out http://www.ncbi.nlm.nih.gov/pathogens/ for the most recent analysis
The vast majority of SE outbreaks have mean/max PWD of < 10 SNPs
New NCBI SNP detection pipeline will soon (?) be on the Pathogen FTP site: ftp://ftp.ncbi.nlm.nih.gov/pathogen/Results/
As part of combating AMR, NCBI wants to capture the phenotypic resistance profile of isolates being received.
Standardized template table for uploading antibiogram results, ontology by ASM/CLSI – these can be sumbitted independently of sequence data: http://www.ncbi.nlm.nih.gov/biosample/docs/antibiogram/
antibiogram fields capture experimentally derived break points, can compare to point mutations in DB
integrating CARD (NCBI has sent corrections to CARD), ResFinder, FDA Centre for Vet Medicine
Building a new AMR Gene Database, curated, ResFams HMM, suitable naming, bad sequence
Hierarchial naming based on similarity: KPC-2 → KPC family → class A betalactamase → NOT betalactamase
assess accuracy of calls using finished high quality genomes, literature and “known truths”
NCBI will provide a AMR report back to submitters to help them correct AMR designations (and errors)
AMR point mutations not curated yet
contamination from sequencing instrument carry-over (Illumina) → take care of with statistics?
sharers need to label samples as potentially contaminated when there is a question
Sanger is biggest submitter → diff read platforms, data quality, contamination → NCBI needs to control input (curate in some way so does not allow upload of all samples)
is NCBI creating application for state labs to pull data to make own tree? no, mandate to create something for data integration
users can pull distance matrices to plug into own pipeline, but NCBI won’t build custom trees for you (they make Kmer trees and make available to public)
tutorials and webinars available for PH workers, SNP pipeline tutorials coming soon

Daniel Janies – U. North Carolina at Charlotte

diverse data→ supraMAP→phylomapping→users
can ask where outbreaks originate, where they are going –>map spread
Visualisation of virus transmission
not just phylo info (evolution), but also SNPs → can include this info through use of colour → can animate
transmission events (denote change of city/”other shifts” through use of colours)→ compare frequency, “betweenness”
Betweenness Centrality: nodes with higher “betweenness” have higher influence on spread of pathogen → determines number of times a node appears during transmission paths
PH interventions could concentrate on places with high Betweenness → longer lasting PH effect rather than local effect
can handle large, diverse datasets and put in visual contact → results lead to actionable conclusions → will help when contact tracing (narrative) not available
Camels had a role in MERS evolution but not important in ongoing spread
International data sharing + informatics = checks and balances for infection contol
Strength and weaknesses of contact tracing and genetic data

Jeremy Peirce – Illumina

co-ordinates FDA sequencing efforts w/in GenomeTrakr
MiSeq QC = the SAV tool = Sequence Analysis Viewer: https://support.illumina.com/sequencing/sequencing_software/sequencing_analysis_viewer_sav.html
SAV(QA/QC metadata about run) off website, from pre-alignment side
Flow cell:
- data by cycle → over cycles, data gets worse at ends of reads, look at Q score distribution (“%>Q30”)
Anatomy and Dissection of Seq Run: Troubleshooting Journey
- Cluster number -outcome of hydridization event:
  - upstream: prob is either hybridization or library prep
  - downstream: prob is downstream sequencing or analysis

Image analysis
- goal of template generation – identify spots on map of all clusters
- signal to noise measure
- dim clusters → check signals in diff channels over course of run
- clusters passing filter → measure of “cleanliness of data” → if “pure”, have high primary intensity
- too many clusters → they will overlap and purity goes down
fluidics (reagents), thermls (heating and cooling), Optics (imaging)
16S has less diversity at beginning, more after run proceeds
M3 motor issues, can contribute 10% to quality eg Q30
NextSeq has different # tiles and diff camera
- high phasing (behind)/pre-phasing(ahead)→ signal to noise ratio degrades over run
- phase/pre-phase should be <0.5
- lots of tiles (“rain pattern” = random) probs optics problems
- if there is a pattern over course of tiles, probs fluidics
Library Diversity (indication of library degradation)
Sequence length (more useful if trimmed)
FastQ toolkit
- remove adaptors, trim reads based on quality

Fiona Brinkman – Simon Fraser Uni

IRIDA – http://www.irida.ca/
Goals: user friendly, web, access control, automatic pipelines, standards compliant, ontologies, open source
Federated database model + Galaxy for workflow engine w/ version controlled pipeline templates (workflows)
GenomeD3Plot – https://github.com/brinkmanlab/GenomeD3Plot/
SNVPhyl (“Sniffle”) for variant calling – https://github.com/apetkau
Data sharing needs to arranged early
Outbreak transmission more closely linked to flight paths than geographic distance

David Boxrud – Minnesota Dept Health

WGS of Salmonella Enteriditis (SE) – clonal, 4 PFGE = 76% of Pulsenet, poultry and eggs
SE Outbreak 1994 has 224,000 case (Schwann’s Icecream in USA/CA)
No longer do PFGE due to clonality – now WGS, originally FDA kmer tree but now SNP tree/heatmap
Retrospective study to compare/assess WGS
- stabilty
- typability
- discriminatory power
- epi concordance
worked with epi’s to produce carefully curated examples
eg SE retrospective study n=55 from 7 outbreaks, n=22 sporadic isolates, some in vivo, some outliers, mystery : with PFGE, MLVA
- very few SNP diffs, 3 max
- several isolates from same person
- Salmonella with common PFGE pattern was resolved into several clusters, 2 most prevalent (type 4 cluster contained 138 SNPs, type 2 cluster contained 58 SNPs)
- stable, few SNPs/person over time
eg Prospective Study (April 1-Dec31 2014)
- PFGE and WGS in real time
- each cluster, regardless of method, will be investigated as outbreak
- cluster definition: indistinguishable PFGE (XbaI and BlnI) type within month, <10 SNPs diff between isolates over 1 month (originally >20 but this gave non-specific results)
- Lots of travel to Mexico clusters, largest clusters they found.
- Salmonella Enteritidis associated with snakes – or associated with food fed to snakes?
Collecting exposure data key to evaluating subtyping technique
interview all cases in surveillance ASAP, collect details on specific exposures:
- dates
- restaurants
- brands
- open-ended
dynamic investigation approach (re-interview if some new potential source discovered during investigation)
eg 159 isolates, 21 unique PFGE patterns
- #clusters by PFGE = 12 (cases/cluster = 9.2), clusters by WGS = 25 (cases/cluster=3)
- 13 instances of multiple isolates collected from same patient, used to study stability (#SNPdiffs/time) –>0 SNP diffs over 150 days
eg Lab-associated infection
- traced hospital technician to patient
- PFGE suggested they were not related but only 1-2 SNP diffs
eg Pet associated infections
eg frozen chicken product outbreak → distinct WGS patterns
- summer of 2014, 19 isolates with same PFGE, WGS → 8 clusters, 0 SNP diffs b/w them
- epi can differentiate and substantiate WGS clusters→ result was recall
- PFGE would have taken longer to solve
Changes based on WGS:
- difficult to relay cluster info to epi’s, excess info they don’t need → need to streamline info sharing practices
- cluster definition changes
Conclusion:
- WGS was superior
- communication of results challenging
- WGS with quality exposure data key to outbreak identification
nomenclature for WGS patterns inprove uptake among epi’s → but difficult to create nomenclature since it depends on interpretation criteria (specific for serotypes), wgMLST to provide technical capacity for nomenclature
can alternatively do thresholds with SNP diffs
need common nomenclature system→ also depends on methods and data quality

Dag Harmsen – Uni Muenster

Antibiotic resistance @ Uni Hospital Muenster
MRSA, CVRE, CRE all isolated at point of care -> Miseq Nextera XT v2 250bp PE @ 100x
Use cgMLST eg. N.meningiditis 1241 locii ~ 55% of genome
Claims “cgMLST produces more robust trees than SNPs by ignoring recombination effects with some minor loss of discriminatory power, offset by ease of use”
Problems with multiplexing different genome sizes, coverage of big genomes sometimes too low
Can save $millions on unnecessary quarantine / preemptive patient isolation
Transmission used to be 5%, and after WGS it went down a little bit, but $$$ saved in bed costs
1 day DNA extraction from single colony – Koeser et al 2014 JAC(69):1275 < 4 euro
Using diluted Nextera reagents with same results -> more cost saving
Power of WGS is to EXCLUDE outbreaks where previously inconclusive. Saves money, but not publishable!

Group Discussion – GenomeTrakr future needs

No notes – primarily FDA / NCBI internal issues.

James Pettengill – OAO – CFSAN – FDA

Bioinformatics approaches for rapidly detecting outbreaks
Identify epidemiologically relevant clusters from 25,000 samples
Salmonella 50 TB, Listeria 5 TB of data so far

Obtain an assembly for each sample, done using ‘cloud bursting’
1. download from SRA
2. FASTX toolkit quality filters
3. Kraken custom DB contamination filter
4. Spades assembly
Evaluate different methods for estimating relationsip between samples
1. Assembly | all vs all VS. one vs all | site based VS. kmer based
  kmer based are fast but loses information as heterogeneity increases
2. Reads | one vs all
3. Simulation Hudon’s ms program 100 tip tree + SeqGen under topology + ART read sim
  http://home.uchicago.edu/rhudson1/source/mksamples.html
  http://tree.bio.ed.ac.uk/software/seqgen/
4. Site based: Numcer, parSNP, CFSAN SNP Pipeline
5. Kmer based: Jaccard Index, Manhattan distance, Euclidean distance
6. k-mer distances and SNP distances break down < 1000 SNPs. if using kmer distances to recruit nearby isolates you need to take this into consideration
7. kmer issues suffer a lot – mobile elements, remove a contig, etc (Jaccard dist)
Fine grain analyses to elucidate genomic differences
1. CFSAN github site
2. SNP pipelines – recovered 98.9% of SNPs, FP rate = 1.04 x 10^-6 @ 100x / 98.8% 8.34 x 10-7 @ 20x. False negatives are due to consensus frequency (i.e. mixed bases) and not coverage, and are NOT random along the genome (although not much evidence except eyeballing!).
Implement daily surveillance and report generation
1. orphan and non orphan samples based on SNP threshold (what is this, how determined?)

25000 genomes assembled in cloud – $8000, useful to clear backlog or use as surge capacity if need to redo all assemblies.

Cecilie Boysen – CLC, Qiagen

keep instruments and reagents agnostic
CLC Genomics Workbench, single user, few samples
Genomics Server, several users, many samples
Build workflows much like Galaxy – has toolbox (like toolshed) – is this Server backend product? I can see “CLC Custom Solutions” at the top of the left hand tool menu.
Can share workflows with colleagues – export and import. Need to “unlock” them so your colleague can point the reference links to their own reference genomes?
many modules eg
- SRA submission
- Typing tool
- OTU clustering
- MLST
- BLAST
- Forensic toolkit

import data from MiSeq

Plugin data to upload data to SRA – has a filter to ensure you don’t send crap data!
Allows to use local FASTQ for data on BaseSpace.

Lynne Bry – Brigham & Women’s Hospital

Sequencing foodborne and MDR clinical isolates at BWH.
1000 micro samples per day! >100 +ve cultures across kingdoms/phyla
50% diagnosis, 10% therapy, 40% screening/surveillance MRSA, VRE, GrpB Strep, Gram-ve
Lots of meta-data: MIC, disk diff, E-TEST, Drug resistnace, zone diam, R/I.S, ESBL, D-zone CLI/ERM
HIPAA de-identified data – Year only, no location
WHONET http://whonet.org/ open source to generate antiobiograms in surveillance (Old, Windows s.w)
Crimson LIMS – prospective analysis of clinical samples & real time query
“Honest Broker” assigns new external IDs
Spades, QUAST, ResFinder, CARD, RAST, Mauve for extra chromosomal, BLAST for plasmid/transposon, where is the resistance gene ? transposon – plasmid or chrom. (Bandage and ISmapper might be useful for these types of analysis?)
SNPS: bowtie2, mpileup, bcftools, custom filtering.
Kp CRE ST258 – found many different plasmids and transposons + point mutations – WGS revealed this detail
E. cloacae CRE – ampC on chrom + porin mutations , multiple mobile elements Tn4401b / Tn6901
S. marcessens CRE – SRT-2 Ampc_SME-4, AmpC and KPC-3 acquire, 3 year surv 2011-2014, 2 close events

TImelimes; MiSeq (14 days), Bioinformatics (1 – 14 days), Epi (14 days)
Despite 3 week turnaround they are told it IS actionable
Rule out just as important as rule in
Mobile element analysis can refine relationship analysis
Curating new genomes and mobile elements take most the time
Desire to use more principled methods for outbreak calling – SaTScan, Bayesian, likelihoods

Ole Lund – DTU, Denmark + CGE + COMPARE-EU

Built a web based system with tools: http://www.genomicepidemiology.org/
164,000 WGS to the server since 2012
Species detection needed for MLST typing – found kmer based scheme worked best
k-mer trees go back to Woese 1977 – oligo based trees
ResFinder for acquired AMR
PIpeline: assembly + put through all the *Finder suite apps in one pass. (checkout https://cge.cbs.dtu.dk/services/cge/ for this)
Batch uploader tool – fill out a metadata XLS form and submit , produces XLS summary output format
Compute: 56 cores / 100,000 submissions — moving to 280 cores ~ 500,000 submissions
SNP tree – Made error in that they used to use ref base when couldn’t call base in isolate (whoops!)
Did use UPGMA but assumption is that they all come from same time point, now using neighbour joining
Mapping – 10 mins per isolate (this seems long? does it include SNP calling as well? PA)
Tree building – N*2.5 + N*N*0.02 seconds = 24 days for 10,000 isolates
Solution is to only analyse those isolates close (<10 SNPs?) to what you have, use precalculation mappings, efficient binary storage etc. Move to clade specific sub-analyses over global analysis.
Reference genomes curated down to a smaller set < 98% id between any two,
Can “place” new strains onto existing trees (aka pplacer) instead of building whole tree.
Problem is core genome sites CHANGE as you add/ new isolates. Need smarter methods?

Wondwossen Gebreyes – The Ohio State University

Comparative Molecular Epidemiology of MDR Salmonella of Global Origin Using Whole Genome Sequencing

NOTE: Phil has started tweeting instead of entering good notes here.

vet
International Consortium on Interface between animal-human-pathogen (KOPHAI)
emerging zoonotic infectious diseases – outbreak and response
Molecular Epidemiology of Salmonella
eg antimicrobial free and conventional (farms)
- goal: understand AMR across geographic and temporal space
- correlation of AMR and co-tolerance genes (heavy metals)
- build database for surveillance, capacity of networked labs
- 924 isolates to CFSAN
- studying 3 countries in Eastern Africa
- high Zn tolerance linked w/ certain AMR phenotypes→ efflux pumps
looked at antimicrobial alternatives and effects on gut microbiome

Phil bringing up good point that the African seqs being added to GenomeTrakr in advance of publication help with more global/other country analyses. FB: Need to definitely get more countries on board

Patrick Chain – LANL

tools for sample analysis and metagenomics
Code: https://github.com/LANL-Bioinformatics
PhaME – core genome matrix, phylogentic and molecular evolution analysis
- grab parts of tree and do analyses (pos/neg/neutral selection)
- tre analysis with subtrees→ whether contigs, reads or closed genomes used, tightly clustered
metagenomics:
- use cgtree, map reads, record SNPs, place strains in tre with other pathogens
- can apply to euk and viral genomes
- identify genome signatures?
GOTTCHA – taxanomic profiling
EDGE-BOUND Bioinfomratics platform
- 6 independent modules (eg pre-processing, assembly)
- simply requires FastQ file
- many outputs

Reviewed

“Accurate read-based metagenome characterization using a hierarchical suite of unique signatures”

http://www.nar.oxfordjournals.org/content/43/10/e69.long

Web server: https://bioedge.lanl.gov/edge_ui/

Round Table

Why not sequence one of every PFGE type? Try and sample the population widely with WGS. We keep finding more and more diversity. Not clear how much more we need to sample! FDA is happy to consider sequencing any novel material to help with this.

Lack of contribution from Europe to SRA/GenomeTrakr – primarily USA and UK. Public health funding should require submission to databases. Ole Lund claims Euro bureaucracy is too difficult. GMI talks about this a lot – it’s mainly political, and it will happen eventually.

END

	Links 8/31/23 \| Mike… on How to find extended-spectrum…
	flashton on Microbiome & infection ine…
	Mat on Microbiome & infection ine…
	flashton on How to do a Dendroscope tangle…
	flashton on How to do a Dendroscope tangle…

Bits and Bugs

Applying bioinformatics to public health microbiology

GenomeTrakr meeting notes

Leave a comment Cancel reply

Share this:

Related posts

Leave a comment Cancel reply