✉️ [email protected] | X (formerly Twitter) | LinkedIn | 🌐 Website
Page Contents
Date of Last Update:
August 2nd 2024
PHB version:
v2.1.0
Introduction to Phylogenetics
Phylogenetics is an approach to understanding evolutionary relationships among organisms, primarily through analysis of gene, amino acid, or genome sequences. These evolutionary relationships are graphically represented by phylogenetic trees.
Broadly, there are two phylogenetic analysis methods…
<aside>
🌲 Phylogenetic tree construction
Creates a phylogenetic tree from a set of sequences
- Goal: Determine the evolutionary relationship between a set of sequences, often to rule out likely transmission
- Pros:
- Can be constructed from any suitable set of samples
- More accurate than phylogenetic placement when a high-quality dataset and appropriate methods are used
- Cons:
- Can be comparably slow and computationally expensive, especially for trees from a large numbers of sequences and large genomes
</aside>
<aside>
🎄 Phylogenetic placement
Places genomes onto an existing phylogenetic tree
- Goal: Determine the closest relatives to a new sequence
- Pros:
- It avoids needing to build a whole tree which is comparably slow and computationally expensive, especially for large amounts of data
- Cons:
- Requires an existing tree to add the new sample to
- Less accurate than building a new phylogenetic tree
</aside>
Phylogenetic tree construction approaches
Key considerations before generating a phylogenetic tree:
- Sequences should have been previously analyzed with TheiaCoV, TheiaProk, or TheiaEuk to assess sequence quality, generate assemblies or annotation files that may be required for some phylogenetic tree-building workflows, and generate any metadata that you might like to use for visualization against the tree.
- All samples included in a phylogenetic tree should pass agreed QC thresholds
- FASTA input trees are particularly reliant on a high-quality assembly
- Repetitive regions may be incorrectly assembled (particularly for de novo assemblies as generated by TheiaProk and TheiaEuk)
- Low-coverage regions and heterologous sites may be included in the phylogeny
- For transmission analyses, samples in the same tree should be closely related- the same lineage or ST
Workflow recommendations for phylogenetic tree construction
<aside>
✅ Recommendations:
-
Augur_Prep → Augur: For building phylogenetic trees from viral genomes
-
kSNP3: For analysis of clonal sets of genomes (e.g. foodborne outbreak analyses), using a simple method
-
Snippy_Streamline: For analysis of bacterial genomes that may undergo recombination or require masking of the genome
-
Snippy_Variants → Snippy_Tree: Similar to Snippy_Streamline, but for when you want more control over the workflow parameters or if you want to generate the tree multiple times using different combinations of sequences aligned against the same reference
-
Mashtree_FASTA: For very quick trees
-
Core_Gene_SNP: For generation of a pangenome analysis, with an additional core- or pan-gene phylogeny to visualize the pangenome against
</aside>
-
Full comparison of Theiagen phylogenetic construction workflows
Interpreting phylogenetic trees and SNP distances
Resources for phylogenetic tree interpretation
SNP distances
During outbreak investigations, SNP distances are sometimes used to help interpret the potential for transmission. SNP distance thresholds have been established for some pathogens, under some circumstances. Typically, SNP distance thresholds
- Identify potential transmission clusters
- Rule OUT transmission events (may be directional, between two specified location/people)
- Difficult to determine SNP thresholds because
- within-host diversity
- unknown number of transmissions/other bottlenecks decreasing genetic diversity
- Variable mutation rates between strains, in different environments, and/or in different regions of the genome
- Imprecise removal of recombination or erroneous SNPs
- Comparison of SNP distance between potentially related strains and background strains, helpful for source attribution (e.g. foodborne outbreaks)
- Combination with epi data can help identify suitable thresholds to rule out transmission
- Mutation rates can be calculated based on SNPs at different time points, allowing inference of start of outbreak
- Incomplete sampling- don’t know if there were other infected individuals who might have been source but weren’t sampled
Visualizing phylogenetic trees
<aside>
✅ Recommendations:
-
Auspice for phylogenetic trees generated using the Augur workflows
-
Phandango for visualizing metadata against the phylogenetic tree (e.g. presence/absence of ARGs or plasmid replicons, SNP-distance matrices, recombination gff files from gubbins, or pangenome visualizations)
-
FigTree for re-rooting phylogenetic trees, visualizing trees with annotated nodes (e.g. time-dated phylogenies) and looking at branch lengths
-
MicrobeTrace for visualizing phylogenetic trees with transmission networks
</aside>
-
Full comparison of no-code phylogenetic tree visualization software
To learn more about MicrobeTrace, please see the following video: 📺 Using KSNP3 in Terra and Visualizing Bacterial Genomic Networks in MicrobeTrace