Page Contents

Introduction to Phylogenetics

Phylogenetics is an approach to understanding evolutionary relationships among organisms, primarily through analysis of gene, amino acid, or genome sequences. These evolutionary relationships are graphically represented by phylogenetic trees.

Broadly, there are two phylogenetic analysis methods…

<aside> 🌲 Phylogenetic tree construction

Creates a phylogenetic tree from a set of sequences

Goal: Determine the evolutionary relationship between a set of sequences, often to rule out likely transmission
Pros:
- Can be constructed from any suitable set of samples
- More accurate than phylogenetic placement when a high-quality dataset and appropriate methods are used
Cons:
- Can be comparably slow and computationally expensive, especially for trees from a large numbers of sequences and large genomes </aside>

<aside> 🎄 Phylogenetic placement

Places genomes onto an existing phylogenetic tree

Goal: Determine the closest relatives to a new sequence
Pros:
- It avoids needing to build a whole tree which is comparably slow and computationally expensive, especially for large amounts of data
Cons:
- Requires an existing tree to add the new sample to
- Less accurate than building a new phylogenetic tree </aside>

Phylogenetic tree construction approaches

Key considerations before generating a phylogenetic tree:

Sequences should have been previously analyzed with TheiaCoV, TheiaProk, or TheiaEuk to assess sequence quality, generate assemblies or annotation files that may be required for some phylogenetic tree-building workflows, and generate any metadata that you might like to use for visualization against the tree.
All samples included in a phylogenetic tree should pass agreed QC thresholds
- FASTA input trees are particularly reliant on a high-quality assembly
  - Repetitive regions may be incorrectly assembled (particularly for de novo assemblies as generated by TheiaProk and TheiaEuk)
  - Low-coverage regions and heterologous sites may be included in the phylogeny
For transmission analyses, samples in the same tree should be closely related- the same lineage or ST

Workflow recommendations for phylogenetic tree construction

<aside> ✅ Recommendations:

Augur_Prep → Augur: For building phylogenetic trees from viral genomes
kSNP3: For analysis of clonal sets of genomes (e.g. foodborne outbreak analyses), using a simple method
Snippy_Streamline: For analysis of bacterial genomes that may undergo recombination or require masking of the genome
Snippy_Variants → Snippy_Tree: Similar to Snippy_Streamline, but for when you want more control over the workflow parameters or if you want to generate the tree multiple times using different combinations of sequences aligned against the same reference
Mashtree_FASTA: For very quick trees
Core_Gene_SNP: For generation of a pangenome analysis, with an additional core- or pan-gene phylogeny to visualize the pangenome against </aside>
Full comparison of Theiagen phylogenetic construction workflows

Interpreting phylogenetic trees and SNP distances

Resources for phylogenetic tree interpretation

Understanding phylogenetic trees, particularly what they represent
How to read a phylogenetic tree
How to interpret phylogenetic trees in terms of transmission

SNP distances

During outbreak investigations, SNP distances are sometimes used to help interpret the potential for transmission. SNP distance thresholds have been established for some pathogens, under some circumstances. Typically, SNP distance thresholds

Identify potential transmission clusters
Rule OUT transmission events (may be directional, between two specified location/people)
Difficult to determine SNP thresholds because
- within-host diversity
- unknown number of transmissions/other bottlenecks decreasing genetic diversity
- Variable mutation rates between strains, in different environments, and/or in different regions of the genome
- Imprecise removal of recombination or erroneous SNPs
Comparison of SNP distance between potentially related strains and background strains, helpful for source attribution (e.g. foodborne outbreaks)
Combination with epi data can help identify suitable thresholds to rule out transmission
Mutation rates can be calculated based on SNPs at different time points, allowing inference of start of outbreak
Incomplete sampling- don’t know if there were other infected individuals who might have been source but weren’t sampled

Visualizing phylogenetic trees

<aside> ✅ Recommendations:

Auspice for phylogenetic trees generated using the Augur workflows
Phandango for visualizing metadata against the phylogenetic tree (e.g. presence/absence of ARGs or plasmid replicons, SNP-distance matrices, recombination gff files from gubbins, or pangenome visualizations)
FigTree for re-rooting phylogenetic trees, visualizing trees with annotated nodes (e.g. time-dated phylogenies) and looking at branch lengths
MicrobeTrace for visualizing phylogenetic trees with transmission networks </aside>
Full comparison of no-code phylogenetic tree visualization software

To learn more about MicrobeTrace, please see the following video: 📺 Using KSNP3 in Terra and Visualizing Bacterial Genomic Networks in MicrobeTrace

✉️ [email protected] | X (formerly Twitter) | LinkedIn | 🌐 Website