Page Contents

Core_Gene_SNP Workflow

Overview

The Core_Gene_SNP workflow is intended for pangenome analysis, core gene alignment, and phylogenetic analysis. The workflow takes in gene sequence data in GFF3 format from a set of samples. It first produces a pangenome summary using Pirate, which clusters genes within the sample set into orthologous gene families. By default, the workflow also instructs Pirate to produce both core gene and pangenome alignments. The workflow subsequently triggers the generation of a phylogenetic tree and SNP distance matrix from the core gene alignment using iqtree and snp-dists, respectively. Optionally, the workflow will also run this analysis using the pangenome alignment. This workflow also features an optional module, summarize_data, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.

<aside> 💡 Please note that while default parameters for pangenome construction and phylogenetic tree generation are provided, these default parameters may not suit every dataset and have not been validated against known phylogenies. Users should take care to select the parameters that are most appropriate for their dataset. Please reach out to [email protected] or one of the other resources listed at the bottom of this page if you would like assistance with this task.

</aside>

Inputs

Required User Inputs

Optional User Inputs

Tasks/Actions

By default, the Core_Gene_SNP workflow will begin by analyzing the input sample set using https://github.com/SionBayliss/PIRATE. Pirate takes in GFF3 files and classifies the genes into gene families by sequence identity, outputting a pangenome summary file. The workflow will instruct Pirate to create core gene and pangenome alignments using this gene family data. Setting the “align” input variable to false will turn off this behavior, and the workflow will output only the pangenome summary. The workflow will then use the core gene alignment from Pirate to infer a phylogenetic tree using IQ-TREE. It will also produce an SNP distance matrix from this alignment using https://github.com/tseemann/snp-dists. This behavior can be turned off by setting the core_tree input variable to false. The workflow will not create a pangenome tree or SNP-matrix by default, but this behavior can be turned on by setting the pan_tree input variable to true.

The optional summarize_data task performs the following only if all of the data_summary_* and sample_names optional variables are filled out:

Digests a comma-separated list of column names, such as "amrfinderplus_virulence_genes,amrfinderplus_stress_genes", etc. that can be found within the origin Terra data table.
It will then parse through those column contents and extract each value; for example, if the amrfinder_amr_genes column for a sample contains these values: "aph(3')-IIIa,tet(O),blaOXA-193", the summarize_data task will check each sample in the set to see if they also have those AMR genes detected.
Outputs a .csv file that indicates presence (TRUE) or absence (empty) for each item in those columns; that is, it will check each sample in the set against the detected items in each column to see if that value was also detected.

By default, this task appends a Phandango coloring tag to color all items from the same column the same; this can be turned off by setting the optional phandango_coloring variable to false.

Core_Gene_SNP Workflow

Overview

Inputs

Required User Inputs

Optional User Inputs

Tasks/Actions

Outputs

All outputs

References