Page Contents
The Core_Gene_SNP workflow is intended for pangenome analysis, core gene alignment, and phylogenetic analysis. The workflow takes in gene sequence data in GFF3 format from a set of samples. It first produces a pangenome summary using Pirate
, which clusters genes within the sample set into orthologous gene families. By default, the workflow also instructs Pirate
to produce both core gene and pangenome alignments. The workflow subsequently triggers the generation of a phylogenetic tree and SNP distance matrix from the core gene alignment using iqtree
and snp-dists
, respectively. Optionally, the workflow will also run this analysis using the pangenome alignment. This workflow also features an optional module, summarize_data
, that creates a presence/absence matrix for the analyzed samples from a list of indicated columns (such as AMR genes, etc.) that can be used in Phandango.
<aside> ๐ก Please note that while default parameters for pangenome construction and phylogenetic tree generation are provided, these default parameters may not suit every dataset and have not been validated against known phylogenies. Users should take care to select the parameters that are most appropriate for their dataset. Please reach out to [email protected] or one of the other resources listed at the bottom of this page if you would like assistance with this task.
</aside>
By default, the Core_Gene_SNP workflow will begin by analyzing the input sample set using https://github.com/SionBayliss/PIRATE. Pirate takes in GFF3 files and classifies the genes into gene families by sequence identity, outputting a pangenome summary file. The workflow will instruct Pirate to create core gene and pangenome alignments using this gene family data. Setting the โalignโ input variable to false will turn off this behavior, and the workflow will output only the pangenome summary. The workflow will then use the core gene alignment from Pirate
to infer a phylogenetic tree using IQ-TREE
. It will also produce an SNP distance matrix from this alignment using https://github.com/tseemann/snp-dists. This behavior can be turned off by setting the core_tree
input variable to false. The workflow will not create a pangenome tree or SNP-matrix by default, but this behavior can be turned on by setting the pan_tree
input variable to true.
The optional summarize_data
task performs the following only if all of the data_summary_*
and sample_names
optional variables are filled out:
"amrfinderplus_virulence_genes,amrfinderplus_stress_genes"
, etc. that can be found within the origin Terra data table.amrfinder_amr_genes
column for a sample contains these values: "aph(3')-IIIa,tet(O),blaOXA-193"
, the summarize_data
task will check each sample in the set to see if they also have those AMR genes detected.By default, this task appends a Phandango coloring tag to color all items from the same column the same; this can be turned off by setting the optional phandango_coloring
variable to false
.
โ๏ธ [email protected] | X (formerly Twitter) | LinkedIn | ๐ Website