The TheiaCoV Workflow Series

The TheiaCoV Workflow Series is a suite of bioinformatics workflows designed for the assembly, quality assessment, and characterization of viral genomes. These workflows accommodate various input data types and support multiple viral organisms, facilitating comprehensive genomic analyses for public health applications.

Figure 1: The TheiaCoV workflow diagram. SARS-CoV-2 is the default organism, but compatibility with several others is directly implemented and custom viruses can be submitted for reference-based genome assembly. Depending on the organism provided, which is controlled by the organism optional input, independent and tailored genomic characterization modules are triggered. All organisms follow a consensus assembly approach computed by iVar, with the exception of flu which is assembled by IRMA.

TheiaCoV Default Organisms

These workflows currently support seven organisms (see list). The workflows are adaptable, with parameters that can be customized for specific organisms. Input JSON files with preset configurations for each supported virus are provided here, streamlining the setup process.

Except for influenza, which follows a different process in TheiaCoV, all organisms are assembled through consensus from a reference genome (Figure 2).

Figure 2: TheiaCoV viral genome assembly flowchart. FASTQ-formatted reads are binned into taxonomic groups, trimmed, QC’ed, mapped to the reference genome, and the consensus assembly is created with respect to the reference.

Figure 2: TheiaCoV viral genome assembly flowchart. FASTQ-formatted reads are binned into taxonomic groups, trimmed, QC’ed, mapped to the reference genome, and the consensus assembly is created with respect to the reference.

These workflows currently support the following organisms:

SARS-CoV-2 ("sars-cov-2", "SARS-CoV-2") - default organism input
Mpox virus ("MPXV", "mpox", "monkeypox", "Monkeypox virus", "Mpox")
Human Immunodeficiency Virus ("HIV")
West Nile Virus ("WNV", "wnv", "West Nile virus")
Influenza ("flu", "influenza", "Flu", "Influenza")
RSV-A ("rsv_a", "rsv-a", "RSV-A", "RSV_A")
RSV-B ("rsv_b", "rsv-b", "RSV-B", "RSV_B")

All non-influenza default organisms go through read quality control and consensus assembly with iVar. First, the human reads are removed from the sample with NCBI's human read removal tool (HRRT), and the data is taxonomically profiled with Kraken2 (using a database with all viral data in RefSeq and human) before and after human read removal. The reads are then trimmed with trimmomatic (default) or fastp. Sequencing adapters, if they exist, are removed with bbduck, and raw and clean read quality is assessed using fastq_scan (default) or FastQC. The clean reads are then mapped to the reference genome after indexing with bwa. Primers are trimmed from the alignment with iVar, and variants are called with samtools. Finally, samtools and iVar are called to generate the consensus assembly.

For default organisms, we provide all the necessary files for all of these processes. To successfully generate a consensus assembly for a non-default organism, depending on the workflow configuration, the intermediary files will need to be provided by the user. We drafted a set of recommendations below to facilitate this process.

Workflow Recommendations for “Custom” Viruses

TheiaCoV can accomplish reference-based consensus genome assembly of some non-model, “custom”, viruses in accord with Figure 2’s workflow. Running TheiaCoV on custom viruses requires inputs that may be displayed as optional in Terra.Bio. These inputs are listed in Table 1. Briefly, an organism name, provided at the TheiaCoV workflow input, is required; reference genome length, assembly FASTA, and gene coordinates GFF files are required for most workflows; and a primer BED file is required for ONT data and is only needed for Illumina samples if primers are set to be trimmed.

TheiaCoV is not designed for custom viruses, so it is important to assess the validity of resulting assemblies. The custom virus approach requires a closely related reference genome as input, or else the workflow will fail due to an insufficient quantity of reads mapping to the reference. Such errors will occur at the ivar_consensus task during read alignment/extraction or during post-assembly variant calling because a consensus assembly comprising degenerate nucleotides was created. These errors primarily occur due to read mapping difficulty in small (< 20 kb), recombinant, or evolutionarily diverse lineages, such as norovirus or rhinovirus. Contamination can also cause reference mapping errors, so it is important to review the Kraken2 report to ensure the taxonomic composition of the sample sufficiently comprises the expected viral lineage.

Table 1: Required and optional inputs for running custom viruses

Task	Input	Description	Custom virus requirement
theiacov_*	genome_length	Expected genome length of organism	Required
theiacov_*	organism	Name of expected organism	Required
theiacov_*	reference_gff	Reference sequence in GFF3 format	Required for SE, PE, ONT;
Omitted from FASTA
theiacov_*	primer_bed	Bed file with primer locations	Required for ONT;
Optional for SE and PE, though required if trim_primers set to True
theiacov_*	reference_gene_locations_bed	Bed file with gene location	Optional to estimate gene coverage
theiacov_*	reference_genome	Reference sequence in FASTA format	Required
theiacov_*	target_organism	Name of the expected organism in Kraken2 database	Optional to quantify percent of reads matching target organism

Request Support for Running TheiaCoV with "Custom" Viruses

Depending on your organism of interest, the guidance above might not be sufficient. For additional support, please reach out to [email protected]! We'll be more than happy to assist you with your analyses.