Supplemental Methods: anvi’o

library(tidyverse)
library(ggtext)
library(ggh4x)
library(RColorBrewer)
library(viridis)
library(grid)
library(ggpubr)
library(gridtext)
library(kableExtra)
library(seqinr)
library(reshape2)
library(plotly)
library(htmlwidgets)
library(seqvisr)

This notebook contains analyses using anvi’o version 8 (Delmont and Eren, 2018; Eren et al., 2020) (development branch). For more in-depth information about anvi’o go to their official online site here: https://anvio.org/

1 Functional annotations

We use as input (path_i) the anvi’o contigs database files (.db) that were generated in SupplementalMethods_Prokka_Annotations.

1.1 COG annotation

First we downloaded and set up the NCBI’s Clusters of Orthologous Groups database Galperin et al. (2020) using anvi-setup-ncbi-cogs. The default is the latest version, which is COG20, meaning that anvi’o will use the NCBI’s 2020 release of COGs to setup the database.

The anvi-run-ncbi-cogs command was used to annotate the contig databases. We used DIAMOND as searching method (Buchfink et al., 2014) with the recommended “sensitive” option.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-ncbi-cogs -T 8 --search-with diamond -c $file;
done

1.2 KEGG annotation

First we downloaded and set up the KEGG KOfam (Kanehisa, 2000; Kanehisa et al., 2022) database using anvi-setup-kegg-data. We used the snapshot of the KEGG database dated 2023-09-22 (hash a2b5bde358bb). It contains 479 modules with 15116 entries.

The program anvi-run-kegg-kofams annotates genes in a given anvi’o contigs database with KEGG Orthology (KO) numbers via hits to the KEGG KOfam database. We used -E lower than 1e-15.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-kegg-kofams -T 8 --log-bitscores -E 1e-15 --just-do-it -c $file;
done

The option --log-bitscores records the bit scores values. They get saved to the main folder for the repo, so we moved them to a new subfolder

path_i <- "."
path_o <- "data/Anvio8/KEGG_bitscores"
dir.create(file.path("data/Anvio8/KEGG_bitscores"), recursive = TRUE, showWarnings=FALSE)
files_to_move <- list.files(path_i, pattern = "_bitscores.txt$", full.names = TRUE)

for (file in files_to_move) {
  new_path <- file.path(path_o, basename(file))
  print(new_path)
  file.rename(file, new_path)
}

1.3 PFAM annotation

First you need to download and set up the PFAM database (Bateman, 2000; Mistry et al., 2020) using anvi-setup-pfams: this is something you will do only once. We used Pfam version 36.0 (2023-07).

The program anvi-run-kegg-pfam annotates genes in a given anvi’o contigs database with the PFAM database.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-pfams -T 8 -c $file;
done

1.4 Running HMMs

When you run the following command the occurrence of bacterial single-copy genes across your contigs is added to your contigs database (Eddy, 2011).

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-hmms -T 8 -c $file --just-do-it;
done

1.5 Running CAZyme

First you need to download and set up the CAZyme database (Drula et al., 2021) using anvi-setup-cazymes: this is something you will do only once. We used Pfam version 11.

The program anvi-run-cazymes annotates genes in your contigs-db with the dbCAN CAZyme database.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-cazymes -T 8 -c $file;
done

1.6 Annotation Summaries

This generates basic stats on the annotations:

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"
path_o="data/Anvio8/Annotation_summaries"
mkdir -p "$path_o"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-display-contigs-stats $file --report-as-text -o $path_o/stats_$FILENAME.txt;
done

This provides basic info on the contigs databases:

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"
path_o="data/Anvio8/Annotation_summaries"

for file in $path_i/*.db; do
    anvi-db-info $file &>> $path_o/anvi-db-info_log.txt;
done

2 Pangenome Analysis

2.1 Generating an anvi’o genomes storage database

The program anvi-gen-genomes-storage requires a .txt file with the names and path locations of the genomes to be included in the genome storage database. The storage database generated with CorPGA_Anvio_GenomeList_v01a.txt includes the 104 Corynebacterium genomes selected for pangenomic analysis (The three C. macginleyi genomes were excluded).

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"
mkdir -p "$path_i"

anvi-gen-genomes-storage -e "data/genome_lists/CorPGA_Anvio_GenomeList_v01a.txt" -o $path_i/CorPGA-All4Cor-GENOMES.db --gene-caller Prodigal

This provides basic info on the generated genomes storage database:

#conda activate anvio-dev

anvi-db-info $path_i/CorPGA-All4Cor-GENOMES.db

2.2 Running the pangenomic analyses

In order to analyze the pangenome individually for each of the four Corynebacterium species we created a .txt file including the list of genomes for each species and ran anvi-pan-genome with the -G flag:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"
path_g="data/genome_lists"

anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCpr.txt \
                -o $path_i/Only_Cpr -n PAN_Cpr \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8
                
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCps.txt \
                -o $path_i/Only_Cps -n PAN_Cps \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8

anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCac.txt \
                -o $path_i/Only_Cac -n PAN_Cac \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8
                
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCtu.txt \
                -o $path_i/Only_Ctu -n PAN_Ctu \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8

2.3 Displaying the pangenomes

This will display the pangenome in your default internet browser. Anvi’o says google chrome is the most compatible with the anvi’o interactive interface. You need to pick one of the -p inputs for each PAN.db file:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-display-pan -p $path_i/Only_Cpr/PAN_Cpr-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

anvi-display-pan -p $path_i/Only_Cps/PAN_Cps-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

anvi-display-pan -p $path_i/Only_Cac/PAN_Cac-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

anvi-display-pan -p $path_i/Only_Ctu/PAN_Ctu-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

If you are using Windows and it cannot display the interactive page then try copy and pasting either 127.0.0.1:8080 or http://localhost:8080/ into your browser. Do not use sudo in the front of the command either.

Once it opens in the browser, you can adjust image settings and click the “DRAW” icon to view it. For help using the interactive interface see http://merenlab.org/2016/02/27/the-anvio-interactive-interface/

We manually reordered the layers to match the phylogenomic trees per species using the click and drag option on the interactive interface. The order names is inverse to the layers in the concentric diagram. When all the layers were re-ordered to match the phylogenomic tree we clicked the draw button and saved the svg file.

2.4 Creating bin collections

2.4.1 Core vs. Accessory

In order to define the Core vs Accessory pangenome for each of the four species level pangenomes we use anvi’o interactive interface. In the “Bins” tab we created the corresponding bins. Using the “Search” tab we used “Search gene clusters using filters” and “Append splits to selected bin” to create bins. The following tables summarizes how the bins in the collection named CorevsAccessory were created (last column indicates the criteria used on each search):

2.4.1.1 C. propinquum pangenome

Pangenome Bin	# of Gene Clusters/Pangenome	Percentage	Gene Cluster occurs in # of genomes
Core	1824/3108	58.7%	19
Single Copy Core	1715/3108	55.2%	19
Multicopy Core	109/3108	3.5%	19
Soft Core	39/3108	1.3%	18
Shell	689/3108	22.2%	3-17
Cloud	556/3108	17.9%	1-2

2.4.1.2 C. pseudodiphtheriticum pangenome

Pangenome Bin	# of Gene Clusters/Pangenome	Percentage	Gene Cluster occurs in # of genomes
Core	1714/3590	47.7%	43
Single Copy Core	1488/3590	41.4%	43
Multicopy Core	226/3590	6.3%	43
Soft Core	46/3590	1.3%	40-42
Shell	972/3590	27.1%	3-39
Cloud	858/3590	23.9%	1-2

2.4.1.3 C. accolens pangenome

Pangenome Bin	# of Gene Clusters/Pangenome	Percentage	Gene Cluster occurs in # of genomes
Core	1904/3427	55.6%	34
Single Copy Core	1734/3427	50.6%	34
Multicopy Core	170/3427	5%	34
Soft Core	77/3427	2.2%	32-33
Shell	676/3427	19.7%	3-31
Cloud	770/3427	22.5%	1-2

2.4.1.4 C. tuberculostearicum pangenome

Pangenome Bin	# of Gene Clusters/Pangenome	Percentage	Gene Cluster occurs in # of genomes
Core	1915/2907	65.9%	8
Single Copy Core	1860/2907	64%	8
Multicopy Core	55/2907	1.9%	8
Soft Core	47/2907	1.6%	7
Shell	270/2907	9.3%	3-6
Cloud	671/2907	23.1%	1-2

2.4.2 PPanGGOLiN Partitions

PPanGGOLiN partitions were imported as a collection of bins named PPanGGOLiNpartitions as described in PPanGGOLiN analysis.

2.5 Summarizing the pangenome files

Using the program anvi-summarize we exported summary tables for each of the four species level pangenomes.

The resulting summaries contain a gene_clusters_summary.txt.gz file that links each gene to gene clusters, genomes, functions, layers, and bins selected from the interface. We used the flag -C to selected the CorevsAccessory collection described above:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-summarize -p $path_i/Only_Cpr/PAN_Cpr-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Cpr/Summaries \
               -C CorevsAccessory

anvi-summarize -p $path_i/Only_Cps/PAN_Cps-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Cps/Summaries \
               -C CorevsAccessory
               
anvi-summarize -p $path_i/Only_Cac/PAN_Cac-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Cac/Summaries \
               -C CorevsAccessory

anvi-summarize -p $path_i/Only_Ctu/PAN_Ctu-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Ctu/Summaries \
               -C CorevsAccessory

The output files were used for the COG functional analysis based on PPanGGOLiN partitions.

3 KEGG Metabolic Analysis

3.1 Estimation of KEGG module completeness

Using anvi-estimate-metabolism we predicted KEGG module completeness as described in (Veseli et al., 2023) for the the organisms listed in CorPGA_MetabolicAnalysis_GenomeList_v01.txt. This list includes:

The 102 Corynebacterium genomes selected for metabolic analysis (Cac_ATCC_49726 & Cps_090104 were excluded for this analysis, as well as all three C. macginleyi genomes)
Plus 27 for the 28 Dolosigranulum pigrum genomes described in (Flores Ramos et al., 2021) (KPL3086 was not included in the analysis due to being highly similar to KPL3065).
As well as the 9 reference genomes with original NCBI annotations listed in Table S3D.

We generated outputs both in long tabular and matrix format with the KO hits and KEGG modules completeness scores.

#conda activate anvio-dev

mkdir -p "data/Anvio8/Metabolic_Analysis"
path_o="data/Anvio8/Metabolic_Analysis"

txt_file="data/genome_lists/CorPGA_MetabolicAnalysis_GenomeList_v01.txt"
prefix=CorPGA
path_o="data/Anvio8/Metabolic_Analysis"
mkdir -p "$path_o"

anvi-estimate-metabolism -e $txt_file \
                         -O $path_o/$prefix --matrix-format --include-metadata

anvi-estimate-metabolism -e $txt_file \
                         -O $path_o/$prefix --add-copy-number --output-modes modules,module_paths,module_steps,hits

3.2 Module Enrichment Analysis

The CorPGA_modules.txt file generated in the previous step is used by anvi-compute-metabolic-enrichment to calculate KEGG module enrichment (Shaiber et al., 2020) across different groups of genomes. Enrichment was explored across different genome groupings as described in the following files:

CorPGA_MetabolicAnalysis_Groups_4Cor.txtincludes 102 genomes for the 4 nasal Corynebacterium grouped by their species.
CorPGA_MetabolicAnalysis_Groups_CorDpi.txt adds to the previous file an extra group with the 27 D. pigrum genomes.
CorPGA_MetabolicAnalysis_Groups_Cps.txt includes the 41 C. pseudodiphtheriticum genomes selected to study metabolic differences across geographic locations (BWA vs. USA grouping).
CorPGA_MetabolicAnalysis_Groups_References.txt adds to the previous file the 9 NCBI reference genomes as 9 new groups.

#conda activate anvio-dev

csv_file_modules="data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt"
path_o="data/Anvio8/Metabolic_Analysis/Enrichment_Modules"
mkdir -p "$path_o"

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_4Cor.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_4Cor.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_CorDpi.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_CorDpi.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_Cps.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_Cps.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0
                                  
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_References.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_References.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0

3.3 KO Enrichment Analysis

We used anvi-display-functions to perform this functional enrichment (Shaiber et al., 2020). It runs the enrichment for the desired annotations (in this case KOfam). It saves the output to a system temp file that was recovered, renamed and saved into the same path_o along with the database file.

The same files where used to indicate grouping options equivalent to the ones used for the enrichment at the module level:

#conda activate anvio-dev

txt_file="data/genome_lists/CorPGA_MetabolicAnalysis_GenomeList_v01.txt"
path_o="data/Anvio8/Metabolic_Analysis/Enrichment_KOs"
mkdir -p "$path_o"

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_4Cor.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_4Cor.db
                       
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_CorDpi.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_CorDpi.db    
                       
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_Cps.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_Cps.db

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_References.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_References.db

This generates a list of all enriched KOs, including KOs that belong to already identified enriched modules. We are interested in identifying other KOs, other than the ones identified by studying the modules output. The next function filters out KOs that are part of the definition of modules listed in the enriched modules output.

get_KOs_not_in_enrich_modules <- function(modules, hits, enrich_modules, enrich_KOs, q_value_threshold) {

  # Get list of KOs that correspond with the modules on the enrichment file
  modules_subset <- modules %>% filter(module %in% enrich_modules$accession)
  module_definition_subset <- paste(modules_subset$module_definition, collapse = " ")
  module_definition_subset <- str_extract_all(module_definition_subset, "\\bK\\d+\\b")
  module_definition_subset <- unique(unlist(module_definition_subset))

  # Add a column to enrich_KOs indicating if the KO is the previous list
  enrich_KOs$inModule <- sapply(strsplit(as.character(enrich_KOs$accession), ","), function(x) all((x %in% module_definition_subset)))

  # Filter Enrich_KO based on inModule and q_value_threshold
  enrich_KOs_filtered <- enrich_KOs %>% 
    filter(inModule == FALSE) %>% 
    filter(adjusted_q_value < q_value_threshold)

  return(enrich_KOs_filtered)
}

Modules <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt", show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character()))
KOHits <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_hits.txt", show_col_types = FALSE)

Enrich_Mod_4Cor <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_4Cor.txt", show_col_types = FALSE)
Enrich_KO_4Cor <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_4Cor.txt", show_col_types = FALSE)
Enrich_KO_4Cor_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_4Cor, Enrich_KO_4Cor, 0.1)
write_csv(Enrich_KO_4Cor_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_4Cor_Filtered.csv")

Enrich_Mod_CorDpi <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_CorDpi.txt", show_col_types = FALSE)
Enrich_KO_CorDpi <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_CorDpi.txt", show_col_types = FALSE)
Enrich_KO_CorDpi_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_CorDpi, Enrich_KO_CorDpi, 0.1)
write_csv(Enrich_KO_CorDpi_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_CorDpi_Filtered.csv")

Enrich_Mod_Cps <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_Cps.txt", show_col_types = FALSE)
Enrich_KO_Cps <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_Cps.txt", show_col_types = FALSE)
Enrich_KO_Cps_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_Cps, Enrich_KO_Cps, 0.1)
write_csv(Enrich_KO_Cps_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_Cps_Filtered.csv")

Enrich_Mod_References <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_References.txt", show_col_types = FALSE)
Enrich_KO_References <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_References.txt", show_col_types = FALSE)
Enrich_KO_References_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_References, Enrich_KO_References, 0.1)
write_csv(Enrich_KO_References_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_References_Filtered.csv")

3.4 Figures and Tables

3.4.1 Tables S3 and S4

From the output files generated by anvi-estimate-metabolism we include as Supplemental data the following tables:

TableS3A. KO hits: Corresponds to output file CorPGA_hits.txt and contains individual KO annotations for all genomes.
TableS4A. KEEG modules: Corresponds to output file CorPGA_modules.txt and contains KEGG module estimations for all genomes in a long format dataframe.
TableS4B. KEGG module stepwise matrix: Corresponds to output file CorPGA-module_stepwise_completeness-MATRIX.txt and contains KEGG module stepwise completion score for all genomes in a wide format dataframe (rows are modules, numeric columns are strains).
TableS4C. KEGG modules by Species: The following code was used to read the previous table and generate a version with stepwise module completeness averaged by species (rows are modules, numeric columns are species):

Modules_MatrixbyGenome <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA-module_stepwise_completeness-MATRIX.txt")

# Identify numerical and non-numerical columns
numeric_cols <- sapply(Modules_MatrixbyGenome, is.numeric)
non_numeric_cols <- names(Modules_MatrixbyGenome)[!numeric_cols]

# Extract unique prefixes (based on strain name including species name) from numerical column names
prefixes <- unique(sub("_.*", "", names(Modules_MatrixbyGenome)[numeric_cols]))

# Define function to calculate row-wise means for columns with the same prefix
calculate_prefix_means <- function(df, prefix) {
  df %>%
    select(starts_with(prefix)) %>%
    rowMeans(na.rm = TRUE) %>%
    round(2)
}

# Calculate the row-wise means for each prefix and bind them into a new data frame
averaged_df <- map_dfc(prefixes, ~ {
  tibble(!! .x := calculate_prefix_means(Modules_MatrixbyGenome, .x))
})

# Combine the non-numerical columns and the averaged columns by species
Modules_MatrixbySpecies <- bind_cols(Modules_MatrixbyGenome %>% select(all_of(non_numeric_cols)), averaged_df)

write.csv(Modules_MatrixbySpecies, "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA-module_stepwise_completeness-MATRIXbySpecies.csv", row.names = F)

As described above, metabolic enrichment was explored across different genome groupings, however only the analysis comparing the 4 nasal Corynebacterium as defined in CorPGA_MetabolicAnalysis_Groups_4Cor.txt was included in the final manuscript. We include the following tables:

TableS4D. Enriched_Modules_4Cor: Corresponds to the output file from anvi-compute-metabolic-enrichment
TableS3B. OtherEnriched_KOs_4Cor: Corresponds to the output file CorPGA_enriched_KOs_4Cor_Filtered.csv. Includes only the enriched KOs that are not part of any module already described in tab A.

3.4.2 Figure 4

# Read modules data and select only the genomes for the four nasal Corynebacterium species and Dolosigranulum pigrum. Make appropriate labeling columns
Modules <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt", show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character())) %>% 
  filter(!grepl("^GCF", genome_name)) %>%
  mutate(module_class_short = paste0("(", substr(module_class, 1, 1), ")")) %>%
  mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
  unite(modules_label, c(module_class_short, module_category, module_subcategory), sep = "_", remove = FALSE) %>% 
  mutate_at(c("species_name", "genome_name", "module", "module_name", "module_category", "module_subcategory", "module_class", "modules_label"), factor)

# Summary tally table of fully stepwise complete modules by species
ModulesTable <- Modules %>%
  filter(stepwise_module_completeness == 1) %>%
  group_by(module_category, module_subcategory, module, module_name, species_name) %>%
  tally() 

# Table converted to wide and percentages calculated based on number of genomes per species
ModulesTableWide <- ModulesTable %>%
  pivot_wider(names_from = species_name, values_from = n, values_fill = 0)

ModulesTableWide$Cpr = round(100*(ModulesTableWide$Cpr/19), 1)
ModulesTableWide$Cps = round(100*(ModulesTableWide$Cps/42), 1)
ModulesTableWide$Cac = round(100*(ModulesTableWide$Cac/33), 1)
ModulesTableWide$Ctu = round(100*(ModulesTableWide$Ctu/8), 1)
ModulesTableWide$Dpi = round(100*(ModulesTableWide$Dpi/27), 1)

# Modules with less than 12.5% across all species eliminated
ModulesTableWide <- ModulesTableWide %>%
  select(module_category, module_subcategory, module, module_name, Cpr, Cps, Cac, Ctu, Dpi) %>%
  filter(rowSums(across(c(Cpr, Cps, Cac, Ctu, Dpi)) > 12.5) > 0)

# Version of the table to calculate the percentage of complete modules across the four Corynebacterium species:
ModulesData <- ModulesTable %>%
  filter(species_name != "Dpi") %>%
  group_by(module_category, module_subcategory, module, module_name) %>%
  summarise(total = sum(n)) %>%
  mutate(percent = 100*total/102)

write_csv(ModulesTableWide, "data/Anvio8/Metabolic_Analysis/Fig_Tables/TableKEGG.csv")
kable(ModulesTableWide)

module_category	module_subcategory	module	module_name	Cpr	Cps	Cac	Ctu	Dpi
Amino acid metabolism	Arginine and proline metabolism	M00015	Proline biosynthesis, glutamate => proline	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Arginine and proline metabolism	M00028	Ornithine biosynthesis, glutamate => ornithine	100.0	100.0	100.0	50.0	0.0
Amino acid metabolism	Arginine and proline metabolism	M00844	Arginine biosynthesis, ornithine => arginine	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Arginine and proline metabolism	M00970	Proline degradation, proline => glutamate	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Aromatic amino acid metabolism	M00022	Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Aromatic amino acid metabolism	M00023	Tryptophan biosynthesis, chorismate => tryptophan	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Branched-chain amino acid metabolism	M00019	Valine/isoleucine biosynthesis, pyruvate => valine / 2-oxobutanoate => isoleucine	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Branched-chain amino acid metabolism	M00432	Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Branched-chain amino acid metabolism	M00570	Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Cysteine and methionine metabolism	M00017	Methionine biosynthesis, aspartate => homoserine => methionine	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Cysteine and methionine metabolism	M00021	Cysteine biosynthesis, serine => cysteine	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Histidine metabolism	M00026	Histidine biosynthesis, PRPP => histidine	100.0	100.0	90.9	0.0	0.0
Amino acid metabolism	Histidine metabolism	M00045	Histidine degradation, histidine => N-formiminoglutamate => glutamate	100.0	100.0	100.0	100.0	0.0
Amino acid metabolism	Lysine metabolism	M00016	Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine	100.0	100.0	100.0	50.0	0.0
Amino acid metabolism	Serine and threonine metabolism	M00018	Threonine biosynthesis, aspartate => homoserine => threonine	100.0	100.0	100.0	100.0	81.5
Amino acid metabolism	Serine and threonine metabolism	M00020	Serine biosynthesis, glycerate-3P => serine	100.0	100.0	100.0	87.5	0.0
Amino acid metabolism	Serine and threonine metabolism	M00621	Glycine cleavage system	0.0	0.0	100.0	100.0	100.0
Biosynthesis of terpenoids and polyketides	Polyketide sugar unit biosynthesis	M00793	dTDP-L-rhamnose biosynthesis	100.0	100.0	100.0	100.0	0.0
Biosynthesis of terpenoids and polyketides	Terpenoid backbone biosynthesis	M00096	C5 isoprenoid biosynthesis, non-mevalonate pathway	0.0	0.0	100.0	100.0	0.0
Biosynthesis of terpenoids and polyketides	Terpenoid backbone biosynthesis	M00364	C10-C20 isoprenoid biosynthesis, bacteria	0.0	0.0	100.0	100.0	0.0
Biosynthesis of terpenoids and polyketides	Terpenoid backbone biosynthesis	M00365	C10-C20 isoprenoid biosynthesis, archaea	0.0	0.0	100.0	100.0	0.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00001	Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate	100.0	100.0	100.0	100.0	100.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00002	Glycolysis, core module involving three-carbon compounds	100.0	100.0	100.0	100.0	100.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00003	Gluconeogenesis, oxaloacetate => fructose-6P	100.0	100.0	100.0	100.0	0.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00004	Pentose phosphate pathway (Pentose phosphate cycle)	100.0	100.0	100.0	100.0	77.8
Carbohydrate metabolism	Central carbohydrate metabolism	M00005	PRPP biosynthesis, ribose 5P => PRPP	100.0	100.0	100.0	100.0	100.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00006	Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P	100.0	100.0	100.0	100.0	100.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00007	Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P	100.0	100.0	100.0	100.0	77.8
Carbohydrate metabolism	Central carbohydrate metabolism	M00009	Citrate cycle (TCA cycle, Krebs cycle)	100.0	100.0	100.0	100.0	0.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00010	Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate	100.0	100.0	100.0	100.0	0.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00011	Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate	100.0	100.0	100.0	100.0	0.0
Carbohydrate metabolism	Central carbohydrate metabolism	M00307	Pyruvate oxidation, pyruvate => acetyl-CoA	100.0	69.0	100.0	100.0	0.0
Carbohydrate metabolism	Other carbohydrate metabolism	M00549	Nucleotide sugar biosynthesis, glucose => UDP-glucose	0.0	0.0	0.0	100.0	0.0
Carbohydrate metabolism	Other carbohydrate metabolism	M00550	Ascorbate degradation, ascorbate => D-xylulose-5P	0.0	0.0	0.0	0.0	37.0
Carbohydrate metabolism	Other carbohydrate metabolism	M00554	Nucleotide sugar biosynthesis, galactose => UDP-galactose	100.0	100.0	100.0	100.0	0.0
Carbohydrate metabolism	Other carbohydrate metabolism	M00632	Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P	100.0	100.0	100.0	100.0	0.0
Carbohydrate metabolism	Other carbohydrate metabolism	M00854	Glycogen biosynthesis, glucose-1P => glycogen/starch	0.0	0.0	0.0	0.0	96.3
Carbohydrate metabolism	Other carbohydrate metabolism	M00909	UDP-N-acetyl-D-glucosamine biosynthesis, prokaryotes, glucose => UDP-GlcNAc	100.0	100.0	100.0	100.0	100.0
Energy metabolism	ATP synthesis	M00151	Cytochrome bc1 complex respiratory unit	100.0	100.0	100.0	100.0	0.0
Energy metabolism	ATP synthesis	M00155	Cytochrome c oxidase, prokaryotes	100.0	100.0	100.0	100.0	0.0
Energy metabolism	ATP synthesis	M00157	F-type ATPase, prokaryotes and chloroplasts	100.0	100.0	100.0	100.0	0.0
Energy metabolism	Carbon fixation	M00579	Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate	0.0	0.0	100.0	100.0	100.0
Lipid metabolism	Fatty acid metabolism	M00083	Fatty acid biosynthesis, elongation	100.0	100.0	0.0	0.0	96.3
Lipid metabolism	Fatty acid metabolism	M00086	beta-Oxidation, acyl-CoA synthesis	100.0	100.0	100.0	100.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00120	Coenzyme A biosynthesis, pantothenate => CoA	100.0	100.0	100.0	100.0	100.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00121	Heme biosynthesis, plants and bacteria, glutamate => heme	100.0	100.0	100.0	100.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00123	Biotin biosynthesis, pimeloyl-ACP/CoA => biotin	21.1	92.9	100.0	0.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00125	Riboflavin biosynthesis, plants and bacteria, GTP => riboflavin/FMN/FAD	100.0	100.0	100.0	100.0	14.8
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00126	Tetrahydrofolate biosynthesis, GTP => THF	0.0	0.0	100.0	100.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00140	C1-unit interconversion, prokaryotes	100.0	97.6	100.0	100.0	100.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00577	Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin	21.1	92.9	100.0	0.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00881	Lipoic acid biosynthesis, plants and bacteria, octanoyl-ACP => dihydrolipoyl-E2/H	100.0	100.0	100.0	100.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00899	Thiamine salvage pathway, HMP/HET => TMP	0.0	0.0	100.0	100.0	100.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00916	Pyridoxal-P biosynthesis, R5P + glyceraldehyde-3P + glutamine => pyridoxal-P	0.0	0.0	100.0	100.0	0.0
Metabolism of cofactors and vitamins	Cofactor and vitamin metabolism	M00926	Heme biosynthesis, bacteria, glutamyl-tRNA => coproporphyrin III => heme	100.0	100.0	100.0	100.0	0.0
Nucleotide metabolism	Purine metabolism	M00048	De novo purine biosynthesis, PRPP + glutamine => IMP	100.0	100.0	100.0	100.0	0.0
Nucleotide metabolism	Purine metabolism	M00049	Adenine ribonucleotide biosynthesis, IMP => ADP,ATP	100.0	100.0	100.0	100.0	0.0
Nucleotide metabolism	Purine metabolism	M00050	Guanine ribonucleotide biosynthesis, IMP => GDP,GTP	100.0	100.0	100.0	100.0	0.0
Nucleotide metabolism	Purine metabolism	M00053	Deoxyribonucleotide biosynthesis, ADP/GDP/CDP/UDP => dATP/dGTP/dCTP/dUTP	100.0	100.0	100.0	100.0	0.0
Nucleotide metabolism	Pyrimidine metabolism	M00938	Pyrimidine deoxyribonucleotide biosynthesis, UDP => dTTP	100.0	100.0	100.0	100.0	0.0

3.4.3 Figure 9

# Reading strain order and label formatting 
StrainOrder <- read_csv("data/genome_lists/CorPGA_MetabolicAnalysis_PlotOrder.csv", show_col_types = FALSE)

# When using stepwise_module_completeness some modules can be = 0 for all genomes but still have pathwise_module_completeness > 0. We are removing those modules.
Modules_cleaned <- Modules %>%
  group_by(module) %>%
  filter(!all(stepwise_module_completeness == 0)) %>%
  ungroup() %>%
  mutate(across(where(is.factor), droplevels))

all_zero_modules <- setdiff(Modules$module, Modules_cleaned$module)
all_zero_subcategory <- setdiff(Modules$module_subcategory, Modules_cleaned$module_subcategory)


# Calculating mean_stepwise_module_completeness and preparing plot labels 
Modules_Means = Modules_cleaned %>% 
  group_by(modules_label, genome_name, .drop = FALSE) %>%
  summarise(mean_stepwise_module_completeness = mean(stepwise_module_completeness)) %>%
  mutate(mean_stepwise_module_completeness = coalesce(mean_stepwise_module_completeness, 0)) %>%
  mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
  mutate(module_category = str_extract(modules_label, "^[^_]+_[^_]+")) %>%
  mutate(module_category = str_replace_all(module_category, "_", " ")) %>%
  mutate(modules_final_label = str_replace(modules_label, "^[^_]+_[^_]+_", "")) %>%
  mutate_at(c("module_category", "modules_final_label"), factor) 

Modules_Means <- left_join(Modules_Means, StrainOrder, by = "genome_name")
Modules_Means$species_name <- factor(Modules_Means$species_name, levels = c("Cpr","Cps","Cac","Ctu","Dpi"))
Modules_Means$genome_name <- factor(Modules_Means$genome_name, levels = StrainOrder$genome_name)
Modules_Means$styled_genome_name <- factor(Modules_Means$styled_genome_name, levels = StrainOrder$styled_genome_name)

# Colors for plot
getPalette <- colorRampPalette(brewer.pal(8, "Set1"))
colorsSpecies <- c("#FF8C00", "#FF0000", "#6A329F", "#3B54A4", "#DAE9F8")

# Heatmap plot
plot <- ggplot(Modules_Means, aes(x = modules_final_label, y = styled_genome_name, fill = mean_stepwise_module_completeness)) + 
  geom_tile(color = "white") +
  facet_grid2(species_name ~ module_category, scales = "free", space = "free", switch = "both",
              strip = strip_themed(background_x = elem_list_rect(fill = getPalette(12)),
                                   background_y = elem_list_rect(fill = colorsSpecies))) +
  labs(x = "", y = "") +
  scale_y_discrete(expand = c(0,0), position = "right", limits = rev) +
  scale_x_discrete(expand = c(0,0)) +
  scale_fill_viridis(direction = -1, option = "A", begin = 0.2) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        strip.background = element_rect(colour = "black", fill = NA),
        strip.text.y.left = element_text(angle = 0, size = 12, face = "bold.italic"),
        strip.text.x = element_text(size = 0),
        axis.text.y.right = element_markdown(size = 4, color = "black"),
        axis.text.x = element_text(angle = 45, hjust = 1, size = 6, color = "black"),
        panel.spacing = unit(0, 'pt'),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 7),
        legend.direction = "vertical",
        legend.position = c(0.97, -0.19),
        legend.justification = c(0.4, 0),
        legend.background = element_blank()) +
  guides(fill = guide_colourbar(title = "\nAverage\nstepwise\nmodule\ncompleteness", 
                                title.position = "left", title.hjust = 0.5, barwidth = 0.8, barheight = 3))

# Module categories legend
legend <- ggplot(Modules_Means, aes(x = module_category, y = mean_stepwise_module_completeness, fill = module_category)) +
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = getPalette(12)) + 
  theme(legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 6),
        legend.key.size = unit(0.3,"line")) +
  guides(fill = guide_legend(ncol = 3))

# Strains origen legend
text <- "<span style='font-size:9px'> USA isolate:</span> <span style='color:#ed1c24; font-size:13px;'>▲</span><br><span style='font-size:9px'> BWA isolate:</span> <span style='color:#6ca9d4; font-size:17px;'>●</span>"
text.p <- textbox_grob(text)

# Final figure
Figure <- ggarrange(plot + theme(plot.margin = unit(c(0.5,0,1.5,3), "lines")), 
                    ggarrange(
                      as_ggplot(get_legend(legend)) + theme(plot.margin = unit(c(-1,0,1,2), "lines")),
                      as_ggplot(text.p) + theme(plot.margin = unit(c(0,0,2,1.5), "lines")),
                      nrow = 1, widths = c(0.8,0.2)),
                    nrow = 2, heights = c(1,0.02))
Figure

ggsave(plot = Figure, filename = "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA_Heatmap.png", height = 9, width = 6, device = 'png', dpi = 600)
ggsave(plot = Figure, filename = "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA_Heatmap.svg", height = 9, width = 6, device = 'svg', dpi = 600)

3.4.3.1 Percentages in Figure 9

Modules_MeansSummary = Modules_Means %>% 
  group_by(modules_final_label, species_name, .drop = TRUE) %>%
  summarise(mean_bySpecies = mean(mean_stepwise_module_completeness))

summaryPlot <- ggplot(Modules_MeansSummary, aes(x = mean_bySpecies, y = modules_final_label, fill = species_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = colorsSpecies) +
  scale_x_continuous(expand = c(0,0)) +
  labs(x = "", y = "") +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_text(hjust = 1, size = 10, color = "black"))
summaryPlot

saveWidget(ggplotly(summaryPlot), "data/Anvio8/Metabolic_Analysis/Fig_Tables/Heatmap_Percents_Interactive.html", selfcontained = TRUE)

3.4.4 Figure 6A

The Assimilatory sulfate reduction, sulfate => H2S module (M00176) had a stepwise completion score = 0.5 in most of the Corynebacterium genomes. However, this module was completely absent in all the C. pseudodiphtheriticum strains from the U.S. and 4 from Botswana. We look for KO assignments for the enzymes encoded in the fpr2cysIXHDNYZ operon (Rückert et al., 2005) that are not part of the M00176 module definition. Genes annotated as K00528 (fpr) were identified in all the Corynebacterium genomes. A manual exploration of the gene annotations in the genomic region around the fpr in all C. pseudodiphtheriticum strains, allowed us to identified hchA as a common gene downstream the fpr2cysIXHDNYZ operon.

This code uses BEDTools (Quinlan, 2014; Quinlan and Hall, 2010) to extract annotations and sequences for the fpr to hchA region for all C. pseudodiphtheriticum genomes:

path_i="data/Anvio8/Prokka_out"
path_o="data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA"
mkdir -p $path_o

gene_1="hchA"
gene_2="fprA"
species="Cps"
label="ASR_Cps_region"

fasta_files=()

for file in $path_i/$species*.gff; do
    filename=$(basename "$file")
    filename_noext="${filename%.*}"
    fasta_output="$path_o/${filename_noext}_region_sequence.fasta"

    awk '/^##FASTA/{flag=1;next} {if(!flag) print}' "$file" > "$path_o/${filename_noext}_annotations.gff"
    awk '/^##FASTA/{flag=1} {if(flag) print}' "$file" > "$path_o/${filename_noext}_sequence_temp.fasta"
    grep -v '^##FASTA' "$path_o/${filename_noext}_sequence_temp.fasta" > "$path_o/${filename_noext}_sequence.fasta"
    rm "$path_o/${filename_noext}_sequence_temp.fasta"

    grep -E 'Name='$gene_1'|Name='$gene_2 "$path_o/${filename_noext}_annotations.gff" | sort -k4,4n | awk 'NR==1 {start=$4} NR==2 {end=$5} END {print $1 "\t" start "\t" end}' > "$path_o/${filename_noext}_region_coordinates.bed"
    bedtools intersect -a "$path_o/${filename_noext}_annotations.gff" -b "$path_o/${filename_noext}_region_coordinates.bed" -wa > "$path_o/${filename_noext}_region_annotations.gff"
    bedtools getfasta -fi "$path_o/${filename_noext}_sequence.fasta" -bed "$path_o/${filename_noext}_region_coordinates.bed" | awk -v filename="$filename_noext" '/^>/ {print ">" filename "|" substr($0, 2)} !/^>/' > "$fasta_output"

    fasta_files+=("$fasta_output")
done

We use MAFFT (Katoh, 2002; Katoh and Standley, 2013) to align the extracted sequences.

#conda activate mafft 

cat "${fasta_files[@]}" > "$path_o/$label.fasta"
mafft --adjustdirection --reorder "$path_o/$label.fasta" > "$path_o/${label}_align.fasta"
sed 's/|.*//;s/_R_//' "$path_o/${label}_align.fasta" > "$path_o/${label}_align_clean.fasta" #Clean up header names

We used AliView (Larsson, 2014) to inspect the output file and to reverse complement the alignment to match the operon order. We saved the reversed complemented version of the alignment as ASR_Cps_region_align_clean_R.fasta.

Next we count the alignment length and create a reference sequence with all Ns and header named “Genes”. This will be used as a base for gene labeling and as a reference so visualize only the gaps to the consensus.

count=$(awk 'NR==2 {print length($0)}' "$path_o/${label}_align_clean_R.fasta") 
filename="$path_o/${label}_reference.fasta"

echo ">Genes" >> "$filename"
echo "$(printf 'N%.0s' $(seq 1 $count))" >> "$filename"
cat "$path_o/${label}_reference.fasta" "$path_o/${label}_align_clean_R.fasta" > "$path_o/${label}_align_clean_R_Ref.fasta"

echo ">-----------Genes" > "$filename"
echo "$(printf 'N%.0s' $(seq 1 $count))" >> "$filename"
echo ">" >> "$filename"
echo "$(printf '-%.0s' $(seq 1 $count))" >> "$filename"

The version of the alignment file with the reference sequence is the input for the visualization with the msavisr package (Vragh, 2022).

plot_ASR <- msavisr("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region_align_clean_R_Ref.fasta", 
                myref = "Genes",
                refontop = TRUE,
                basecolors = c("black","black","white"))

plot_ASR <- plot_ASR + 
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 14, color = "black"))

saveRDS(plot_ASR, "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_plot.rds")

This code creates a reference sequence with gene locations and length proportional to the corresponding gene coordinates to use as a label.

#Manual list of gene coordinates generated by searching individual genes in the aligned fasta
genes <- list(c("-----------Genes", 11579:10901, "8_hchA"), 
              c("-----------Genes", 8816:7815, "7_cysZ"), 
              c("-----------Genes", 7806:6988, "6_cysY"), 
              c("-----------Genes", 6971:5595, "5_cysN"),
              c("-----------Genes", 5595:4513, "4_cysD"),
              c("-----------Genes", 4516:3458, "3_cysH"),
              c("-----------Genes", 3461:1788, "2_cysI"),
              c("-----------Genes", 1424:1, "1_fpr2")
)

plot_ASRlabels <- msavisr("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region_reference.fasta", 
                myref = "-----------Genes",
                refontop = FALSE,
                basecolors = c("black","black","white"),
                myroi = genes, hroi = 1,
                roicolors = c("#9fd122","#00a84a","#f88502","#00a9f2","#6f88c9","#a94488","#c8c3e3", "gray80"))

# Values manually calculated to place the labels in the middle of each gene 
plot_ASRlabels <- plot_ASRlabels + 
  annotate("text", x = 678, y = 1, label = "fpr2", size = 7, fontface = 2) +
  annotate("text", x = 2624, y = 1, label = "cysI", size = 7, fontface = 2) +
  annotate("text", x = 3987, y = 1, label = "cysH", size = 7, fontface = 2) +
  annotate("text", x = 5054, y = 1, label = "cysD", size = 7, fontface = 2) +
  annotate("text", x = 6283, y = 1, label = "cysN", size = 7, fontface = 2) +
  annotate("text", x = 7397, y = 1, label = "cysY", size = 7, fontface = 2) +
  annotate("text", x = 8315, y = 1, label = "cysZ", size = 7, fontface = 2) +
  annotate("text", x = 11240, y = 1, label = "hchA", size = 7, fontface = 2) +
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 12))

This generates the combined figure that was edited in Ilustrator for final edits, including adding symbols for the strains geographic location.

plot_ASR <- readRDS("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_plot.rds")
FinalFigure <- ggarrange(plot_ASRlabels, plot_ASR, ncol = 1, heights = c(0.20, 1))
FinalFigure

ggsave(FinalFigure, file = "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region.svg", height = 10, width = 20, dpi = 150)
ggsave(FinalFigure, file = "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region.png", height = 10, width = 20, dpi = 150)

3.5 GC level analysis

We performed pangenomic level analysis in 107 genomes: the 102 nasal Corynebacterium genomes selected for metabolic analysis (Cac_ATCC_49726 & Cps_090104 were excluded for this analysis) plus the 5 Corynebacterium listed in Table S1D:

Corynebacterium glutamicum ATCC 13032 (Kyowa Hakko) (GCF_000011325.1)
Corynebacterium diphtheriae NCTC11397 (GCF_001457455.1)
Corynebacterium simulans PES1 (GCF_001586215.1)
Corynebacterium kroppenstedtii DSM 44385 (GCF_000023145.1)
Corynebacterium amycolatum FDAARGOS_1108 (GCF_016728725.1)

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-gen-genomes-storage -e "data/genome_lists/CorPGA_Anvio_GenomeList_v01c.txt" -o $path_i/CorPGA-All4CorRefs-GENOMES.db --gene-caller Prodigal

anvi-db-info $path_i/CorPGA-All4CorRefs-GENOMES.db

anvi-pan-genome -g $path_i/CorPGA-All4CorRefs-GENOMES.db \
                -o $path_i/All4CorRefs -n PAN_All4CorRefs \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8 --I-know-this-is-not-a-good-idea

Create a bin collection named All4CorRefs with all Core GCs for all the Corynebacterium species using the visual interface. We do this because in order to use anvi-summarize we need to have at least one bin.

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-display-pan -p $path_i/All4CorRefs/PAN_All4CorRefs-PAN.db -g $path_i/CorPGA-All4CorRefs-GENOMES.db         

anvi-summarize -p $path_i/All4CorRefs/PAN_All4CorRefs-PAN.db \
               -g $path_i/CorPGA-All4CorRefs-GENOMES.db \
               -o $path_i/All4CorRefs/Summaries \
               -C All4CorRefs

References

Bateman, A. (2000). The pfam protein families database. Nucleic Acids Research 28, 263–266.

Buchfink, B., Xie, C. and Huson, D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60.

Delmont, T. O. and Eren, A. M. (2018). Linking pangenomes and metagenomes: the Prochlorococcus metapangenome. PeerJ 6, e4320.

Drula, E., Garron, M.-L., Dogan, S., Lombard, V., Henrissat, B. and Terrapon, N. (2021). The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Research 50, D571–D577.

Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLoS Computational Biology 7, e1002195.

Eren, A. M., Kiefl, E., Shaiber, A., Veseli, I., Miller, S. E., Schechter, M. S., Fink, I., Pan, J. N., Yousef, M., Fogarty, E. C., et al. (2020). Community-led, integrated, reproducible multi-omics with anvi’o. Nature Microbiology 6, 3–6.

Flores Ramos, S., Brugger, S. D., Escapa, I. F., Skeete, C. A., Cotton, S. L., Eslami, S. M., Gao, W., Bomar, L., Tran, T. H., Jones, D. S., et al. (2021). Genomic Stability and Genetic Defense Systems inDolosigranulum pigrum, a Candidate Beneficial Bacterium from the Human Microbiome. mSystems 6,.

Galperin, M. Y., Wolf, Y. I., Makarova, K. S., Vera Alvarez, R., Landsman, D. and Koonin, E. V. (2020). COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Research 49, D274–D281.

Kanehisa, M. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28, 27–30.

Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. and Ishiguro-Watanabe, M. (2022). KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research 51, D587–D592.

Katoh, K. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Research 30, 3059–3066.

Katoh, K. and Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30, 772–780.

Larsson, A. (2014). AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30, 3276–3278.

Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G., Sonnhammer, E. L. L., Tosatto, S. C. E., Paladin, L., Raj, S., Richardson, L. J., et al. (2020). Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–D419.

Quinlan, A. R. (2014). BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Current Protocols in Bioinformatics 47,.

Quinlan, A. R. and Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842.

Rückert, C., Koch, D. J., Rey, D. A., Albersmeier, A., Mormann, S., Pühler, A. and Kalinowski, J. (2005). Functional genomics and expression analysis of the Corynebacterium glutamicum fpr2-cysIXHDNYZ gene cluster involved in assimilatory sulphate reduction. BMC Genomics 6,.

Shaiber, A., Willis, A. D., Delmont, T. O., Roux, S., Chen, L.-X., Schmid, A. C., Yousef, M., Watson, A. R., Lolans, K., Esen, Ö. C., et al. (2020). Functional and genetic markers of niche partitioning among enigmatic members of the human oral microbiome. Genome Biology 21,.

Tatusov, R. L. (2000). The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28, 33–36.

Veseli, I., Chen, Y. T., Schechter, M. S., Vanni, C., Fogarty, E. C., Watson, A. R., Jabri, B., Blekhman, R., Willis, A. D., Yu, M. K., et al. (2023). Microbes with higher metabolic independence are enriched in human gut microbiomes under stress. biorxiv.

Vragh (2022). Vragh/seqvisr: v0.2.7. Zenodo.