Supplemental Methods: anvi’o

library(tidyverse)
library(ggtext)
library(ggh4x)
library(RColorBrewer)
library(viridis)
library(grid)
library(ggpubr)
library(gridtext)
library(kableExtra)
library(seqinr)
library(reshape2)
library(plotly)
library(htmlwidgets)
library(seqvisr)

This notebook contains analyses using anvi’o version 8 (Delmont and Eren, 2018; Eren et al., 2020) (development branch). For more in-depth information about anvi’o go to their official online site here: https://anvio.org/

1 Functional annotations

We use as input (path_i) the anvi’o contigs database files (.db) that were generated in SupplementalMethods_Prokka_Annotations.

1.1 COG annotation

First we downloaded and set up the NCBI’s Clusters of Orthologous Groups database Galperin et al. (2020) using anvi-setup-ncbi-cogs. The default is the latest version, which is COG20, meaning that anvi’o will use the NCBI’s 2020 release of COGs to setup the database.

The anvi-run-ncbi-cogs command was used to annotate the contig databases. We used DIAMOND as searching method (Buchfink et al., 2014) with the recommended “sensitive” option.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-ncbi-cogs -T 8 --search-with diamond -c $file;
done

1.2 KEGG annotation

First we downloaded and set up the KEGG KOfam (Kanehisa, 2000; Kanehisa et al., 2022) database using anvi-setup-kegg-data. We used the snapshot of the KEGG database dated 2023-09-22 (hash a2b5bde358bb). It contains 479 modules with 15116 entries.

The program anvi-run-kegg-kofams annotates genes in a given anvi’o contigs database with KEGG Orthology (KO) numbers via hits to the KEGG KOfam database. We used -E lower than 1e-15.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-kegg-kofams -T 8 --log-bitscores -E 1e-15 --just-do-it -c $file;
done

The option --log-bitscores records the bit scores values. They get saved to the main folder for the repo, so we moved them to a new subfolder

path_i <- "."
path_o <- "data/Anvio8/KEGG_bitscores"
dir.create(file.path("data/Anvio8/KEGG_bitscores"), recursive = TRUE, showWarnings=FALSE)
files_to_move <- list.files(path_i, pattern = "_bitscores.txt$", full.names = TRUE)

for (file in files_to_move) {
  new_path <- file.path(path_o, basename(file))
  print(new_path)
  file.rename(file, new_path)
}

1.3 PFAM annotation

First you need to download and set up the PFAM database (Bateman, 2000; Mistry et al., 2020) using anvi-setup-pfams: this is something you will do only once. We used Pfam version 36.0 (2023-07).

The program anvi-run-kegg-pfam annotates genes in a given anvi’o contigs database with the PFAM database.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-pfams -T 8 -c $file;
done

1.4 Running HMMs

When you run the following command the occurrence of bacterial single-copy genes across your contigs is added to your contigs database (Eddy, 2011).

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-hmms -T 8 -c $file --just-do-it;
done

1.5 Running CAZyme

First you need to download and set up the CAZyme database (Drula et al., 2021) using anvi-setup-cazymes: this is something you will do only once. We used Pfam version 11.

The program anvi-run-cazymes annotates genes in your contigs-db with the dbCAN CAZyme database.

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"

for file in $path_i/*.db; do
    anvi-run-cazymes -T 8 -c $file;
done

1.6 Annotation Summaries

This generates basic stats on the annotations:

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"
path_o="data/Anvio8/Annotation_summaries"
mkdir -p "$path_o"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-display-contigs-stats $file --report-as-text -o $path_o/stats_$FILENAME.txt;
done

This provides basic info on the contigs databases:

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"
path_o="data/Anvio8/Annotation_summaries"

for file in $path_i/*.db; do
    anvi-db-info $file &>> $path_o/anvi-db-info_log.txt;
done

2 Pangenome Analysis

2.1 Generating an anvi’o genomes storage database

The program anvi-gen-genomes-storage requires a .txt file with the names and path locations of the genomes to be included in the genome storage database. The storage database generated with CorPGA_Anvio_GenomeList_v01a.txt includes the 104 Corynebacterium genomes selected for pangenomic analysis (The three C. macginleyi genomes were excluded).

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"
mkdir -p "$path_i"

anvi-gen-genomes-storage -e "data/genome_lists/CorPGA_Anvio_GenomeList_v01a.txt" -o $path_i/CorPGA-All4Cor-GENOMES.db --gene-caller Prodigal

This provides basic info on the generated genomes storage database:

#conda activate anvio-dev

anvi-db-info $path_i/CorPGA-All4Cor-GENOMES.db

2.2 Running the pangenomic analyses

In order to analyze the pangenome individually for each of the four Corynebacterium species we created a .txt file including the list of genomes for each species and ran anvi-pan-genome with the -G flag:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"
path_g="data/genome_lists"

anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCpr.txt \
                -o $path_i/Only_Cpr -n PAN_Cpr \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8
                
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCps.txt \
                -o $path_i/Only_Cps -n PAN_Cps \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8

anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCac.txt \
                -o $path_i/Only_Cac -n PAN_Cac \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8
                
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
                -G $path_g/CorPGA_Anvio_GenomeList_PangenomeCtu.txt \
                -o $path_i/Only_Ctu -n PAN_Ctu \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8                

2.3 Displaying the pangenomes

This will display the pangenome in your default internet browser. Anvi’o says google chrome is the most compatible with the anvi’o interactive interface. You need to pick one of the -p inputs for each PAN.db file:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-display-pan -p $path_i/Only_Cpr/PAN_Cpr-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

anvi-display-pan -p $path_i/Only_Cps/PAN_Cps-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

anvi-display-pan -p $path_i/Only_Cac/PAN_Cac-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

anvi-display-pan -p $path_i/Only_Ctu/PAN_Ctu-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db

If you are using Windows and it cannot display the interactive page then try copy and pasting either 127.0.0.1:8080 or http://localhost:8080/ into your browser. Do not use sudo in the front of the command either.

Once it opens in the browser, you can adjust image settings and click the “DRAW” icon to view it. For help using the interactive interface see http://merenlab.org/2016/02/27/the-anvio-interactive-interface/

We manually reordered the layers to match the phylogenomic trees per species using the click and drag option on the interactive interface. The order names is inverse to the layers in the concentric diagram. When all the layers were re-ordered to match the phylogenomic tree we clicked the draw button and saved the svg file.

2.4 Creating bin collections

2.4.1 Core vs. Accessory

In order to define the Core vs Accessory pangenome for each of the four species level pangenomes we use anvi’o interactive interface. In the “Bins” tab we created the corresponding bins. Using the “Search” tab we used “Search gene clusters using filters” and “Append splits to selected bin” to create bins. The following tables summarizes how the bins in the collection named CorevsAccessory were created (last column indicates the criteria used on each search):

2.4.1.1 C. propinquum pangenome

Pangenome Bin # of Gene Clusters/Pangenome Percentage Gene Cluster occurs in # of genomes
Core 1824/3108 58.7% 19
Single Copy Core 1715/3108 55.2% 19
Multicopy Core 109/3108 3.5% 19
Soft Core 39/3108 1.3% 18
Shell 689/3108 22.2% 3-17
Cloud 556/3108 17.9% 1-2

2.4.1.2 C. pseudodiphtheriticum pangenome

Pangenome Bin # of Gene Clusters/Pangenome Percentage Gene Cluster occurs in # of genomes
Core 1714/3590 47.7% 43
Single Copy Core 1488/3590 41.4% 43
Multicopy Core 226/3590 6.3% 43
Soft Core 46/3590 1.3% 40-42
Shell 972/3590 27.1% 3-39
Cloud 858/3590 23.9% 1-2

2.4.1.3 C. accolens pangenome

Pangenome Bin # of Gene Clusters/Pangenome Percentage Gene Cluster occurs in # of genomes
Core 1904/3427 55.6% 34
Single Copy Core 1734/3427 50.6% 34
Multicopy Core 170/3427 5% 34
Soft Core 77/3427 2.2% 32-33
Shell 676/3427 19.7% 3-31
Cloud 770/3427 22.5% 1-2

2.4.1.4 C. tuberculostearicum pangenome

Pangenome Bin # of Gene Clusters/Pangenome Percentage Gene Cluster occurs in # of genomes
Core 1915/2907 65.9% 8
Single Copy Core 1860/2907 64% 8
Multicopy Core 55/2907 1.9% 8
Soft Core 47/2907 1.6% 7
Shell 270/2907 9.3% 3-6
Cloud 671/2907 23.1% 1-2

2.4.2 PPanGGOLiN Partitions

PPanGGOLiN partitions were imported as a collection of bins named PPanGGOLiNpartitions as described in PPanGGOLiN analysis.

2.5 Summarizing the pangenome files

Using the program anvi-summarize we exported summary tables for each of the four species level pangenomes.

The resulting summaries contain a gene_clusters_summary.txt.gz file that links each gene to gene clusters, genomes, functions, layers, and bins selected from the interface. We used the flag -C to selected the CorevsAccessory collection described above:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-summarize -p $path_i/Only_Cpr/PAN_Cpr-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Cpr/Summaries \
               -C CorevsAccessory

anvi-summarize -p $path_i/Only_Cps/PAN_Cps-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Cps/Summaries \
               -C CorevsAccessory
               
anvi-summarize -p $path_i/Only_Cac/PAN_Cac-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Cac/Summaries \
               -C CorevsAccessory

anvi-summarize -p $path_i/Only_Ctu/PAN_Ctu-PAN.db \
               -g $path_i/CorPGA-All4Cor-GENOMES.db \
               -o $path_i/Only_Ctu/Summaries \
               -C CorevsAccessory

The output files were used for the COG functional analysis based on PPanGGOLiN partitions.

3 KEGG Metabolic Analysis

3.1 Estimation of KEGG module completeness

Using anvi-estimate-metabolism we predicted KEGG module completeness as described in (Veseli et al., 2023) for the the organisms listed in CorPGA_MetabolicAnalysis_GenomeList_v01.txt. This list includes:

  • The 102 Corynebacterium genomes selected for metabolic analysis (Cac_ATCC_49726 & Cps_090104 were excluded for this analysis, as well as all three C. macginleyi genomes)
  • Plus 27 for the 28 Dolosigranulum pigrum genomes described in (Flores Ramos et al., 2021) (KPL3086 was not included in the analysis due to being highly similar to KPL3065).
  • As well as the 9 reference genomes with original NCBI annotations listed in Table S3D.

We generated outputs both in long tabular and matrix format with the KO hits and KEGG modules completeness scores.

#conda activate anvio-dev

mkdir -p "data/Anvio8/Metabolic_Analysis"
path_o="data/Anvio8/Metabolic_Analysis"

txt_file="data/genome_lists/CorPGA_MetabolicAnalysis_GenomeList_v01.txt"
prefix=CorPGA
path_o="data/Anvio8/Metabolic_Analysis"
mkdir -p "$path_o"

anvi-estimate-metabolism -e $txt_file \
                         -O $path_o/$prefix --matrix-format --include-metadata

anvi-estimate-metabolism -e $txt_file \
                         -O $path_o/$prefix --add-copy-number --output-modes modules,module_paths,module_steps,hits 

3.2 Module Enrichment Analysis

The CorPGA_modules.txt file generated in the previous step is used by anvi-compute-metabolic-enrichment to calculate KEGG module enrichment (Shaiber et al., 2020) across different groups of genomes. Enrichment was explored across different genome groupings as described in the following files:

  • CorPGA_MetabolicAnalysis_Groups_4Cor.txtincludes 102 genomes for the 4 nasal Corynebacterium grouped by their species.
  • CorPGA_MetabolicAnalysis_Groups_CorDpi.txt adds to the previous file an extra group with the 27 D. pigrum genomes.
  • CorPGA_MetabolicAnalysis_Groups_Cps.txt includes the 41 C. pseudodiphtheriticum genomes selected to study metabolic differences across geographic locations (BWA vs. USA grouping).
  • CorPGA_MetabolicAnalysis_Groups_References.txt adds to the previous file the 9 NCBI reference genomes as 9 new groups.
#conda activate anvio-dev

csv_file_modules="data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt"
path_o="data/Anvio8/Metabolic_Analysis/Enrichment_Modules"
mkdir -p "$path_o"

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_4Cor.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_4Cor.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_CorDpi.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_CorDpi.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_Cps.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_Cps.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0
                                  
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_References.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_References.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0

3.3 KO Enrichment Analysis

We used anvi-display-functions to perform this functional enrichment (Shaiber et al., 2020). It runs the enrichment for the desired annotations (in this case KOfam). It saves the output to a system temp file that was recovered, renamed and saved into the same path_o along with the database file.

The same files where used to indicate grouping options equivalent to the ones used for the enrichment at the module level:

#conda activate anvio-dev

txt_file="data/genome_lists/CorPGA_MetabolicAnalysis_GenomeList_v01.txt"
path_o="data/Anvio8/Metabolic_Analysis/Enrichment_KOs"
mkdir -p "$path_o"

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_4Cor.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_4Cor.db
                       
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_CorDpi.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_CorDpi.db    
                       
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_Cps.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_Cps.db

txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_References.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_References.db  

This generates a list of all enriched KOs, including KOs that belong to already identified enriched modules. We are interested in identifying other KOs, other than the ones identified by studying the modules output. The next function filters out KOs that are part of the definition of modules listed in the enriched modules output.

get_KOs_not_in_enrich_modules <- function(modules, hits, enrich_modules, enrich_KOs, q_value_threshold) {

  # Get list of KOs that correspond with the modules on the enrichment file
  modules_subset <- modules %>% filter(module %in% enrich_modules$accession)
  module_definition_subset <- paste(modules_subset$module_definition, collapse = " ")
  module_definition_subset <- str_extract_all(module_definition_subset, "\\bK\\d+\\b")
  module_definition_subset <- unique(unlist(module_definition_subset))

  # Add a column to enrich_KOs indicating if the KO is the previous list
  enrich_KOs$inModule <- sapply(strsplit(as.character(enrich_KOs$accession), ","), function(x) all((x %in% module_definition_subset)))

  # Filter Enrich_KO based on inModule and q_value_threshold
  enrich_KOs_filtered <- enrich_KOs %>% 
    filter(inModule == FALSE) %>% 
    filter(adjusted_q_value < q_value_threshold)

  return(enrich_KOs_filtered)
}

Modules <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt", show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character()))
KOHits <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_hits.txt", show_col_types = FALSE)

Enrich_Mod_4Cor <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_4Cor.txt", show_col_types = FALSE)
Enrich_KO_4Cor <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_4Cor.txt", show_col_types = FALSE)
Enrich_KO_4Cor_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_4Cor, Enrich_KO_4Cor, 0.1)
write_csv(Enrich_KO_4Cor_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_4Cor_Filtered.csv")

Enrich_Mod_CorDpi <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_CorDpi.txt", show_col_types = FALSE)
Enrich_KO_CorDpi <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_CorDpi.txt", show_col_types = FALSE)
Enrich_KO_CorDpi_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_CorDpi, Enrich_KO_CorDpi, 0.1)
write_csv(Enrich_KO_CorDpi_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_CorDpi_Filtered.csv")

Enrich_Mod_Cps <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_Cps.txt", show_col_types = FALSE)
Enrich_KO_Cps <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_Cps.txt", show_col_types = FALSE)
Enrich_KO_Cps_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_Cps, Enrich_KO_Cps, 0.1)
write_csv(Enrich_KO_Cps_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_Cps_Filtered.csv")

Enrich_Mod_References <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_References.txt", show_col_types = FALSE)
Enrich_KO_References <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_References.txt", show_col_types = FALSE)
Enrich_KO_References_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_References, Enrich_KO_References, 0.1)
write_csv(Enrich_KO_References_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_References_Filtered.csv")

3.4 Figures and Tables

3.4.1 Tables S3 and S4

From the output files generated by anvi-estimate-metabolism we include as Supplemental data the following tables:

  • TableS3A. KO hits: Corresponds to output file CorPGA_hits.txt and contains individual KO annotations for all genomes.
  • TableS4A. KEEG modules: Corresponds to output file CorPGA_modules.txt and contains KEGG module estimations for all genomes in a long format dataframe.
  • TableS4B. KEGG module stepwise matrix: Corresponds to output file CorPGA-module_stepwise_completeness-MATRIX.txt and contains KEGG module stepwise completion score for all genomes in a wide format dataframe (rows are modules, numeric columns are strains).
  • TableS4C. KEGG modules by Species: The following code was used to read the previous table and generate a version with stepwise module completeness averaged by species (rows are modules, numeric columns are species):
Modules_MatrixbyGenome <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA-module_stepwise_completeness-MATRIX.txt")

# Identify numerical and non-numerical columns
numeric_cols <- sapply(Modules_MatrixbyGenome, is.numeric)
non_numeric_cols <- names(Modules_MatrixbyGenome)[!numeric_cols]

# Extract unique prefixes (based on strain name including species name) from numerical column names
prefixes <- unique(sub("_.*", "", names(Modules_MatrixbyGenome)[numeric_cols]))

# Define function to calculate row-wise means for columns with the same prefix
calculate_prefix_means <- function(df, prefix) {
  df %>%
    select(starts_with(prefix)) %>%
    rowMeans(na.rm = TRUE) %>%
    round(2)
}

# Calculate the row-wise means for each prefix and bind them into a new data frame
averaged_df <- map_dfc(prefixes, ~ {
  tibble(!! .x := calculate_prefix_means(Modules_MatrixbyGenome, .x))
})

# Combine the non-numerical columns and the averaged columns by species
Modules_MatrixbySpecies <- bind_cols(Modules_MatrixbyGenome %>% select(all_of(non_numeric_cols)), averaged_df)

write.csv(Modules_MatrixbySpecies, "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA-module_stepwise_completeness-MATRIXbySpecies.csv", row.names = F)

As described above, metabolic enrichment was explored across different genome groupings, however only the analysis comparing the 4 nasal Corynebacterium as defined in CorPGA_MetabolicAnalysis_Groups_4Cor.txt was included in the final manuscript. We include the following tables:

  • TableS4D. Enriched_Modules_4Cor: Corresponds to the output file from anvi-compute-metabolic-enrichment
  • TableS3B. OtherEnriched_KOs_4Cor: Corresponds to the output file CorPGA_enriched_KOs_4Cor_Filtered.csv. Includes only the enriched KOs that are not part of any module already described in tab A.

3.4.2 Figure 4

# Read modules data and select only the genomes for the four nasal Corynebacterium species and Dolosigranulum pigrum. Make appropriate labeling columns
Modules <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt", show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character())) %>% 
  filter(!grepl("^GCF", genome_name)) %>%
  mutate(module_class_short = paste0("(", substr(module_class, 1, 1), ")")) %>%
  mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
  unite(modules_label, c(module_class_short, module_category, module_subcategory), sep = "_", remove = FALSE) %>% 
  mutate_at(c("species_name", "genome_name", "module", "module_name", "module_category", "module_subcategory", "module_class", "modules_label"), factor)

# Summary tally table of fully stepwise complete modules by species
ModulesTable <- Modules %>%
  filter(stepwise_module_completeness == 1) %>%
  group_by(module_category, module_subcategory, module, module_name, species_name) %>%
  tally() 

# Table converted to wide and percentages calculated based on number of genomes per species
ModulesTableWide <- ModulesTable %>%
  pivot_wider(names_from = species_name, values_from = n, values_fill = 0)

ModulesTableWide$Cpr = round(100*(ModulesTableWide$Cpr/19), 1)
ModulesTableWide$Cps = round(100*(ModulesTableWide$Cps/42), 1)
ModulesTableWide$Cac = round(100*(ModulesTableWide$Cac/33), 1)
ModulesTableWide$Ctu = round(100*(ModulesTableWide$Ctu/8), 1)
ModulesTableWide$Dpi = round(100*(ModulesTableWide$Dpi/27), 1)

# Modules with less than 12.5% across all species eliminated
ModulesTableWide <- ModulesTableWide %>%
  select(module_category, module_subcategory, module, module_name, Cpr, Cps, Cac, Ctu, Dpi) %>%
  filter(rowSums(across(c(Cpr, Cps, Cac, Ctu, Dpi)) > 12.5) > 0)

# Version of the table to calculate the percentage of complete modules across the four Corynebacterium species:
ModulesData <- ModulesTable %>%
  filter(species_name != "Dpi") %>%
  group_by(module_category, module_subcategory, module, module_name) %>%
  summarise(total = sum(n)) %>%
  mutate(percent = 100*total/102)

write_csv(ModulesTableWide, "data/Anvio8/Metabolic_Analysis/Fig_Tables/TableKEGG.csv")
kable(ModulesTableWide)
module_category module_subcategory module module_name Cpr Cps Cac Ctu Dpi
Amino acid metabolism Arginine and proline metabolism M00015 Proline biosynthesis, glutamate => proline 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Arginine and proline metabolism M00028 Ornithine biosynthesis, glutamate => ornithine 100.0 100.0 100.0 50.0 0.0
Amino acid metabolism Arginine and proline metabolism M00844 Arginine biosynthesis, ornithine => arginine 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Arginine and proline metabolism M00970 Proline degradation, proline => glutamate 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Aromatic amino acid metabolism M00022 Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Aromatic amino acid metabolism M00023 Tryptophan biosynthesis, chorismate => tryptophan 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Branched-chain amino acid metabolism M00019 Valine/isoleucine biosynthesis, pyruvate => valine / 2-oxobutanoate => isoleucine 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Branched-chain amino acid metabolism M00432 Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Branched-chain amino acid metabolism M00570 Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Cysteine and methionine metabolism M00017 Methionine biosynthesis, aspartate => homoserine => methionine 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Cysteine and methionine metabolism M00021 Cysteine biosynthesis, serine => cysteine 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Histidine metabolism M00026 Histidine biosynthesis, PRPP => histidine 100.0 100.0 90.9 0.0 0.0
Amino acid metabolism Histidine metabolism M00045 Histidine degradation, histidine => N-formiminoglutamate => glutamate 100.0 100.0 100.0 100.0 0.0
Amino acid metabolism Lysine metabolism M00016 Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine 100.0 100.0 100.0 50.0 0.0
Amino acid metabolism Serine and threonine metabolism M00018 Threonine biosynthesis, aspartate => homoserine => threonine 100.0 100.0 100.0 100.0 81.5
Amino acid metabolism Serine and threonine metabolism M00020 Serine biosynthesis, glycerate-3P => serine 100.0 100.0 100.0 87.5 0.0
Amino acid metabolism Serine and threonine metabolism M00621 Glycine cleavage system 0.0 0.0 100.0 100.0 100.0
Biosynthesis of terpenoids and polyketides Polyketide sugar unit biosynthesis M00793 dTDP-L-rhamnose biosynthesis 100.0 100.0 100.0 100.0 0.0
Biosynthesis of terpenoids and polyketides Terpenoid backbone biosynthesis M00096 C5 isoprenoid biosynthesis, non-mevalonate pathway 0.0 0.0 100.0 100.0 0.0
Biosynthesis of terpenoids and polyketides Terpenoid backbone biosynthesis M00364 C10-C20 isoprenoid biosynthesis, bacteria 0.0 0.0 100.0 100.0 0.0
Biosynthesis of terpenoids and polyketides Terpenoid backbone biosynthesis M00365 C10-C20 isoprenoid biosynthesis, archaea 0.0 0.0 100.0 100.0 0.0
Carbohydrate metabolism Central carbohydrate metabolism M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate 100.0 100.0 100.0 100.0 100.0
Carbohydrate metabolism Central carbohydrate metabolism M00002 Glycolysis, core module involving three-carbon compounds 100.0 100.0 100.0 100.0 100.0
Carbohydrate metabolism Central carbohydrate metabolism M00003 Gluconeogenesis, oxaloacetate => fructose-6P 100.0 100.0 100.0 100.0 0.0
Carbohydrate metabolism Central carbohydrate metabolism M00004 Pentose phosphate pathway (Pentose phosphate cycle) 100.0 100.0 100.0 100.0 77.8
Carbohydrate metabolism Central carbohydrate metabolism M00005 PRPP biosynthesis, ribose 5P => PRPP 100.0 100.0 100.0 100.0 100.0
Carbohydrate metabolism Central carbohydrate metabolism M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P 100.0 100.0 100.0 100.0 100.0
Carbohydrate metabolism Central carbohydrate metabolism M00007 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P 100.0 100.0 100.0 100.0 77.8
Carbohydrate metabolism Central carbohydrate metabolism M00009 Citrate cycle (TCA cycle, Krebs cycle) 100.0 100.0 100.0 100.0 0.0
Carbohydrate metabolism Central carbohydrate metabolism M00010 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate 100.0 100.0 100.0 100.0 0.0
Carbohydrate metabolism Central carbohydrate metabolism M00011 Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate 100.0 100.0 100.0 100.0 0.0
Carbohydrate metabolism Central carbohydrate metabolism M00307 Pyruvate oxidation, pyruvate => acetyl-CoA 100.0 69.0 100.0 100.0 0.0
Carbohydrate metabolism Other carbohydrate metabolism M00549 Nucleotide sugar biosynthesis, glucose => UDP-glucose 0.0 0.0 0.0 100.0 0.0
Carbohydrate metabolism Other carbohydrate metabolism M00550 Ascorbate degradation, ascorbate => D-xylulose-5P 0.0 0.0 0.0 0.0 37.0
Carbohydrate metabolism Other carbohydrate metabolism M00554 Nucleotide sugar biosynthesis, galactose => UDP-galactose 100.0 100.0 100.0 100.0 0.0
Carbohydrate metabolism Other carbohydrate metabolism M00632 Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P 100.0 100.0 100.0 100.0 0.0
Carbohydrate metabolism Other carbohydrate metabolism M00854 Glycogen biosynthesis, glucose-1P => glycogen/starch 0.0 0.0 0.0 0.0 96.3
Carbohydrate metabolism Other carbohydrate metabolism M00909 UDP-N-acetyl-D-glucosamine biosynthesis, prokaryotes, glucose => UDP-GlcNAc 100.0 100.0 100.0 100.0 100.0
Energy metabolism ATP synthesis M00151 Cytochrome bc1 complex respiratory unit 100.0 100.0 100.0 100.0 0.0
Energy metabolism ATP synthesis M00155 Cytochrome c oxidase, prokaryotes 100.0 100.0 100.0 100.0 0.0
Energy metabolism ATP synthesis M00157 F-type ATPase, prokaryotes and chloroplasts 100.0 100.0 100.0 100.0 0.0
Energy metabolism Carbon fixation M00579 Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate 0.0 0.0 100.0 100.0 100.0
Lipid metabolism Fatty acid metabolism M00083 Fatty acid biosynthesis, elongation 100.0 100.0 0.0 0.0 96.3
Lipid metabolism Fatty acid metabolism M00086 beta-Oxidation, acyl-CoA synthesis 100.0 100.0 100.0 100.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00120 Coenzyme A biosynthesis, pantothenate => CoA 100.0 100.0 100.0 100.0 100.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00121 Heme biosynthesis, plants and bacteria, glutamate => heme 100.0 100.0 100.0 100.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00123 Biotin biosynthesis, pimeloyl-ACP/CoA => biotin 21.1 92.9 100.0 0.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00125 Riboflavin biosynthesis, plants and bacteria, GTP => riboflavin/FMN/FAD 100.0 100.0 100.0 100.0 14.8
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00126 Tetrahydrofolate biosynthesis, GTP => THF 0.0 0.0 100.0 100.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00140 C1-unit interconversion, prokaryotes 100.0 97.6 100.0 100.0 100.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00577 Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin 21.1 92.9 100.0 0.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00881 Lipoic acid biosynthesis, plants and bacteria, octanoyl-ACP => dihydrolipoyl-E2/H 100.0 100.0 100.0 100.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00899 Thiamine salvage pathway, HMP/HET => TMP 0.0 0.0 100.0 100.0 100.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00916 Pyridoxal-P biosynthesis, R5P + glyceraldehyde-3P + glutamine => pyridoxal-P 0.0 0.0 100.0 100.0 0.0
Metabolism of cofactors and vitamins Cofactor and vitamin metabolism M00926 Heme biosynthesis, bacteria, glutamyl-tRNA => coproporphyrin III => heme 100.0 100.0 100.0 100.0 0.0
Nucleotide metabolism Purine metabolism M00048 De novo purine biosynthesis, PRPP + glutamine => IMP 100.0 100.0 100.0 100.0 0.0
Nucleotide metabolism Purine metabolism M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP 100.0 100.0 100.0 100.0 0.0
Nucleotide metabolism Purine metabolism M00050 Guanine ribonucleotide biosynthesis, IMP => GDP,GTP 100.0 100.0 100.0 100.0 0.0
Nucleotide metabolism Purine metabolism M00053 Deoxyribonucleotide biosynthesis, ADP/GDP/CDP/UDP => dATP/dGTP/dCTP/dUTP 100.0 100.0 100.0 100.0 0.0
Nucleotide metabolism Pyrimidine metabolism M00938 Pyrimidine deoxyribonucleotide biosynthesis, UDP => dTTP 100.0 100.0 100.0 100.0 0.0

3.4.3 Figure 9

# Reading strain order and label formatting 
StrainOrder <- read_csv("data/genome_lists/CorPGA_MetabolicAnalysis_PlotOrder.csv", show_col_types = FALSE)

# When using stepwise_module_completeness some modules can be = 0 for all genomes but still have pathwise_module_completeness > 0. We are removing those modules.
Modules_cleaned <- Modules %>%
  group_by(module) %>%
  filter(!all(stepwise_module_completeness == 0)) %>%
  ungroup() %>%
  mutate(across(where(is.factor), droplevels))

all_zero_modules <- setdiff(Modules$module, Modules_cleaned$module)
all_zero_subcategory <- setdiff(Modules$module_subcategory, Modules_cleaned$module_subcategory)


# Calculating mean_stepwise_module_completeness and preparing plot labels 
Modules_Means = Modules_cleaned %>% 
  group_by(modules_label, genome_name, .drop = FALSE) %>%
  summarise(mean_stepwise_module_completeness = mean(stepwise_module_completeness)) %>%
  mutate(mean_stepwise_module_completeness = coalesce(mean_stepwise_module_completeness, 0)) %>%
  mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
  mutate(module_category = str_extract(modules_label, "^[^_]+_[^_]+")) %>%
  mutate(module_category = str_replace_all(module_category, "_", " ")) %>%
  mutate(modules_final_label = str_replace(modules_label, "^[^_]+_[^_]+_", "")) %>%
  mutate_at(c("module_category", "modules_final_label"), factor) 

Modules_Means <- left_join(Modules_Means, StrainOrder, by = "genome_name")
Modules_Means$species_name <- factor(Modules_Means$species_name, levels = c("Cpr","Cps","Cac","Ctu","Dpi"))
Modules_Means$genome_name <- factor(Modules_Means$genome_name, levels = StrainOrder$genome_name)
Modules_Means$styled_genome_name <- factor(Modules_Means$styled_genome_name, levels = StrainOrder$styled_genome_name)
# Colors for plot
getPalette <- colorRampPalette(brewer.pal(8, "Set1"))
colorsSpecies <- c("#FF8C00", "#FF0000", "#6A329F", "#3B54A4", "#DAE9F8")

# Heatmap plot
plot <- ggplot(Modules_Means, aes(x = modules_final_label, y = styled_genome_name, fill = mean_stepwise_module_completeness)) + 
  geom_tile(color = "white") +
  facet_grid2(species_name ~ module_category, scales = "free", space = "free", switch = "both",
              strip = strip_themed(background_x = elem_list_rect(fill = getPalette(12)),
                                   background_y = elem_list_rect(fill = colorsSpecies))) +
  labs(x = "", y = "") +
  scale_y_discrete(expand = c(0,0), position = "right", limits = rev) +
  scale_x_discrete(expand = c(0,0)) +
  scale_fill_viridis(direction = -1, option = "A", begin = 0.2) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        strip.background = element_rect(colour = "black", fill = NA),
        strip.text.y.left = element_text(angle = 0, size = 12, face = "bold.italic"),
        strip.text.x = element_text(size = 0),
        axis.text.y.right = element_markdown(size = 4, color = "black"),
        axis.text.x = element_text(angle = 45, hjust = 1, size = 6, color = "black"),
        panel.spacing = unit(0, 'pt'),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 7),
        legend.direction = "vertical",
        legend.position = c(0.97, -0.19),
        legend.justification = c(0.4, 0),
        legend.background = element_blank()) +
  guides(fill = guide_colourbar(title = "\nAverage\nstepwise\nmodule\ncompleteness", 
                                title.position = "left", title.hjust = 0.5, barwidth = 0.8, barheight = 3))

# Module categories legend
legend <- ggplot(Modules_Means, aes(x = module_category, y = mean_stepwise_module_completeness, fill = module_category)) +
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = getPalette(12)) + 
  theme(legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 6),
        legend.key.size = unit(0.3,"line")) +
  guides(fill = guide_legend(ncol = 3))

# Strains origen legend
text <- "<span style='font-size:9px'> USA isolate:</span> <span style='color:#ed1c24; font-size:13px;'>▲</span><br><span style='font-size:9px'> BWA isolate:</span> <span style='color:#6ca9d4; font-size:17px;'>●</span>"
text.p <- textbox_grob(text)
# Final figure
Figure <- ggarrange(plot + theme(plot.margin = unit(c(0.5,0,1.5,3), "lines")), 
                    ggarrange(
                      as_ggplot(get_legend(legend)) + theme(plot.margin = unit(c(-1,0,1,2), "lines")),
                      as_ggplot(text.p) + theme(plot.margin = unit(c(0,0,2,1.5), "lines")),
                      nrow = 1, widths = c(0.8,0.2)),
                    nrow = 2, heights = c(1,0.02))
Figure

ggsave(plot = Figure, filename = "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA_Heatmap.png", height = 9, width = 6, device = 'png', dpi = 600)
ggsave(plot = Figure, filename = "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA_Heatmap.svg", height = 9, width = 6, device = 'svg', dpi = 600)

3.4.3.1 Percentages in Figure 9

Modules_MeansSummary = Modules_Means %>% 
  group_by(modules_final_label, species_name, .drop = TRUE) %>%
  summarise(mean_bySpecies = mean(mean_stepwise_module_completeness))
summaryPlot <- ggplot(Modules_MeansSummary, aes(x = mean_bySpecies, y = modules_final_label, fill = species_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = colorsSpecies) +
  scale_x_continuous(expand = c(0,0)) +
  labs(x = "", y = "") +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_text(hjust = 1, size = 10, color = "black"))
summaryPlot

saveWidget(ggplotly(summaryPlot), "data/Anvio8/Metabolic_Analysis/Fig_Tables/Heatmap_Percents_Interactive.html", selfcontained = TRUE)

3.4.4 Figure 6A

The Assimilatory sulfate reduction, sulfate => H2S module (M00176) had a stepwise completion score = 0.5 in most of the Corynebacterium genomes. However, this module was completely absent in all the C. pseudodiphtheriticum strains from the U.S. and 4 from Botswana. We look for KO assignments for the enzymes encoded in the fpr2cysIXHDNYZ operon (Rückert et al., 2005) that are not part of the M00176 module definition. Genes annotated as K00528 (fpr) were identified in all the Corynebacterium genomes. A manual exploration of the gene annotations in the genomic region around the fpr in all C. pseudodiphtheriticum strains, allowed us to identified hchA as a common gene downstream the fpr2cysIXHDNYZ operon.

This code uses BEDTools (Quinlan, 2014; Quinlan and Hall, 2010) to extract annotations and sequences for the fpr to hchA region for all C. pseudodiphtheriticum genomes:

path_i="data/Anvio8/Prokka_out"
path_o="data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA"
mkdir -p $path_o

gene_1="hchA"
gene_2="fprA"
species="Cps"
label="ASR_Cps_region"

fasta_files=()

for file in $path_i/$species*.gff; do
    filename=$(basename "$file")
    filename_noext="${filename%.*}"
    fasta_output="$path_o/${filename_noext}_region_sequence.fasta"

    awk '/^##FASTA/{flag=1;next} {if(!flag) print}' "$file" > "$path_o/${filename_noext}_annotations.gff"
    awk '/^##FASTA/{flag=1} {if(flag) print}' "$file" > "$path_o/${filename_noext}_sequence_temp.fasta"
    grep -v '^##FASTA' "$path_o/${filename_noext}_sequence_temp.fasta" > "$path_o/${filename_noext}_sequence.fasta"
    rm "$path_o/${filename_noext}_sequence_temp.fasta"

    grep -E 'Name='$gene_1'|Name='$gene_2 "$path_o/${filename_noext}_annotations.gff" | sort -k4,4n | awk 'NR==1 {start=$4} NR==2 {end=$5} END {print $1 "\t" start "\t" end}' > "$path_o/${filename_noext}_region_coordinates.bed"
    bedtools intersect -a "$path_o/${filename_noext}_annotations.gff" -b "$path_o/${filename_noext}_region_coordinates.bed" -wa > "$path_o/${filename_noext}_region_annotations.gff"
    bedtools getfasta -fi "$path_o/${filename_noext}_sequence.fasta" -bed "$path_o/${filename_noext}_region_coordinates.bed" | awk -v filename="$filename_noext" '/^>/ {print ">" filename "|" substr($0, 2)} !/^>/' > "$fasta_output"

    fasta_files+=("$fasta_output")
done

We use MAFFT (Katoh, 2002; Katoh and Standley, 2013) to align the extracted sequences.

#conda activate mafft 

cat "${fasta_files[@]}" > "$path_o/$label.fasta"
mafft --adjustdirection --reorder "$path_o/$label.fasta" > "$path_o/${label}_align.fasta"
sed 's/|.*//;s/_R_//' "$path_o/${label}_align.fasta" > "$path_o/${label}_align_clean.fasta" #Clean up header names

We used AliView (Larsson, 2014) to inspect the output file and to reverse complement the alignment to match the operon order. We saved the reversed complemented version of the alignment as ASR_Cps_region_align_clean_R.fasta.

Next we count the alignment length and create a reference sequence with all Ns and header named “Genes”. This will be used as a base for gene labeling and as a reference so visualize only the gaps to the consensus.

count=$(awk 'NR==2 {print length($0)}' "$path_o/${label}_align_clean_R.fasta") 
filename="$path_o/${label}_reference.fasta"

echo ">Genes" >> "$filename"
echo "$(printf 'N%.0s' $(seq 1 $count))" >> "$filename"
cat "$path_o/${label}_reference.fasta" "$path_o/${label}_align_clean_R.fasta" > "$path_o/${label}_align_clean_R_Ref.fasta"

echo ">-----------Genes" > "$filename"
echo "$(printf 'N%.0s' $(seq 1 $count))" >> "$filename"
echo ">" >> "$filename"
echo "$(printf '-%.0s' $(seq 1 $count))" >> "$filename"

The version of the alignment file with the reference sequence is the input for the visualization with the msavisr package (Vragh, 2022).

plot_ASR <- msavisr("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region_align_clean_R_Ref.fasta", 
                myref = "Genes",
                refontop = TRUE,
                basecolors = c("black","black","white"))

plot_ASR <- plot_ASR + 
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 14, color = "black"))

saveRDS(plot_ASR, "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_plot.rds")

This code creates a reference sequence with gene locations and length proportional to the corresponding gene coordinates to use as a label.

#Manual list of gene coordinates generated by searching individual genes in the aligned fasta
genes <- list(c("-----------Genes", 11579:10901, "8_hchA"), 
              c("-----------Genes", 8816:7815, "7_cysZ"), 
              c("-----------Genes", 7806:6988, "6_cysY"), 
              c("-----------Genes", 6971:5595, "5_cysN"),
              c("-----------Genes", 5595:4513, "4_cysD"),
              c("-----------Genes", 4516:3458, "3_cysH"),
              c("-----------Genes", 3461:1788, "2_cysI"),
              c("-----------Genes", 1424:1, "1_fpr2")
)

plot_ASRlabels <- msavisr("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region_reference.fasta", 
                myref = "-----------Genes",
                refontop = FALSE,
                basecolors = c("black","black","white"),
                myroi = genes, hroi = 1,
                roicolors = c("#9fd122","#00a84a","#f88502","#00a9f2","#6f88c9","#a94488","#c8c3e3", "gray80"))

# Values manually calculated to place the labels in the middle of each gene 
plot_ASRlabels <- plot_ASRlabels + 
  annotate("text", x = 678, y = 1, label = "fpr2", size = 7, fontface = 2) +
  annotate("text", x = 2624, y = 1, label = "cysI", size = 7, fontface = 2) +
  annotate("text", x = 3987, y = 1, label = "cysH", size = 7, fontface = 2) +
  annotate("text", x = 5054, y = 1, label = "cysD", size = 7, fontface = 2) +
  annotate("text", x = 6283, y = 1, label = "cysN", size = 7, fontface = 2) +
  annotate("text", x = 7397, y = 1, label = "cysY", size = 7, fontface = 2) +
  annotate("text", x = 8315, y = 1, label = "cysZ", size = 7, fontface = 2) +
  annotate("text", x = 11240, y = 1, label = "hchA", size = 7, fontface = 2) +
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 12))

This generates the combined figure that was edited in Ilustrator for final edits, including adding symbols for the strains geographic location.

plot_ASR <- readRDS("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_plot.rds")
FinalFigure <- ggarrange(plot_ASRlabels, plot_ASR, ncol = 1, heights = c(0.20, 1))
FinalFigure

ggsave(FinalFigure, file = "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region.svg", height = 10, width = 20, dpi = 150)
ggsave(FinalFigure, file = "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region.png", height = 10, width = 20, dpi = 150)

3.5 GC level analysis

We performed pangenomic level analysis in 107 genomes: the 102 nasal Corynebacterium genomes selected for metabolic analysis (Cac_ATCC_49726 & Cps_090104 were excluded for this analysis) plus the 5 Corynebacterium listed in Table S1D:

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-gen-genomes-storage -e "data/genome_lists/CorPGA_Anvio_GenomeList_v01c.txt" -o $path_i/CorPGA-All4CorRefs-GENOMES.db --gene-caller Prodigal

anvi-db-info $path_i/CorPGA-All4CorRefs-GENOMES.db

anvi-pan-genome -g $path_i/CorPGA-All4CorRefs-GENOMES.db \
                -o $path_i/All4CorRefs -n PAN_All4CorRefs \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8 --I-know-this-is-not-a-good-idea

Create a bin collection named All4CorRefs with all Core GCs for all the Corynebacterium species using the visual interface. We do this because in order to use anvi-summarize we need to have at least one bin.

#conda activate anvio-dev

path_i="data/Anvio8/Pangenomic_Analysis"

anvi-display-pan -p $path_i/All4CorRefs/PAN_All4CorRefs-PAN.db -g $path_i/CorPGA-All4CorRefs-GENOMES.db         

anvi-summarize -p $path_i/All4CorRefs/PAN_All4CorRefs-PAN.db \
               -g $path_i/CorPGA-All4CorRefs-GENOMES.db \
               -o $path_i/All4CorRefs/Summaries \
               -C All4CorRefs

References

Bateman, A. (2000). The pfam protein families database. Nucleic Acids Research 28, 263–266.
Buchfink, B., Xie, C. and Huson, D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60.
Delmont, T. O. and Eren, A. M. (2018). Linking pangenomes and metagenomes: the Prochlorococcus metapangenome. PeerJ 6, e4320.
Drula, E., Garron, M.-L., Dogan, S., Lombard, V., Henrissat, B. and Terrapon, N. (2021). The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Research 50, D571–D577.
Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLoS Computational Biology 7, e1002195.
Eren, A. M., Kiefl, E., Shaiber, A., Veseli, I., Miller, S. E., Schechter, M. S., Fink, I., Pan, J. N., Yousef, M., Fogarty, E. C., et al. (2020). Community-led, integrated, reproducible multi-omics with anvio. Nature Microbiology 6, 3–6.
Flores Ramos, S., Brugger, S. D., Escapa, I. F., Skeete, C. A., Cotton, S. L., Eslami, S. M., Gao, W., Bomar, L., Tran, T. H., Jones, D. S., et al. (2021). Genomic Stability and Genetic Defense Systems inDolosigranulum pigrum, a Candidate Beneficial Bacterium from the Human Microbiome. mSystems 6,.
Galperin, M. Y., Wolf, Y. I., Makarova, K. S., Vera Alvarez, R., Landsman, D. and Koonin, E. V. (2020). COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Research 49, D274–D281.
Kanehisa, M. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28, 27–30.
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. and Ishiguro-Watanabe, M. (2022). KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research 51, D587–D592.
Katoh, K. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Research 30, 3059–3066.
Katoh, K. and Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30, 772–780.
Larsson, A. (2014). AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30, 3276–3278.
Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G., Sonnhammer, E. L. L., Tosatto, S. C. E., Paladin, L., Raj, S., Richardson, L. J., et al. (2020). Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–D419.
Quinlan, A. R. (2014). BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Current Protocols in Bioinformatics 47,.
Quinlan, A. R. and Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842.
Rückert, C., Koch, D. J., Rey, D. A., Albersmeier, A., Mormann, S., Pühler, A. and Kalinowski, J. (2005). Functional genomics and expression analysis of the Corynebacterium glutamicum fpr2-cysIXHDNYZ gene cluster involved in assimilatory sulphate reduction. BMC Genomics 6,.
Shaiber, A., Willis, A. D., Delmont, T. O., Roux, S., Chen, L.-X., Schmid, A. C., Yousef, M., Watson, A. R., Lolans, K., Esen, Ö. C., et al. (2020). Functional and genetic markers of niche partitioning among enigmatic members of the human oral microbiome. Genome Biology 21,.
Tatusov, R. L. (2000). The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28, 33–36.
Veseli, I., Chen, Y. T., Schechter, M. S., Vanni, C., Fogarty, E. C., Watson, A. R., Jabri, B., Blekhman, R., Willis, A. D., Yu, M. K., et al. (2023). Microbes with higher metabolic independence are enriched in human gut microbiomes under stress. biorxiv.
Vragh (2022). Vragh/seqvisr: v0.2.7. Zenodo.