Supplemental Methods: anvi’o
library(tidyverse)
library(ggtext)
library(ggh4x)
library(RColorBrewer)
library(viridis)
library(grid)
library(ggpubr)
library(gridtext)
library(kableExtra)
library(seqinr)
library(reshape2)
library(plotly)
library(htmlwidgets)
library(seqvisr)
This notebook contains analyses using anvi’o version 8 (Delmont and Eren, 2018; Eren et al., 2020) (development branch). For more in-depth information about anvi’o go to their official online site here: https://anvio.org/
1 Functional annotations
We use as input (path_i
) the anvi’o contigs database files (.db
) that were generated in SupplementalMethods_Prokka_Annotations.
1.1 COG annotation
First we downloaded and set up the NCBI’s Clusters of Orthologous Groups database Galperin et al. (2020) using anvi-setup-ncbi-cogs
. The default is the latest version, which is COG20, meaning that anvi’o will use the NCBI’s 2020 release of COGs to setup the database.
The anvi-run-ncbi-cogs
command was used to annotate the contig databases. We used DIAMOND as searching method (Buchfink et al., 2014) with the recommended “sensitive” option.
1.2 KEGG annotation
First we downloaded and set up the KEGG KOfam (Kanehisa, 2000; Kanehisa et al., 2022) database using anvi-setup-kegg-data
. We used the snapshot of the KEGG database dated 2023-09-22 (hash a2b5bde358bb). It contains 479 modules with 15116 entries.
The program anvi-run-kegg-kofams
annotates genes in a given anvi’o contigs database with KEGG Orthology (KO) numbers via hits to the KEGG KOfam database. We used -E
lower than 1e-15.
#conda activate anvio-dev
path_i="data/Anvio8/Contigs_db"
for file in $path_i/*.db; do
anvi-run-kegg-kofams -T 8 --log-bitscores -E 1e-15 --just-do-it -c $file;
done
The option --log-bitscores
records the bit scores values. They get saved to the main folder for the repo, so we moved them to a new subfolder
path_i <- "."
path_o <- "data/Anvio8/KEGG_bitscores"
dir.create(file.path("data/Anvio8/KEGG_bitscores"), recursive = TRUE, showWarnings=FALSE)
files_to_move <- list.files(path_i, pattern = "_bitscores.txt$", full.names = TRUE)
for (file in files_to_move) {
new_path <- file.path(path_o, basename(file))
print(new_path)
file.rename(file, new_path)
}
1.3 PFAM annotation
First you need to download and set up the PFAM database (Bateman, 2000; Mistry et al., 2020) using anvi-setup-pfams
: this is something you will do only once. We used Pfam version 36.0 (2023-07).
The program anvi-run-kegg-pfam
annotates genes in a given anvi’o contigs database with the PFAM database.
1.4 Running HMMs
When you run the following command the occurrence of bacterial single-copy genes across your contigs is added to your contigs database (Eddy, 2011).
1.5 Running CAZyme
First you need to download and set up the CAZyme database (Drula et al., 2021) using anvi-setup-cazymes
: this is something you will do only once. We used Pfam version 11.
The program anvi-run-cazymes
annotates genes in your contigs-db with the dbCAN CAZyme database.
1.6 Annotation Summaries
This generates basic stats on the annotations:
#conda activate anvio-dev
path_i="data/Anvio8/Contigs_db"
path_o="data/Anvio8/Annotation_summaries"
mkdir -p "$path_o"
for file in $path_i/*.db; do
FILENAME=`basename ${file%.*}`
anvi-display-contigs-stats $file --report-as-text -o $path_o/stats_$FILENAME.txt;
done
This provides basic info on the contigs databases:
2 Pangenome Analysis
2.1 Generating an anvi’o genomes storage database
The program anvi-gen-genomes-storage
requires a .txt
file with the names and path locations of the genomes to be included in the genome storage database. The storage database generated with CorPGA_Anvio_GenomeList_v01a.txt
includes the 104 Corynebacterium genomes selected for pangenomic analysis (The three C. macginleyi genomes were excluded).
#conda activate anvio-dev
path_i="data/Anvio8/Pangenomic_Analysis"
mkdir -p "$path_i"
anvi-gen-genomes-storage -e "data/genome_lists/CorPGA_Anvio_GenomeList_v01a.txt" -o $path_i/CorPGA-All4Cor-GENOMES.db --gene-caller Prodigal
This provides basic info on the generated genomes storage database:
2.2 Running the pangenomic analyses
In order to analyze the pangenome individually for each of the four Corynebacterium species we created a .txt
file including the list of genomes for each species and ran anvi-pan-genome
with the -G
flag:
#conda activate anvio-dev
path_i="data/Anvio8/Pangenomic_Analysis"
path_g="data/genome_lists"
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
-G $path_g/CorPGA_Anvio_GenomeList_PangenomeCpr.txt \
-o $path_i/Only_Cpr -n PAN_Cpr \
--use-ncbi-blast --mcl-inflation 10 --num-threads 8
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
-G $path_g/CorPGA_Anvio_GenomeList_PangenomeCps.txt \
-o $path_i/Only_Cps -n PAN_Cps \
--use-ncbi-blast --mcl-inflation 10 --num-threads 8
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
-G $path_g/CorPGA_Anvio_GenomeList_PangenomeCac.txt \
-o $path_i/Only_Cac -n PAN_Cac \
--use-ncbi-blast --mcl-inflation 10 --num-threads 8
anvi-pan-genome -g $path_i/CorPGA-All4Cor-GENOMES.db \
-G $path_g/CorPGA_Anvio_GenomeList_PangenomeCtu.txt \
-o $path_i/Only_Ctu -n PAN_Ctu \
--use-ncbi-blast --mcl-inflation 10 --num-threads 8
2.3 Displaying the pangenomes
This will display the pangenome in your default internet browser. Anvi’o says google chrome is the most compatible with the anvi’o interactive interface. You need to pick one of the -p
inputs for each PAN.db file:
#conda activate anvio-dev
path_i="data/Anvio8/Pangenomic_Analysis"
anvi-display-pan -p $path_i/Only_Cpr/PAN_Cpr-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db
anvi-display-pan -p $path_i/Only_Cps/PAN_Cps-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db
anvi-display-pan -p $path_i/Only_Cac/PAN_Cac-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db
anvi-display-pan -p $path_i/Only_Ctu/PAN_Ctu-PAN.db -g $path_i/CorPGA-All4Cor-GENOMES.db
If you are using Windows and it cannot display the interactive page then try copy and pasting either 127.0.0.1:8080 or http://localhost:8080/ into your browser. Do not use sudo in the front of the command either.
Once it opens in the browser, you can adjust image settings and click the “DRAW” icon to view it. For help using the interactive interface see http://merenlab.org/2016/02/27/the-anvio-interactive-interface/
We manually reordered the layers to match the phylogenomic trees per species using the click and drag option on the interactive interface. The order names is inverse to the layers in the concentric diagram. When all the layers were re-ordered to match the phylogenomic tree we clicked the draw button and saved the svg file.
2.4 Creating bin collections
2.4.1 Core vs. Accessory
In order to define the Core vs Accessory pangenome for each of the four species level pangenomes we use anvi’o interactive interface. In the “Bins” tab we created the corresponding bins. Using the “Search” tab we used “Search gene clusters using filters” and “Append splits to selected bin” to create bins. The following tables summarizes how the bins in the collection named CorevsAccessory were created (last column indicates the criteria used on each search):
2.4.1.1 C. propinquum pangenome
Pangenome Bin | # of Gene Clusters/Pangenome | Percentage | Gene Cluster occurs in # of genomes |
---|---|---|---|
Core | 1824/3108 | 58.7% | 19 |
Single Copy Core | 1715/3108 | 55.2% | 19 |
Multicopy Core | 109/3108 | 3.5% | 19 |
Soft Core | 39/3108 | 1.3% | 18 |
Shell | 689/3108 | 22.2% | 3-17 |
Cloud | 556/3108 | 17.9% | 1-2 |
2.4.1.2 C. pseudodiphtheriticum pangenome
Pangenome Bin | # of Gene Clusters/Pangenome | Percentage | Gene Cluster occurs in # of genomes |
---|---|---|---|
Core | 1714/3590 | 47.7% | 43 |
Single Copy Core | 1488/3590 | 41.4% | 43 |
Multicopy Core | 226/3590 | 6.3% | 43 |
Soft Core | 46/3590 | 1.3% | 40-42 |
Shell | 972/3590 | 27.1% | 3-39 |
Cloud | 858/3590 | 23.9% | 1-2 |
2.4.1.3 C. accolens pangenome
Pangenome Bin | # of Gene Clusters/Pangenome | Percentage | Gene Cluster occurs in # of genomes |
---|---|---|---|
Core | 1904/3427 | 55.6% | 34 |
Single Copy Core | 1734/3427 | 50.6% | 34 |
Multicopy Core | 170/3427 | 5% | 34 |
Soft Core | 77/3427 | 2.2% | 32-33 |
Shell | 676/3427 | 19.7% | 3-31 |
Cloud | 770/3427 | 22.5% | 1-2 |
2.4.1.4 C. tuberculostearicum pangenome
Pangenome Bin | # of Gene Clusters/Pangenome | Percentage | Gene Cluster occurs in # of genomes |
---|---|---|---|
Core | 1915/2907 | 65.9% | 8 |
Single Copy Core | 1860/2907 | 64% | 8 |
Multicopy Core | 55/2907 | 1.9% | 8 |
Soft Core | 47/2907 | 1.6% | 7 |
Shell | 270/2907 | 9.3% | 3-6 |
Cloud | 671/2907 | 23.1% | 1-2 |
2.4.2 PPanGGOLiN Partitions
PPanGGOLiN partitions were imported as a collection of bins named PPanGGOLiNpartitions as described in PPanGGOLiN analysis.
2.5 Summarizing the pangenome files
Using the program anvi-summarize
we exported summary tables for each of the four species level pangenomes.
The resulting summaries contain a gene_clusters_summary.txt.gz
file that links each gene to gene clusters, genomes, functions, layers, and bins selected from the interface. We used the flag -C
to selected the CorevsAccessory collection described above:
#conda activate anvio-dev
path_i="data/Anvio8/Pangenomic_Analysis"
anvi-summarize -p $path_i/Only_Cpr/PAN_Cpr-PAN.db \
-g $path_i/CorPGA-All4Cor-GENOMES.db \
-o $path_i/Only_Cpr/Summaries \
-C CorevsAccessory
anvi-summarize -p $path_i/Only_Cps/PAN_Cps-PAN.db \
-g $path_i/CorPGA-All4Cor-GENOMES.db \
-o $path_i/Only_Cps/Summaries \
-C CorevsAccessory
anvi-summarize -p $path_i/Only_Cac/PAN_Cac-PAN.db \
-g $path_i/CorPGA-All4Cor-GENOMES.db \
-o $path_i/Only_Cac/Summaries \
-C CorevsAccessory
anvi-summarize -p $path_i/Only_Ctu/PAN_Ctu-PAN.db \
-g $path_i/CorPGA-All4Cor-GENOMES.db \
-o $path_i/Only_Ctu/Summaries \
-C CorevsAccessory
The output files were used for the COG functional analysis based on PPanGGOLiN partitions.
3 KEGG Metabolic Analysis
3.1 Estimation of KEGG module completeness
Using anvi-estimate-metabolism
we predicted KEGG module completeness as described in (Veseli et al., 2023) for the the organisms listed in CorPGA_MetabolicAnalysis_GenomeList_v01.txt
. This list includes:
- The 102 Corynebacterium genomes selected for metabolic analysis (Cac_ATCC_49726 & Cps_090104 were excluded for this analysis, as well as all three C. macginleyi genomes)
- Plus 27 for the 28 Dolosigranulum pigrum genomes described in (Flores Ramos et al., 2021) (KPL3086 was not included in the analysis due to being highly similar to KPL3065).
- As well as the 9 reference genomes with original NCBI annotations listed in Table S3D.
We generated outputs both in long tabular and matrix format with the KO hits and KEGG modules completeness scores.
#conda activate anvio-dev
mkdir -p "data/Anvio8/Metabolic_Analysis"
path_o="data/Anvio8/Metabolic_Analysis"
txt_file="data/genome_lists/CorPGA_MetabolicAnalysis_GenomeList_v01.txt"
prefix=CorPGA
path_o="data/Anvio8/Metabolic_Analysis"
mkdir -p "$path_o"
anvi-estimate-metabolism -e $txt_file \
-O $path_o/$prefix --matrix-format --include-metadata
anvi-estimate-metabolism -e $txt_file \
-O $path_o/$prefix --add-copy-number --output-modes modules,module_paths,module_steps,hits
3.2 Module Enrichment Analysis
The CorPGA_modules.txt
file generated in the previous step is used by anvi-compute-metabolic-enrichment
to calculate KEGG module enrichment (Shaiber et al., 2020) across different groups of genomes. Enrichment was explored across different genome groupings as described in the following files:
CorPGA_MetabolicAnalysis_Groups_4Cor.txt
includes 102 genomes for the 4 nasal Corynebacterium grouped by their species.CorPGA_MetabolicAnalysis_Groups_CorDpi.txt
adds to the previous file an extra group with the 27 D. pigrum genomes.CorPGA_MetabolicAnalysis_Groups_Cps.txt
includes the 41 C. pseudodiphtheriticum genomes selected to study metabolic differences across geographic locations (BWA vs. USA grouping).CorPGA_MetabolicAnalysis_Groups_References.txt
adds to the previous file the 9 NCBI reference genomes as 9 new groups.
#conda activate anvio-dev
csv_file_modules="data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt"
path_o="data/Anvio8/Metabolic_Analysis/Enrichment_Modules"
mkdir -p "$path_o"
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_4Cor.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_4Cor.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
-G "$txt_file_groups" \
-o "$output_file" \
--use-stepwise-completeness \
--module-completion-threshold 1.0
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_CorDpi.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_CorDpi.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
-G "$txt_file_groups" \
-o "$output_file" \
--use-stepwise-completeness \
--module-completion-threshold 1.0
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_Cps.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_Cps.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
-G "$txt_file_groups" \
-o "$output_file" \
--use-stepwise-completeness \
--module-completion-threshold 1.0
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_References.txt"
output_file="$path_o/CorPGA_enriched_modules_stepwise_1_References.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
-G "$txt_file_groups" \
-o "$output_file" \
--use-stepwise-completeness \
--module-completion-threshold 1.0
3.3 KO Enrichment Analysis
We used anvi-display-functions
to perform this functional enrichment (Shaiber et al., 2020). It runs the enrichment for the desired annotations (in this case KOfam). It saves the output to a system temp file that was recovered, renamed and saved into the same path_o
along with the database file.
The same files where used to indicate grouping options equivalent to the ones used for the enrichment at the module level:
#conda activate anvio-dev
txt_file="data/genome_lists/CorPGA_MetabolicAnalysis_GenomeList_v01.txt"
path_o="data/Anvio8/Metabolic_Analysis/Enrichment_KOs"
mkdir -p "$path_o"
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_4Cor.txt"
anvi-display-functions -e $txt_file \
--groups-txt $txt_file_groups \
--annotation-source KOfam \
--profile-db $path_o/KOfam_4Cor.db
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_CorDpi.txt"
anvi-display-functions -e $txt_file \
--groups-txt $txt_file_groups \
--annotation-source KOfam \
--profile-db $path_o/KOfam_CorDpi.db
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_Cps.txt"
anvi-display-functions -e $txt_file \
--groups-txt $txt_file_groups \
--annotation-source KOfam \
--profile-db $path_o/KOfam_Cps.db
txt_file_groups="data/genome_lists/CorPGA_MetabolicAnalysis_Groups_References.txt"
anvi-display-functions -e $txt_file \
--groups-txt $txt_file_groups \
--annotation-source KOfam \
--profile-db $path_o/KOfam_References.db
This generates a list of all enriched KOs, including KOs that belong to already identified enriched modules. We are interested in identifying other KOs, other than the ones identified by studying the modules output. The next function filters out KOs that are part of the definition of modules listed in the enriched modules output.
get_KOs_not_in_enrich_modules <- function(modules, hits, enrich_modules, enrich_KOs, q_value_threshold) {
# Get list of KOs that correspond with the modules on the enrichment file
modules_subset <- modules %>% filter(module %in% enrich_modules$accession)
module_definition_subset <- paste(modules_subset$module_definition, collapse = " ")
module_definition_subset <- str_extract_all(module_definition_subset, "\\bK\\d+\\b")
module_definition_subset <- unique(unlist(module_definition_subset))
# Add a column to enrich_KOs indicating if the KO is the previous list
enrich_KOs$inModule <- sapply(strsplit(as.character(enrich_KOs$accession), ","), function(x) all((x %in% module_definition_subset)))
# Filter Enrich_KO based on inModule and q_value_threshold
enrich_KOs_filtered <- enrich_KOs %>%
filter(inModule == FALSE) %>%
filter(adjusted_q_value < q_value_threshold)
return(enrich_KOs_filtered)
}
Modules <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt", show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character()))
KOHits <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_hits.txt", show_col_types = FALSE)
Enrich_Mod_4Cor <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_4Cor.txt", show_col_types = FALSE)
Enrich_KO_4Cor <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_4Cor.txt", show_col_types = FALSE)
Enrich_KO_4Cor_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_4Cor, Enrich_KO_4Cor, 0.1)
write_csv(Enrich_KO_4Cor_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_4Cor_Filtered.csv")
Enrich_Mod_CorDpi <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_CorDpi.txt", show_col_types = FALSE)
Enrich_KO_CorDpi <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_CorDpi.txt", show_col_types = FALSE)
Enrich_KO_CorDpi_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_CorDpi, Enrich_KO_CorDpi, 0.1)
write_csv(Enrich_KO_CorDpi_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_CorDpi_Filtered.csv")
Enrich_Mod_Cps <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_Cps.txt", show_col_types = FALSE)
Enrich_KO_Cps <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_Cps.txt", show_col_types = FALSE)
Enrich_KO_Cps_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_Cps, Enrich_KO_Cps, 0.1)
write_csv(Enrich_KO_Cps_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_Cps_Filtered.csv")
Enrich_Mod_References <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_Modules/CorPGA_enriched_modules_stepwise_1_References.txt", show_col_types = FALSE)
Enrich_KO_References <- read_delim("data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_References.txt", show_col_types = FALSE)
Enrich_KO_References_Filtered <- get_KOs_not_in_enrich_modules(Modules, KOHits, Enrich_Mod_References, Enrich_KO_References, 0.1)
write_csv(Enrich_KO_References_Filtered, "data/Anvio8/Metabolic_Analysis/Enrichment_KOs/CorPGA_enriched_KOs_References_Filtered.csv")
3.4 Figures and Tables
3.4.1 Tables S3 and S4
From the output files generated by anvi-estimate-metabolism
we include as Supplemental data the following tables:
- TableS3A. KO hits: Corresponds to output file
CorPGA_hits.txt
and contains individual KO annotations for all genomes. - TableS4A. KEEG modules: Corresponds to output file
CorPGA_modules.txt
and contains KEGG module estimations for all genomes in a long format dataframe. - TableS4B. KEGG module stepwise matrix: Corresponds to output file
CorPGA-module_stepwise_completeness-MATRIX.txt
and contains KEGG module stepwise completion score for all genomes in a wide format dataframe (rows are modules, numeric columns are strains). - TableS4C. KEGG modules by Species: The following code was used to read the previous table and generate a version with stepwise module completeness averaged by species (rows are modules, numeric columns are species):
Modules_MatrixbyGenome <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA-module_stepwise_completeness-MATRIX.txt")
# Identify numerical and non-numerical columns
numeric_cols <- sapply(Modules_MatrixbyGenome, is.numeric)
non_numeric_cols <- names(Modules_MatrixbyGenome)[!numeric_cols]
# Extract unique prefixes (based on strain name including species name) from numerical column names
prefixes <- unique(sub("_.*", "", names(Modules_MatrixbyGenome)[numeric_cols]))
# Define function to calculate row-wise means for columns with the same prefix
calculate_prefix_means <- function(df, prefix) {
df %>%
select(starts_with(prefix)) %>%
rowMeans(na.rm = TRUE) %>%
round(2)
}
# Calculate the row-wise means for each prefix and bind them into a new data frame
averaged_df <- map_dfc(prefixes, ~ {
tibble(!! .x := calculate_prefix_means(Modules_MatrixbyGenome, .x))
})
# Combine the non-numerical columns and the averaged columns by species
Modules_MatrixbySpecies <- bind_cols(Modules_MatrixbyGenome %>% select(all_of(non_numeric_cols)), averaged_df)
write.csv(Modules_MatrixbySpecies, "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA-module_stepwise_completeness-MATRIXbySpecies.csv", row.names = F)
As described above, metabolic enrichment was explored across different genome groupings, however only the analysis comparing the 4 nasal Corynebacterium as defined in CorPGA_MetabolicAnalysis_Groups_4Cor.txt
was included in the final manuscript. We include the following tables:
- TableS4D. Enriched_Modules_4Cor: Corresponds to the output file from
anvi-compute-metabolic-enrichment
- TableS3B. OtherEnriched_KOs_4Cor: Corresponds to the output file
CorPGA_enriched_KOs_4Cor_Filtered.csv
. Includes only the enriched KOs that are not part of any module already described in tab A.
3.4.2 Figure 4
# Read modules data and select only the genomes for the four nasal Corynebacterium species and Dolosigranulum pigrum. Make appropriate labeling columns
Modules <- read_delim("data/Anvio8/Metabolic_Analysis/CorPGA_modules.txt", show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character())) %>%
filter(!grepl("^GCF", genome_name)) %>%
mutate(module_class_short = paste0("(", substr(module_class, 1, 1), ")")) %>%
mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
unite(modules_label, c(module_class_short, module_category, module_subcategory), sep = "_", remove = FALSE) %>%
mutate_at(c("species_name", "genome_name", "module", "module_name", "module_category", "module_subcategory", "module_class", "modules_label"), factor)
# Summary tally table of fully stepwise complete modules by species
ModulesTable <- Modules %>%
filter(stepwise_module_completeness == 1) %>%
group_by(module_category, module_subcategory, module, module_name, species_name) %>%
tally()
# Table converted to wide and percentages calculated based on number of genomes per species
ModulesTableWide <- ModulesTable %>%
pivot_wider(names_from = species_name, values_from = n, values_fill = 0)
ModulesTableWide$Cpr = round(100*(ModulesTableWide$Cpr/19), 1)
ModulesTableWide$Cps = round(100*(ModulesTableWide$Cps/42), 1)
ModulesTableWide$Cac = round(100*(ModulesTableWide$Cac/33), 1)
ModulesTableWide$Ctu = round(100*(ModulesTableWide$Ctu/8), 1)
ModulesTableWide$Dpi = round(100*(ModulesTableWide$Dpi/27), 1)
# Modules with less than 12.5% across all species eliminated
ModulesTableWide <- ModulesTableWide %>%
select(module_category, module_subcategory, module, module_name, Cpr, Cps, Cac, Ctu, Dpi) %>%
filter(rowSums(across(c(Cpr, Cps, Cac, Ctu, Dpi)) > 12.5) > 0)
# Version of the table to calculate the percentage of complete modules across the four Corynebacterium species:
ModulesData <- ModulesTable %>%
filter(species_name != "Dpi") %>%
group_by(module_category, module_subcategory, module, module_name) %>%
summarise(total = sum(n)) %>%
mutate(percent = 100*total/102)
write_csv(ModulesTableWide, "data/Anvio8/Metabolic_Analysis/Fig_Tables/TableKEGG.csv")
kable(ModulesTableWide)
module_category | module_subcategory | module | module_name | Cpr | Cps | Cac | Ctu | Dpi |
---|---|---|---|---|---|---|---|---|
Amino acid metabolism | Arginine and proline metabolism | M00015 | Proline biosynthesis, glutamate => proline | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Arginine and proline metabolism | M00028 | Ornithine biosynthesis, glutamate => ornithine | 100.0 | 100.0 | 100.0 | 50.0 | 0.0 |
Amino acid metabolism | Arginine and proline metabolism | M00844 | Arginine biosynthesis, ornithine => arginine | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Arginine and proline metabolism | M00970 | Proline degradation, proline => glutamate | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Aromatic amino acid metabolism | M00022 | Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Aromatic amino acid metabolism | M00023 | Tryptophan biosynthesis, chorismate => tryptophan | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Branched-chain amino acid metabolism | M00019 | Valine/isoleucine biosynthesis, pyruvate => valine / 2-oxobutanoate => isoleucine | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Branched-chain amino acid metabolism | M00432 | Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Branched-chain amino acid metabolism | M00570 | Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Cysteine and methionine metabolism | M00017 | Methionine biosynthesis, aspartate => homoserine => methionine | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Cysteine and methionine metabolism | M00021 | Cysteine biosynthesis, serine => cysteine | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Histidine metabolism | M00026 | Histidine biosynthesis, PRPP => histidine | 100.0 | 100.0 | 90.9 | 0.0 | 0.0 |
Amino acid metabolism | Histidine metabolism | M00045 | Histidine degradation, histidine => N-formiminoglutamate => glutamate | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Amino acid metabolism | Lysine metabolism | M00016 | Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine | 100.0 | 100.0 | 100.0 | 50.0 | 0.0 |
Amino acid metabolism | Serine and threonine metabolism | M00018 | Threonine biosynthesis, aspartate => homoserine => threonine | 100.0 | 100.0 | 100.0 | 100.0 | 81.5 |
Amino acid metabolism | Serine and threonine metabolism | M00020 | Serine biosynthesis, glycerate-3P => serine | 100.0 | 100.0 | 100.0 | 87.5 | 0.0 |
Amino acid metabolism | Serine and threonine metabolism | M00621 | Glycine cleavage system | 0.0 | 0.0 | 100.0 | 100.0 | 100.0 |
Biosynthesis of terpenoids and polyketides | Polyketide sugar unit biosynthesis | M00793 | dTDP-L-rhamnose biosynthesis | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Biosynthesis of terpenoids and polyketides | Terpenoid backbone biosynthesis | M00096 | C5 isoprenoid biosynthesis, non-mevalonate pathway | 0.0 | 0.0 | 100.0 | 100.0 | 0.0 |
Biosynthesis of terpenoids and polyketides | Terpenoid backbone biosynthesis | M00364 | C10-C20 isoprenoid biosynthesis, bacteria | 0.0 | 0.0 | 100.0 | 100.0 | 0.0 |
Biosynthesis of terpenoids and polyketides | Terpenoid backbone biosynthesis | M00365 | C10-C20 isoprenoid biosynthesis, archaea | 0.0 | 0.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00001 | Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00002 | Glycolysis, core module involving three-carbon compounds | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00003 | Gluconeogenesis, oxaloacetate => fructose-6P | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00004 | Pentose phosphate pathway (Pentose phosphate cycle) | 100.0 | 100.0 | 100.0 | 100.0 | 77.8 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00005 | PRPP biosynthesis, ribose 5P => PRPP | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00006 | Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00007 | Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P | 100.0 | 100.0 | 100.0 | 100.0 | 77.8 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00009 | Citrate cycle (TCA cycle, Krebs cycle) | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00010 | Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00011 | Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Central carbohydrate metabolism | M00307 | Pyruvate oxidation, pyruvate => acetyl-CoA | 100.0 | 69.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Other carbohydrate metabolism | M00549 | Nucleotide sugar biosynthesis, glucose => UDP-glucose | 0.0 | 0.0 | 0.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Other carbohydrate metabolism | M00550 | Ascorbate degradation, ascorbate => D-xylulose-5P | 0.0 | 0.0 | 0.0 | 0.0 | 37.0 |
Carbohydrate metabolism | Other carbohydrate metabolism | M00554 | Nucleotide sugar biosynthesis, galactose => UDP-galactose | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Other carbohydrate metabolism | M00632 | Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Carbohydrate metabolism | Other carbohydrate metabolism | M00854 | Glycogen biosynthesis, glucose-1P => glycogen/starch | 0.0 | 0.0 | 0.0 | 0.0 | 96.3 |
Carbohydrate metabolism | Other carbohydrate metabolism | M00909 | UDP-N-acetyl-D-glucosamine biosynthesis, prokaryotes, glucose => UDP-GlcNAc | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Energy metabolism | ATP synthesis | M00151 | Cytochrome bc1 complex respiratory unit | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Energy metabolism | ATP synthesis | M00155 | Cytochrome c oxidase, prokaryotes | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Energy metabolism | ATP synthesis | M00157 | F-type ATPase, prokaryotes and chloroplasts | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Energy metabolism | Carbon fixation | M00579 | Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate | 0.0 | 0.0 | 100.0 | 100.0 | 100.0 |
Lipid metabolism | Fatty acid metabolism | M00083 | Fatty acid biosynthesis, elongation | 100.0 | 100.0 | 0.0 | 0.0 | 96.3 |
Lipid metabolism | Fatty acid metabolism | M00086 | beta-Oxidation, acyl-CoA synthesis | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00120 | Coenzyme A biosynthesis, pantothenate => CoA | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00121 | Heme biosynthesis, plants and bacteria, glutamate => heme | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00123 | Biotin biosynthesis, pimeloyl-ACP/CoA => biotin | 21.1 | 92.9 | 100.0 | 0.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00125 | Riboflavin biosynthesis, plants and bacteria, GTP => riboflavin/FMN/FAD | 100.0 | 100.0 | 100.0 | 100.0 | 14.8 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00126 | Tetrahydrofolate biosynthesis, GTP => THF | 0.0 | 0.0 | 100.0 | 100.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00140 | C1-unit interconversion, prokaryotes | 100.0 | 97.6 | 100.0 | 100.0 | 100.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00577 | Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin | 21.1 | 92.9 | 100.0 | 0.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00881 | Lipoic acid biosynthesis, plants and bacteria, octanoyl-ACP => dihydrolipoyl-E2/H | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00899 | Thiamine salvage pathway, HMP/HET => TMP | 0.0 | 0.0 | 100.0 | 100.0 | 100.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00916 | Pyridoxal-P biosynthesis, R5P + glyceraldehyde-3P + glutamine => pyridoxal-P | 0.0 | 0.0 | 100.0 | 100.0 | 0.0 |
Metabolism of cofactors and vitamins | Cofactor and vitamin metabolism | M00926 | Heme biosynthesis, bacteria, glutamyl-tRNA => coproporphyrin III => heme | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Nucleotide metabolism | Purine metabolism | M00048 | De novo purine biosynthesis, PRPP + glutamine => IMP | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Nucleotide metabolism | Purine metabolism | M00049 | Adenine ribonucleotide biosynthesis, IMP => ADP,ATP | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Nucleotide metabolism | Purine metabolism | M00050 | Guanine ribonucleotide biosynthesis, IMP => GDP,GTP | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Nucleotide metabolism | Purine metabolism | M00053 | Deoxyribonucleotide biosynthesis, ADP/GDP/CDP/UDP => dATP/dGTP/dCTP/dUTP | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
Nucleotide metabolism | Pyrimidine metabolism | M00938 | Pyrimidine deoxyribonucleotide biosynthesis, UDP => dTTP | 100.0 | 100.0 | 100.0 | 100.0 | 0.0 |
3.4.3 Figure 9
# Reading strain order and label formatting
StrainOrder <- read_csv("data/genome_lists/CorPGA_MetabolicAnalysis_PlotOrder.csv", show_col_types = FALSE)
# When using stepwise_module_completeness some modules can be = 0 for all genomes but still have pathwise_module_completeness > 0. We are removing those modules.
Modules_cleaned <- Modules %>%
group_by(module) %>%
filter(!all(stepwise_module_completeness == 0)) %>%
ungroup() %>%
mutate(across(where(is.factor), droplevels))
all_zero_modules <- setdiff(Modules$module, Modules_cleaned$module)
all_zero_subcategory <- setdiff(Modules$module_subcategory, Modules_cleaned$module_subcategory)
# Calculating mean_stepwise_module_completeness and preparing plot labels
Modules_Means = Modules_cleaned %>%
group_by(modules_label, genome_name, .drop = FALSE) %>%
summarise(mean_stepwise_module_completeness = mean(stepwise_module_completeness)) %>%
mutate(mean_stepwise_module_completeness = coalesce(mean_stepwise_module_completeness, 0)) %>%
mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
mutate(module_category = str_extract(modules_label, "^[^_]+_[^_]+")) %>%
mutate(module_category = str_replace_all(module_category, "_", " ")) %>%
mutate(modules_final_label = str_replace(modules_label, "^[^_]+_[^_]+_", "")) %>%
mutate_at(c("module_category", "modules_final_label"), factor)
Modules_Means <- left_join(Modules_Means, StrainOrder, by = "genome_name")
Modules_Means$species_name <- factor(Modules_Means$species_name, levels = c("Cpr","Cps","Cac","Ctu","Dpi"))
Modules_Means$genome_name <- factor(Modules_Means$genome_name, levels = StrainOrder$genome_name)
Modules_Means$styled_genome_name <- factor(Modules_Means$styled_genome_name, levels = StrainOrder$styled_genome_name)
# Colors for plot
getPalette <- colorRampPalette(brewer.pal(8, "Set1"))
colorsSpecies <- c("#FF8C00", "#FF0000", "#6A329F", "#3B54A4", "#DAE9F8")
# Heatmap plot
plot <- ggplot(Modules_Means, aes(x = modules_final_label, y = styled_genome_name, fill = mean_stepwise_module_completeness)) +
geom_tile(color = "white") +
facet_grid2(species_name ~ module_category, scales = "free", space = "free", switch = "both",
strip = strip_themed(background_x = elem_list_rect(fill = getPalette(12)),
background_y = elem_list_rect(fill = colorsSpecies))) +
labs(x = "", y = "") +
scale_y_discrete(expand = c(0,0), position = "right", limits = rev) +
scale_x_discrete(expand = c(0,0)) +
scale_fill_viridis(direction = -1, option = "A", begin = 0.2) +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
strip.background = element_rect(colour = "black", fill = NA),
strip.text.y.left = element_text(angle = 0, size = 12, face = "bold.italic"),
strip.text.x = element_text(size = 0),
axis.text.y.right = element_markdown(size = 4, color = "black"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 6, color = "black"),
panel.spacing = unit(0, 'pt'),
legend.text = element_text(size = 7),
legend.title = element_text(size = 7),
legend.direction = "vertical",
legend.position = c(0.97, -0.19),
legend.justification = c(0.4, 0),
legend.background = element_blank()) +
guides(fill = guide_colourbar(title = "\nAverage\nstepwise\nmodule\ncompleteness",
title.position = "left", title.hjust = 0.5, barwidth = 0.8, barheight = 3))
# Module categories legend
legend <- ggplot(Modules_Means, aes(x = module_category, y = mean_stepwise_module_completeness, fill = module_category)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = getPalette(12)) +
theme(legend.title = element_blank(),
legend.position = "bottom",
legend.text = element_text(size = 6),
legend.key.size = unit(0.3,"line")) +
guides(fill = guide_legend(ncol = 3))
# Strains origen legend
text <- "<span style='font-size:9px'> USA isolate:</span> <span style='color:#ed1c24; font-size:13px;'>▲</span><br><span style='font-size:9px'> BWA isolate:</span> <span style='color:#6ca9d4; font-size:17px;'>●</span>"
text.p <- textbox_grob(text)
# Final figure
Figure <- ggarrange(plot + theme(plot.margin = unit(c(0.5,0,1.5,3), "lines")),
ggarrange(
as_ggplot(get_legend(legend)) + theme(plot.margin = unit(c(-1,0,1,2), "lines")),
as_ggplot(text.p) + theme(plot.margin = unit(c(0,0,2,1.5), "lines")),
nrow = 1, widths = c(0.8,0.2)),
nrow = 2, heights = c(1,0.02))
Figure
ggsave(plot = Figure, filename = "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA_Heatmap.png", height = 9, width = 6, device = 'png', dpi = 600)
ggsave(plot = Figure, filename = "data/Anvio8/Metabolic_Analysis/Fig_Tables/CorPGA_Heatmap.svg", height = 9, width = 6, device = 'svg', dpi = 600)
3.4.3.1 Percentages in Figure 9
Modules_MeansSummary = Modules_Means %>%
group_by(modules_final_label, species_name, .drop = TRUE) %>%
summarise(mean_bySpecies = mean(mean_stepwise_module_completeness))
summaryPlot <- ggplot(Modules_MeansSummary, aes(x = mean_bySpecies, y = modules_final_label, fill = species_name)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = colorsSpecies) +
scale_x_continuous(expand = c(0,0)) +
labs(x = "", y = "") +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.y = element_text(hjust = 1, size = 10, color = "black"))
summaryPlot
3.4.4 Figure 6A
The Assimilatory sulfate reduction, sulfate => H2S
module (M00176) had a stepwise completion score = 0.5 in most of the Corynebacterium genomes. However, this module was completely absent in all the C. pseudodiphtheriticum strains from the U.S. and 4 from Botswana. We look for KO assignments for the enzymes encoded in the fpr2cysIXHDNYZ operon (Rückert et al., 2005) that are not part of the M00176 module definition. Genes annotated as K00528 (fpr) were identified in all the Corynebacterium genomes. A manual exploration of the gene annotations in the genomic region around the fpr in all C. pseudodiphtheriticum strains, allowed us to identified hchA as a common gene downstream the fpr2cysIXHDNYZ operon.
This code uses BEDTools
(Quinlan, 2014; Quinlan and Hall, 2010) to extract annotations and sequences for the fpr to hchA region for all C. pseudodiphtheriticum genomes:
path_i="data/Anvio8/Prokka_out"
path_o="data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA"
mkdir -p $path_o
gene_1="hchA"
gene_2="fprA"
species="Cps"
label="ASR_Cps_region"
fasta_files=()
for file in $path_i/$species*.gff; do
filename=$(basename "$file")
filename_noext="${filename%.*}"
fasta_output="$path_o/${filename_noext}_region_sequence.fasta"
awk '/^##FASTA/{flag=1;next} {if(!flag) print}' "$file" > "$path_o/${filename_noext}_annotations.gff"
awk '/^##FASTA/{flag=1} {if(flag) print}' "$file" > "$path_o/${filename_noext}_sequence_temp.fasta"
grep -v '^##FASTA' "$path_o/${filename_noext}_sequence_temp.fasta" > "$path_o/${filename_noext}_sequence.fasta"
rm "$path_o/${filename_noext}_sequence_temp.fasta"
grep -E 'Name='$gene_1'|Name='$gene_2 "$path_o/${filename_noext}_annotations.gff" | sort -k4,4n | awk 'NR==1 {start=$4} NR==2 {end=$5} END {print $1 "\t" start "\t" end}' > "$path_o/${filename_noext}_region_coordinates.bed"
bedtools intersect -a "$path_o/${filename_noext}_annotations.gff" -b "$path_o/${filename_noext}_region_coordinates.bed" -wa > "$path_o/${filename_noext}_region_annotations.gff"
bedtools getfasta -fi "$path_o/${filename_noext}_sequence.fasta" -bed "$path_o/${filename_noext}_region_coordinates.bed" | awk -v filename="$filename_noext" '/^>/ {print ">" filename "|" substr($0, 2)} !/^>/' > "$fasta_output"
fasta_files+=("$fasta_output")
done
We use MAFFT (Katoh, 2002; Katoh and Standley, 2013) to align the extracted sequences.
#conda activate mafft
cat "${fasta_files[@]}" > "$path_o/$label.fasta"
mafft --adjustdirection --reorder "$path_o/$label.fasta" > "$path_o/${label}_align.fasta"
sed 's/|.*//;s/_R_//' "$path_o/${label}_align.fasta" > "$path_o/${label}_align_clean.fasta" #Clean up header names
We used AliView (Larsson, 2014) to inspect the output file and to reverse complement the alignment to match the operon order. We saved the reversed complemented version of the alignment as ASR_Cps_region_align_clean_R.fasta
.
Next we count the alignment length and create a reference sequence with all Ns and header named “Genes”. This will be used as a base for gene labeling and as a reference so visualize only the gaps to the consensus.
count=$(awk 'NR==2 {print length($0)}' "$path_o/${label}_align_clean_R.fasta")
filename="$path_o/${label}_reference.fasta"
echo ">Genes" >> "$filename"
echo "$(printf 'N%.0s' $(seq 1 $count))" >> "$filename"
cat "$path_o/${label}_reference.fasta" "$path_o/${label}_align_clean_R.fasta" > "$path_o/${label}_align_clean_R_Ref.fasta"
echo ">-----------Genes" > "$filename"
echo "$(printf 'N%.0s' $(seq 1 $count))" >> "$filename"
echo ">" >> "$filename"
echo "$(printf '-%.0s' $(seq 1 $count))" >> "$filename"
The version of the alignment file with the reference sequence is the input for the visualization with the msavisr
package (Vragh, 2022).
plot_ASR <- msavisr("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region_align_clean_R_Ref.fasta",
myref = "Genes",
refontop = TRUE,
basecolors = c("black","black","white"))
plot_ASR <- plot_ASR +
theme(legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 14, color = "black"))
saveRDS(plot_ASR, "data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_plot.rds")
This code creates a reference sequence with gene locations and length proportional to the corresponding gene coordinates to use as a label.
#Manual list of gene coordinates generated by searching individual genes in the aligned fasta
genes <- list(c("-----------Genes", 11579:10901, "8_hchA"),
c("-----------Genes", 8816:7815, "7_cysZ"),
c("-----------Genes", 7806:6988, "6_cysY"),
c("-----------Genes", 6971:5595, "5_cysN"),
c("-----------Genes", 5595:4513, "4_cysD"),
c("-----------Genes", 4516:3458, "3_cysH"),
c("-----------Genes", 3461:1788, "2_cysI"),
c("-----------Genes", 1424:1, "1_fpr2")
)
plot_ASRlabels <- msavisr("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_Cps_region_reference.fasta",
myref = "-----------Genes",
refontop = FALSE,
basecolors = c("black","black","white"),
myroi = genes, hroi = 1,
roicolors = c("#9fd122","#00a84a","#f88502","#00a9f2","#6f88c9","#a94488","#c8c3e3", "gray80"))
# Values manually calculated to place the labels in the middle of each gene
plot_ASRlabels <- plot_ASRlabels +
annotate("text", x = 678, y = 1, label = "fpr2", size = 7, fontface = 2) +
annotate("text", x = 2624, y = 1, label = "cysI", size = 7, fontface = 2) +
annotate("text", x = 3987, y = 1, label = "cysH", size = 7, fontface = 2) +
annotate("text", x = 5054, y = 1, label = "cysD", size = 7, fontface = 2) +
annotate("text", x = 6283, y = 1, label = "cysN", size = 7, fontface = 2) +
annotate("text", x = 7397, y = 1, label = "cysY", size = 7, fontface = 2) +
annotate("text", x = 8315, y = 1, label = "cysZ", size = 7, fontface = 2) +
annotate("text", x = 11240, y = 1, label = "hchA", size = 7, fontface = 2) +
theme(legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 12))
This generates the combined figure that was edited in Ilustrator for final edits, including adding symbols for the strains geographic location.
plot_ASR <- readRDS("data/Anvio8/Metabolic_Analysis/Fig_Tables/Alignment_hchA-fprA/ASR_plot.rds")
FinalFigure <- ggarrange(plot_ASRlabels, plot_ASR, ncol = 1, heights = c(0.20, 1))
FinalFigure
3.5 GC level analysis
We performed pangenomic level analysis in 107 genomes: the 102 nasal Corynebacterium genomes selected for metabolic analysis (Cac_ATCC_49726 & Cps_090104 were excluded for this analysis) plus the 5 Corynebacterium listed in Table S1D:
Corynebacterium glutamicum ATCC 13032 (Kyowa Hakko) (GCF_000011325.1)
Corynebacterium diphtheriae NCTC11397 (GCF_001457455.1)
Corynebacterium simulans PES1 (GCF_001586215.1)
Corynebacterium kroppenstedtii DSM 44385 (GCF_000023145.1)
Corynebacterium amycolatum FDAARGOS_1108 (GCF_016728725.1)
#conda activate anvio-dev
path_i="data/Anvio8/Pangenomic_Analysis"
anvi-gen-genomes-storage -e "data/genome_lists/CorPGA_Anvio_GenomeList_v01c.txt" -o $path_i/CorPGA-All4CorRefs-GENOMES.db --gene-caller Prodigal
anvi-db-info $path_i/CorPGA-All4CorRefs-GENOMES.db
anvi-pan-genome -g $path_i/CorPGA-All4CorRefs-GENOMES.db \
-o $path_i/All4CorRefs -n PAN_All4CorRefs \
--use-ncbi-blast --mcl-inflation 10 --num-threads 8 --I-know-this-is-not-a-good-idea
Create a bin collection named All4CorRefs with all Core GCs for all the Corynebacterium species using the visual interface. We do this because in order to use anvi-summarize
we need to have at least one bin.
#conda activate anvio-dev
path_i="data/Anvio8/Pangenomic_Analysis"
anvi-display-pan -p $path_i/All4CorRefs/PAN_All4CorRefs-PAN.db -g $path_i/CorPGA-All4CorRefs-GENOMES.db
anvi-summarize -p $path_i/All4CorRefs/PAN_All4CorRefs-PAN.db \
-g $path_i/CorPGA-All4CorRefs-GENOMES.db \
-o $path_i/All4CorRefs/Summaries \
-C All4CorRefs