Description of Corynebacterium nasorum sp. nov. and Corynebacterium hallucis sp. nov. isolated from human nasal passages and skin
  1. Methods
  2. Anvio
  • Introduction
  • Methods
    • Average Nucleotide Identity (ANI)
    • Digital DNA:DNA hybridization (dDDH)
    • Prokka Annotations
    • Anvio
    • Figures
  • References
  • R Session Info

Table of contents

  • Set up paths
  • Functional annotations
    • COG annotation
    • KEGG annotation
    • PFAM annotation
    • CAZyme Annotation
    • Getting HMMs
    • Getting Metabolic Networks
    • Annotation Summaries
  • Metabolic Analysis
    • Estimation of KEGG module completeness
    • Module Enrichment Analysis
    • KO Enrichment Analysis
    • Reading results into R
  • Pangenomic level analysis
    • Reading results into R

Anvio

library(tidyverse)
library(ggtext)
library(viridis)
library(RColorBrewer)
library(knitr)
library(ggh4x)
library(reshape2)
library(seqinr)

Set up paths

#This will open the .Renviron file for the project. You need to add a DATA_PATH environment variable with the path to your data folder.
#usethis::edit_r_environ(scope = "project") 

# This will create and open the .bash_env file for editing. You need to add a DATA_PATH environment variable with the path to your data folder.
#file.create(".bash_env")
#file.edit(".bash_env") 

FOLDER_PATH <- Sys.getenv("DATA_PATH") 
genomes_list_folder <- file.path(FOLDER_PATH, "genome_lists")
anvio_folder <- file.path(FOLDER_PATH, "anvio-9")

This notebook contains analyses using anvi’o version 9 (Delmont and Eren, 2018; Eren et al., 2020). For more in-depth information about anvi’o go to their official online site here: https://anvio.org/

We first source the .bash_env file to set up the environment variables.

source .bash_env

Functional annotations

We use as input (path_i) the anvi’o contigs database files (.db) that were generated in Methods_Prokka_Annotations.

COG annotation

First we downloaded and set up the NCBI’s Clusters of Orthologous Groups databaseGalperin et al. (2024) using anvi-setup-ncbi-cogs. The default is the latest version, which is COG24, meaning that anvi’o will use the NCBI’s 2024 release of COGs to setup the database.

The anvi-run-ncbi-cogs command was used to annotate the contig databases. We used DIAMOND as searching method(Buchfink et al., 2014) with the recommended “sensitive” option.

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-ncbi-cogs -T 8 --search-with diamond -c $file;
done

KEGG annotation

First we downloaded and set up the KEGG KOfam(Kanehisa, 2000; Kanehisa et al., 2022) database using anvi-setup-kegg-data. We used the snapshot of the KEGG database dated 2025-11-07 (hash 68221bd12b3).

The program anvi-run-kegg-kofams annotates genes in a given anvi’o contigs database with KEGG Orthology (KO) numbers via hits to the KEGG KOfam database (Aramaki et al., 2019). We used -E lower than 1e-15.

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-kegg-kofams -T 8 --log-bitscores -E 1e-15 --just-do-it -c $file;
done

The option --log-bitscores records the bit scores values. They get saved to the main folder for the repo, so we moved them to a new subfolder.

path_i <- "."
path_o <- file.path(anvio_folder, "KEGG_bitscores")
dir.create(path_o, recursive = TRUE, showWarnings = FALSE)
files_to_move <- list.files(path_i, pattern = "_bitscores.txt$", full.names = TRUE)

for (file in files_to_move) {
  new_path <- file.path(path_o, basename(file))
  print(new_path)
  file.rename(file, new_path)
}

PFAM annotation

First you need to download and set up the PFAM database(Bateman, 2000; Mistry et al., 2020) using anvi-setup-pfams: this is something you will do only once. We used Pfam version 38.1 (2025-09)

The program anvi-run-kegg-pfam annotates genes in a given anvi’o contigs database with the PFAM database.

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-pfams -T 8 -c $file;
done

CAZyme Annotation

First you need to download and set up the CAZyme database (Drula et al., 2021) using anvi-setup-cazymes: this is something you will do only once. We used CAZyme version 12.

The program anvi-run-cazymes annotates genes in your contigs-db with the dbCAN CAZyme database.

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-cazymes -T 8 -c $file;
done

Getting HMMs

When you run the anvi-run-hmms command the occurrence of bacterial single-copy genes across your contigs is added to your contigs database (Eddy, 2011).

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-hmms -T 8 -c $file --just-do-it;
done

Getting Metabolic Networks

The program anvi-reaction-network is used to store a metabolic reaction-network in each contigs-db. The network consists of data on biochemical reactions predicted to be encoded by each genome, referencing the KEGG Orthology (KO) and ModelSEED Biochemistry databases.

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-reaction-network -c $file;
done

Annotation Summaries

This generates basic stats on the annotations:

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"
path_o="$DATA_PATH/anvio-9/annotation_summaries"
mkdir -p "$path_o"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-display-contigs-stats $file --report-as-text -o $path_o/stats_$FILENAME.txt;
done

This provides basic info on the contigs databases:

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"
path_o="$DATA_PATH/anvio-9/annotation_summaries"

for file in $path_i/*.db; do
    anvi-db-info $file &>> $path_o/anvi-db-info_log.txt;
done

Metabolic Analysis

Estimation of KEGG module completeness

Using anvi-estimate-metabolism we predicted KEGG module completeness as described in (Veseli et al., 2023) for the the organisms listed in anvio_KEGG_30_CtuComplex.txt. This list includes the name of the 30 Corynebacterium genomes selected for metabolic analysis and the path to the corresponding anvi’o .db files.

conda deactivate
conda activate anvio-9

txt_file="$DATA_PATH/genome_lists/anvio_KEGG_30_CtuComplex.txt"
prefix=NovCor
path_o="$DATA_PATH/anvio-9/metabolic_analysis"
mkdir -p "$path_o"

anvi-estimate-metabolism -e $txt_file \
                         -O $path_o/$prefix --add-copy-number --output-modes hits,modules

The generated NovCor_modules.txt is included as Table S4 in the manuscript. It contains the KEGG module completeness values for each genome, which are used to generate the KEGG module summary Table S5, and the heatmap in Figure 4.

Code for KEGG Manuscript Table S5

# Read modules data and make appropriate labeling columns
Modules <- read_delim(file.path(anvio_folder, "metabolic_analysis/NovCor_modules.txt"), show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character())) %>% 
  mutate(module_class_short = paste0("(", substr(module_class, 1, 1), ")")) %>%
  unite(modules_label, c(module_class_short, module_category, module_subcategory), sep = "_", remove = FALSE) %>% 
  mutate_at(c("genome_name", "module", "module_name", "module_category", "module_subcategory", "modules_label"), factor)

# Summary tally table of fully stepwise complete modules by species. Filter to modules present in at least 20 genomes
ModulesTable <- Modules %>%
  filter(stepwise_module_completeness == 1) %>%
  group_by(module_class, module_category, module_subcategory, module, module_name) %>%
  tally() %>%
  ungroup() %>%
  select(module, module_name, module_subcategory, n, -module_class) %>%  
  arrange(desc(n)) %>%
  filter(n > 20)

write_csv(ModulesTable, file.path(anvio_folder, "metabolic_analysis/NovCor_KEEG_Table.csv"))
kable(ModulesTable)
module module_name module_subcategory n
M00015 Proline biosynthesis, glutamate => proline Arginine and proline metabolism 30
M00844 Arginine biosynthesis, ornithine => arginine Arginine and proline metabolism 30
M00970 Proline degradation, proline => glutamate Arginine and proline metabolism 30
M00022 Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate Aromatic amino acid metabolism 30
M00019 Valine/isoleucine biosynthesis, pyruvate => valine / 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism 30
M00432 Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate Branched-chain amino acid metabolism 30
M00570 Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism 30
M00021 Cysteine biosynthesis, serine => cysteine Cysteine and methionine metabolism 30
M00045 Histidine degradation, histidine => N-formiminoglutamate => glutamate Histidine metabolism 30
M00018 Threonine biosynthesis, aspartate => homoserine => threonine Serine and threonine metabolism 30
M00020 Serine biosynthesis, glycerate-3P => serine Serine and threonine metabolism 30
M00621 Glycine cleavage system Serine and threonine metabolism 30
M00096 C5 isoprenoid biosynthesis, non-mevalonate pathway Terpenoid backbone biosynthesis 30
M00364 C10-C20 isoprenoid biosynthesis, bacteria Terpenoid backbone biosynthesis 30
M00365 C10-C20 isoprenoid biosynthesis, archaea Terpenoid backbone biosynthesis 30
M00097 Lycopene biosynthesis, geranylgeranyl-PP => lycopene Terpenoid biosynthesis 30
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate Central carbohydrate metabolism 30
M00002 Glycolysis, core module involving three-carbon compounds Central carbohydrate metabolism 30
M00003 Gluconeogenesis, oxaloacetate => fructose-6P Central carbohydrate metabolism 30
M00004 Pentose phosphate pathway (Pentose phosphate cycle) Central carbohydrate metabolism 30
M00005 PRPP biosynthesis, ribose 5P => PRPP Central carbohydrate metabolism 30
M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P Central carbohydrate metabolism 30
M00007 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P Central carbohydrate metabolism 30
M00009 Citrate cycle (TCA cycle, Krebs cycle) Central carbohydrate metabolism 30
M00010 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate Central carbohydrate metabolism 30
M00011 Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate Central carbohydrate metabolism 30
M00632 Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P Other carbohydrate metabolism 30
M00855 Glycogen degradation, glycogen => glucose-6P Other carbohydrate metabolism 30
M00151 Cytochrome bc1 complex respiratory unit ATP synthesis 30
M00155 Cytochrome c oxidase, prokaryotes ATP synthesis 30
M00157 F-type ATPase, prokaryotes and chloroplasts ATP synthesis 30
M00579 Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate Carbon fixation 30
M00549 UDP-Glc biosynthesis, Glc => UDP-Glc Nucleotide sugar biosynthesis 30
M00554 UDP-Gal biosynthesis, Gal => UDP-Gal Nucleotide sugar biosynthesis 30
M00793 dTDP-L-Rha biosynthesis, Glc-1P => dTDP-L-Rha Nucleotide sugar biosynthesis 30
M00909 UDP-GlcNAc biosynthesis, prokaryotes, Fru-6P => UDP-GlcNAc Nucleotide sugar biosynthesis 30
M00995 UDP-MurNAc biosynthesis, Fru-6P => UDP-MurNAc Nucleotide sugar biosynthesis 30
M01004 UDP-Galf biosynthesis, UDP-Glc => UDP-Galf Nucleotide sugar biosynthesis 30
M00888 Galactofuranan biosynthesis, decaprenyl phosphate + UDP-GlcNAc (+ dTDP-Rha/UDP-Galf) => GL-5 Other polysaccharide metabolism 30
M00086 beta-Oxidation, acyl-CoA synthesis Fatty acid metabolism 30
M00887 Mycolic acid biosynthesis, meromycolic acid + alpha-carboxyacyl-CoA + trehalose => TMM => TDM/mAGP/GMM Fatty acid metabolism 30
M00120 Coenzyme A biosynthesis, pantothenate => CoA Cofactor and vitamin metabolism 30
M00121 Heme biosynthesis, plants and bacteria, glutamate => heme Cofactor and vitamin metabolism 30
M00125 Riboflavin biosynthesis, plants and bacteria, GTP => riboflavin/FMN/FAD Cofactor and vitamin metabolism 30
M00126 Tetrahydrofolate biosynthesis, GTP => THF Cofactor and vitamin metabolism 30
M00881 Lipoic acid biosynthesis, plants and bacteria, octanoyl-ACP => dihydrolipoyl-E2/H Cofactor and vitamin metabolism 30
M00899 Thiamine salvage pathway, HMP/HET => TMP Cofactor and vitamin metabolism 30
M00916 Pyridoxal-P biosynthesis, R5P + glyceraldehyde-3P + glutamine => pyridoxal-P Cofactor and vitamin metabolism 30
M00926 Heme biosynthesis, bacteria, glutamyl-tRNA => coproporphyrin III => heme Cofactor and vitamin metabolism 30
M00048 De novo purine biosynthesis, PRPP + glutamine => IMP Purine metabolism 30
M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP Purine metabolism 30
M00050 Guanine ribonucleotide biosynthesis, IMP => GDP,GTP Purine metabolism 30
M00053 Deoxyribonucleotide biosynthesis, ADP/GDP/CDP/UDP => dATP/dGTP/dCTP/dUTP Purine metabolism 30
M00938 Pyrimidine deoxyribonucleotide biosynthesis, UDP => dTTP Pyrimidine metabolism 30
M00023 Tryptophan biosynthesis, chorismate => tryptophan Aromatic amino acid metabolism 29
M00555 Betaine biosynthesis, choline => betaine Serine and threonine metabolism 29
M00140 C1-unit interconversion, prokaryotes Cofactor and vitamin metabolism 29
M00028 Ornithine biosynthesis, glutamate => ornithine Arginine and proline metabolism 26
M00016 Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine Lysine metabolism 26
M00017 Methionine biosynthesis, aspartate => homoserine => methionine Cysteine and methionine metabolism 25
M00307 Pyruvate oxidation, pyruvate => acetyl-CoA Central carbohydrate metabolism 25
M01014 UDP-ManNAcA biosynthesis, UDP-GlcNAc => UDP-ManNAcA Nucleotide sugar biosynthesis 25
M00982 Methylcitrate cycle Other carbohydrate metabolism 24

Code for KEGG Manuscript Figure 4

# When using stepwise_module_completeness some modules can be = 0 for all genomes but still have pathwise_module_completeness > 0. We are removing those modules.
Modules_cleaned <- Modules %>%
  group_by(module) %>%
  filter(!all(stepwise_module_completeness == 0)) %>%
  ungroup() %>%
  mutate(across(where(is.factor), droplevels))

all_zero_modules <- setdiff(Modules$module, Modules_cleaned$module)
all_zero_subcategory <- setdiff(Modules$module_subcategory, Modules_cleaned$module_subcategory)

# Calculating mean_stepwise_module_completeness and preparing plot labels 
Modules_Means = Modules_cleaned %>% 
  group_by(modules_label, genome_name, .drop = FALSE) %>%
  summarise(mean_stepwise_module_completeness = mean(stepwise_module_completeness)) %>%
  mutate(mean_stepwise_module_completeness = coalesce(mean_stepwise_module_completeness, 0)) %>%
  mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
  mutate(module_category = str_extract(modules_label, "^[^_]+_[^_]+")) %>%
  mutate(module_category = str_replace_all(module_category, "_", " ")) %>%
  mutate(modules_final_label = str_replace(modules_label, "^[^_]+_[^_]+_", "")) %>%
  mutate_at(c("module_category", "modules_final_label"), factor) 

# Reorder based on ANI results and add styled genome names for plotting
StrainOrder <- read_csv(file.path(genomes_list_folder, "ANI_PlotOrder_30_CtuComplex_styled.csv"))
Modules_Means <- left_join(Modules_Means, StrainOrder, by = c("genome_name" = "Genome"))

Modules_Means$species_name <- sub("Cke", "Cna", Modules_Means$species_name)
Modules_Means$species_name <- sub("Cau", "Cmq", Modules_Means$species_name)
Modules_Means$species_name <- factor(Modules_Means$species_name, levels = c("Cna","Cyo","Cmq","Ctu","Ccu","Cha"))
Modules_Means$genome_name <- factor(Modules_Means$genome_name, levels = StrainOrder$Genome)
Modules_Means$styled_Genome <- factor(Modules_Means$styled_Genome, levels = rev(StrainOrder$styled_Genome))
Modules_Means$modules_final_label  <- factor(Modules_Means$modules_final_label, rev(levels(Modules_Means$modules_final_label)))
# Colors for plot
colorsSpecies <- c("#8200E7","#0A6003","#39D9E5","#002BFF","#FF00FF","#0ABF00")

# Heatmap plot
plot <- ggplot(Modules_Means, aes(x = styled_Genome, y = modules_final_label, fill = mean_stepwise_module_completeness)) + 
  geom_tile(color = "white") +
  facet_grid2(module_category ~ species_name, scales = "free", space = "free", switch = "y", labeller = label_wrap_gen(width = 31),
              #strip = strip_themed(background_x = elem_list_rect(fill = colorsSpecies))) +
              strip = strip_themed(text_x = elem_list_text(colour = colorsSpecies))) +
  labs(x = "", y = "") +
  scale_y_discrete(expand = c(0,0), position = "right") +
  scale_x_discrete(expand = c(0,0)) +
  scale_fill_viridis(direction = -1, option = "A", begin = 0.2) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        strip.background = element_rect(colour = "black", fill = NA),
        strip.text.y.left = element_text(angle = 0, hjust = 0.5, size = 11),
        strip.text.x = element_text(size = 12, face = "bold"),
        axis.text.y = element_text(size = 11, color = "black"),
        axis.text.x = element_markdown(angle = 45, hjust = 1, size = 11, color = "black"),
        panel.spacing = unit(0, 'pt'),
        legend.position = c(1.30, -0.11), 
        legend.direction = "horizontal",
        legend.justification = c(1, 0)) +
  guides(fill = guide_colourbar(title = "Average stepwise\nmodule completeness", title.position = "top", title.hjust = 0.5, barwidth = 8))


ggsave(plot = plot, filename = file.path(anvio_folder, "metabolic_analysis/NovCor_KEEG_Figure.png"), height = 13, width = 12, device = 'png', dpi = 600)
saveRDS(plot, file = file.path(anvio_folder, "metabolic_analysis/NovCor_KEEG_Figure.rds"))
plot

Module Enrichment Analysis

The NovCor_modules.txt file generated in the previous step is used by anvi-compute-metabolic-enrichment to calculate KEGG module enrichment(Shaiber et al., 2020) across different groups of genomes. We used the -G flag to indicate the following grouping options:

  • Enriched KEGG modules in Corynebacterium nasorum versus other Corynebacterium tuberculostearicum species complex genomes.
  • Enriched KEGG modules in Corynebacterium hallucis versus other Corynebacterium tuberculostearicum species complex genomes.
conda deactivate
conda activate anvio-9

csv_file_modules="$DATA_PATH/anvio-9/metabolic_analysis/NovCor_modules.txt"
path_o="$DATA_PATH/anvio-9/metabolic_analysis/enrichment_modules"
mkdir -p "$path_o"
                                  
txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cna.txt"
output_file="$path_o/NovCor_enriched_modules_stepwise_1_Cna.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0      

txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cha.txt"
output_file="$path_o/NovCor_enriched_modules_stepwise_1_Cha.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0 

KO Enrichment Analysis

We use anvi-display-functions(Shaiber et al., 2020). It runs the enrichment for the desired annotations (in this case KOfam). It saves the output to a temp file that can be recovered, renamed and saved into the same path_o as the database file.

  • Table S6. Enriched KOs in Corynebacterium nasorum versus other Corynebacterium tuberculostearicum species complex genomes.
  • Table S7. Enriched KOs in Corynebacterium hallucis versus other Corynebacterium tuberculostearicum species complex genomes.
conda deactivate
conda activate anvio-9

txt_file="$DATA_PATH/genome_lists/anvio_KEGG_30_CtuComplex.txt"
path_o="$DATA_PATH/anvio-9/metabolic_analysis/enrichment_KOs"
mkdir -p "$path_o"

txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cna.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_Cna.db
                       
txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cha.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_Cha.db

Reading results into R

This allows to read the generated enrichment results for KEGG modules and KOs into R for further analysis and plotting.

Enrich_Mod_Cna <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_modules/NovCor_enriched_modules_stepwise_1_Cna.txt"), show_col_types = FALSE)
Enrich_Mod_Cha <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_modules/NovCor_enriched_modules_stepwise_1_Cna.txt"), show_col_types = FALSE)

Enrich_KO_Cna <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_KOs/NovCor_enriched_KOs_Cna.txt"), show_col_types = FALSE)
Enrich_KO_Cha <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_KOs/NovCor_enriched_KOs_Cha.txt"), show_col_types = FALSE)

Pangenomic level analysis

This analysis using anvio_KEGG_30_CtuComplex.txt includes the same 30 Corynebacterium genomes selected for metabolic analysis.

conda deactivate
conda activate anvio-9

txt_file="$DATA_PATH/genome_lists/anvio_KEGG_30_CtuComplex.txt"
path_i="$DATA_PATH/anvio-9/pangenomic_analysis"
mkdir -p "$path_i"

anvi-gen-genomes-storage -e $txt_file -o $path_i/NovCor-GENOMES.db --gene-caller Prodigal

anvi-db-info $path_i/NovCor-GENOMES.db

anvi-pan-genome -g $path_i/NovCor-GENOMES.db \
                -o $path_i/pangenome -n NovCor \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8

#Create a bin collection named CORE with all Core GCs for all the included genomes using the visual interface:       

anvi-display-pan -p $path_i/pangenome/NovCor-PAN.db -g $path_i/NovCor-GENOMES.db         

anvi-summarize -p $path_i/pangenome/NovCor-PAN.db \
               -g $path_i/NovCor-GENOMES.db \
               -o $path_i/summaries \
               -C CORE

Reading results into R

The NovCor_gene_clusters_summary.txt.gz file contains the summary of gene clusters generated in the pangenomic analysis, which can be used to explore specific gene clusters of interest based on the enrichment results.

NovCor_PAN <- read_delim(file.path(anvio_folder, "pangenomic_analysis/summaries/NovCor_gene_clusters_summary.txt.gz"), show_col_types = FALSE)

This function allows to explored the generated summary dataset for any GC of interest. It generates a subset of the dataset, a fasta file with the protein alignment and a identity percent distance matrix foe any selected GC:

info_for_GC <- function(Summary, output_path, GC, gene) {
  data <- Summary %>% 
    filter(gene_cluster_id == GC) %>% 
    mutate(fasta_lines = paste0(">", gene_cluster_id, "|", genome_name, "|callers_id:", gene_callers_id, "\n", aa_sequence))
  
  if (!file.exists(output_path)) {
    dir.create(output_path, recursive = TRUE)
  }
  output_fasta <- file.path(output_path, paste0(GC, "_", gene, ".fasta"))
  output_csv <- file.path(output_path, paste0(GC, "_", gene, ".csv"))
  
  writeLines(data$fasta_lines, con = output_fasta)
  write_csv(data, file = output_csv)
  
  dist = dist.alignment(read.alignment(output_fasta,"fasta"), matrix = "identity")
  mat <- 100*(1 - dist^2) # convert to identity percent since the dist matrix contains the squared root of the pairwise distances
  mat <- as.matrix(mat)
  df <- melt(mat)
  df$value <- as.numeric(df$value)
  df$Var1 <- sub(".*\\|(.*?)\\|.*", "\\1", df$Var1)
  df$Var2 <- sub(".*\\|(.*?)\\|.*", "\\1", df$Var2)
  
  if (nrow(df) > 1) {
    plot <- ggplot(df, aes(x = Var1, y = Var2, fill = value)) + 
      geom_tile() +
      scale_fill_viridis(direction = -1, option = "A", begin = 0.2, limits = c(40, 100), na.value = 'white') +
      labs(x = "", y = "") +
      scale_y_discrete(expand = c(0,0), position = "right") +
      scale_x_discrete(expand = c(0,0)) +
      theme_bw() +
      theme(axis.text.x = element_markdown(angle = 45, hjust = 1, size = 6),
            axis.text.y = element_markdown(size = 6))
    
    output_plot <- file.path(output_path, paste0(GC, "_", gene, ".jpg"))
    ggsave(plot, filename = output_plot, height = 20, width = 20, device = 'jpg', dpi = 600)
  }

  num_seqs <- nrow(data)
  num_genomes <- data$num_genomes_gene_cluster_has_hits[1]
  protein_id <- ifelse("Cgl_ATCC_13032" %in% data$genome_name, data$accession[data$genome_name == "Cgl_ATCC_13032"][1], NA)
  KO <- ifelse("Cgl_ATCC_13032" %in% data$genome_name, data$KOfam_ACC[data$genome_name == "Cgl_ATCC_13032"][1], NA)
  index <- data$combined_homogeneity_index[1]
  
  cat(protein_id, gene, GC, KO, "has", num_seqs, "aa sequences across", num_genomes, "genomes with", index ,"combined homogeneity. \n")
}
output_path <- file.path(anvio_folder, "pangenomic_analysis/individual_GCs")

info_for_GC(NovCor_PAN, output_path, "GC_00002114", "YncB")
NA YncB GC_00002114 NA has 16 aa sequences across 16 genomes with 0.9477969 combined homogeneity. 
info_for_GC(NovCor_PAN, output_path, "GC_00002607", "nagB")
NA nagB GC_00002607 NA has 4 aa sequences across 4 genomes with 0.9976981 combined homogeneity. 
Aramaki, T., Blanc-Mathieu, R., Endo, H., Ohkubo, K., Kanehisa, M., Goto, S. and Ogata, H. (2019). KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252.
Bateman, A. (2000). The pfam protein families database. Nucleic Acids Research 28, 263–266.
Buchfink, B., Xie, C. and Huson, D. H. (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60.
Delmont, T. O. and Eren, A. M. (2018). Linking pangenomes and metagenomes: the Prochlorococcus metapangenome. PeerJ 6, e4320.
Drula, E., Garron, M.-L., Dogan, S., Lombard, V., Henrissat, B. and Terrapon, N. (2021). The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Research 50, D571–D577.
Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLoS Computational Biology 7, e1002195.
Eren, A. M., Kiefl, E., Shaiber, A., Veseli, I., Miller, S. E., Schechter, M. S., Fink, I., Pan, J. N., Yousef, M., Fogarty, E. C., et al. (2020). Community-led, integrated, reproducible multi-omics with anvi’o. Nature Microbiology 6, 3–6.
Galperin, M. Y., Wolf, Y. I., Makarova, K. S., Vera Alvarez, R., Landsman, D. and Koonin, E. V. (2020). COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Research 49, D274–D281.
Galperin, M. Y., Vera Alvarez, R., Karamycheva, S., Makarova, K. S., Wolf, Y., Landsman, D. and Koonin, E. V. (2024). COG database update 2024. Nucleic Acids Research 53, D356–D363.
Kanehisa, M. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28, 27–30.
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. and Ishiguro-Watanabe, M. (2022). KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research 51, D587–D592.
Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G., Sonnhammer, E. L. L., Tosatto, S. C. E., Paladin, L., Raj, S., Richardson, L. J., et al. (2020). Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–D419.
Shaiber, A., Willis, A. D., Delmont, T. O., Roux, S., Chen, L.-X., Schmid, A. C., Yousef, M., Watson, A. R., Lolans, K., Esen, Ö. C., et al. (2020). Functional and genetic markers of niche partitioning among enigmatic members of the human oral microbiome. Genome Biology 21,.
Tatusov, R. L. (2000). The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28, 33–36.
Veseli, I., Chen, Y. T., Schechter, M. S., Vanni, C., Fogarty, E. C., Watson, A. R., Jabri, B., Blekhman, R., Willis, A. D., Yu, M. K., et al. (2023). Microbes with higher metabolic independence are enriched in human gut microbiomes under stress. biorxiv.
Prokka Annotations
Figures
Source Code
---
execute:
  eval: TRUE
  message: FALSE
  warning: FALSE
---

# Anvio {.unnumbered}

```{r}
library(tidyverse)
library(ggtext)
library(viridis)
library(RColorBrewer)
library(knitr)
library(ggh4x)
library(reshape2)
library(seqinr)
```

## Set up paths

```{r}
#This will open the .Renviron file for the project. You need to add a DATA_PATH environment variable with the path to your data folder.
#usethis::edit_r_environ(scope = "project") 

# This will create and open the .bash_env file for editing. You need to add a DATA_PATH environment variable with the path to your data folder.
#file.create(".bash_env")
#file.edit(".bash_env") 

FOLDER_PATH <- Sys.getenv("DATA_PATH") 
genomes_list_folder <- file.path(FOLDER_PATH, "genome_lists")
anvio_folder <- file.path(FOLDER_PATH, "anvio-9")
```

This notebook contains analyses using anvi'o version 9 [@eren2020; @delmont2018]. For more in-depth information about anvi'o go to their official online site here: <https://anvio.org/>

We first source the `.bash_env` file to set up the environment variables.

```{bash, eval=FALSE}
source .bash_env
```

## Functional annotations

We use as input (`path_i`) the anvi'o contigs database files (`.db`) that were generated in [[Methods_Prokka_Annotations]{.underline}](Methods_Prokka_Annotations.html).

### COG annotation

First we downloaded and set up the NCBI's Clusters of Orthologous Groups database[@Tatusov2000, @Galperin2021, @galperin2024] using `anvi-setup-ncbi-cogs`. The default is the latest version, which is COG24, meaning that anvi'o will use the NCBI's 2024 release of COGs to setup the database.

The `anvi-run-ncbi-cogs` command was used to annotate the contig databases. We used DIAMOND as searching method[@buchfink2014] with the recommended "sensitive" option.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-ncbi-cogs -T 8 --search-with diamond -c $file;
done
```

### KEGG annotation

First we downloaded and set up the KEGG KOfam[@Kanehisa2000; @Kanehisa2023] database using `anvi-setup-kegg-data`. We used the snapshot of the KEGG database dated 2025-11-07 (hash 68221bd12b3).

The program `anvi-run-kegg-kofams` annotates genes in a given anvi'o contigs database with KEGG Orthology (KO) numbers via hits to the KEGG KOfam database [@aramaki2019]. We used `-E` lower than 1e-15.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-kegg-kofams -T 8 --log-bitscores -E 1e-15 --just-do-it -c $file;
done
```

The option `--log-bitscores` records the bit scores values. They get saved to the main folder for the repo, so we moved them to a new subfolder.

```{r}
path_i <- "."
path_o <- file.path(anvio_folder, "KEGG_bitscores")
dir.create(path_o, recursive = TRUE, showWarnings = FALSE)
files_to_move <- list.files(path_i, pattern = "_bitscores.txt$", full.names = TRUE)

for (file in files_to_move) {
  new_path <- file.path(path_o, basename(file))
  print(new_path)
  file.rename(file, new_path)
}
```

### PFAM annotation

First you need to download and set up the PFAM database[@Bateman2000; @Mistry2021] using `anvi-setup-pfams`: this is something you will do only once. We used Pfam version 38.1 (2025-09)

The program `anvi-run-kegg-pfam` annotates genes in a given anvi'o contigs database with the PFAM database.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-pfams -T 8 -c $file;
done
```

### CAZyme Annotation

First you need to download and set up the CAZyme database [@drula2021] using `anvi-setup-cazymes`: this is something you will do only once. We used CAZyme version 12.

The program `anvi-run-cazymes` annotates genes in your contigs-db with the dbCAN CAZyme database.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-cazymes -T 8 -c $file;
done
```

### Getting HMMs

When you run the `anvi-run-hmms` command the occurrence of bacterial single-copy genes across your contigs is added to your contigs database [@Eddy2011].

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-run-hmms -T 8 -c $file --just-do-it;
done
```

### Getting Metabolic Networks

The program `anvi-reaction-network` is used to store a metabolic reaction-network in each contigs-db. The network consists of data on biochemical reactions predicted to be encoded by each genome, referencing the KEGG Orthology (KO) and ModelSEED Biochemistry databases.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"

for file in $path_i/*.db; do
    anvi-reaction-network -c $file;
done
```

### Annotation Summaries

This generates basic stats on the annotations:

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"
path_o="$DATA_PATH/anvio-9/annotation_summaries"
mkdir -p "$path_o"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-display-contigs-stats $file --report-as-text -o $path_o/stats_$FILENAME.txt;
done
```

This provides basic info on the contigs databases:

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"
path_o="$DATA_PATH/anvio-9/annotation_summaries"

for file in $path_i/*.db; do
    anvi-db-info $file &>> $path_o/anvi-db-info_log.txt;
done
```

## Metabolic Analysis

### Estimation of KEGG module completeness

Using `anvi-estimate-metabolism` we predicted KEGG module completeness as described in [@veseli2023] for the the organisms listed in `anvio_KEGG_30_CtuComplex.txt`. This list includes the name of the 30 *Corynebacterium* genomes selected for metabolic analysis and the path to the corresponding anvi'o .db files.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

txt_file="$DATA_PATH/genome_lists/anvio_KEGG_30_CtuComplex.txt"
prefix=NovCor
path_o="$DATA_PATH/anvio-9/metabolic_analysis"
mkdir -p "$path_o"

anvi-estimate-metabolism -e $txt_file \
                         -O $path_o/$prefix --add-copy-number --output-modes hits,modules
```

The generated `NovCor_modules.txt` is included as **Table S4** in the manuscript. It contains the KEGG module completeness values for each genome, which are used to generate the KEGG module summary **Table S5**, and the heatmap in **Figure 4**.

#### Code for KEGG Manuscript Table S5

```{r}
# Read modules data and make appropriate labeling columns
Modules <- read_delim(file.path(anvio_folder, "metabolic_analysis/NovCor_modules.txt"), show_col_types = FALSE, col_types = cols(unique_enzymes_hit_counts = col_character(), gene_caller_ids_in_module = col_character())) %>% 
  mutate(module_class_short = paste0("(", substr(module_class, 1, 1), ")")) %>%
  unite(modules_label, c(module_class_short, module_category, module_subcategory), sep = "_", remove = FALSE) %>% 
  mutate_at(c("genome_name", "module", "module_name", "module_category", "module_subcategory", "modules_label"), factor)

# Summary tally table of fully stepwise complete modules by species. Filter to modules present in at least 20 genomes
ModulesTable <- Modules %>%
  filter(stepwise_module_completeness == 1) %>%
  group_by(module_class, module_category, module_subcategory, module, module_name) %>%
  tally() %>%
  ungroup() %>%
  select(module, module_name, module_subcategory, n, -module_class) %>%  
  arrange(desc(n)) %>%
  filter(n > 20)

write_csv(ModulesTable, file.path(anvio_folder, "metabolic_analysis/NovCor_KEEG_Table.csv"))
kable(ModulesTable)
```

#### Code for KEGG Manuscript Figure 4

```{r}
# When using stepwise_module_completeness some modules can be = 0 for all genomes but still have pathwise_module_completeness > 0. We are removing those modules.
Modules_cleaned <- Modules %>%
  group_by(module) %>%
  filter(!all(stepwise_module_completeness == 0)) %>%
  ungroup() %>%
  mutate(across(where(is.factor), droplevels))

all_zero_modules <- setdiff(Modules$module, Modules_cleaned$module)
all_zero_subcategory <- setdiff(Modules$module_subcategory, Modules_cleaned$module_subcategory)

# Calculating mean_stepwise_module_completeness and preparing plot labels 
Modules_Means = Modules_cleaned %>% 
  group_by(modules_label, genome_name, .drop = FALSE) %>%
  summarise(mean_stepwise_module_completeness = mean(stepwise_module_completeness)) %>%
  mutate(mean_stepwise_module_completeness = coalesce(mean_stepwise_module_completeness, 0)) %>%
  mutate(species_name = str_extract(genome_name, "^[^_]+")) %>%
  mutate(module_category = str_extract(modules_label, "^[^_]+_[^_]+")) %>%
  mutate(module_category = str_replace_all(module_category, "_", " ")) %>%
  mutate(modules_final_label = str_replace(modules_label, "^[^_]+_[^_]+_", "")) %>%
  mutate_at(c("module_category", "modules_final_label"), factor) 

# Reorder based on ANI results and add styled genome names for plotting
StrainOrder <- read_csv(file.path(genomes_list_folder, "ANI_PlotOrder_30_CtuComplex_styled.csv"))
Modules_Means <- left_join(Modules_Means, StrainOrder, by = c("genome_name" = "Genome"))

Modules_Means$species_name <- sub("Cke", "Cna", Modules_Means$species_name)
Modules_Means$species_name <- sub("Cau", "Cmq", Modules_Means$species_name)
Modules_Means$species_name <- factor(Modules_Means$species_name, levels = c("Cna","Cyo","Cmq","Ctu","Ccu","Cha"))
Modules_Means$genome_name <- factor(Modules_Means$genome_name, levels = StrainOrder$Genome)
Modules_Means$styled_Genome <- factor(Modules_Means$styled_Genome, levels = rev(StrainOrder$styled_Genome))
Modules_Means$modules_final_label  <- factor(Modules_Means$modules_final_label, rev(levels(Modules_Means$modules_final_label)))
```

```{r, fig.align="center",fig.width=12, fig.height=13}
# Colors for plot
colorsSpecies <- c("#8200E7","#0A6003","#39D9E5","#002BFF","#FF00FF","#0ABF00")

# Heatmap plot
plot <- ggplot(Modules_Means, aes(x = styled_Genome, y = modules_final_label, fill = mean_stepwise_module_completeness)) + 
  geom_tile(color = "white") +
  facet_grid2(module_category ~ species_name, scales = "free", space = "free", switch = "y", labeller = label_wrap_gen(width = 31),
              #strip = strip_themed(background_x = elem_list_rect(fill = colorsSpecies))) +
              strip = strip_themed(text_x = elem_list_text(colour = colorsSpecies))) +
  labs(x = "", y = "") +
  scale_y_discrete(expand = c(0,0), position = "right") +
  scale_x_discrete(expand = c(0,0)) +
  scale_fill_viridis(direction = -1, option = "A", begin = 0.2) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        strip.background = element_rect(colour = "black", fill = NA),
        strip.text.y.left = element_text(angle = 0, hjust = 0.5, size = 11),
        strip.text.x = element_text(size = 12, face = "bold"),
        axis.text.y = element_text(size = 11, color = "black"),
        axis.text.x = element_markdown(angle = 45, hjust = 1, size = 11, color = "black"),
        panel.spacing = unit(0, 'pt'),
        legend.position = c(1.30, -0.11), 
        legend.direction = "horizontal",
        legend.justification = c(1, 0)) +
  guides(fill = guide_colourbar(title = "Average stepwise\nmodule completeness", title.position = "top", title.hjust = 0.5, barwidth = 8))


ggsave(plot = plot, filename = file.path(anvio_folder, "metabolic_analysis/NovCor_KEEG_Figure.png"), height = 13, width = 12, device = 'png', dpi = 600)
saveRDS(plot, file = file.path(anvio_folder, "metabolic_analysis/NovCor_KEEG_Figure.rds"))
plot
```

### Module Enrichment Analysis

The `NovCor_modules.txt` file generated in the previous step is used by `anvi-compute-metabolic-enrichment` to calculate KEGG module enrichment[@shaiber2020] across different groups of genomes. We used the `-G` flag to indicate the following grouping options:

-   Enriched KEGG modules in *Corynebacterium nasorum* versus other *Corynebacterium tuberculostearicum* species complex genomes.
-   Enriched KEGG modules in *Corynebacterium hallucis* versus other *Corynebacterium tuberculostearicum* species complex genomes.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

csv_file_modules="$DATA_PATH/anvio-9/metabolic_analysis/NovCor_modules.txt"
path_o="$DATA_PATH/anvio-9/metabolic_analysis/enrichment_modules"
mkdir -p "$path_o"
                                  
txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cna.txt"
output_file="$path_o/NovCor_enriched_modules_stepwise_1_Cna.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0      

txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cha.txt"
output_file="$path_o/NovCor_enriched_modules_stepwise_1_Cha.txt"
anvi-compute-metabolic-enrichment -M "$csv_file_modules" \
                                  -G "$txt_file_groups" \
                                  -o "$output_file" \
                                  --use-stepwise-completeness \
                                  --module-completion-threshold 1.0 
```

### KO Enrichment Analysis

We use `anvi-display-functions`[@shaiber2020]. It runs the enrichment for the desired annotations (in this case KOfam). It saves the output to a temp file that can be recovered, renamed and saved into the same `path_o` as the database file.

-   **Table S6.** Enriched KOs in *Corynebacterium nasorum* versus other *Corynebacterium tuberculostearicum* species complex genomes.
-   **Table S7.** Enriched KOs in *Corynebacterium hallucis*  versus other *Corynebacterium tuberculostearicum* species complex genomes.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

txt_file="$DATA_PATH/genome_lists/anvio_KEGG_30_CtuComplex.txt"
path_o="$DATA_PATH/anvio-9/metabolic_analysis/enrichment_KOs"
mkdir -p "$path_o"

txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cna.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_Cna.db
                       
txt_file_groups="$DATA_PATH/genome_lists/anvio_MetabolicEnrichment_Groups_Cha.txt"
anvi-display-functions -e $txt_file \
                       --groups-txt $txt_file_groups \
                       --annotation-source KOfam \
                       --profile-db $path_o/KOfam_Cha.db
```

### Reading results into R

This allows to read the generated enrichment results for KEGG modules and KOs into R for further analysis and plotting.
```{r}
Enrich_Mod_Cna <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_modules/NovCor_enriched_modules_stepwise_1_Cna.txt"), show_col_types = FALSE)
Enrich_Mod_Cha <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_modules/NovCor_enriched_modules_stepwise_1_Cna.txt"), show_col_types = FALSE)

Enrich_KO_Cna <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_KOs/NovCor_enriched_KOs_Cna.txt"), show_col_types = FALSE)
Enrich_KO_Cha <- read_delim(file.path(anvio_folder, "metabolic_analysis/enrichment_KOs/NovCor_enriched_KOs_Cha.txt"), show_col_types = FALSE)
```

## Pangenomic level analysis

This analysis using `anvio_KEGG_30_CtuComplex.txt` includes the same 30 *Corynebacterium* genomes selected for metabolic analysis.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

txt_file="$DATA_PATH/genome_lists/anvio_KEGG_30_CtuComplex.txt"
path_i="$DATA_PATH/anvio-9/pangenomic_analysis"
mkdir -p "$path_i"

anvi-gen-genomes-storage -e $txt_file -o $path_i/NovCor-GENOMES.db --gene-caller Prodigal

anvi-db-info $path_i/NovCor-GENOMES.db

anvi-pan-genome -g $path_i/NovCor-GENOMES.db \
                -o $path_i/pangenome -n NovCor \
                --use-ncbi-blast --mcl-inflation 10  --num-threads 8

#Create a bin collection named CORE with all Core GCs for all the included genomes using the visual interface:       

anvi-display-pan -p $path_i/pangenome/NovCor-PAN.db -g $path_i/NovCor-GENOMES.db         

anvi-summarize -p $path_i/pangenome/NovCor-PAN.db \
               -g $path_i/NovCor-GENOMES.db \
               -o $path_i/summaries \
               -C CORE
```

### Reading results into R

The `NovCor_gene_clusters_summary.txt.gz` file contains the summary of gene clusters generated in the pangenomic analysis, which can be used to explore specific gene clusters of interest based on the enrichment results.

```{r}
NovCor_PAN <- read_delim(file.path(anvio_folder, "pangenomic_analysis/summaries/NovCor_gene_clusters_summary.txt.gz"), show_col_types = FALSE)
```

This function allows to explored the generated summary dataset for any GC of interest. It generates a subset of the dataset, a fasta file with the protein alignment and a identity percent distance matrix foe any selected GC:
```{r}
info_for_GC <- function(Summary, output_path, GC, gene) {
  data <- Summary %>% 
    filter(gene_cluster_id == GC) %>% 
    mutate(fasta_lines = paste0(">", gene_cluster_id, "|", genome_name, "|callers_id:", gene_callers_id, "\n", aa_sequence))
  
  if (!file.exists(output_path)) {
    dir.create(output_path, recursive = TRUE)
  }
  output_fasta <- file.path(output_path, paste0(GC, "_", gene, ".fasta"))
  output_csv <- file.path(output_path, paste0(GC, "_", gene, ".csv"))
  
  writeLines(data$fasta_lines, con = output_fasta)
  write_csv(data, file = output_csv)
  
  dist = dist.alignment(read.alignment(output_fasta,"fasta"), matrix = "identity")
  mat <- 100*(1 - dist^2) # convert to identity percent since the dist matrix contains the squared root of the pairwise distances
  mat <- as.matrix(mat)
  df <- melt(mat)
  df$value <- as.numeric(df$value)
  df$Var1 <- sub(".*\\|(.*?)\\|.*", "\\1", df$Var1)
  df$Var2 <- sub(".*\\|(.*?)\\|.*", "\\1", df$Var2)
  
  if (nrow(df) > 1) {
    plot <- ggplot(df, aes(x = Var1, y = Var2, fill = value)) + 
      geom_tile() +
      scale_fill_viridis(direction = -1, option = "A", begin = 0.2, limits = c(40, 100), na.value = 'white') +
      labs(x = "", y = "") +
      scale_y_discrete(expand = c(0,0), position = "right") +
      scale_x_discrete(expand = c(0,0)) +
      theme_bw() +
      theme(axis.text.x = element_markdown(angle = 45, hjust = 1, size = 6),
            axis.text.y = element_markdown(size = 6))
    
    output_plot <- file.path(output_path, paste0(GC, "_", gene, ".jpg"))
    ggsave(plot, filename = output_plot, height = 20, width = 20, device = 'jpg', dpi = 600)
  }

  num_seqs <- nrow(data)
  num_genomes <- data$num_genomes_gene_cluster_has_hits[1]
  protein_id <- ifelse("Cgl_ATCC_13032" %in% data$genome_name, data$accession[data$genome_name == "Cgl_ATCC_13032"][1], NA)
  KO <- ifelse("Cgl_ATCC_13032" %in% data$genome_name, data$KOfam_ACC[data$genome_name == "Cgl_ATCC_13032"][1], NA)
  index <- data$combined_homogeneity_index[1]
  
  cat(protein_id, gene, GC, KO, "has", num_seqs, "aa sequences across", num_genomes, "genomes with", index ,"combined homogeneity. \n")
}
```

```{r}
output_path <- file.path(anvio_folder, "pangenomic_analysis/individual_GCs")

info_for_GC(NovCor_PAN, output_path, "GC_00002114", "YncB")
info_for_GC(NovCor_PAN, output_path, "GC_00002607", "nagB")
```