Supplemental Methods: PPanGGOLiN

PPanGGOLiN v1.1.141 (Gautreau et al., 2020) was used for this analysis.

1 Import Prokka annotations and anvi’o 7 clusters into PPanGGOLiN

In order to import the anvi’o clustering into PPanGGOLiN we need:

A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files.
The annotated genomes with gene IDs that match the ones listed in the previous .tsv file.

We read here the files generated using anvi-summarize that link each gene to gene clusters, genomes and functions:

AnvioSummary_Cps <- read.delim("/analysis_Anvio7/PAN_Cps_gene_clusters_summary.txt.gz")

AnvioSummary_Cpr <- read.delim("/analysis_Anvio7/PAN_Cpr_gene_clusters_summary.txt.gz")

AnvioSummary_Ctu <- read.delim("/analysis_Anvio7/PAN_Ctu_gene_clusters_summary.txt.gz")

AnvioSummary_Cac <- read.delim("/analysis_Anvio7/PAN_Cac_gene_clusters_summary.txt.gz")

Here we subset the info needed to generate the .tsv files. We clean up the “RefSeq_” prefix that was added to the genome names to avoid anvi’o complains about genome names starting with a number. We need to remove the prefix to match the info inside the new .gffs we are about to export:

AnvioClusters_Cps <- AnvioSummary_Cps %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Cps <- select(AnvioClusters_Cps, c(gene_cluster_id, new_id))

AnvioClusters_Cpr <- AnvioSummary_Cpr %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Cpr <- select(AnvioClusters_Cpr, c(gene_cluster_id, new_id))

AnvioClusters_Ctu <- AnvioSummary_Ctu %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Ctu <- select(AnvioClusters_Ctu, c(gene_cluster_id, new_id))

AnvioClusters_Cac <- AnvioSummary_Cac %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Cac <- select(AnvioClusters_Cac, c(gene_cluster_id, new_id))

Now we write those as .tsv files:

write_delim(AnvioClusters_Cps, "/analysis_PPanGGOLiN/Cps_PAN_SUMMARY.tsv", col_names=FALSE)

write_delim(AnvioClusters_Cpr, "/analysis_PPanGGOLiN/Cpr_PAN_SUMMARY.tsv", col_names=FALSE)

write_delim(AnvioClusters_Ctu, "/analysis_PPanGGOLiN/Ctu_PAN_SUMMARY.tsv", col_names=FALSE)

write_delim(AnvioClusters_Cac, "/analysis_PPanGGOLiN/Cac_PAN_SUMMARY.tsv", col_names=FALSE)

IMPORTANT: The gene_callers_id provided by anvi’o don’t match the original ones in the prokka annotation. The prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the prokka output in order to be compatible with the pangenomic anvi’o pipeline. While parsing new gene_callers_id are generated only for the ORFs that will be imported into anvi’o. Fortunately, anvi-get-sequences-for-gene-calls can be used to export new .gff files with only the ORFs and these can be used for PPanGGOLiN, instead of the original prokka ones:

#conda activate anvio-7.1
mkdir -p "/analysis_Anvio7/Pangenomic_Results/Exported_gffs"

path_f="/analysis_Anvio7/Contigs_db"
path_o="/analysis_Anvio7/Pangenomic_Results/Exported_gffs"

for file in $path_f/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-get-sequences-for-gene-calls -c $file --export-gff3 -o $path_o/$FILENAME.gff
      
done

We create two tab delimited text files per species. Both of the text files will have a two column structure. The first column is the name to be utilized by PPanGGOLiNn for each genome per line, and the second column contains the paths to the corresponding file. One text file is for the anvi’o exported .gff and the other is for the reformatted .fasta files. The extension by default was .fa for the reformatted anvi’o files, so we had to convert them all to .fasta for PPanGGOLiN to accept them. We need to create the text files to run the annotate subcommand.

#conda activate PPanGGOLiN
ppanggolin annotate --anno "/analysis_PPanGGOLiN/Cac_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Cac_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Cac_anno_output" --basename Cac -f 

ppanggolin annotate --anno "/analysis_PPanGGOLiN/Cpr_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Cpr_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Cpr_anno_output" --basename Cpr -f 

ppanggolin annotate --anno "/analysis_PPanGGOLiN/Cps_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Cps_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Cps_anno_output" --basename Cps -f 

ppanggolin annotate --anno "/analysis_PPanGGOLiN/Ctu_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Ctu_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Ctu_anno_output" --basename Ctu -f

For the cluster subcommand we use the .tsv files created before from the anvi-summarize output.

#conda activate PPanGGOLiN
ppanggolin cluster -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --clusters "/analysis_PPanGGOLiN/Cac_PAN_SUMMARY.tsv" --infer_singletons

ppanggolin cluster -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --clusters "/analysis_PPanGGOLiN/Cps_PAN_SUMMARY.tsv" --infer_singletons

ppanggolin cluster -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --clusters "/analysis_PPanGGOLiN/Ctu_PAN_SUMMARY.tsv" --infer_singletons

ppanggolin cluster -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --clusters "/analysis_PPanGGOLiN/Cpr_PAN_SUMMARY.tsv" --infer_singletons

1.1 Graphing and Partitioning

The graph subcommand has only a single other option, which is ‘-r’ or ‘–remove_high_copy_number’. If used, it will remove the gene families that are too duplicated in your genomes. This is useful if you want to visualize your pangenome afterward and want to remove the biggest hubs to have a clearer view. It can also be used to limit the influence of very duplicated genes such as transposases or ABC transporters in the partition step.

#conda activate PPanGGOLiN
ppanggolin graph -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5"
ppanggolin graph -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5"
ppanggolin graph -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5"
ppanggolin graph -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5"

We let the partition subcommand statistical criterion find the optimal number of partitions.

#conda activate PPanGGOLiN
ppanggolin partition -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5"
ppanggolin partition -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5"
ppanggolin partition -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5"
ppanggolin partition -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5"

1.2 Writing outputs

For details on PPanGGOLiN outputs see: https://github.com/labgem/PPanGGOLiN/wiki/Outputs

#conda activate PPanGGOLiN
ppanggolin write -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" -o "/analysis_PPanGGOLiN/Output_Cac" --light_gexf --gexf  --csv --Rtab --stats --partitions --projection --families_tsv -f

ppanggolin write -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" -o "/analysis_PPanGGOLiN/Output_Cps" --light_gexf --gexf  --csv --Rtab --stats --partitions --projection --families_tsv -f

ppanggolin write -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" -o "/analysis_PPanGGOLiN/Output_Ctu" --light_gexf --gexf  --csv --Rtab --stats --partitions --projection --families_tsv -f

ppanggolin write -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" -o "/analysis_PPanGGOLiN/Output_Cpr" --light_gexf --gexf  --csv --Rtab --stats --partitions --projection --families_tsv -f

1.3 Rarefaction curves

#conda activate PPanGGOLiN
ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --force -o "/analysis_PPanGGOLiN/Output_Cac_rarefaction"

ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --force -o "/analysis_PPanGGOLiN/Output_Cps_rarefaction"

ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --force -o "/analysis_PPanGGOLiN/Output_Ctu_rarefaction"

ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --force -o "/analysis_PPanGGOLiN/Output_Cpr_rarefaction"

1.4 Finding Regions of Genome Plasticity

For details on RGPs see: https://github.com/labgem/PPanGGOLiN/wiki/Regions-of-Genome-Plasticity

The spots can be labeled by:

#conda activate PPanGGOLiN
ppanggolin rgp -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" 
ppanggolin rgp -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" 
ppanggolin rgp -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" 
ppanggolin rgp -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5"

ppanggolin spot -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Cac_spots/spots_ID" -f
ppanggolin spot -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Cps_spots/spots_ID" -f
ppanggolin spot -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Ctu_spots/spots_ID" -f
ppanggolin spot -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Cpr_spots/spots_ID" -f

ppanggolin write -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" -o "/analysis_PPanGGOLiN/Output_Cac_spots" --regions --spots -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" -o "/analysis_PPanGGOLiN/Output_Cps_spots" --regions --spots -f
ppanggolin write -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" -o "/analysis_PPanGGOLiN/Output_Ctu_spots" --regions --spots -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" -o "/analysis_PPanGGOLiN/Output_Cpr_spots" --regions --spots -f

1.5 Summary of used parameters

The info subcommand indicates, for each steps of the analysis, the PPanGGOLiN parameters that were used and the source of the data if appropriate.

#conda activate PPanGGOLiN
ppanggolin info -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --parameters --content --status

ppanggolin info -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --parameters --content --status

ppanggolin info -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --parameters --content --status

ppanggolin info -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --parameters --content --status

2 Import PPanGGOLiN Partitions into anvi’o

The pangenomeGraph.gexf file can be opened with Gephi, and the Data Table exported as a gephi_table.csv. This can be used to easily explore the PPanGGOLiN output with the same gene IDs anvi’o uses. The matrix.csv should be used with caution, since records PPanGGOLiN internal gene IDs, that are different from the anvi’o ones.

Gephi <-  read.delim("analysis_PPanGGOLiN/Output_Cps/Cps_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)

Gephi <-  read.delim("analysis_PPanGGOLiN/Output_Cpr/Cpr_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)

Gephi <-  read.delim("analysis_PPanGGOLiN/Output_Ctu/Ctu_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)

Gephi <-  read.delim("analysis_PPanGGOLiN/Output_Cac/Cac_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)

2.1 Persistent vs. Accessory based on PPanGGOLiN

PPanGGOLiN generates a “partitions” folder as part of its output. We read the persistent, cloud and shell files from this output and add a column labeling them. We combine the three in one partitions.txt file that can be imported into anvi’o:

Cps_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Cps/partitions/persistent.txt", col.names = "GC")
Cps_persistent$PPanGGOLiN <- "persistent"
Cps_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Cps/partitions/cloud.txt", col.names = "GC")
Cps_cloud$PPanGGOLiN <- "cloud"
Cps_shell <- read.delim("/analysis_PPanGGOLiN/Output_Cps/partitions/shell.txt", col.names = "GC")
Cps_shell$PPanGGOLiN <- "shell"
Cps_partitions <- rbind(Cps_persistent, Cps_cloud, Cps_shell)
Cps_partitions <- Cps_partitions[grep("GC_00", Cps_partitions$GC),]
write_delim(Cps_partitions, "/analysis_PPanGGOLiN/Output_Cps/partitions/partitions.txt", delim = "\t")

Cpr_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Cpr/partitions/persistent.txt", col.names = "GC")
Cpr_persistent$PPanGGOLiN <- "persistent"
Cpr_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Cpr/partitions/cloud.txt", col.names = "GC")
Cpr_cloud$PPanGGOLiN <- "cloud"
Cpr_shell <- read.delim("/analysis_PPanGGOLiN/Output_Cpr/partitions/shell.txt", col.names = "GC")
Cpr_shell$PPanGGOLiN <- "shell"
Cpr_partitions <- rbind(Cpr_persistent, Cpr_cloud, Cpr_shell)
Cpr_partitions <- Cpr_partitions[grep("GC_", Cpr_partitions$GC),]
write_delim(Cpr_partitions, "/analysis_PPanGGOLiN/Output_Cpr/partitions/partitions.txt", delim = "\t")

Ctu_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Ctu/partitions/persistent.txt", col.names = "GC")
Ctu_persistent$PPanGGOLiN <- "persistent"
Ctu_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Ctu/partitions/cloud.txt", col.names = "GC")
Ctu_cloud$PPanGGOLiN <- "cloud"
Ctu_shell <- read.delim("/analysis_PPanGGOLiN/Output_Ctu/partitions/shell.txt", col.names = "GC")
Ctu_shell$PPanGGOLiN <- "shell"
Ctu_partitions <- rbind(Ctu_persistent, Ctu_cloud, Ctu_shell)
Ctu_partitions <- Ctu_partitions[grep("GC_", Ctu_partitions$GC),]
write_delim(Ctu_partitions, "/analysis_PPanGGOLiN/Output_Ctu/partitions/partitions.txt", delim = "\t")

Cac_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Cac/partitions/persistent.txt", col.names = "GC")
Cac_persistent$PPanGGOLiN <- "persistent"
Cac_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Cac/partitions/cloud.txt", col.names = "GC")
Cac_cloud$PPanGGOLiN <- "cloud"
Cac_shell <- read.delim("/analysis_PPanGGOLiN/Output_Cac/partitions/shell.txt", col.names = "GC")
Cac_shell$PPanGGOLiN <- "shell"
Cac_partitions <- rbind(Cac_persistent, Cac_cloud, Cac_shell)
Cac_partitions <- Cac_partitions[grep("GC_", Cac_partitions$GC),]
write_delim(Cac_partitions, "/analysis_PPanGGOLiN/Output_Cac/partitions/partitions.txt", delim = "\t")

This will import the PPanGGOLiN partitions (from partitions.txt) as a collection of bins named PPanGGOLiNpartitions in each profile-db:

#conda activate anvio-7.1
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Cps/partitions_Cps.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Cps/PAN_Cps-PAN.db" \
                       -C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Cpr/partitions_Cpr.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Cpr/PAN_Cpr-PAN.db" \
                       -C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Ctu/partitions_Ctu.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Ctu/PAN_Ctu-PAN.db" \
                       -C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Cac/partitions_Cac.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Cac/PAN_Cac-PAN.db" \
                       -C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"

We can also import this data (from partitions.txt) as a PPanGGOLiN layer into in each profile-db:

#conda activate anvio-7.1
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Cps/partitions_Cps.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Cps/PAN_Cps-PAN.db" -t items
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Cpr/partitions_Cpr.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Cpr/PAN_Cpr-PAN.db" -t items
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Ctu/partitions_Ctu.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Ctu/PAN_Ctu-PAN.db" -t items
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Cac/partitions_Cac.txt" \
                       -p "/analysis_Anvio7/Pangenomic_Results/Cac/PAN_Cac-PAN.db" -t items

2.2 Displaying the generated bin collections

We can see the collections and bins created in a pangenomic database using the command anvi-show-collections-and-bins:

#conda activate anvio-7.1
anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Cps/PAN_Cps-PAN.db" --debug 

anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Cpr/PAN_Cpr-PAN.db" --debug 

anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Ctu/PAN_Ctu-PAN.db" --debug 

anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Cac/PAN_Cac-PAN.db" --debug

2.3 PPanGGOLiN Pangenome Compartments

2.3.1 C. propinquum pangenome

Pangenome Matrix	# of Gene Clusters/Pangenome	Percentage
Persistent	1945/3108	62.6%
Shell	323/3108	10.4%
Cloud	837/3108	26.9%
None	3/3108	0.1%

2.3.2 C. pseudodiphtheriticum pangenome

Pangenome Matrix	# of Gene Clusters/Pangenome	Percentage
Persistent	1771/3590	49.3%
Shell	392/3590	10.9%
Cloud	1424/3590	39.7%
None	3/3590	0.1%

2.3.3 C. accolens pangenome

Pangenome Matrix	# of Gene Clusters/Pangenome	Percentage
Persistent	2011/3427	58.7%
Shell	301/3427	8.8%
Cloud	1113/3427	32.5%
None	2/3427	0.1%

2.3.4 C. tuberculostearicum pangenome

Pangenome Matrix	# of Gene Clusters/Pangenome	Percentage
Persistent	1991/2907	68.5%
Shell	147/2907	5.1%
Cloud	766/2907	26.4%
None	3/2907	0.1%

References

Gautreau, G., Bazin, A., Gachet, M., Planel, R., Burlot, L., Dubois, M., Perrin, A., Médigue, C., Calteau, A., Cruveiller, S., et al. (2020). PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16, e1007732.