Supplemental Methods: PPanGGOLiN
PPanGGOLiN v1.1.141 (Gautreau et al., 2020) was used for this analysis.
1 Import Prokka annotations and anvi’o 7 clusters into PPanGGOLiN
In order to import the anvi’o clustering into PPanGGOLiN we need:
- A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files.
- The annotated genomes with gene IDs that match the ones listed in the previous .tsv file.
We read here the files generated using anvi-summarize
that link each gene to gene clusters, genomes and functions:
AnvioSummary_Cps <- read.delim("/analysis_Anvio7/PAN_Cps_gene_clusters_summary.txt.gz")
AnvioSummary_Cpr <- read.delim("/analysis_Anvio7/PAN_Cpr_gene_clusters_summary.txt.gz")
AnvioSummary_Ctu <- read.delim("/analysis_Anvio7/PAN_Ctu_gene_clusters_summary.txt.gz")
AnvioSummary_Cac <- read.delim("/analysis_Anvio7/PAN_Cac_gene_clusters_summary.txt.gz")
Here we subset the info needed to generate the .tsv files. We clean up the “RefSeq_” prefix that was added to the genome names to avoid anvi’o complains about genome names starting with a number. We need to remove the prefix to match the info inside the new .gffs we are about to export:
AnvioClusters_Cps <- AnvioSummary_Cps %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Cps <- select(AnvioClusters_Cps, c(gene_cluster_id, new_id))
AnvioClusters_Cpr <- AnvioSummary_Cpr %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Cpr <- select(AnvioClusters_Cpr, c(gene_cluster_id, new_id))
AnvioClusters_Ctu <- AnvioSummary_Ctu %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Ctu <- select(AnvioClusters_Ctu, c(gene_cluster_id, new_id))
AnvioClusters_Cac <- AnvioSummary_Cac %>% unite(new_id, genome_name:gene_callers_id, sep = "___", remove = FALSE) %>% mutate(new_id = gsub("RefSeq_", "", new_id))
AnvioClusters_Cac <- select(AnvioClusters_Cac, c(gene_cluster_id, new_id))
Now we write those as .tsv files:
write_delim(AnvioClusters_Cps, "/analysis_PPanGGOLiN/Cps_PAN_SUMMARY.tsv", col_names=FALSE)
write_delim(AnvioClusters_Cpr, "/analysis_PPanGGOLiN/Cpr_PAN_SUMMARY.tsv", col_names=FALSE)
write_delim(AnvioClusters_Ctu, "/analysis_PPanGGOLiN/Ctu_PAN_SUMMARY.tsv", col_names=FALSE)
write_delim(AnvioClusters_Cac, "/analysis_PPanGGOLiN/Cac_PAN_SUMMARY.tsv", col_names=FALSE)
IMPORTANT: The gene_callers_id provided by anvi’o don’t match the original ones in the prokka annotation. The prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py
. By default prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py
will only utilize open reading frames reported by Prodigal in the prokka output in order to be compatible with the pangenomic anvi’o pipeline. While parsing new gene_callers_id are generated only for the ORFs that will be imported into anvi’o. Fortunately, anvi-get-sequences-for-gene-calls
can be used to export new .gff files with only the ORFs and these can be used for PPanGGOLiN, instead of the original prokka ones:
#conda activate anvio-7.1
mkdir -p "/analysis_Anvio7/Pangenomic_Results/Exported_gffs"
path_f="/analysis_Anvio7/Contigs_db"
path_o="/analysis_Anvio7/Pangenomic_Results/Exported_gffs"
for file in $path_f/*.db; do
FILENAME=`basename ${file%.*}`
anvi-get-sequences-for-gene-calls -c $file --export-gff3 -o $path_o/$FILENAME.gff
done
We create two tab delimited text files per species. Both of the text files will have a two column structure. The first column is the name to be utilized by PPanGGOLiNn for each genome per line, and the second column contains the paths to the corresponding file. One text file is for the anvi’o exported .gff and the other is for the reformatted .fasta files. The extension by default was .fa for the reformatted anvi’o files, so we had to convert them all to .fasta for PPanGGOLiN to accept them. We need to create the text files to run the annotate
subcommand.
#conda activate PPanGGOLiN
ppanggolin annotate --anno "/analysis_PPanGGOLiN/Cac_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Cac_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Cac_anno_output" --basename Cac -f
ppanggolin annotate --anno "/analysis_PPanGGOLiN/Cpr_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Cpr_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Cpr_anno_output" --basename Cpr -f
ppanggolin annotate --anno "/analysis_PPanGGOLiN/Cps_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Cps_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Cps_anno_output" --basename Cps -f
ppanggolin annotate --anno "/analysis_PPanGGOLiN/Ctu_AnvioExported_gff.txt" --fasta "/analysis_PPanGGOLiN/Ctu_AnvioReformatted_fasta.txt" -o "/analysis_PPanGGOLiN/Ctu_anno_output" --basename Ctu -f
For the cluster
subcommand we use the .tsv files created before from the anvi-summarize
output.
#conda activate PPanGGOLiN
ppanggolin cluster -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --clusters "/analysis_PPanGGOLiN/Cac_PAN_SUMMARY.tsv" --infer_singletons
ppanggolin cluster -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --clusters "/analysis_PPanGGOLiN/Cps_PAN_SUMMARY.tsv" --infer_singletons
ppanggolin cluster -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --clusters "/analysis_PPanGGOLiN/Ctu_PAN_SUMMARY.tsv" --infer_singletons
ppanggolin cluster -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --clusters "/analysis_PPanGGOLiN/Cpr_PAN_SUMMARY.tsv" --infer_singletons
1.1 Graphing and Partitioning
The graph
subcommand has only a single other option, which is ‘-r’ or ‘–remove_high_copy_number’. If used, it will remove the gene families that are too duplicated in your genomes. This is useful if you want to visualize your pangenome afterward and want to remove the biggest hubs to have a clearer view. It can also be used to limit the influence of very duplicated genes such as transposases or ABC transporters in the partition step.
#conda activate PPanGGOLiN
ppanggolin graph -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5"
ppanggolin graph -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5"
ppanggolin graph -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5"
ppanggolin graph -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5"
We let the partition
subcommand statistical criterion find the optimal number of partitions.
#conda activate PPanGGOLiN
ppanggolin partition -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5"
ppanggolin partition -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5"
ppanggolin partition -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5"
ppanggolin partition -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5"
1.2 Writing outputs
For details on PPanGGOLiN outputs see: https://github.com/labgem/PPanGGOLiN/wiki/Outputs
#conda activate PPanGGOLiN
ppanggolin write -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" -o "/analysis_PPanGGOLiN/Output_Cac" --light_gexf --gexf --csv --Rtab --stats --partitions --projection --families_tsv -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" -o "/analysis_PPanGGOLiN/Output_Cps" --light_gexf --gexf --csv --Rtab --stats --partitions --projection --families_tsv -f
ppanggolin write -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" -o "/analysis_PPanGGOLiN/Output_Ctu" --light_gexf --gexf --csv --Rtab --stats --partitions --projection --families_tsv -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" -o "/analysis_PPanGGOLiN/Output_Cpr" --light_gexf --gexf --csv --Rtab --stats --partitions --projection --families_tsv -f
1.3 Rarefaction curves
#conda activate PPanGGOLiN
ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --force -o "/analysis_PPanGGOLiN/Output_Cac_rarefaction"
ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --force -o "/analysis_PPanGGOLiN/Output_Cps_rarefaction"
ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --force -o "/analysis_PPanGGOLiN/Output_Ctu_rarefaction"
ppanggolin rarefaction -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --force -o "/analysis_PPanGGOLiN/Output_Cpr_rarefaction"
1.4 Finding Regions of Genome Plasticity
For details on RGPs see: https://github.com/labgem/PPanGGOLiN/wiki/Regions-of-Genome-Plasticity
The spots can be labeled by:
#conda activate PPanGGOLiN
ppanggolin rgp -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5"
ppanggolin rgp -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5"
ppanggolin rgp -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5"
ppanggolin rgp -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5"
ppanggolin spot -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Cac_spots/spots_ID" -f
ppanggolin spot -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Cps_spots/spots_ID" -f
ppanggolin spot -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Ctu_spots/spots_ID" -f
ppanggolin spot -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --draw_hotspots -o "/analysis_PPanGGOLiN/Output_Cpr_spots/spots_ID" -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" -o "/analysis_PPanGGOLiN/Output_Cac_spots" --regions --spots -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" -o "/analysis_PPanGGOLiN/Output_Cps_spots" --regions --spots -f
ppanggolin write -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" -o "/analysis_PPanGGOLiN/Output_Ctu_spots" --regions --spots -f
ppanggolin write -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" -o "/analysis_PPanGGOLiN/Output_Cpr_spots" --regions --spots -f
1.5 Summary of used parameters
The info
subcommand indicates, for each steps of the analysis, the PPanGGOLiN parameters that were used and the source of the data if appropriate.
#conda activate PPanGGOLiN
ppanggolin info -p "/analysis_PPanGGOLiN/Cac_anno_output/Cac.h5" --parameters --content --status
ppanggolin info -p "/analysis_PPanGGOLiN/Cps_anno_output/Cps.h5" --parameters --content --status
ppanggolin info -p "/analysis_PPanGGOLiN/Ctu_anno_output/Ctu.h5" --parameters --content --status
ppanggolin info -p "/analysis_PPanGGOLiN/Cpr_anno_output/Cpr.h5" --parameters --content --status
2 Import PPanGGOLiN Partitions into anvi’o
The pangenomeGraph.gexf
file can be opened with Gephi, and the Data Table exported as a gephi_table.csv
. This can be used to easily explore the PPanGGOLiN output with the same gene IDs anvi’o uses. The matrix.csv
should be used with caution, since records PPanGGOLiN internal gene IDs, that are different from the anvi’o ones.
Gephi <- read.delim("analysis_PPanGGOLiN/Output_Cps/Cps_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)
Gephi <- read.delim("analysis_PPanGGOLiN/Output_Cpr/Cpr_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)
Gephi <- read.delim("analysis_PPanGGOLiN/Output_Ctu/Ctu_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)
Gephi <- read.delim("analysis_PPanGGOLiN/Output_Cac/Cac_gephi_table.csv", ",", escape_double = FALSE, trim_ws = TRUE)
2.1 Persistent vs. Accessory based on PPanGGOLiN
PPanGGOLiN generates a “partitions” folder as part of its output. We read the persistent, cloud and shell files from this output and add a column labeling them. We combine the three in one partitions.txt
file that can be imported into anvi’o:
Cps_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Cps/partitions/persistent.txt", col.names = "GC")
Cps_persistent$PPanGGOLiN <- "persistent"
Cps_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Cps/partitions/cloud.txt", col.names = "GC")
Cps_cloud$PPanGGOLiN <- "cloud"
Cps_shell <- read.delim("/analysis_PPanGGOLiN/Output_Cps/partitions/shell.txt", col.names = "GC")
Cps_shell$PPanGGOLiN <- "shell"
Cps_partitions <- rbind(Cps_persistent, Cps_cloud, Cps_shell)
Cps_partitions <- Cps_partitions[grep("GC_00", Cps_partitions$GC),]
write_delim(Cps_partitions, "/analysis_PPanGGOLiN/Output_Cps/partitions/partitions.txt", delim = "\t")
Cpr_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Cpr/partitions/persistent.txt", col.names = "GC")
Cpr_persistent$PPanGGOLiN <- "persistent"
Cpr_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Cpr/partitions/cloud.txt", col.names = "GC")
Cpr_cloud$PPanGGOLiN <- "cloud"
Cpr_shell <- read.delim("/analysis_PPanGGOLiN/Output_Cpr/partitions/shell.txt", col.names = "GC")
Cpr_shell$PPanGGOLiN <- "shell"
Cpr_partitions <- rbind(Cpr_persistent, Cpr_cloud, Cpr_shell)
Cpr_partitions <- Cpr_partitions[grep("GC_", Cpr_partitions$GC),]
write_delim(Cpr_partitions, "/analysis_PPanGGOLiN/Output_Cpr/partitions/partitions.txt", delim = "\t")
Ctu_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Ctu/partitions/persistent.txt", col.names = "GC")
Ctu_persistent$PPanGGOLiN <- "persistent"
Ctu_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Ctu/partitions/cloud.txt", col.names = "GC")
Ctu_cloud$PPanGGOLiN <- "cloud"
Ctu_shell <- read.delim("/analysis_PPanGGOLiN/Output_Ctu/partitions/shell.txt", col.names = "GC")
Ctu_shell$PPanGGOLiN <- "shell"
Ctu_partitions <- rbind(Ctu_persistent, Ctu_cloud, Ctu_shell)
Ctu_partitions <- Ctu_partitions[grep("GC_", Ctu_partitions$GC),]
write_delim(Ctu_partitions, "/analysis_PPanGGOLiN/Output_Ctu/partitions/partitions.txt", delim = "\t")
Cac_persistent <- read.delim("/analysis_PPanGGOLiN/Output_Cac/partitions/persistent.txt", col.names = "GC")
Cac_persistent$PPanGGOLiN <- "persistent"
Cac_cloud <- read.delim("/analysis_PPanGGOLiN/Output_Cac/partitions/cloud.txt", col.names = "GC")
Cac_cloud$PPanGGOLiN <- "cloud"
Cac_shell <- read.delim("/analysis_PPanGGOLiN/Output_Cac/partitions/shell.txt", col.names = "GC")
Cac_shell$PPanGGOLiN <- "shell"
Cac_partitions <- rbind(Cac_persistent, Cac_cloud, Cac_shell)
Cac_partitions <- Cac_partitions[grep("GC_", Cac_partitions$GC),]
write_delim(Cac_partitions, "/analysis_PPanGGOLiN/Output_Cac/partitions/partitions.txt", delim = "\t")
This will import the PPanGGOLiN partitions (from partitions.txt
) as a collection of bins named PPanGGOLiNpartitions in each profile-db:
#conda activate anvio-7.1
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Cps/partitions_Cps.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Cps/PAN_Cps-PAN.db" \
-C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Cpr/partitions_Cpr.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Cpr/PAN_Cpr-PAN.db" \
-C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Ctu/partitions_Ctu.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Ctu/PAN_Ctu-PAN.db" \
-C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
anvi-import-collection "/analysis_Anvio7/Pangenomic_Results/Cac/partitions_Cac.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Cac/PAN_Cac-PAN.db" \
-C "PPanGGOLiNpartitions" --bins-info "/analysis_Anvio7/main_outputs/PPanGGOLiN_bin_colors.txt"
We can also import this data (from partitions.txt
) as a PPanGGOLiN layer into in each profile-db:
#conda activate anvio-7.1
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Cps/partitions_Cps.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Cps/PAN_Cps-PAN.db" -t items
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Cpr/partitions_Cpr.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Cpr/PAN_Cpr-PAN.db" -t items
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Ctu/partitions_Ctu.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Ctu/PAN_Ctu-PAN.db" -t items
anvi-import-misc-data "/analysis_Anvio7/Pangenomic_Results/Cac/partitions_Cac.txt" \
-p "/analysis_Anvio7/Pangenomic_Results/Cac/PAN_Cac-PAN.db" -t items
2.2 Displaying the generated bin collections
We can see the collections and bins created in a pangenomic database using the command anvi-show-collections-and-bins
:
#conda activate anvio-7.1
anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Cps/PAN_Cps-PAN.db" --debug
anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Cpr/PAN_Cpr-PAN.db" --debug
anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Ctu/PAN_Ctu-PAN.db" --debug
anvi-show-collections-and-bins -p "/analysis_Anvio7/Pangenomic_Results/Cac/PAN_Cac-PAN.db" --debug
2.3 PPanGGOLiN Pangenome Compartments
2.3.1 C. propinquum pangenome
Pangenome Matrix | # of Gene Clusters/Pangenome | Percentage |
---|---|---|
Persistent | 1945/3108 | 62.6% |
Shell | 323/3108 | 10.4% |
Cloud | 837/3108 | 26.9% |
None | 3/3108 | 0.1% |
2.3.2 C. pseudodiphtheriticum pangenome
Pangenome Matrix | # of Gene Clusters/Pangenome | Percentage |
---|---|---|
Persistent | 1771/3590 | 49.3% |
Shell | 392/3590 | 10.9% |
Cloud | 1424/3590 | 39.7% |
None | 3/3590 | 0.1% |
2.3.3 C. accolens pangenome
Pangenome Matrix | # of Gene Clusters/Pangenome | Percentage |
---|---|---|
Persistent | 2011/3427 | 58.7% |
Shell | 301/3427 | 8.8% |
Cloud | 1113/3427 | 32.5% |
None | 2/3427 | 0.1% |
2.3.4 C. tuberculostearicum pangenome
Pangenome Matrix | # of Gene Clusters/Pangenome | Percentage |
---|---|---|
Persistent | 1991/2907 | 68.5% |
Shell | 147/2907 | 5.1% |
Cloud | 766/2907 | 26.4% |
None | 3/2907 | 0.1% |