Supplemental Methods: GET_HOMOLOGUES, GET_PHYLOMARKERS and IQ-TREE2

This notebook contains analyses using the genomic software GET_HOMOLOGUES version 24082022 (Contreras-Moreira and Vinuesa, 2013). For more in-depth information about GET_HOMOLOGUES you can go to their official online manual here: http://eead-csic-compbio.github.io/get_homologues/manual/

We will use the .gbk (genebank) file for each Corynebacterium strain genome from the Prokka annotation output. We copy and isolate the .gbk files into new folders for each of the species, plus a combined directory with all the 107 genomes.

cd "/analysis_GH/Prokka_out" 

# Create five new directories
mkdir -p "Prokka_All_gbk" # gbk files of all 107 Corynebacterium species
mkdir -p "Prokka_Cpr_gbk" # gbk files of 19 C. propinquum
mkdir -p "Prokka_Cps_gbk" # gbk files of 43 C. pseudodiphtheriticum
mkdir -p "Prokka_Cac_gbk" # gbk files of 34 C. accolens
mkdir -p "Prokka_Ctu_gbk" # gbk files of 8 C. tuberculostearicum

# Copy .gbk files to those directories
cp *.gbk "Prokka_All_gbk/"
cp Cpr_*.gbk "Prokka_Cpr_gbk/"
cp Cps_*.gbk "Prokka_Cps_gbk/"
cp Cac_*.gbk "Prokka_Cac_gbk/"
cp Ctu_*.gbk "Prokka_Ctu_gbk/"

1 ANI (Average Nucleotide Identity)

There will be two ANI matrices at the nucleotide level for each of the four species using OMCL (OrthoMuscle) (Li et al., 2003). The first one is to calculate ANI percentages from the core CDS (coding sequences) and the second (-t 0) is to calculate ANI values from all shared CDS regions.

1.1 Generate ANI `.tab` Files

C. propinquum core and all shared CDS regions ANI .tab file:

cd get_homologues

# Generate core CDS ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -A -t 19 -M -n 8 -C 90 -a 'CDS'

# Generate all shared CDS regions ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -A -t 0 -M -n 8 -C 90 -a 'CDS'

C. pseudodiphtheriticum core and all shared CDS regions ANI .tab file:

cd get_homologues

# Generate core CDS ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -A -t 43 -M -n 8 -C 90 -a 'CDS'

# Generate all shared CDS regions ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -A -t 0 -M -n 8 -C 90 -a 'CDS'

C. accolens core and all shared CDS regions ANI .tab file:

cd get_homologues

# Generate core CDS ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -A -t 34 -M -n 8 -C 90 -a 'CDS'

# Generate all shared CDS regions ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -A -t 0 -M -n 8 -C 90 -a 'CDS'

C. tuberculostearicum core and all shared CDS regions ANI .tab file:

cd get_homologues

# Generate core CDS ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -A -t 8 -M -n 8 -C 90 -a 'CDS'

# Generate all shared CDS regions ANI .tab file
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -A -t 0 -M -n 8 -C 90 -a 'CDS'

1.2 Use the ANI `.tab` Files to Create Editable `.svgs`

C. propinquum core and all shared CDS regions ANI .tab file plot to .svg:

cd get_homologues

# Generate editable .svg file from core .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Cpr/Cpr*_f0_19taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Cpr Core CDS ANI" \
                         -H 9 \
                         -W 13

# Generate editable .svg file from all shared CDS regions .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Cpr/Cpr*_f0_0taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Cpr All Shared CDS Regions ANI" \
                         -H 9 \
                         -W 13

C. pseudodiphtheriticum core and all shared CDS regions ANI .tab file plot to .svg:

cd get_homologues

# Generate editable .svg file from core .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Cps/Cps*_f0_43taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Cps Core CDS ANI" \
                         -H 16 \
                         -W 23 

# Generate editable .svg file from all shared CDS regions .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Cps/Cps*_f0_0taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Cps All Shared CDS Regions ANI" \
                         -H 16 \
                         -W 23

C. accolens core and all shared CDS regions ANI .tab file plot to .svg:

cd get_homologues

# Generate editable .svg file from core .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Cac/Cac*_f0_34taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Cac Core CDS ANI" \
                         -H 15 \
                         -W 19

# Generate editable .svg file from all shared CDS regions .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Cac/Cac*_f0_0taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Cac All Shared CDS Regions ANI" \
                         -H 15 \
                         -W 19

C. tuberculostearicum core and all shared CDS regions ANI .tab file plot to .svg:

cd get_homologues

# Generate editable .svg file from core .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Ctu/Ctu*_f0_8taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Ctu Core CDS ANI" \
                         -H 7 \
                         -W 8

# Generate editable .svg file from all shared CDS regions .tab file
./plot_matrix_heatmap.sh -i "/Users/username/get_homologues/Ctu/Ctu*_f0_0taxa_algOMCL_e0_C90_Avg_identity.tab" \
                         -d 1 \
                         -t "Ctu All Shared CDS Regions ANI" \
                         -H 7 \
                         -W 8

After the .svg files were created they were further edited in Adobe Illustrator.

2 Conservative Core and Phylogeny of All 107 Corynebacterium Genomes

The core will be determined by three separate clustering algorithms: BDBH (bidirectional best hits) (Contreras-Moreira and Vinuesa, 2013), COGS (Clusters of Orthologous Genes) (Kristensen et al., 2010), and OMCL (OrthoMCL) (Li et al., 2003). Once the core is determined for each algorithm we generate a Venn diagram of the shared conservative gene clusters between them. This will be the number in the middle of the Venn diagram where all three algorithms intersect.

2.1 Running BDBH, COGS, & OMCL with 107 Corynebacterium Genomes

107 Corynebacterium genomes clustered with BDBH, COGS, and OMCL:

cd get_homologues

# Cluster with BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_All_gbk" -n 8 -t 107 -C 90

# cluster with COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_All_gbk" -n 8 -t 0 -C 90 -G

# cluster with OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_All_gbk" -n 8 -t 0 -C 90 -M

2.2 Creating a Conservative Core Venn Diagram of 107 Corynebacterium genomes

From our experience you will run into less issues if you move the three cluster folders and cluster list files into a new directory (here we manually moved them to All_gbk_cluster_files) . Then use the paths to those cluster folders after the -d flag separated by commas.

Create conservative core Venn diagram intersection to output .faa and .fna files:

cd get_homologues

# Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/All_gbk_venn" \
                      -d "/analysis_GH/All_gbk_cluster_files/C*_f0_107taxa_algBDBH_e0_C90_","/analysis_GH/All_gbk_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_","/analysis_GH/All_gbk_cluster_files/C*_f0_0taxa_algCOG_e0_C90_" \
                      -t 107
                      
# Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/All_gbk_venn" \
                      -d "/analysis_GH/All_gbk_cluster_files/C*_f0_107taxa_algBDBH_e0_C90_","/analysis_GH/All_gbk_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_","/analysis_GH/All_gbk_cluster_files/C*_f0_0taxa_algCOG_e0_C90_" \
                      -t 107 \
                      -n

2.3 Generate a Phylogenomic Tree for all Corynebacterium genomes from the Conservative Core

At this point we have created .faa and .fna files of the core gene clusters from 107 Corynebacterium genomes. To begin generating a phylogenomic tree we begin by isolating only the .faa and .fna files into a new folder.

Create a new folder and move all files with the extension .faa and .fna into a new destination folder:

# Create a new directory to hold the faa and fna files
mkdir -p "analysis_GH/All_gbk_faa_fna" 

# Go to the output folder from creating the Venn diagram intersection
cd "/analysis_GH/All_gbk_venn"

# Move all .faa files to the new directory
mv *.faa "/analysis_GH/All_gbk_faa_fna"

# Move all .fna files to the new directory
mv *.fna "/analysis_GH/All_gbk_faa_fna"

Now we use GET_PHYLOMARKERS version 2.2.9.1 (Vinuesa et al., 2018) to generate a concatenated alignment file for each single copy core gene cluster.

Concatenate and align the .faa and .fna files to generate codon alignment files:

# Go to directory holding all the .fna and .faa files only
cd "/analysis_GH/All_gbk_faa_fna" 

# Run the get_phylomarkers master script
/Users/username/get_phylomarkers/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

Run IQ-TREE v2.1.3 (Minh et al., 2020) with the codon alignment folder containing .fasta files:

# Go to IQTREE2
cd IQ-TREE2

# Run iqtree2
bin/iqtree2 -p "analysis_GH/All_gbk_faa_fna/All_gbk_codon_alignments" --prefix All_107_C_sp -alrt 1000 -B 1000 -T 8

The -p flag performs edge-linked proportional partition model (Chernomor et al., 2016) for each of the individual gene clusters. Fast model selection for each cluster was determined by ModelFinder (Kalyaanamoorthy et al., 2017). The flags -alrt 1000 and -B 1000 represent 1000 replicates of sH-aLRT and UFbootstraps. -T 8 runs the program at 8 threads.

This will create a .treefile file with the best fit maximum likelyhood (ML) tree. This .treefile can be viewed and edited through iTol annotation editor (Letunic and Bork, 2021) on Google Chrome browser. In iTol the tree can be scaled and assigned strain names from the tree_labels.list file created from GET_PHYLOMARKERS. After scaling and assigning strain names the .svg version can be exported from iTol and further edited in Adobe Illustrator.

3 Corynebacterium Species Specific Phylogenomic Trees with Outgroup

The goal of this analysis is to be able to create phylogenetic trees for each of the four Corynebacterium species plus an outgroup using all the gene clusters common to each species, including those not shared by each corresponding outgroup genome.

3.1 Setting Up Two Data Sets for Each of the Four Species

We have already created four directories containing .gbk (genebank) files for each of the four species.

# gbk files of 19 Cpr
"/analysis_GH/Prokka_out/Prokka_Cpr_gbk"

# gbk files of 43 Cps
"/analysis_GH/Prokka_out/Prokka_Cps_gbk" 

# gbk files of 34 Cac
"/analysis_GH/Prokka_out/Prokka_Cac_gbk" 

# gbk files of 8 Ctu
"/analysis_GH/Prokka_out/Prokka_Ctu_gbk"

Secondly, we created a copy of the same four folders with the addition of an outgroup we chose for each species (see manuscript methods section).

# gbk files of 19 Cpr + 1 = 20
"/analysis_GH/Prokka_out/Prokka_Cpr_gbk_outgroup" 

# gbk files of 43 Cps + 1 = 44 
"/analysis_GH/Prokka_out/Prokka_Cps_gbk_outgroup" 

# gbk files of 34 Cac + 1 = 35
"/analysis_GH/Prokka_out/Prokka_Cac_gbk_outgroup" 

# gbk files of 8 Ctu + 1 = 9
"/analysis_GH/Prokka_out/Prokka_Ctu_gbk_outgroup"

There are eight total directories. The first four only contain n (# of genomes of that specific species) in each directory. The latter four directories have n+1 genomes due to the addition of the outgroup we chose.

3.2 Running BDBH, COGS, and OMCL on Each Directory

Running BDBH, COGS, and OMCL on only C. propinquum and with the addition of an outgroup:

cd get_homologues

# Run BDBH, COGS, and OMCL for Cpr
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -n 8 -t 19 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -n 8 -t 0 -C 90 -G 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -n 8 -t 0 -C 90 -M 

---
# Run BDBH, COGS, and OMCL for (Cpr + outgroup)
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk_outgroup" -n 8 -t 20 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk_outgroup" -n 8 -t 0 -C 90 -G 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk_outgroup" -n 8 -t 0 -C 90 -M

Running BDBH, COGS, and OMCL on only C. pseudodiphtheriticum and with the addition of an outgroup:

cd get_homologues

# Run BDBH, COGS, and OMCL for Cps
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -n 8 -t 43 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -n 8 -t 0 -C 90 -G 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -n 8 -t 0 -C 90 -M 

---
# Run BDBH, COGS, and OMCL for (Cps + outgroup)
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk_outgroup" -n 8 -t 44 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk_outgroup" -n 8 -t 0 -C 90 -G

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk_outgroup" -n 8 -t 0 -C 90 -M

Running BDBH, COGS, and OMCL on only C. accolens

cd get_homologues

# Run BDBH, COGS, and OMCL for Cac
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -n 8 -t 34 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -n 8 -t 0 -C 90 -M 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -n 8 -t 0 -C 90 -G 

---
# Run BDBH, COGS, and OMCL for (Cac + outgroup)
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk_outgroup" -n 8 -t 35 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk_outgroup" -n 8 -t 0 -C 90 -M 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk_outgroup" -n 8 -t 0 -C 90 -G

Running BDBH, COGS, and OMCL on only C. tuberculostearicum

cd get_homologues

# Run BDBH, COGS, and OMCL for Ctu
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -n 8 -t 8 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -n 8 -t 0 -C 90 -M 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -n 8 -t 0 -C 90 -G 

---
# Run BDBH, COGS, and OMCL for (Ctu + outgroup)
## Runs BDBH
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk_outgroup" -n 8 -t 9 -C 90 

## Runs COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk_outgroup" -n 8 -t 0 -C 90 -M 

## Runs OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk_outgroup" -n 8 -t 0 -C 90 -G

3.3 Calculating a Conservative Core GCs of Only the Species of Interest, and with the Additional Outgroup per Species

From our experience you will run into less issues here if you move the three cluster folders and cluster list files into a new directory. Then use the paths to those cluster folders after the -d flag separated by commas.

Create two conservative core Venn diagrams and outputting .faa and .fna files for C. propinquum and (C. propinquum + outgroup):

cd get_homologues

# Create conservative core Venn diagram for (Cpr)
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Cpr_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_19taxa_algBDBH_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 19 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Cpr_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_19taxa_algBDBH_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 19 \
                      -n 
---   
# Create conservative core Venn diagram for (Cpr + outgroup)                
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Cpr_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_20taxa_algBDBH_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 20 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Cpr_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_20taxa_algBDBH_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cpr_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 20 \
                      -n

Create two conservative core Venn diagrams and outputting .faa and .fna files for C. pseudodiphtheriticum and (C. pseudodiphtheriticum + outgroup):

cd get_homologues

# Create conservative core Venn diagram for (Cps)
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Cps_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_43taxa_algBDBH_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 43 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Cps_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_43taxa_algBDBH_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 43 \
                      -n 
---                      
# Create conservative core Venn diagram for (Cps + outgroup)                         
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Cps_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_44taxa_algBDBH_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 44 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Cps_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_44taxa_algBDBH_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cps_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 44 \
                      -n

Create two conservative core Venn diagrams and outputting .faa and .fna files for C. accolens and (C. accolens + outgroup):

cd get_homologues

# Create conservative core Venn diagram for (Cac)
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Cac_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_34taxa_algBDBH_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 34 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Cac_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_34taxa_algBDBH_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 34 \
                      -n 
---                      
# Create conservative core Venn diagram for (Cac + outgroup)                        
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Cac_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_35taxa_algBDBH_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 35 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Cac_gbk_outgroup_venn" \
                      -d "/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_35taxa_algBDBH_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Cac_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 35 \
                      -n

Create two conservative core Venn diagrams and outputting .faa and .fna files for C. tuberculostearicum and (C. tuberculostearicum + outgroup):

cd get_homologues

# Create conservative core Venn diagram for (Ctu)
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Ctu_gbk_outgroup_venn" \
                      -d "/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_8taxa_algBDBH_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 8 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Ctu_gbk_outgroup_venn" \
                      -d "/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_8taxa_algBDBH_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 8 \
                      -n 
---     
# Create conservative core Venn diagram for (Ctu + outgroup)      
## Creates a conservative core with the output of faa files
./compare_clusters.pl -o "/analysis_GH/Ctu_gbk_outgroup_venn" \
                      -d "/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_9taxa_algBDBH_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 9 \
                      
## Creates a conservative core with the output of fna files from the additional `-n` flag
./compare_clusters.pl -o "/analysis_GH/Ctu_gbk_outgroup_venn" \
                      -d "/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_9taxa_algBDBH_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algCOG_e0_C90_","/analysis_GH/Ctu_gbk_outgroup_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_" \
                      -t 9 \
                      -n

3.4 Isolate `.fna` and `.faa` Files Together into a New Directory

Isolate the .faa and .fna files that make up the Conservative Core GCs for C. propinquum and (C. propinquum + outgroup):

# Isolate the faa and fna files for (Cpr)
## Create directory to hold the (Cpr) faa and fna files
mkdir -p "analysis_GH/Cpr_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Cpr_gbk_venn" 

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Cpr_faa_fna" 

## Moves all fna files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Cpr_faa_fna" 

---
# Isolate the faa and fna files for (Cpr + outgroup)
## Create directory to hold the (Cpr _ outgroup) faa and fna files
mkdir -p "analysis_GH/Cpr_outgroup_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Cpr_outgroup_gbk_venn"

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Cpr_outgroup_faa_fna"

## Moves all faa files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Cpr_outgroup_faa_fna"

Isolate the .faa and .fna files that make up the Conservative Core GCs for C. pseudodiphthericum and (C. pseudodiptheriticum + outgroup):

# Isolate the faa and fna files for (Cps)
## Create directory to hold the (Cps) faa and fna files
mkdir -p "analysis_GH/Cps_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Cps_gbk_venn" 

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Cps_faa_fna" 

## Moves all fna files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Cps_faa_fna" 

---
# Isolate the faa and fna files for (Cps + outgroup)
## Create directory to hold the (Cps _ outgroup) faa and fna files
mkdir -p "analysis_GH/Cps_outgroup_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Cps_outgroup_gbk_venn"

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Cps_outgroup_faa_fna"

## Moves all faa files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Cps_outgroup_faa_fna"

Isolate the .faa and .fna files that make up the Conservative Core GCs for C. accolens and (C. accolens + outgroup):

# Isolate the faa and fna files for (Cac)
## Create directory to hold the (Cac) faa and fna files
mkdir -p "analysis_GH/Cac_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Cac_gbk_venn" 

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Cac_faa_fna" 

## Moves all fna files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Cac_faa_fna" 

---
# Isolate the faa and fna files for (Cac + outgroup)
## Create directory to hold the (Cac _ outgroup) faa and fna files
mkdir -p "analysis_GH/Cac_outgroup_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Cac_outgroup_gbk_venn"

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Cac_outgroup_faa_fna"

## Moves all faa files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Cac_outgroup_faa_fna"

Isolate the .faa and .fna files that make up the Conservative Core GCs for C. tuberculostearicum and (C. tuberculostearicum + outgroup):

# Isolate the faa and fna files for (Ctu)
## Create directory to hold the (Ctu) faa and fna files
mkdir -p "analysis_GH/Ctu_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Ctu_gbk_venn" 

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Ctu_faa_fna" 

## Moves all fna files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Ctu_faa_fna" 

---
# Isolate the faa and fna files for (Ctu + outgroup)
## Create directory to hold the (Ctu _ outgroup) faa and fna files
mkdir -p "analysis_GH/Ctu_outgroup_faa_fna" 

## Go to the directory holding the faa and fna files
cd "/analysis_GH/Ctu_outgroup_gbk_venn"

## Moves all faa files into the new folder holding only faa and fna files
mv *.faa "analysis_GH/Ctu_outgroup_faa_fna"

## Moves all faa files into the new folder holding only faa and fna files
mv *.fna "analysis_GH/Ctu_outgroup_faa_fna"

3.5 Comparing the Conservative Core of the Species of Interest to the Conservative Core with an Additional Outgroup

This will compare the two conservative core GCs of only the species of interest versus the species of interest plus an outgroup. The addition of an outgroup causes a decrease in the number of GCs in the conservative core due to less GCs being shared with the outgroup. We are able to generate a list of which gene clusters are not shared with the outgroup. We can use this list to recover the lost gene clusters back into our analysis when making phylogenies. The -r flag will take the first cluster directory as the reference set.

C. propinquum Conservative Core vs C. propinquum Conservative Core + outgroup:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Cpr_vs_Cpr_outgroup_gbk_venn" \
                      -d "analysis_GH/Cpr_faa_fna","analysis_GH/Cpr_outgroup_faa_fna" \
                      -r

C. pseudodiphtheriticum Conservative Core vs C. pseudodiphtheriticum Conservative Core + outgroup:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Cps_vs_Cps_outgroup_gbk_venn" \
                      -d "analysis_GH/Cps_faa_fna","analysis_GH/Cps_outgroup_faa_fna" \
                      -r

C. accolens Conservative Core vs C. accolens Conservative Core + outgroup:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Cac_vs_Cac_outgroup_gbk_venn" \
                      -d "analysis_GH/Cac_faa_fna","analysis_GH/Cac_outgroup_faa_fna" \
                      -r

C. tuberculostearicum Conservative Core vs C. tuberculostearicum Conservative Core + outgroup:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Ctu_vs_Ctu_outgroup_gbk_venn" \
                      -d "analysis_GH/Ctu_faa_fna","analysis_GH/Ctu_outgroup_faa_fna" \
                      -r

3.6 Recover Lost Gene Clusters Due to the Addition of an Outgroup

Now in the output folder from the section above there will be a .txt file listing the gene clusters that the outgroup did NOT share with the species of interest. Find and move the .txt file to the directory holding only the .faa and .fna files of the conservative core of the species of interest. We will use this .txt file to isolate and add these gene clusters into the directory holding the .faa and .fna files of the conservative core from the species of interest + outgroup.

Recover lost gene clusters due to addition of an outgroup C. propinquum:

# Go to the folder of the Venn diagram comparing the faa and faa files of Cpr and (Cpr + outgroup) 
cd "analysis_GH/Cpr_vs_Cpr_outgroup_gbk_venn"

# Move the txt file to the Cpr faa and fna folder
mv "unique_Cpr*.venn_t0.txt" "analysis_GH/Cpr_faa_fna"

# Go to the Cpr faa and fna folder
cd "analysis_GH/Cpr_faa_fna"

# Make a directory to copy loss GCs to
mkdir -p Recovered_GCs_Cpr

# Read the txt file and copy the loss GCs to Recovered_GCs folder
cat "unique_Cpr_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Cpr"

# We also need the fna files of these gene clusters. So, open the .txt file with a text editor, like BBEdit. Search up ".faa" and replace all with ".fna". Then re-run the cat and xargs pipe.
# second run for fna files
cat "unique_Cpr_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Cpr" 

# Go inside the parent folder
cd "analysis_GH" 

# Moves all recovered GCs faa and faa files into the (Cpr + outgroup) faa and fna directory
mv Cpr_faa_fna/Recovered_GCs_Cpr/*.f* "./Cpr_outgroup_faa_fna"

Recover lost gene clusters due to addition of an outgroup C. pseudodiphtheriticum:

# Go to the folder of the Venn diagram comparing the faa and faa files of Cps and (Cps + outgroup) 
cd "/analysis_GH/Cps_vs_Cps_outgroup_gbk_venn"

# Move the txt file to the Cps faa and fna folder
mv "unique_Cps*.venn_t0.txt" "analysis_GH/Cps_faa_fna"

# Go to the Cps faa and fna folder
cd "analysis_GH/Cps_faa_fna"

# Make a directory to copy loss GCs to
mkdir -p Recovered_GCs_Cps

# Read the txt file and copy the loss GCs to Recovered_GCs foleder
cat "unique_Cps_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Cps"

# We also need the fna files of these gene clusters. So, open the txt file with a text editor we use BBEdit. Search up ".faa" and replace all with ".fna". Then re-run the cat and xargs pipe.
# second run for fna files
cat "unique_Cps_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Cps" 

# Go inside the parent folder
cd "analysis_GH" 

# Moves all recovered GCs faa and faa files into the (Cps + outgroup) faa and fna directory
mv Cps_faa_fna/Recovered_GCs_Cps/*.f* "./Cps_outgroup_faa_fna"

Recover lost gene clusters due to addition of an outgroup C. accolens:

# Go to the folder of the Venn diagram comparing the faa and faa files of Cac and (Cac + outgroup) 
cd "/analysis_GH/Cac_vs_Cac_outgroup_gbk_venn"

# Move the txt file to the Cac faa and fna folder
mv "unique_Cac*.venn_t0.txt" "analysis_GH/Cac_faa_fna"

# Go to the Cac faa and fna folder
cd "analysis_GH/Cac_faa_fna"

# Make a directory to copy loss GCs to
mkdir -p Recovered_GCs_Cac

# Read the txt file and copy the loss GCs to Recovered_GCs foleder
cat "unique_Cac_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Cac"

# We also need the fna files of these gene clusters. So, open the txt file with a text editor we use BBEdit. Search up ".faa" and replace all with ".fna". Then re-run the cat and xargs pipe.
# second run for fna files
cat "unique_Cac_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Cac" 

# Go inside the parent folder
cd "analysis_GH" 

# Moves all recovered GCs faa and faa files into the (Cac + outgroup) faa and fna directory
mv Cac_faa_fna/Recovered_GCs_Cac/*.f* "./Cac_outgroup_faa_fna"

Recover lost gene clusters due to addition of an outgroup C. tuberculostearicum:

# Go to the folder of the Venn diagram comparing the faa and faa files of Ctu and (Ctu + outgroup) 
cd "/analysis_GH/Ctu_vs_Ctu_outgroup_gbk_venn"

# Move the txt file to the Ctu faa and fna folder
mv "unique_Ctu*.venn_t0.txt" "analysis_GH/Ctu_faa_fna"

# Go to the Ctu faa and fna folder
cd "analysis_GH/Ctu_faa_fna"

# Make a directory to copy loss GCs to
mkdir -p Recovered_GCs_Ctu

# Read the txt file and copy the loss GCs to Recovered_GCs foleder
cat "unique_Ctu_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Ctu"

# We also need the fna files of these gene clusters. So, open the txt file with a text editor we use BBEdit. Search up ".faa" and replace all with ".fna". Then re-run the cat and xargs pipe.
# second run for fna files
cat "unique_Ctu_faa_fna_19.venn_t0.txt" | xargs -J % cp % "Recovered_GCs_Ctu" 

# Go inside the parent folder
cd "analysis_GH" 

# Moves all recovered GCs faa and faa files into the (Ctu + outgroup) faa and fna directory
mv Ctu_faa_fna/Recovered_GCs_Ctu/*.f* "./Ctu_outgroup_faa_fna"

3.7 Concatenate and Align Files using GET_PHYLOMARKERS for Each of the Four Corynebacterium species

Before running it we located the file run_get_phylomarkers_pipeline.sh and commented out lines 925-932 to remove a block that halts the process if all GCs do not contain the same number of concatenated taxa.

(C. propinquum + outgroup) concatenate and align the .faa and .fna files to generate codon alignment files:

# Go to the (Cpr + outgroup) faa and fna directory
cd "analysis_GH/Cpr_outgroup_faa_fna" 

# Run get_phylomarkers
/Users/username/get_phylomarkers/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

(C. pseudodiphtheriticum + outgroup) concatenate and align the .faa and .fna files to generate codon alignment files:

# Go to the (Cps + outgroup) faa and fna directory
cd "analysis_GH/Cps_outgroup_faa_fna"

# Run get_phylomarkers
/Users/username/get_phylomarkers/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

(C. accolens + outgroup) concatenate and align the .faa and .fna files to generate codon alignment files:

# Go to the (Cac + outgroup) faa and fna directory
cd "analysis_GH/Cac_outgroup_faa_fna" 

# Run get_phylomarkers
/Users/username/get_phylomarkers/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

(C. tuberculostearicum + outgroup) concatenate and align the .faa and .fna files to generate codon alignment files:

# Go to the (Ctu + outgroup) faa and fna directory
cd "analysis_GH/Ctu_outgroup_faa_fna" 

# Run get_phylomarkers
/Users/username/get_phylomarkers/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

3.8 Generate a Phylogeny treefile using IQ-TREE v2.1.3 for Four Corynebacterium species with Their Corresponding Outgroups

(C. propinquum + outgroup) generate treefile:

# Go to IQ-TREE2
cd IQ-TREE2

# run IQ-TREE2 on the codon alignments
bin/IQ-TREE2 -p "analysis_GH/Cpr_outgroup_faa_fna/get_phylomarkers*/codon_alignments" --prefix Cpr_outgroup -alrt 1000 -B 1000 -T 8

(C. pseudodiphtheriticum + outgroup) generate treefile:

# Go to IQ-TREE2
cd IQ-TREE2

# run IQ-TREE2 on the codon alignments
bin/IQ-TREE2 -p "analysis_GH/Cps_outgroup_faa_fna/get_phylomarkers*/codon_alignments" --prefix Cps_outgroup -alrt 1000 -B 1000 -T 8

(C. accolens + outgroup) generate treefile:

# Go to IQ-TREE2
cd IQ-TREE2

# run IQ-TREE2 on the codon alignments
bin/IQ-TREE2 -p "analysis_GH/Cac_outgroup_faa_fna/get_phylomarkers*/codon_alignments" --prefix Cac_outgroup -alrt 1000 -B 1000 -T 8

(C. tuberculostearicum + outgroup) generate treefile:

# Go to IQ-TREE2
cd IQ-TREE2

# run IQ-TREE2 on the codon alignments
bin/IQ-TREE2 -p "analysis_GH/Ctu_outgroup_faa_fna/get_phylomarkers*/codon_alignments" --prefix Ctu_outgroup -alrt 1000 -B 1000 -T 8

This will create a .treefile file with the best fit maximum likelyhood (ML) tree for each species plus outgroup. These .treefile files can be viewed and edited through iTol annotation editor (Letunic and Bork, 2021) on Google Chrome browser. In iTol the trees can be scaled and assigned strain names from the tree_labels.list files created from GET_PHYLOMARKERS. After scaling and assigning strain names the .svg versions can be exported from iTol and further edited in Adobe Illustrator.

4 Rarefaction Curves for Each of the Four Corynebacterium Species

Here we will be using ./get_homologues.pl with the flag -c to report genome composition analysis using OMCL.

First we located and opened the file lib/marfil_homology.pm, and updated some of the default parameters identified in lines 136-137:

# From:
$MIN_PERSEQID_HOM = 0;
$MIN_COVERAGE_HOM = 20;

# To:
$MIN_PERSEQID_HOM = 50;
$MIN_COVERAGE_HOM = 50;

4.1 Create Core-genome and Pangenome `.tab` Files

Create C. propinquum rarefaction core-genome and pangenome .tab file:

cd get_homologues
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -c -n 8 -M -C 90

Create C. pseudodiphtheriticum rarefaction core-genome and pangenome .tab file:

cd get_homologues
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -c -n 8 -M -C 90

Create C. accolens rarefaction core-genome and pangenome .tab file:

cd get_homologues
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -c -n 8 -M -C 90

Create C. tuberculostearicum rarefaction core-genome and pangenome .tab file:

cd get_homologues
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -c -n 8 -M -C 90

4.2 Plot Core-genome and Pangenome `.tab` Files

Plot C. propinquum rarefaction core genome and pangenome curve:

cd get_homologues

# Plot the core genome size estimate with the Tettelin (Red) and Willenbrock (blue) best fit lines
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Cpr_gbk/core_genome_algOMCL_C90.tab" -f core_both

# Plot the pangenome size estimate with Tettelin (Red) best fit line
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Cpr_gbk/pan_genome_algOMCL_C90.tab" -f pan

Plot C. pseudodiphtheriticum rarefaction core genome and pangenome curve:

cd get_homologues

# Plot the core genome size estimate with the Tettelin (Red) and Willenbrock (blue) best fit lines
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Cps_gbk/core_genome_algOMCL_C90.tab" -f core_both

# Plot the pangenome size estimate with Tettelin (Red) best fit line
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Cps_gbk/pan_genome_algOMCL_C90.tab" -f pan

Plot C. accolens rarefaction core genome and pangenome curve:

cd get_homologues

# Plot the core genome size estimate with the Tettelin (Red) and Willenbrock (blue) best fit lines
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Cac_gbk/core_genome_algOMCL_C90.tab" -f core_both

# Plot the pangenome size estimate with Tettelin (Red) best fit line
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Cac_gbk/pan_genome_algOMCL_C90.tab" -f pan

Plot C. tuberculostearicum rarefaction core genome and pangenome curve:

cd get_homologues

# Plot the core genome size estimate with the Tettelin (Red) and Willenbrock (blue) best fit lines
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Ctu_gbk/core_genome_algOMCL_C90.tab" -f core_both

# Plot the pangenome size estimate with Tettelin (Red) best fit line
./plot_pancore_matrix.pl -i "/Users/username/get_homologues/Prokka_Ctu_gbk/pan_genome_algOMCL_C90.tab" -f pan

After the .svg files were created they were further edited in Adobe Illustrator.

4.3 C. pseudodiphtheriticum and C. accolens troubleshooting

We noticed the core genome of the rarefaction curve for C. pseudodiphtheriticum and C. accolens was splitting. We were getting a false lower bound and were able to fix this by dropping Cps_090104 and Cac_ATCC_49726 from the analysis. See more detailed analysis that helped pinpoint these two specific genomes here.

5 Pangenome Matrix for Each of the Four Corynebacterium Species

This will involve using the command ./get_homologues.pl. However, this time we will use the flag -t 0 to get all the possible clusters. Also, this can only be performed with the OrthoMCL and COGS clustering algorithms.

We have already ran these previously, so if you want you can use the cluster output directory that was produced and move to the next step.

C. propinquum find all the possible clusters with the COGS and OMCL algorithms:

cd get_homologues

# Predicts Pangenome GCs with COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -n 8 -G -t 0 -C 90

# Predicts Pangenome GCs with OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cpr_gbk" -n 8 -M -t 0 -C 90

C. pseuodiphtheriticum find all the possible clusters with the COGS and OMCL algorithms:

cd get_homologues

# Predicts Pangenome GCs with COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -n 8 -G -t 0 -C 90

# Predicts Pangenome GCs with OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cps_gbk" -n 8 -M -t 0 -C 90

C. accolens find all the possible clusters with the COGS and OMCL algorithms:

cd get_homologues

# Predicts Pangenome GCs with COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -n 8 -G -t 0 -C 90

# Predicts Pangenome GCs with OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Cac_gbk" -n 8 -M -t 0 -C 90

C. tuberculostearicum find all the possible clusters with the COGS and OMCL algorithms:

cd get_homologues

# Predicts Pangenome GCs with COGS
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -n 8 -G -t 0 -C 90

# Predicts Pangenome GCs with OMCL
./get_homologues.pl -d "/analysis_GH/Prokka_out/Prokka_Ctu_gbk" -n 8 -M -t 0 -C 90

5.1 Compile and Create Pangenome Matrix Tab File

This will compile the following data and make a pangenome matrix. Here we add the flag -m to create the pangenome matrices intersection. This will create a pangenome_matrix.tab file that will be used next.

C. propinquum intersection pangenomes of OMCL & COGS:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Cpr_gbk_pangenome_venn" 
                      -m 
                      -d "/analysis_GH/Cpr_gbk_pangenome_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_","/analysis_GH/Cpr_gbk_pangenome_cluster_files/C*_f0_0taxa_algCOG_e0_C90_"

C. pseudodiphtheriticum intersection pangenomes of OMCL & COGS:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Cps_gbk_pangenome_venn" 
                      -m 
                      -d "/analysis_GH/Cps_gbk_pangenome_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_","/analysis_GH/Cps_gbk_pangenome_cluster_files/C*_f0_0taxa_algCOG_e0_C90_"

C. accolens intersection pangenomes of OMCL & COGS:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Cac_gbk_pangenome_venn" 
                      -m 
                      -d "/analysis_GH/Cac_gbk_pangenome_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_","/analysis_GH/Cac_gbk_pangenome_cluster_files/C*_f0_0taxa_algCOG_e0_C90_"

C. tuberculostearicum intersection pangenomes of OMCL & COGS:

cd get_homologues

./compare_clusters.pl -o "/analysis_GH/Ctu_gbk_pangenome_venn" 
                      -m 
                      -d "/analysis_GH/Ctu_gbk_pangenome_cluster_files/C*_f0_0taxa_algOMCL_e0_C90_","/analysis_GH/Ctu_gbk_pangenome_cluster_files/C*_f0_0taxa_algCOG_e0_C90_"

5.2 Plot the Pangenome Matrix Tab File

C. propinquum plot pangenome:

./parse_pangenome_matrix.pl -m "/analysis_GH/Cpr_gbk_pangenome_venn/pangenome_matrix_t0.tab" -s -x

C. pseudodiphtheriticum plot pangenome:

./parse_pangenome_matrix.pl -m "/analysis_GH/Cps_gbk_pangenome_venn/pangenome_matrix_t0.tab" -s -x

C. accolens plot pangenome:

./parse_pangenome_matrix.pl -m "/analysis_GH/Cac_gbk_pangenome_venn/pangenome_matrix_t0.tab" -s -x

C. tuberculostearicum plot pangenome:

./parse_pangenome_matrix.pl -m "/analysis_GH/Ctu_gbk_pangenome_venn/pangenome_matrix_t0.tab" -s -x

After the .svg files were created they were further edited in Adobe Illustrator.

6 Main Commands and Flags

Flags that are bolded are the ones that we used or modified.

6.1 `./get_homologues.pl`

Flags	Meanings
-h	help message
-v	print version, credits and checks installation
-d	directory with input .gbk files
-i	input amino acid FASTA file with [taxon names] in headers
-o	only run BLAST/Pfam searches and exit
-c	report genome composition analysis
-R	set random seed for genome composition analysis
-s	save memory by using BerkeleyDB; default parsing stores, sequence hits in RAM
-m	runmode [local\|cluster\|dryrun] (default local)
-n	nb of threads for BLAST/HMMER/MCL in ‘local’ runmode, (default=2)
-I	file with .faa/.gbk files in -d to be included (takes all by default, requires -d)

Clustering Algorithms (default is Bidirectional Best Hits (BDBH)):

-G	use COGtriangle algorithm (COGS, PubMed=20439257), (requires 3+ genomes\|taxa)
-M	use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:

-X	use diamond instead of blastp
-C	min %coverage in BLAST pairwise alignments (range [1-100],default=75)
-E	max E-value (default=1e-05,max=0.01)
-D	require equal Pfam domain composition when defining similarity-based orthology
-S	min %sequence identity in BLAST query/subj pairs (range [1-100],default=1 [BDBH\|OMCL])
-N	min BLAST neighborhood correlation PubMed=18475320 (range [0,1],default=0 [BDBH\|OMCL])
-b	compile core-genome with minimum BLAST searches (ignores -c [BDBH])

Options that control clustering:

-t	report sequence clusters including at least t taxa (default t=numberOfTaxa,t=0 reports all clusters [OMCL\|COGS])
-a	report clusters of sequence features in GenBank files instead of default ‘CDS’ GenBank features
-g	report clusters of intergenic sequences flanked by ORFs in addition to default ‘CDS’ clusters
-f	filter by %length difference within clusters (range [1-100], by default sequence length is not checked)
-r	reference proteome .faa/.gbk file
-e	exclude clusters with inparalogues
-x	allow sequences in multiple COG clusters
-F	orthoMCL inflation value
-A	calculate average identity of clustered sequences, by default uses blastp results but can use blastn with -a
-P	calculate percentage of conserved proteins (POCP), by default uses blastp results but can use blastn with -a
-z	add soft-core to genome composition analysis

6.2 `./compare_clusters.pl`

Flags	Meanings
-h	help message
-d	comma-separated names of cluster directories
-o	output directory
-n	use nucleotide sequence .fna clusters
-r	take first cluster dir as reference set, which might contain a single representative sequence per cluster
-s	use only clusters with syntenic genes
-t	use only clusters with single-copy orthologues from taxa >= t
-l	produce clusters with single-copy seqs from ALL taxa in file
-m	produce intersection pangenome matrices
-x	produce cluster report in OrthoXML format
-T	produce parsimony-based pangenomic tree

6.3 `./parse_pangenome_matrix.pl`

Flags	Meanings
-h	help message
-m	input pangenome matrix .tab
-s	report cloud,shell,soft core and core clusters
-l	list taxa names present in clusters reported in -m matrix
-x	produce matrix of intersection pangenome clusters
-l	use only taxon names included in file
-A	file with taxon names (.faa,.gbk,.nucl files) of group A
-B	file with taxon names (.faa,.gbk,.nucl files) of group B
-a	find genes/clusters which are absent in B
-g	find genes/clusters present in A which are absent in B
-e	find gene family expansions in A with respect to B
-S	skip clusters with occupancy <S
-P	percentage of genomes that must comply presence/absence

References

Chernomor, O., Haeseler, A. von and Minh, B. Q. (2016). Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices. Systematic Biology 65, 997–1008.

Contreras-Moreira, B. and Vinuesa, P. (2013). GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis. Applied and Environmental Microbiology 79, 7696–7701.

Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Haeseler, A. von and Jermiin, L. S. (2017). ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods 14, 587–589.

Kristensen, D. M., Kannan, L., Coleman, M. K., Wolf, Y. I., Sorokin, A., Koonin, E. V. and Mushegian, A. (2010). A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26, 1481–1487.

Letunic, I. and Bork, P. (2021). Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Research 49, W293–W296.

Li, L., Stoeckert, C. J. and Roos, D. S. (2003). OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Research 13, 2178–2189.

Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., Haeseler, A. von and Lanfear, R. (2020). IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution 37, 1530–1534.

Vinuesa, P., Ochoa-Sánchez, L. E. and Contreras-Moreira, B. (2018). GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus stenotrophomonas. Frontiers in Microbiology 9,.

Supplemental Methods: GET_HOMOLOGUES, GET_PHYLOMARKERS and IQ-TREE2

1 ANI (Average Nucleotide Identity)

1.1 Generate ANI .tab Files

1.2 Use the ANI .tab Files to Create Editable .svgs

2 Conservative Core and Phylogeny of All 107 Corynebacterium Genomes

2.1 Running BDBH, COGS, & OMCL with 107 Corynebacterium Genomes

2.2 Creating a Conservative Core Venn Diagram of 107 Corynebacterium genomes

2.3 Generate a Phylogenomic Tree for all Corynebacterium genomes from the Conservative Core

3 Corynebacterium Species Specific Phylogenomic Trees with Outgroup

3.1 Setting Up Two Data Sets for Each of the Four Species

3.2 Running BDBH, COGS, and OMCL on Each Directory

3.3 Calculating a Conservative Core GCs of Only the Species of Interest, and with the Additional Outgroup per Species

3.4 Isolate .fna and .faa Files Together into a New Directory

3.5 Comparing the Conservative Core of the Species of Interest to the Conservative Core with an Additional Outgroup

3.6 Recover Lost Gene Clusters Due to the Addition of an Outgroup

3.7 Concatenate and Align Files using GET_PHYLOMARKERS for Each of the Four Corynebacterium species

3.8 Generate a Phylogeny treefile using IQ-TREE v2.1.3 for Four Corynebacterium species with Their Corresponding Outgroups

4 Rarefaction Curves for Each of the Four Corynebacterium Species

4.1 Create Core-genome and Pangenome .tab Files

4.2 Plot Core-genome and Pangenome .tab Files

4.3 C. pseudodiphtheriticum and C. accolens troubleshooting

5 Pangenome Matrix for Each of the Four Corynebacterium Species

5.1 Compile and Create Pangenome Matrix Tab File

5.2 Plot the Pangenome Matrix Tab File

6 Main Commands and Flags

6.1 ./get_homologues.pl

6.2 ./compare_clusters.pl

6.3 ./parse_pangenome_matrix.pl

References

1.1 Generate ANI `.tab` Files

1.2 Use the ANI `.tab` Files to Create Editable `.svgs`

3.4 Isolate `.fna` and `.faa` Files Together into a New Directory

4.1 Create Core-genome and Pangenome `.tab` Files

4.2 Plot Core-genome and Pangenome `.tab` Files

6.1 `./get_homologues.pl`

6.2 `./compare_clusters.pl`

6.3 `./parse_pangenome_matrix.pl`