# conda activate get_homologues
path_i="data/GET_HOMOLOGUES/Prokka_gbk_TreeCTSC_gbk"
# Cluster with BDBH
get_homologues.pl -d "$path_i" -n 16 -t 30 -C 90
# Cluster with COGS
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -G
# Cluster with OMCL
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -M
Phylogenies
Corynebacterium tuberculostearicum Species Complex Conservative Core Phylogeny
Corynebacterium genomes listed in NovCor_CTSC_Tree_GenomeList.csv
with Mycobacterium_tuberculosis_H37Rv as an outgroup were used in the next 3 phylogenies.
Clustered with BDBH, COGS, and OMCL:
From our experience you will run into less issues if you move the three cluster folders and cluster list files into a new directory we manually moved them to data/GET_HOMOLOGUES/CTSC_Clusters
. Then use the paths to those cluster folders after the -d flag separated by commas.
Create conservative core Venn diagram intersection to output .faa and .fna files:
# conda activate get_homologues
path_i="data/GET_HOMOLOGUES/CTSC_clusters"
path_o="data/GET_HOMOLOGUES/CTSC_intersection"
# Creates a conservative core with the output of faa files
compare_clusters.pl -o "$path_o" \
-d "$path_i/CTSC_f0_30taxa_algBDBH_e0_C90_,$path_i/CTSC_f0_0taxa_algCOG_e0_C90_,$path_i/CTSC_f0_0taxa_algOMCL_e0_C90_" \
-t 30
# Creates a conservative core with the output of fna files from the additional `-n` flag
compare_clusters.pl -o "$path_o" \
-d "$path_i/CTSC_f0_30taxa_algBDBH_e0_C90_,$path_i/CTSC_f0_0taxa_algCOG_e0_C90_,$path_i/CTSC_f0_0taxa_algOMCL_e0_C90_" \
-t 30 \
-n
We had a total of 309 conservative core gene clusters. At this point we have created .faa
and .fna
files of the core gene clusters from 30 Corynebacterium genomes. Next, to begin generating a phylogenomic tree we isolated only the .faa and .fna files into a new folder.
Create a new folder and move all files with the extension .faa and .fna into a new destination folder:
cd "data/GET_HOMOLOGUES/CTSC_intersection"
mkdir -p "data/GET_HOMOLOGUES/CTSC_faa_fna"
path_o="data/GET_HOMOLOGUES/CTSC_faa_fna"
mv *.faa "$path_o"
mv *.fna "$path_o"
Now we use GET_PHYLOMARKERS version 2.2.9.1 (Vinuesa, Ochoa-Sánchez, and Contreras-Moreira 2018) to generate a concatenated alignment file for each single copy core gene cluster.
Concatenate and align the files inside data/GET_HOMOLOGUES/CTSC_faa_fna
to generate codon alignment files:
# Go to directory holding all the .fna and .faa files only
cd "data/GET_HOMOLOGUES/CTSC_faa_fna"
# Run the get_phylomarkers master script using the path from where it is installed on your system
$path/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7
GET_PHYLOMARKERS will generate codon alignment .fasta
files which we moved to data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs
:
cd "data/GET_HOMOLOGUES/CTSC_faa_fna/get_phylomarkers"
mv *.fasta "data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs"
Run IQ-TREE v2.2.2.6 (Minh et al. 2020) with the codon alignment folder containing codon .fasta
files:
# Go to where IQTREE2 is installed on your system
cd $path/IQTREE2
# Run iqtree2
path_i="data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs"
bin/iqtree2 -p "$path_i" --prefix CTSC_30 -alrt 1000 -B 1000 -T 16
The -p
flag performs edge-linked proportional partition model (Chernomor, Haeseler, and Minh 2016) for each of the individual gene clusters. Fast model selection for each cluster was determined by ModelFinder (Kalyaanamoorthy et al. 2017). The flags -alrt 1000
and -B 1000
represent 1000 replicates of sH-aLRT and UFbootstraps. -T 16
runs the program at 16 CPU threads.
This will create a .treefile
file with the best fit maximum likelyhood (ML) tree. This .treefile
can be viewed and edited through iTol annotation editor (Letunic and Bork 2021) on Google Chrome browser. In iTol the tree can be scaled and assigned strain names from the tree_labels.list
file created from GET_PHYLOMARKERS. After scaling and assigning strain names an .svg
was exported from iTol and further edited in Adobe Illustrator.
Corynebacterium tuberculostearicum Species 16S Phylogeny
Run barrnap Corynebacterium tuberculostearicum species complex genomes plus Mycobacterium_tuberculosis_H37Rv
# Make directories
mkdir -p "data/genomes/CTSC_fasta"
mkdir -p "data/genomes/CTSC_barrnap"
# Set paths
path_i="data/genomes/CTSC_fasta"
path_o="data/genomes/CTSC_barrnap"
# Run barrnap
for file in $path_i/*.fa*; do
FILENAME=`basename ${file%.*}`
echo $FILENAME
barrnap $file --outseq $path_o/$FILENAME.allrRNA.fasta --threads 16;
done
Filter out no copies, single copy, and multiple 16S rRNA copies
# Move strains with no 16s identified sequence into a folder
cd "data/genomes/CTSC_barrnap"
# Create folder to parse out different identified copy numbers
mkdir -p "nocopies"
mkdir -p "singlecopy"
mkdir -p "multiplecopies"
# Set paths
path_o="data/genomes/CTSC_barrnap"
nocopies="$path_o/nocopies"
singlecopy="$path_o/singlecopy"
multiplecopies="$path_o/multiplecopies"
# Move strains with no 16s identified sequence into a folder
for file in "$path_o"/*; do
if [ -f "$file" ]; then
count=$(grep -c "16S_rRNA" "$file")
if [ "$count" -eq 0 ]; then
echo "Extracting $file"
mv "$file" "$nocopies"
fi
fi
done
# Move strains with only one 16s copy into a folder
for file in "$path_o"/*; do
if [ -f "$file" ]; then
count=$(grep -c "16S_rRNA" "$file")
if [ "$count" -eq 1 ]; then
echo "Extracting $file"
mv "$file" "$singlecopy"
fi
fi
done
# Move strains with two or more 16s copies into a folder
for file in "$path_o"/*; do
if [ -f "$file" ]; then
count=$(grep -c "16S_rRNA" "$file")
if [ "$count" -ge 2 ]; then
echo "Extracting $file"
mv "$file" "$multiplecopies"
fi
fi
done
Strains that had NO identified 16S sequences are not used any further, in data/16_phylogenies/CTSC_barrnap/nocopies
. There were no strains in the CTSC group that no copies.
Single copy versus multiple copies instructions
Strains with a single 16S rRNA sequence
Subset the 16S rRNA sequences:
for file in $singlecopy/*.allrRNA.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit grep -r -n -p '16S_rRNA*' $file -o $singlecopy/$FILENAME.16S.fasta;
done
Rename Headers for the 16S.fasta
files:
for file in $singlecopy/*.16S.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit replace --f-by-name -p '16S_rRNA.*' -r $FILENAME $file -o $single_copy/$FILENAME.16Sclean.fasta;
done
Concatenate the single 16S rRNA copies:
cat $singlecopy/*.16Sclean.fasta > $singlecopy/All16S_singlecopy.fasta
Strains with multiple 16S rRNA copies
Subset the 16S rRNA sequences:
for file in $multiplecopies/*.allrRNA.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit grep -r -n -p '16S_rRNA*' $file -o $multiplecopies/$FILENAME.16S.fasta;
done
The command em_cons
is used for Ubuntu and cons
for MacOS after installing the EMBOSS package. Make consensus per genome:
for file in $multiplecopies/*.16S.fasta; do
FILENAME=`basename ${file%.*.*}`
em_cons $file -outseq $multiplecopies/$FILENAME.16Sconsensus.fasta;
done
Rename Headers for the 16sconsensus.fasta
files:
for file in $multiplecopies/*.16Sconsensus.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit replace --f-by-name -p 'EMBOSS_001' -r $FILENAME $file -o $multiplecopies/$FILENAME.16Sconsensusclean.fasta;
done
Concatenate:
cat $multiplecopies/*.16Sconsensusclean.fasta > $multiplecopies/All16Sconsensus.fasta
Concatenate and align single copy 16S sequences with consensus sequences
# Set paths and make directory
path_o="data/16s_phylogenies/CTSC_barrnap"
mkdir -p "$path_o/16S_CTSC_full_concatenated_alignment"
path_x="$path_o/16S_CTSC_full_concatenated_alignment"
# Concatenate the single and multiple copy .fasta files into one
cat $singlecopy/All16S_singlecopy.fasta $multiplecopies/All16Sconsensus.fasta > $path_x/All16S_concatenated.fasta
# Use muscle to align the sequences in the combined concatenated .fasta
muscle -in $path_x/All16S_concatenated.fasta -out $path_x/All16S_concatenated_and_aligned.fasta
Run IQTREE2
cd IQTREE2
bin/iqtree2 -s "$path_x/All16S_concatenated_and_aligned.fasta" --prefix 16s_Ctu_complex_test -alrt 1000 -B 1000 -T 16 -m MFP
Corynebacterium tuberculostearicum species complex genomes plus Mycobacterium_tuberculosis_H37Rv full length rpoB gene phylogeny
We located and used the concatenated and aligned rpoB gene generated from GET_PHYLOMARKERS:
cd IQTREE2
path_i="data/GET_HOMOLOGUES/rpoB/CTSC_rpoB_cdnAln.fasta"
bin/iqtree2 -s "$path_i" --prefix CTSC_rpoB -alrt 1000 -B 1000 -T 16
Use the .treefile
in iTol and Adobe illustrator
Overall Corynebacterium genus Conservative Core Phylogeny
The genomes used for the next 3 phylogenies are listed in the file NovCor_FigS1_Phylogenies_GenomeList.csv
.
72 Corynebacterium genomes clustered with BDBH, COGS, and OMCL:
# conda activate get_homologues
path_i="data/GET_HOMOLOGUES/Prokka_gbk_TreeCoryneGenus_gbk"
# Cluster with BDBH
get_homologues.pl -d "$path_i" -n 16 -t 72 -C 90
# Cluster with COGS
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -G
# Cluster with OMCL
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -M
Create conservative core Venn diagram intersection to output .faa and .fna files:
# conda activate get_homologues
path_i="data/GET_HOMOLOGUES/TreeCoryneGenus_clusters"
path_o="data/GET_HOMOLOGUES/TreeCoryneGenus_intersection"
# Creates a conservative core with the output of faa files
compare_clusters.pl -o "$path_o" \
-d "$path_i/CoryneGenus_f0_72taxa_algBDBH_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algCOG_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algOMCL_e0_C90_" \
-t 72
# Creates a conservative core with the output of fna files from the additional `-n` flag
compare_clusters.pl -o "$path_o" \
-d "$path_i/CoryneGenus_f0_72taxa_algBDBH_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algCOG_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algOMCL_e0_C90_" \
-t 72 \
-n
We had a total of 193 conservative core gene clusters.
Create a new folder and move all files with the extension .faa and .fna into a new destination folder:
cd "data/GET_HOMOLOGUES/TreeCoryneGenus_intersection"
mkdir -p "data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna"
path_o="data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna"
mv *.faa "$path_o"
mv *.fna "$path_o"
Concatenate and align the files inside data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna
to generate codon alignment files:
# Go to directory holding all the .fna and .faa files only
cd "data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna"
# Run the get_phylomarkers master script using the path from where it is installed on your system
$path/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7
Run IQ-TREE v2.2.2.6 (Minh et al. 2020) with the codon alignment folder containing codon .fasta
files:
# Go to where IQTREE2 is installed on your system
cd $path/IQTREE2
# Run iqtree2
path_i="data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna/get_phylomarkers/condon_fastas
bin/iqtree2 -p "$path_i" --prefix TreeCoryneGenus_72 -alrt 1000 -B 1000 -T 16
This will create a .treefile
file with the best fit maximum likelyhood (ML) tree. This .treefile
can be viewed and edited through iTol annotation editor (Letunic and Bork 2021) on Google Chrome browser. In iTol the tree can be scaled and assigned strain names from the tree_labels.list
file created from GET_PHYLOMARKERS. After scaling and assigning strain names an .svg
was exported from iTol and further edited in Adobe Illustrator.
Overall Corynebacterium genus 16s rRNA Phylogeny
Run barrnap for all genomes
# Set up paths
path_i="data/16S_phylogenies/CoryneGenus_fasta"
path_o="data/16S_phylogenies/CoryneGenus_barrnap"
# Run barrnap
for file in $path_i/*.fa*; do
FILENAME=`basename ${file%.*}`
echo $FILENAME
barrnap $file --outseq $path_o/$FILENAME.allrRNA.fasta --threads 8;
done
Subset the 16S rRNA sequences:
# conda activate seqkit
for file in $path_o/*.allrRNA.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit grep -r -n -p '16S_rRNA*' $file -o $path_o/$FILENAME.16S.fasta;
done
Move into different folders for no copy, single copy, and multiple 16S rRNA copies
# Move strains with no 16s identified sequence into a folder
cd "data/genomes/CoryneGenus_barrnap"
# Create folder to parse out different identified copy numbers
mkdir -p "nocopies"
mkdir -p "singlecopy"
mkdir -p "multiplecopies"
# Set paths
path_o="data/genomes/CoryneGenus_barrnap"
nocopies="$path_o/nocopies"
singlecopy="$path_o/singlecopy"
multiplecopies="$path_o/multiplecopies"
# Move strains with no 16s identified sequence into a folder
for file in "$path_o"/*; do
if [ -f "$file" ]; then
count=$(grep -c "16S_rRNA" "$file")
if [ "$count" -eq 0 ]; then
echo "Extracting $file"
mv "$file" "$nocopies"
fi
fi
done
# Move strains with only one 16s copy into a folder
for file in "$path_o"/*; do
if [ -f "$file" ]; then
count=$(grep -c "16S_rRNA" "$file")
if [ "$count" -eq 1 ]; then
echo "Extracting $file"
mv "$file" "$singlecopy"
fi
fi
done
# Move strains with two or more 16s copies into a folder
for file in "$path_o"/*; do
if [ -f "$file" ]; then
count=$(grep -c "16S_rRNA" "$file")
if [ "$count" -ge 2 ]; then
echo "Extracting $file"
mv "$file" "$multiplecopies"
fi
fi
done
Strains that had NO identified 16S sequences are not used any further, in data/16_phylognies/CoryneGenus_barrnap/nocopies
. C_mastitidis_DSM_44356 had no identified 16S rRNA sequence and was excluded from the final 16S rRNA phylogeny.
Instructions for single and multiple 16S rRNA copies
Strains with a single 16S rRNA sequence
Rename Headers for the 16S.fasta
files:
path_o="data/16S_phylogenies/CoryneGenus_barrnap"
single_copy="$path_o/single_copy"
for file in $single_copy/*.16S.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit replace --f-by-name -p '16S_rRNA.*' -r $FILENAME $file -o $single_copy/$FILENAME.16Sclean.fasta;
done
Concatenate the single 16S rRNA copies:
cat $single_copy/*.16Sclean.fasta > $single_copy/All16S_singlecopy.fasta
Strains with multiple 16S rRNA copies
The command em_cons
is used for Ubuntu and cons
for MacOS after installing the EMBOSS package. Make consensus per genome:
for file in $multiplecopies/*.16S.fasta; do
FILENAME=`basename ${file%.*.*}`
em_cons $file -outseq $multiplecopies/$FILENAME.16Sconsensus.fasta;
done
Rename Headers for the 16sconsensus.fasta
files:
for file in $multiplecopies/*.16Sconsensus.fasta; do
FILENAME=`basename ${file%.*.*}`
seqkit replace --f-by-name -p 'EMBOSS_001' -r $FILENAME $file -o $multiplecopies/$FILENAME.16Sconsensusclean.fasta;
done
Concatenate:
cat $multiplecopies/*.16Sconsensusclean.fasta > $multiplecopies/All16Sconsensus.fasta
Concatenate and align single copy 16S sequences with consensus sequences
# Set paths
path_o="data/16s_phylogenies/CoryneGenus_barrnap"
mkdir -p "$path_o/16S_CoryneGenus_full_concatenated_alignment"
path_x="$path_o/16S_CoryneGenus_full_concatenated_alignment"
# Concatenate the single and multiple copy .fasta files into one
cat $singlecopy/All16S_singlecopy.fasta $multiplecopies/All16Sconsensus.fasta > $path_x/All16S_concatenated.fasta
# Use muscle to align the sequences in the combined concatenated .fasta
muscle -in $path_x/All16S_concatenated.fasta -out $path_x/All16S_concatenated_and_aligned.fasta
Run iqtree2
cd /Users/klemonlab/iqtree-2.2.2.4-MacOSX
bin/iqtree2 -s "$path_o/16S_full_alignment/All16S_concatenated_and_aligned.fasta" --prefix Overall_Corye_NovCor_NO_MAGS -alrt 1000 -B 1000 -T 16 -m MFP
Overall Corynebacterium rpoB tree
cd IQTREE2
path_i="data/GET_HOMOLOGUES/rpoB/CoryneGenus_rpoB_cdnAln.fasta"
bin/iqtree2 -s "$path_i" --prefix CoryneGenus_rpoB -alrt 1000 -B 1000 -T 16
Use .treefile
in iTol and Adobe illustrator
The 3 phylogenies in Figure S3 were generated similarly and the genome list is NovCor_FigS3_Phylogenies_GenomeList_v01.csv.