Phylogenies

Corynebacterium tuberculostearicum Species Complex Conservative Core Phylogeny

Corynebacterium genomes listed in NovCor_CTSC_Tree_GenomeList.csv with Mycobacterium_tuberculosis_H37Rv as an outgroup were used in the next 3 phylogenies.

Clustered with BDBH, COGS, and OMCL:

# conda activate get_homologues

path_i="data/GET_HOMOLOGUES/Prokka_gbk_TreeCTSC_gbk"

# Cluster with BDBH
get_homologues.pl -d "$path_i" -n 16 -t 30 -C 90

# Cluster with COGS
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -G

# Cluster with OMCL
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -M

From our experience you will run into less issues if you move the three cluster folders and cluster list files into a new directory we manually moved them to data/GET_HOMOLOGUES/CTSC_Clusters. Then use the paths to those cluster folders after the -d flag separated by commas.

Create conservative core Venn diagram intersection to output .faa and .fna files:

# conda activate get_homologues

path_i="data/GET_HOMOLOGUES/CTSC_clusters"
path_o="data/GET_HOMOLOGUES/CTSC_intersection"

# Creates a conservative core with the output of faa files
compare_clusters.pl -o "$path_o" \
                      -d "$path_i/CTSC_f0_30taxa_algBDBH_e0_C90_,$path_i/CTSC_f0_0taxa_algCOG_e0_C90_,$path_i/CTSC_f0_0taxa_algOMCL_e0_C90_" \
                      -t 30
                        
# Creates a conservative core with the output of fna files from the additional `-n` flag
compare_clusters.pl -o "$path_o" \
                      -d "$path_i/CTSC_f0_30taxa_algBDBH_e0_C90_,$path_i/CTSC_f0_0taxa_algCOG_e0_C90_,$path_i/CTSC_f0_0taxa_algOMCL_e0_C90_" \
                      -t 30 \
                      -n

We had a total of 309 conservative core gene clusters. At this point we have created .faa and .fna files of the core gene clusters from 30 Corynebacterium genomes. Next, to begin generating a phylogenomic tree we isolated only the .faa and .fna files into a new folder.

Create a new folder and move all files with the extension .faa and .fna into a new destination folder:

cd "data/GET_HOMOLOGUES/CTSC_intersection"
mkdir -p "data/GET_HOMOLOGUES/CTSC_faa_fna"
path_o="data/GET_HOMOLOGUES/CTSC_faa_fna"
mv *.faa "$path_o"
mv *.fna "$path_o"

Now we use GET_PHYLOMARKERS version 2.2.9.1 (Vinuesa, Ochoa-Sánchez, and Contreras-Moreira 2018) to generate a concatenated alignment file for each single copy core gene cluster.

Concatenate and align the files inside data/GET_HOMOLOGUES/CTSC_faa_fna to generate codon alignment files:

# Go to directory holding all the .fna and .faa files only
cd "data/GET_HOMOLOGUES/CTSC_faa_fna" 

# Run the get_phylomarkers master script using the path from where it is installed on your system
$path/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

GET_PHYLOMARKERS will generate codon alignment .fasta files which we moved to data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs:

cd "data/GET_HOMOLOGUES/CTSC_faa_fna/get_phylomarkers"

mv *.fasta "data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs"

Run IQ-TREE v2.2.2.6 (Minh et al. 2020) with the codon alignment folder containing codon .fasta files:

# Go to where IQTREE2 is installed on your system
cd $path/IQTREE2

# Run iqtree2
path_i="data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs"

bin/iqtree2 -p "$path_i" --prefix CTSC_30 -alrt 1000 -B 1000 -T 16

The -p flag performs edge-linked proportional partition model (Chernomor, Haeseler, and Minh 2016) for each of the individual gene clusters. Fast model selection for each cluster was determined by ModelFinder (Kalyaanamoorthy et al. 2017). The flags -alrt 1000 and -B 1000 represent 1000 replicates of sH-aLRT and UFbootstraps. -T 16 runs the program at 16 CPU threads.

This will create a .treefile file with the best fit maximum likelyhood (ML) tree. This .treefile can be viewed and edited through iTol annotation editor (Letunic and Bork 2021) on Google Chrome browser. In iTol the tree can be scaled and assigned strain names from the tree_labels.list file created from GET_PHYLOMARKERS. After scaling and assigning strain names an .svg was exported from iTol and further edited in Adobe Illustrator.

Corynebacterium tuberculostearicum Species 16S Phylogeny

Run barrnap Corynebacterium tuberculostearicum species complex genomes plus Mycobacterium_tuberculosis_H37Rv

# Make directories
mkdir -p "data/genomes/CTSC_fasta"
mkdir -p "data/genomes/CTSC_barrnap"

# Set paths
path_i="data/genomes/CTSC_fasta"
path_o="data/genomes/CTSC_barrnap"

# Run barrnap
for file in $path_i/*.fa*; do
    FILENAME=`basename ${file%.*}`
    echo $FILENAME
    barrnap $file --outseq $path_o/$FILENAME.allrRNA.fasta --threads 16; 
done

Filter out no copies, single copy, and multiple 16S rRNA copies

# Move strains with no 16s identified sequence into a folder
cd "data/genomes/CTSC_barrnap"

# Create folder to parse out different identified copy numbers
mkdir -p "nocopies"
mkdir -p "singlecopy"
mkdir -p "multiplecopies"

# Set paths
path_o="data/genomes/CTSC_barrnap"
nocopies="$path_o/nocopies"
singlecopy="$path_o/singlecopy"
multiplecopies="$path_o/multiplecopies"

# Move strains with no 16s identified sequence into a folder
for file in "$path_o"/*; do
    if [ -f "$file" ]; then 
        count=$(grep -c "16S_rRNA" "$file")

        if [ "$count" -eq 0 ]; then
            echo "Extracting $file"
            mv "$file" "$nocopies"
        fi
    fi
done

# Move strains with only one 16s copy into a folder
for file in "$path_o"/*; do
    if [ -f "$file" ]; then  
        count=$(grep -c "16S_rRNA" "$file")
        if [ "$count" -eq 1 ]; then
            echo "Extracting $file"
            mv "$file" "$singlecopy"
        fi
    fi
done

# Move strains with two or more 16s copies into a folder
for file in "$path_o"/*; do
    if [ -f "$file" ]; then
        count=$(grep -c "16S_rRNA" "$file")
        if [ "$count" -ge 2 ]; then
            echo "Extracting $file"
            mv "$file" "$multiplecopies"
        fi
    fi
done

Strains that had NO identified 16S sequences are not used any further, in data/16_phylogenies/CTSC_barrnap/nocopies. There were no strains in the CTSC group that no copies.

Single copy versus multiple copies instructions

Strains with a single 16S rRNA sequence

Subset the 16S rRNA sequences:

for file in $singlecopy/*.allrRNA.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit grep -r -n -p '16S_rRNA*' $file -o $singlecopy/$FILENAME.16S.fasta;
done

Rename Headers for the 16S.fasta files:

for file in $singlecopy/*.16S.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit replace --f-by-name -p '16S_rRNA.*' -r $FILENAME $file -o $single_copy/$FILENAME.16Sclean.fasta;
done

Concatenate the single 16S rRNA copies:

cat $singlecopy/*.16Sclean.fasta > $singlecopy/All16S_singlecopy.fasta

Strains with multiple 16S rRNA copies

Subset the 16S rRNA sequences:

for file in $multiplecopies/*.allrRNA.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit grep -r -n -p '16S_rRNA*' $file -o $multiplecopies/$FILENAME.16S.fasta;
done

The command em_cons is used for Ubuntu and cons for MacOS after installing the EMBOSS package. Make consensus per genome:

for file in $multiplecopies/*.16S.fasta; do
    FILENAME=`basename ${file%.*.*}`
    em_cons $file -outseq $multiplecopies/$FILENAME.16Sconsensus.fasta;
done

Rename Headers for the 16sconsensus.fasta files:

for file in $multiplecopies/*.16Sconsensus.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit replace --f-by-name -p 'EMBOSS_001' -r $FILENAME $file -o $multiplecopies/$FILENAME.16Sconsensusclean.fasta;
done

Concatenate:

cat $multiplecopies/*.16Sconsensusclean.fasta > $multiplecopies/All16Sconsensus.fasta

Concatenate and align single copy 16S sequences with consensus sequences

# Set paths and make directory
path_o="data/16s_phylogenies/CTSC_barrnap"
mkdir -p "$path_o/16S_CTSC_full_concatenated_alignment"
path_x="$path_o/16S_CTSC_full_concatenated_alignment"

# Concatenate the single and multiple copy .fasta files into one
cat $singlecopy/All16S_singlecopy.fasta $multiplecopies/All16Sconsensus.fasta > $path_x/All16S_concatenated.fasta

# Use muscle to align the sequences in the combined concatenated .fasta
muscle -in $path_x/All16S_concatenated.fasta -out $path_x/All16S_concatenated_and_aligned.fasta

Run IQTREE2

cd IQTREE2

bin/iqtree2 -s "$path_x/All16S_concatenated_and_aligned.fasta" --prefix 16s_Ctu_complex_test -alrt 1000 -B 1000 -T 16 -m MFP

Corynebacterium tuberculostearicum species complex genomes plus Mycobacterium_tuberculosis_H37Rv full length rpoB gene phylogeny

We located and used the concatenated and aligned rpoB gene generated from GET_PHYLOMARKERS:

cd IQTREE2
path_i="data/GET_HOMOLOGUES/rpoB/CTSC_rpoB_cdnAln.fasta"

bin/iqtree2 -s "$path_i" --prefix CTSC_rpoB -alrt 1000 -B 1000 -T 16

Use the .treefile in iTol and Adobe illustrator

Overall Corynebacterium genus Conservative Core Phylogeny

The genomes used for the next 3 phylogenies are listed in the file NovCor_FigS1_Phylogenies_GenomeList.csv.

72 Corynebacterium genomes clustered with BDBH, COGS, and OMCL:

# conda activate get_homologues

path_i="data/GET_HOMOLOGUES/Prokka_gbk_TreeCoryneGenus_gbk"

# Cluster with BDBH
get_homologues.pl -d "$path_i" -n 16 -t 72 -C 90

# Cluster with COGS
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -G

# Cluster with OMCL
get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -M

Create conservative core Venn diagram intersection to output .faa and .fna files:

# conda activate get_homologues

path_i="data/GET_HOMOLOGUES/TreeCoryneGenus_clusters"
path_o="data/GET_HOMOLOGUES/TreeCoryneGenus_intersection"

# Creates a conservative core with the output of faa files
compare_clusters.pl   -o "$path_o" \
                      -d "$path_i/CoryneGenus_f0_72taxa_algBDBH_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algCOG_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algOMCL_e0_C90_" \
                      -t 72
                        
# Creates a conservative core with the output of fna files from the additional `-n` flag
compare_clusters.pl   -o "$path_o" \
                      -d "$path_i/CoryneGenus_f0_72taxa_algBDBH_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algCOG_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algOMCL_e0_C90_" \
                      -t 72 \
                      -n

We had a total of 193 conservative core gene clusters.

Create a new folder and move all files with the extension .faa and .fna into a new destination folder:

cd "data/GET_HOMOLOGUES/TreeCoryneGenus_intersection"
mkdir -p "data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna"
path_o="data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna"
mv *.faa "$path_o"
mv *.fna "$path_o"

Concatenate and align the files inside data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna to generate codon alignment files:

# Go to directory holding all the .fna and .faa files only
cd "data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna" 

# Run the get_phylomarkers master script using the path from where it is installed on your system
$path/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7

Run IQ-TREE v2.2.2.6 (Minh et al. 2020) with the codon alignment folder containing codon .fasta files:

# Go to where IQTREE2 is installed on your system
cd $path/IQTREE2

# Run iqtree2
path_i="data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna/get_phylomarkers/condon_fastas

bin/iqtree2 -p "$path_i" --prefix TreeCoryneGenus_72 -alrt 1000 -B 1000 -T 16

Overall Corynebacterium genus 16s rRNA Phylogeny

Run barrnap for all genomes

# Set up paths
path_i="data/16S_phylogenies/CoryneGenus_fasta"
path_o="data/16S_phylogenies/CoryneGenus_barrnap"

# Run barrnap
for file in $path_i/*.fa*; do
    FILENAME=`basename ${file%.*}`
    echo $FILENAME
    barrnap $file --outseq $path_o/$FILENAME.allrRNA.fasta --threads 8; 
done

Subset the 16S rRNA sequences:

# conda activate seqkit

for file in $path_o/*.allrRNA.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit grep -r -n -p '16S_rRNA*' $file -o $path_o/$FILENAME.16S.fasta;
done

Move into different folders for no copy, single copy, and multiple 16S rRNA copies

# Move strains with no 16s identified sequence into a folder
cd "data/genomes/CoryneGenus_barrnap"

# Create folder to parse out different identified copy numbers
mkdir -p "nocopies"
mkdir -p "singlecopy"
mkdir -p "multiplecopies"

# Set paths
path_o="data/genomes/CoryneGenus_barrnap"
nocopies="$path_o/nocopies"
singlecopy="$path_o/singlecopy"
multiplecopies="$path_o/multiplecopies"

# Move strains with no 16s identified sequence into a folder
for file in "$path_o"/*; do
    if [ -f "$file" ]; then 
        count=$(grep -c "16S_rRNA" "$file")
        if [ "$count" -eq 0 ]; then
            echo "Extracting $file"
            mv "$file" "$nocopies"
        fi
    fi
done

# Move strains with only one 16s copy into a folder
for file in "$path_o"/*; do
    if [ -f "$file" ]; then  
        count=$(grep -c "16S_rRNA" "$file")
        if [ "$count" -eq 1 ]; then
            echo "Extracting $file"
            mv "$file" "$singlecopy"
        fi
    fi
done

# Move strains with two or more 16s copies into a folder
for file in "$path_o"/*; do
    if [ -f "$file" ]; then
        count=$(grep -c "16S_rRNA" "$file")
        if [ "$count" -ge 2 ]; then
            echo "Extracting $file"
            mv "$file" "$multiplecopies"
        fi
    fi
done

Strains that had NO identified 16S sequences are not used any further, in data/16_phylognies/CoryneGenus_barrnap/nocopies. C_mastitidis_DSM_44356 had no identified 16S rRNA sequence and was excluded from the final 16S rRNA phylogeny.

Instructions for single and multiple 16S rRNA copies

Strains with a single 16S rRNA sequence

Rename Headers for the 16S.fasta files:

path_o="data/16S_phylogenies/CoryneGenus_barrnap"
single_copy="$path_o/single_copy"

for file in $single_copy/*.16S.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit replace --f-by-name -p '16S_rRNA.*' -r $FILENAME $file -o $single_copy/$FILENAME.16Sclean.fasta;
done

Concatenate the single 16S rRNA copies:

cat $single_copy/*.16Sclean.fasta > $single_copy/All16S_singlecopy.fasta

Strains with multiple 16S rRNA copies

The command em_cons is used for Ubuntu and cons for MacOS after installing the EMBOSS package. Make consensus per genome:

for file in $multiplecopies/*.16S.fasta; do
    FILENAME=`basename ${file%.*.*}`
    em_cons $file -outseq $multiplecopies/$FILENAME.16Sconsensus.fasta;
done

Rename Headers for the 16sconsensus.fasta files:

for file in $multiplecopies/*.16Sconsensus.fasta; do
    FILENAME=`basename ${file%.*.*}`
    seqkit replace --f-by-name -p 'EMBOSS_001' -r $FILENAME $file -o $multiplecopies/$FILENAME.16Sconsensusclean.fasta;
done

Concatenate:

cat $multiplecopies/*.16Sconsensusclean.fasta > $multiplecopies/All16Sconsensus.fasta

Concatenate and align single copy 16S sequences with consensus sequences

# Set paths
path_o="data/16s_phylogenies/CoryneGenus_barrnap"
mkdir -p "$path_o/16S_CoryneGenus_full_concatenated_alignment"
path_x="$path_o/16S_CoryneGenus_full_concatenated_alignment"

# Concatenate the single and multiple copy .fasta files into one
cat $singlecopy/All16S_singlecopy.fasta $multiplecopies/All16Sconsensus.fasta > $path_x/All16S_concatenated.fasta

# Use muscle to align the sequences in the combined concatenated .fasta
muscle -in $path_x/All16S_concatenated.fasta -out $path_x/All16S_concatenated_and_aligned.fasta

Run iqtree2

cd /Users/klemonlab/iqtree-2.2.2.4-MacOSX

bin/iqtree2 -s "$path_o/16S_full_alignment/All16S_concatenated_and_aligned.fasta" --prefix Overall_Corye_NovCor_NO_MAGS -alrt 1000 -B 1000 -T 16 -m MFP

Overall Corynebacterium rpoB tree

cd IQTREE2
path_i="data/GET_HOMOLOGUES/rpoB/CoryneGenus_rpoB_cdnAln.fasta"


bin/iqtree2 -s "$path_i" --prefix CoryneGenus_rpoB -alrt 1000 -B 1000 -T 16

Use .treefile in iTol and Adobe illustrator

The 3 phylogenies in Figure S3 were generated similarly and the genome list is NovCor_FigS3_Phylogenies_GenomeList_v01.csv.

# Phylogenies {.unnumbered} ```{r} #| echo: FALSE #| message: FALSE ``` ## *Corynebacterium tuberculostearicum* Species Complex Conservative Core Phylogeny *Corynebacterium* genomes listed in `NovCor_CTSC_Tree_GenomeList.csv` with *Mycobacterium_tuberculosis*_H37Rv as an outgroup were used in the next 3 phylogenies. Clustered with BDBH, COGS, and OMCL: ```{bash, eval=FALSE} # conda activate get_homologues path_i="data/GET_HOMOLOGUES/Prokka_gbk_TreeCTSC_gbk" # Cluster with BDBH get_homologues.pl -d "$path_i" -n 16 -t 30 -C 90 # Cluster with COGS get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -G # Cluster with OMCL get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -M ``` From our experience you will run into less issues if you move the three cluster folders and cluster list files into a new directory we manually moved them to `data/GET_HOMOLOGUES/CTSC_Clusters`. Then use the paths to those cluster folders after the -d flag separated by commas. Create conservative core Venn diagram intersection to output .faa and .fna files: ```{bash, eval=FALSE} # conda activate get_homologues path_i="data/GET_HOMOLOGUES/CTSC_clusters" path_o="data/GET_HOMOLOGUES/CTSC_intersection" # Creates a conservative core with the output of faa files compare_clusters.pl -o "$path_o" \ -d "$path_i/CTSC_f0_30taxa_algBDBH_e0_C90_,$path_i/CTSC_f0_0taxa_algCOG_e0_C90_,$path_i/CTSC_f0_0taxa_algOMCL_e0_C90_" \ -t 30 # Creates a conservative core with the output of fna files from the additional `-n` flag compare_clusters.pl -o "$path_o" \ -d "$path_i/CTSC_f0_30taxa_algBDBH_e0_C90_,$path_i/CTSC_f0_0taxa_algCOG_e0_C90_,$path_i/CTSC_f0_0taxa_algOMCL_e0_C90_" \ -t 30 \ -n ``` We had a total of 309 conservative core gene clusters. At this point we have created `.faa` and `.fna` files of the core gene clusters from 30 *Corynebacterium* genomes. Next, to begin generating a phylogenomic tree we isolated only the .faa and .fna files into a new folder. Create a new folder and move all files with the extension .faa and .fna into a new destination folder: ```{bash, eval=FALSE} cd "data/GET_HOMOLOGUES/CTSC_intersection" mkdir -p "data/GET_HOMOLOGUES/CTSC_faa_fna" path_o="data/GET_HOMOLOGUES/CTSC_faa_fna" mv *.faa "$path_o" mv *.fna "$path_o" ``` Now we use GET_PHYLOMARKERS version 2.2.9.1 (Vinuesa, Ochoa-Sánchez, and Contreras-Moreira 2018) to generate a concatenated alignment file for each single copy core gene cluster. Concatenate and align the files inside `data/GET_HOMOLOGUES/CTSC_faa_fna` to generate codon alignment files: ```{bash, eval=FALSE} # Go to directory holding all the .fna and .faa files only cd "data/GET_HOMOLOGUES/CTSC_faa_fna" # Run the get_phylomarkers master script using the path from where it is installed on your system $path/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7 ``` GET_PHYLOMARKERS will generate codon alignment `.fasta` files which we moved to `data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs`: ```{bash, eval=FALSE} cd "data/GET_HOMOLOGUES/CTSC_faa_fna/get_phylomarkers" mv *.fasta "data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs" ``` Run IQ-TREE v2.2.2.6 (Minh et al. 2020) with the codon alignment folder containing codon `.fasta` files: ```{bash, eval=FALSE} # Go to where IQTREE2 is installed on your system cd $path/IQTREE2 # Run iqtree2 path_i="data/GET_HOMOLOGUES/CTSC_aligned_concatenated_core_GCs" bin/iqtree2 -p "$path_i" --prefix CTSC_30 -alrt 1000 -B 1000 -T 16 ``` The `-p` flag performs edge-linked proportional partition model (Chernomor, Haeseler, and Minh 2016) for each of the individual gene clusters. Fast model selection for each cluster was determined by ModelFinder (Kalyaanamoorthy et al. 2017). The flags `-alrt 1000` and `-B 1000` represent 1000 replicates of sH-aLRT and UFbootstraps. `-T 16` runs the program at 16 CPU threads. This will create a `.treefile` file with the best fit maximum likelyhood (ML) tree. This `.treefile` can be viewed and edited through iTol annotation editor (Letunic and Bork 2021) on Google Chrome browser. In iTol the tree can be scaled and assigned strain names from the `tree_labels.list` file created from GET_PHYLOMARKERS. After scaling and assigning strain names an `.svg` was exported from iTol and further edited in Adobe Illustrator. ## *Corynebacterium tuberculostearicum* Species 16S Phylogeny #### Run barrnap *Corynebacterium tuberculostearicum* species complex genomes plus *Mycobacterium_tuberculosis*_H37Rv ```{bash, eval=FALSE} # Make directories mkdir -p "data/genomes/CTSC_fasta" mkdir -p "data/genomes/CTSC_barrnap" # Set paths path_i="data/genomes/CTSC_fasta" path_o="data/genomes/CTSC_barrnap" # Run barrnap for file in $path_i/*.fa*; do FILENAME=`basename ${file%.*}` echo $FILENAME barrnap $file --outseq $path_o/$FILENAME.allrRNA.fasta --threads 16; done ``` #### Filter out no copies, single copy, and multiple 16S rRNA copies ```{bash, eval=FALSE} # Move strains with no 16s identified sequence into a folder cd "data/genomes/CTSC_barrnap" # Create folder to parse out different identified copy numbers mkdir -p "nocopies" mkdir -p "singlecopy" mkdir -p "multiplecopies" # Set paths path_o="data/genomes/CTSC_barrnap" nocopies="$path_o/nocopies" singlecopy="$path_o/singlecopy" multiplecopies="$path_o/multiplecopies" # Move strains with no 16s identified sequence into a folder for file in "$path_o"/*; do if [ -f "$file" ]; then count=$(grep -c "16S_rRNA" "$file") if [ "$count" -eq 0 ]; then echo "Extracting $file" mv "$file" "$nocopies" fi fi done # Move strains with only one 16s copy into a folder for file in "$path_o"/*; do if [ -f "$file" ]; then count=$(grep -c "16S_rRNA" "$file") if [ "$count" -eq 1 ]; then echo "Extracting $file" mv "$file" "$singlecopy" fi fi done # Move strains with two or more 16s copies into a folder for file in "$path_o"/*; do if [ -f "$file" ]; then count=$(grep -c "16S_rRNA" "$file") if [ "$count" -ge 2 ]; then echo "Extracting $file" mv "$file" "$multiplecopies" fi fi done ``` Strains that had NO identified 16S sequences are not used any further, in `data/16_phylogenies/CTSC_barrnap/nocopies`. There were no strains in the CTSC group that no copies. ## Single copy versus multiple copies instructions {.tabset .tabset-pills .tabset-fade} ### Strains with a single 16S rRNA sequence Subset the 16S rRNA sequences: ```{bash, eval=FALSE} for file in $singlecopy/*.allrRNA.fasta; do FILENAME=`basename ${file%.*.*}` seqkit grep -r -n -p '16S_rRNA*' $file -o $singlecopy/$FILENAME.16S.fasta; done ``` Rename Headers for the `16S.fasta` files: ```{bash, eval=FALSE} for file in $singlecopy/*.16S.fasta; do FILENAME=`basename ${file%.*.*}` seqkit replace --f-by-name -p '16S_rRNA.*' -r $FILENAME $file -o $single_copy/$FILENAME.16Sclean.fasta; done ``` Concatenate the single 16S rRNA copies: ```{bash, eval=FALSE} cat $singlecopy/*.16Sclean.fasta > $singlecopy/All16S_singlecopy.fasta ``` ### Strains with multiple 16S rRNA copies Subset the 16S rRNA sequences: ```{bash, eval=FALSE} for file in $multiplecopies/*.allrRNA.fasta; do FILENAME=`basename ${file%.*.*}` seqkit grep -r -n -p '16S_rRNA*' $file -o $multiplecopies/$FILENAME.16S.fasta; done ``` The command `em_cons` is used for Ubuntu and `cons` for MacOS after installing the EMBOSS package. Make consensus per genome: ```{bash, eval=FALSE} for file in $multiplecopies/*.16S.fasta; do FILENAME=`basename ${file%.*.*}` em_cons $file -outseq $multiplecopies/$FILENAME.16Sconsensus.fasta; done ``` Rename Headers for the `16sconsensus.fasta` files: ```{bash, eval=FALSE} for file in $multiplecopies/*.16Sconsensus.fasta; do FILENAME=`basename ${file%.*.*}` seqkit replace --f-by-name -p 'EMBOSS_001' -r $FILENAME $file -o $multiplecopies/$FILENAME.16Sconsensusclean.fasta; done ``` Concatenate: ```{bash, eval=FALSE} cat $multiplecopies/*.16Sconsensusclean.fasta > $multiplecopies/All16Sconsensus.fasta ``` ## {-} ### Concatenate and align single copy 16S sequences with consensus sequences ```{bash, eval=FALSE} # Set paths and make directory path_o="data/16s_phylogenies/CTSC_barrnap" mkdir -p "$path_o/16S_CTSC_full_concatenated_alignment" path_x="$path_o/16S_CTSC_full_concatenated_alignment" # Concatenate the single and multiple copy .fasta files into one cat $singlecopy/All16S_singlecopy.fasta $multiplecopies/All16Sconsensus.fasta > $path_x/All16S_concatenated.fasta # Use muscle to align the sequences in the combined concatenated .fasta muscle -in $path_x/All16S_concatenated.fasta -out $path_x/All16S_concatenated_and_aligned.fasta ``` ### Run IQTREE2 ```{bash, eval=FALSE} cd IQTREE2 bin/iqtree2 -s "$path_x/All16S_concatenated_and_aligned.fasta" --prefix 16s_Ctu_complex_test -alrt 1000 -B 1000 -T 16 -m MFP ``` ## *Corynebacterium tuberculostearicum* species complex genomes plus *Mycobacterium_tuberculosis*_H37Rv full length rpoB gene phylogeny We located and used the concatenated and aligned rpoB gene generated from GET_PHYLOMARKERS: ```{bash, eval=FALSE} cd IQTREE2 path_i="data/GET_HOMOLOGUES/rpoB/CTSC_rpoB_cdnAln.fasta" bin/iqtree2 -s "$path_i" --prefix CTSC_rpoB -alrt 1000 -B 1000 -T 16 ``` Use the `.treefile` in iTol and Adobe illustrator ## Overall *Corynebacterium* genus Conservative Core Phylogeny The genomes used for the next 3 phylogenies are listed in the file `NovCor_FigS1_Phylogenies_GenomeList.csv`. 72 *Corynebacterium* genomes clustered with BDBH, COGS, and OMCL: ```{bash, eval=FALSE} # conda activate get_homologues path_i="data/GET_HOMOLOGUES/Prokka_gbk_TreeCoryneGenus_gbk" # Cluster with BDBH get_homologues.pl -d "$path_i" -n 16 -t 72 -C 90 # Cluster with COGS get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -G # Cluster with OMCL get_homologues.pl -d "$path_i" -n 16 -t 0 -C 90 -M ``` Create conservative core Venn diagram intersection to output .faa and .fna files: ```{bash, eval=FALSE} # conda activate get_homologues path_i="data/GET_HOMOLOGUES/TreeCoryneGenus_clusters" path_o="data/GET_HOMOLOGUES/TreeCoryneGenus_intersection" # Creates a conservative core with the output of faa files compare_clusters.pl -o "$path_o" \ -d "$path_i/CoryneGenus_f0_72taxa_algBDBH_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algCOG_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algOMCL_e0_C90_" \ -t 72 # Creates a conservative core with the output of fna files from the additional `-n` flag compare_clusters.pl -o "$path_o" \ -d "$path_i/CoryneGenus_f0_72taxa_algBDBH_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algCOG_e0_C90_,$path_i/CoryneGenus_f0_0taxa_algOMCL_e0_C90_" \ -t 72 \ -n ``` We had a total of 193 conservative core gene clusters. Create a new folder and move all files with the extension .faa and .fna into a new destination folder: ```{bash, eval=FALSE} cd "data/GET_HOMOLOGUES/TreeCoryneGenus_intersection" mkdir -p "data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna" path_o="data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna" mv *.faa "$path_o" mv *.fna "$path_o" ``` Concatenate and align the files inside `data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna` to generate codon alignment files: ```{bash, eval=FALSE} # Go to directory holding all the .fna and .faa files only cd "data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna" # Run the get_phylomarkers master script using the path from where it is installed on your system $path/run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 0.7 -m 0.7 ``` Run IQ-TREE v2.2.2.6 (Minh et al. 2020) with the codon alignment folder containing codon `.fasta` files: ```{bash, eval=FALSE} # Go to where IQTREE2 is installed on your system cd $path/IQTREE2 # Run iqtree2 path_i="data/GET_HOMOLOGUES/TreeCoryneGenus_faa_fna/get_phylomarkers/condon_fastas bin/iqtree2 -p "$path_i" --prefix TreeCoryneGenus_72 -alrt 1000 -B 1000 -T 16 ``` This will create a `.treefile` file with the best fit maximum likelyhood (ML) tree. This `.treefile` can be viewed and edited through iTol annotation editor (Letunic and Bork 2021) on Google Chrome browser. In iTol the tree can be scaled and assigned strain names from the `tree_labels.list` file created from GET_PHYLOMARKERS. After scaling and assigning strain names an `.svg` was exported from iTol and further edited in Adobe Illustrator. ## Overall *Corynebacterium* genus 16s rRNA Phylogeny ### Run barrnap for all genomes ```{bash, eval=FALSE} # Set up paths path_i="data/16S_phylogenies/CoryneGenus_fasta" path_o="data/16S_phylogenies/CoryneGenus_barrnap" # Run barrnap for file in $path_i/*.fa*; do FILENAME=`basename ${file%.*}` echo $FILENAME barrnap $file --outseq $path_o/$FILENAME.allrRNA.fasta --threads 8; done ``` ### Subset the 16S rRNA sequences: ```{bash, eval=FALSE} # conda activate seqkit for file in $path_o/*.allrRNA.fasta; do FILENAME=`basename ${file%.*.*}` seqkit grep -r -n -p '16S_rRNA*' $file -o $path_o/$FILENAME.16S.fasta; done ``` ### Move into different folders for no copy, single copy, and multiple 16S rRNA copies ```{bash, eval=FALSE} # Move strains with no 16s identified sequence into a folder cd "data/genomes/CoryneGenus_barrnap" # Create folder to parse out different identified copy numbers mkdir -p "nocopies" mkdir -p "singlecopy" mkdir -p "multiplecopies" # Set paths path_o="data/genomes/CoryneGenus_barrnap" nocopies="$path_o/nocopies" singlecopy="$path_o/singlecopy" multiplecopies="$path_o/multiplecopies" # Move strains with no 16s identified sequence into a folder for file in "$path_o"/*; do if [ -f "$file" ]; then count=$(grep -c "16S_rRNA" "$file") if [ "$count" -eq 0 ]; then echo "Extracting $file" mv "$file" "$nocopies" fi fi done # Move strains with only one 16s copy into a folder for file in "$path_o"/*; do if [ -f "$file" ]; then count=$(grep -c "16S_rRNA" "$file") if [ "$count" -eq 1 ]; then echo "Extracting $file" mv "$file" "$singlecopy" fi fi done # Move strains with two or more 16s copies into a folder for file in "$path_o"/*; do if [ -f "$file" ]; then count=$(grep -c "16S_rRNA" "$file") if [ "$count" -ge 2 ]; then echo "Extracting $file" mv "$file" "$multiplecopies" fi fi done ``` Strains that had NO identified 16S sequences are not used any further, in `data/16_phylognies/CoryneGenus_barrnap/nocopies`. C_mastitidis_DSM_44356 had no identified 16S rRNA sequence and was excluded from the final 16S rRNA phylogeny. ## Instructions for single and multiple 16S rRNA copies {.tabset .tabset-pills .tabset-fade} ### Strains with a single 16S rRNA sequence Rename Headers for the `16S.fasta` files: ```{bash, eval=FALSE} path_o="data/16S_phylogenies/CoryneGenus_barrnap" single_copy="$path_o/single_copy" for file in $single_copy/*.16S.fasta; do FILENAME=`basename ${file%.*.*}` seqkit replace --f-by-name -p '16S_rRNA.*' -r $FILENAME $file -o $single_copy/$FILENAME.16Sclean.fasta; done ``` Concatenate the single 16S rRNA copies: ```{bash, eval=FALSE} cat $single_copy/*.16Sclean.fasta > $single_copy/All16S_singlecopy.fasta ``` ### Strains with multiple 16S rRNA copies The command `em_cons` is used for Ubuntu and `cons` for MacOS after installing the EMBOSS package. Make consensus per genome: ```{bash, eval=FALSE} for file in $multiplecopies/*.16S.fasta; do FILENAME=`basename ${file%.*.*}` em_cons $file -outseq $multiplecopies/$FILENAME.16Sconsensus.fasta; done ``` Rename Headers for the `16sconsensus.fasta` files: ```{bash, eval=FALSE} for file in $multiplecopies/*.16Sconsensus.fasta; do FILENAME=`basename ${file%.*.*}` seqkit replace --f-by-name -p 'EMBOSS_001' -r $FILENAME $file -o $multiplecopies/$FILENAME.16Sconsensusclean.fasta; done ``` Concatenate: ```{bash, eval=FALSE} cat $multiplecopies/*.16Sconsensusclean.fasta > $multiplecopies/All16Sconsensus.fasta ``` ## {-} ### Concatenate and align single copy 16S sequences with consensus sequences ```{bash, eval=FALSE} # Set paths path_o="data/16s_phylogenies/CoryneGenus_barrnap" mkdir -p "$path_o/16S_CoryneGenus_full_concatenated_alignment" path_x="$path_o/16S_CoryneGenus_full_concatenated_alignment" # Concatenate the single and multiple copy .fasta files into one cat $singlecopy/All16S_singlecopy.fasta $multiplecopies/All16Sconsensus.fasta > $path_x/All16S_concatenated.fasta # Use muscle to align the sequences in the combined concatenated .fasta muscle -in $path_x/All16S_concatenated.fasta -out $path_x/All16S_concatenated_and_aligned.fasta ``` ### Run iqtree2 ```{bash, eval=FALSE} cd /Users/klemonlab/iqtree-2.2.2.4-MacOSX bin/iqtree2 -s "$path_o/16S_full_alignment/All16S_concatenated_and_aligned.fasta" --prefix Overall_Corye_NovCor_NO_MAGS -alrt 1000 -B 1000 -T 16 -m MFP ``` ## Overall *Corynebacterium* rpoB tree ```{bash, eval=FALSE} cd IQTREE2 path_i="data/GET_HOMOLOGUES/rpoB/CoryneGenus_rpoB_cdnAln.fasta" bin/iqtree2 -s "$path_i" --prefix CoryneGenus_rpoB -alrt 1000 -B 1000 -T 16 ``` Use `.treefile` in iTol and Adobe illustrator The 3 phylogenies in **Figure S3** were generated similarly and the genome list is `NovCor_FigS3_Phylogenies_GenomeList_v01.csv.`