Description of Corynebacterium nasorum sp. nov. and Corynebacterium hallucis sp. nov. isolated from human nasal passages and skin
  1. Methods
  2. Prokka Annotations
  • Introduction
  • Methods
    • Average Nucleotide Identity (ANI)
    • Digital DNA:DNA hybridization (dDDH)
    • Prokka Annotations
    • Anvio
    • Figures
  • References
  • R Session Info

Table of contents

  • Set up paths
  • Custom Prokka Annotations
    • Prokka annotations for GET_HOMOLOGUES (NEEDS UPDATE!!!)
    • Prokka annotation for anvi’o

Prokka Annotations

Set up paths

#This will open the .Renviron file for the project. You need to add a DATA_PATH environment variable with the path to your data folder.
#usethis::edit_r_environ(scope = "project") 

# This will create and open the .bash_env file for editing. You need to add a DATA_PATH environment variable with the path to your data folder.
#file.create(".bash_env")
#file.edit(".bash_env") 

Custom Prokka Annotations

We annotated the genomes with Prokka v1.14.6 (Seemann, 2014) in two different ways for proper compatibility and strain labeling with both GET_HOMOLOGUES and anvi’o v9.

Prokka annotations for GET_HOMOLOGUES (NEEDS UPDATE!!!)

This step annotates all the .fasta files in the selected input folder (path_i) and places all the output annotated files in the output folder (path_o). Output files headers get updated with –genus ‘Corynebacterium’ –species ‘sp’ and –strain based on the file name. We used default parameters, including gene recognition and translation initiation site identification with Prodigal (Hyatt et al., 2010).

conda deactivate
conda activate Prokka

path_i="data/genomes/fasta_51_MainFigs"
path_o="data/phylogenies/51_MainFigs/annot_prokka_51"

#path_i="data/genomes/fasta_72_SuppFigure" #Run for both folders
#path_o="data/phylogenies/72_SuppFigure/annot_prokka_72"

mkdir -p "$path_o"

for file in $path_i/*.f*; do
    FILENAME=`basename ${file%.*}`
    prokka --prefix $FILENAME --outdir $path_o --strain $FILENAME --centre X --compliant --cpus 0 --force $file; 
done

Prokka annotation for anvi’o

We selected 30 Corynebacterium strains for analysis of KEGG metabolic capabilities using anvi’o (Eren et al., 2020). These genomes are listed in genome_list_30_CtuComplex.csv

We first source the .bash_env file to set up the environment variables.

source .bash_env

Fasta reformatting

Before the annotation step with Prokka we need to reformat the .fasta files using anvi-script-reformat-fasta. This script creates .fasta files with simplified deflines and also by using --seq-type NT prevents downstream errors with “characters that are not any of A, C, T, G, N, a, c, t, g, n.”

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/genomes/fasta_30_CtuComplex"
path_o="$DATA_PATH/anvio-9/reformatted"

mkdir -p "$path_o"

for file in $path_i/*.f*; do
    FILENAME=`basename ${file%.*}`
    anvi-script-reformat-fasta -o $path_o/$FILENAME.fa --min-len 0 --simplify-names $file --seq-type NT; 
done

Prokka annotation

This step repeats the Prokka annotation using the anvi’o reformatted .fasta files.

Output files headers get updated with –genus –species and –strain based on the info in the genomes list .csv file.

conda deactivate
conda activate Prokka

csv_file="$DATA_PATH/genome_lists/genome_list_30_CtuComplex.csv"
path_i="$DATA_PATH/anvio-9/reformatted"
path_o="$DATA_PATH/anvio-9/prokka_out"
mkdir -p "$path_o"

while IFS=',' read -r name genus species; do
    if [[ "$name" != "name" ]]; then  # Skip the header
        prokka --prefix "$name" --outdir "$path_o" --genus "$genus" --species "$species" --strain "$name" --cpus 0 --force "$path_i/$name.fa"
    fi
done < "$csv_file"

Parsing .gff files

This step is to parse Prokka annotated genomes to import both the external Prodigal gene calls and functions independently into anvi’o v9. The input (path_i) is the annotation in GFF3 format and outputs (path_o) are two tab-delimited text files, one for gene calls (calls_*.txt) and one for annotations (annot_*.txt).

This is done with the script gff_parser.py described in this tutorial.

conda deactivate
conda activate gffutils

path_i="$DATA_PATH/anvio-9/prokka_out"
path_o="$DATA_PATH/anvio-9/parsed_prokka"
mkdir -p "$path_o"

for file in $path_i/*.gff; do
    FILENAME=`basename ${file%.*}`
    python scripts/gff_parser.py $file \
    --gene-calls $path_o/calls_$FILENAME.txt \
    --annotation $path_o/annot_$FILENAME.txt;
done

Generating contigs databases

In this step the reformatted .fa files (path_i) and the external gene calls (calls_*.txt) from Prokka (path_e) get imported to generate anvi’o contig databases (path_o). Initially we got a lot of early stop codon errors. Therefore, we add the –ignore-internal-stop-codons flag.

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/reformatted"
path_e="$DATA_PATH/anvio-9/parsed_prokka"
path_o="$DATA_PATH/anvio-9/contigs_db"
mkdir -p "$path_o"

for file in $path_i/*.fa; do
    FILENAME=`basename ${file%.*}`
    anvi-gen-contigs-database -f $file \
                              -o $path_o/$FILENAME.db \
                              --external-gene-calls $path_e/calls_$FILENAME.txt \
                              --ignore-internal-stop-codons \
                              -n $FILENAME;
done

Importing Prokka functional annotation

Finally, the external functional annotations (annot_*.txt) from Prokka (path_e) get imported into the Anvi’o contigs databases (path_i).

conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"
path_e="$DATA_PATH/anvio-9/parsed_prokka"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-import-functions -c $file \
                          -i $path_e/annot_$FILENAME.txt
      
done
Eren, A. M., Kiefl, E., Shaiber, A., Veseli, I., Miller, S. E., Schechter, M. S., Fink, I., Pan, J. N., Yousef, M., Fogarty, E. C., et al. (2020). Community-led, integrated, reproducible multi-omics with anvi’o. Nature Microbiology 6, 3–6.
Hyatt, D., Chen, G.-L., LoCascio, P. F., Land, M. L., Larimer, F. W. and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11,.
Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069.
Digital DNA:DNA hybridization (dDDH)
Anvio
Source Code
---
execute:
  eval: FALSE
  message: FALSE
  warning: FALSE
---

# Prokka Annotations {.unnumbered}

## Set up paths

```{r}
#This will open the .Renviron file for the project. You need to add a DATA_PATH environment variable with the path to your data folder.
#usethis::edit_r_environ(scope = "project") 

# This will create and open the .bash_env file for editing. You need to add a DATA_PATH environment variable with the path to your data folder.
#file.create(".bash_env")
#file.edit(".bash_env") 
```

## Custom Prokka Annotations

We annotated the genomes with [Prokka v1.14.6](https://github.com/tseemann/prokka) [@Seemann2014] in two different ways for proper compatibility and strain labeling with both GET_HOMOLOGUES and anvi'o v9.

### Prokka annotations for GET_HOMOLOGUES (NEEDS UPDATE!!!)

This step annotates all the `.fasta` files in the selected input folder (`path_i`) and places all the output annotated files in the output folder (`path_o`). Output files headers get updated with --genus 'Corynebacterium' --species 'sp' and --strain based on the file name. We used default parameters, including gene recognition and translation initiation site identification with Prodigal [@Hyatt2010].

```{bash, eval=FALSE}
conda deactivate
conda activate Prokka

path_i="data/genomes/fasta_51_MainFigs"
path_o="data/phylogenies/51_MainFigs/annot_prokka_51"

#path_i="data/genomes/fasta_72_SuppFigure" #Run for both folders
#path_o="data/phylogenies/72_SuppFigure/annot_prokka_72"

mkdir -p "$path_o"

for file in $path_i/*.f*; do
    FILENAME=`basename ${file%.*}`
    prokka --prefix $FILENAME --outdir $path_o --strain $FILENAME --centre X --compliant --cpus 0 --force $file; 
done
```

### Prokka annotation for anvi'o

We selected 30 *Corynebacterium* strains for analysis of KEGG metabolic capabilities using anvi'o [@eren2020]. These genomes are listed in `genome_list_30_CtuComplex.csv` 

We first source the `.bash_env` file to set up the environment variables.

```{bash, eval=FALSE}
source .bash_env
```

#### Fasta reformatting

Before the annotation step with Prokka we need to reformat the `.fasta` files using `anvi-script-reformat-fasta`. This script creates `.fasta` files with simplified deflines and also by using `--seq-type NT` prevents downstream errors with "characters that are not any of A, C, T, G, N, a, c, t, g, n."

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/genomes/fasta_30_CtuComplex"
path_o="$DATA_PATH/anvio-9/reformatted"

mkdir -p "$path_o"

for file in $path_i/*.f*; do
    FILENAME=`basename ${file%.*}`
    anvi-script-reformat-fasta -o $path_o/$FILENAME.fa --min-len 0 --simplify-names $file --seq-type NT; 
done
```

#### Prokka annotation

This step repeats the Prokka annotation using the anvi'o reformatted `.fasta` files.

Output files headers get updated with --genus --species and --strain based on the info in the genomes list `.csv` file.

```{bash, eval=FALSE}
conda deactivate
conda activate Prokka

csv_file="$DATA_PATH/genome_lists/genome_list_30_CtuComplex.csv"
path_i="$DATA_PATH/anvio-9/reformatted"
path_o="$DATA_PATH/anvio-9/prokka_out"
mkdir -p "$path_o"

while IFS=',' read -r name genus species; do
    if [[ "$name" != "name" ]]; then  # Skip the header
        prokka --prefix "$name" --outdir "$path_o" --genus "$genus" --species "$species" --strain "$name" --cpus 0 --force "$path_i/$name.fa"
    fi
done < "$csv_file"
```

#### Parsing .gff files

This step is to parse Prokka annotated genomes to import both the external Prodigal gene calls and functions independently into anvi'o v9. The input (`path_i`) is the annotation in GFF3 format and outputs (`path_o`) are two tab-delimited text files, one for gene calls (`calls_*.txt`) and one for annotations (`annot_*.txt`).

This is done with the script `gff_parser.py` described in this [tutorial](https://merenlab.org/2017/05/18/working-with-prokka/).

```{bash, eval=FALSE}
conda deactivate
conda activate gffutils

path_i="$DATA_PATH/anvio-9/prokka_out"
path_o="$DATA_PATH/anvio-9/parsed_prokka"
mkdir -p "$path_o"

for file in $path_i/*.gff; do
    FILENAME=`basename ${file%.*}`
    python scripts/gff_parser.py $file \
    --gene-calls $path_o/calls_$FILENAME.txt \
    --annotation $path_o/annot_$FILENAME.txt;
done
```

#### Generating contigs databases

In this step the reformatted `.fa` files (`path_i`) and the external gene calls (`calls_*.txt`) from Prokka (`path_e`) get imported to generate anvi’o contig databases (`path_o`). Initially we got a lot of early stop codon errors. Therefore, we add the `–ignore-internal-stop-codons` flag.

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/reformatted"
path_e="$DATA_PATH/anvio-9/parsed_prokka"
path_o="$DATA_PATH/anvio-9/contigs_db"
mkdir -p "$path_o"

for file in $path_i/*.fa; do
    FILENAME=`basename ${file%.*}`
    anvi-gen-contigs-database -f $file \
                              -o $path_o/$FILENAME.db \
                              --external-gene-calls $path_e/calls_$FILENAME.txt \
                              --ignore-internal-stop-codons \
                              -n $FILENAME;
done
```

#### Importing Prokka functional annotation

Finally, the external functional annotations (`annot_*.txt`) from Prokka (`path_e`) get imported into the Anvi’o contigs databases (`path_i`).

```{bash, eval=FALSE}
conda deactivate
conda activate anvio-9

path_i="$DATA_PATH/anvio-9/contigs_db"
path_e="$DATA_PATH/anvio-9/parsed_prokka"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-import-functions -c $file \
                          -i $path_e/annot_$FILENAME.txt
      
done
```