Supplemental Methods: Prokka Annotations

1 Custom Prokka Annotations

We used Prokka v1.14.6 (Seemann, 2014) to annotate the 107 Corynebacterium strain genomes described in Table S1-A and the 28 Dolosigranulum pigrum genomes from our previous manuscript (Flores Ramos et al., 2021). The total 135 genomes are listed in CorPGA_AnnotationProkka_GenomeList_v01a.csv and included in data/genomes as .fasta files.

We annotated the genomes in two different ways for proper compatibility and strain labeling with both GET_HOMOLOGUES and anvi’o.

1.1 Prokka annotations for GET_HOMOLOGUES

This step annotates all the .fasta files in the selected input folder (path_i) and places all the output annotated files in the output folder (path_o). Output files headers get updated with –genus ‘Corynebacterium’ –species ‘sp’ and –strain based on the file name. We used default parameters, including gene recognition and translation initiation site identification with Prodigal (Hyatt et al., 2010).

#conda activate Prokka

path_i="data/genomes"
path_o="data/GET_HOMOLOGUES/Prokka_out"
mkdir -p "$path_o"

for file in $path_i/*.f*; do
    FILENAME=`basename ${file%.*}`
    prokka --prefix $FILENAME --outdir $path_o --genus 'Corynebacterium' --species 'sp' --strain $FILENAME --centre X --compliant --cpus 0 --force $file; 
done

1.2 Prokka annotation for anvi’o

More information about importing Prokka annotations into anvi’o can be found here: https://merenlab.org/2017/05/18/working-with-prokka/#note-for-the-pangenomics-workflow

1.2.1 Fasta reformatting

Before the annotation step with Prokka we need to reformat the .fasta files using anvi-script-reformat-fasta. This script creates .fasta files with simplified deflines and also by using --seq-type NT prevents downstream errors with “characters that are not any of A, C, T, G, N, a, c, t, g, n.”

#conda activate anvio-dev

path_i="data/genomes"
path_o="data/Anvio8/Reformatted"
mkdir -p "$path_o"

for file in $path_i/*.f*; do
    FILENAME=`basename ${file%.*}`
    anvi-script-reformat-fasta -o $path_o/$FILENAME.fa --min-len 0 --simplify-names $file --seq-type NT; 
done

1.2.2 Prokka annotation

This step repeats the Prokka annotation using the anvi’o reformatted .fasta files.

Output files headers get updated with –genus –species and –strain based on the info in the genomes list .csv file.

#conda activate Prokka

csv_file="data/genome_lists/CorPGA_AnnotationProkka_GenomeList_v01a.csv"
path_i="data/Anvio8/Reformatted"
path_o="data/Anvio8/Prokka_out"
mkdir -p "$path_o"

while IFS=',' read -r name genus species; do
    if [[ "$name" != "name" ]]; then  # Skip the header
        prokka --prefix "$name" --outdir "$path_o" --genus "$genus" --species "$species" --strain "$name" --cpus 0 --force "$path_i/$name.fa"
    fi
done < "$csv_file"

1.2.3 Parsing .gff files

This step is to parse Prokka annotated genomes to import both the external Prodigal gene calls and functions independently into anvi’o. The input (path_i) is the annotation in GFF3 format and outputs (path_o) are two tab-delimited text files, one for gene calls (calls_*.txt) and one for annotations (annot_*.txt).

This is done with the script gff_parser.py described in this tutorial.

#conda activate gffutils

path_i="data/Anvio8/Prokka_out"
path_o="data/Anvio8/Parsed_prokka"
mkdir -p "$path_o"

for file in $path_i/*.gff; do
    FILENAME=`basename ${file%.*}`
    python scripts/gff_parser.py $file \
    --gene-calls $path_o/calls_$FILENAME.txt \
    --annotation $path_o/annot_$FILENAME.txt;
done

1.2.4 Generating contigs databases

In this step the reformatted .fa files (path_i) and the external gene calls (calls_*.txt) from Prokka (path_e) get imported to generate anvi’o contig databases (path_o). Initially we got a lot of early stop codon errors. Therefore, we add the –ignore-internal-stop-codons flag.

#conda activate anvio-dev

path_i="data/Anvio8/Reformatted"
path_e="data/Anvio8/Parsed_prokka"
path_o="data/Anvio8/Contigs_db"
mkdir -p "$path_o"

for file in $path_i/*.fa; do
    FILENAME=`basename ${file%.*}`
    anvi-gen-contigs-database -f $file \
                              -o $path_o/$FILENAME.db \
                              --external-gene-calls $path_e/calls_$FILENAME.txt \
                              --ignore-internal-stop-codons \
                              -n $FILENAME;
done

1.2.5 Importing Prokka functional annotation

Finally, the external functional annotations (annot_*.txt) from Prokka (path_e) get imported into the Anvi’o contigs databases (path_i).

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"
path_e="data/Anvio8/Parsed_prokka"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-import-functions -c $file \
                          -i $path_e/annot_$FILENAME.txt
      
done

2 NCBI annotations

The 9 genome assemblies listed in Table S3D CorPGA_AnnotationNCBI_GenomeList_v01a.csv and included in data/genomes_NCBIAnnotation where used as references for comparisons on the KEGG metabolic analysis only. These genomes were not part of the phylogenetic/pangenomic analysis, and we were interested in keeping the original NCBI annotations in the anvi’o contig databases.

They were downloaded from NCBI in .gbff format and processed as follows:

2.1 Parsing .gbff files downloaded from NCBI

First we used anvi-script-process-genbank to parse the .gbff files (path_i) in order to output (path_o) the corresponding properly formatted .fa files, plus two tab-delimited text files, one for gene calls (calls_*.txt) and one for annotations (annot_*.txt).

#conda activate anvio-dev

path_i="data/genomes_NCBIAnnotation"
path_o="data/Anvio8/Parsed_NCBI"
mkdir -p "$path_o"

for file in $path_i/*.gbff; do
    FILENAME=`basename ${file%.*}`
    anvi-script-process-genbank -i $file \
                            --output-gene-calls $path_o/calls_$FILENAME.txt \
                            --output-functions $path_o/annot_$FILENAME.txt \
                            --output-fasta $path_o/$FILENAME.fa \
                            --annotation-source prodigal
done

2.2 Fixing annotation files

The parsed annotation files from the original NCBI annotations describe the annotation source as “prodigal”, but in the rest of the files that we have manually annotated describe the source as “Prodigal”. Later anvi’o will not be able to cluster together genomes with different annotation sources, so we need a consistent name. Here we iterate through all *.txt files in the folder and perform text replacement from “prodigal” to “Prodigal”.

path_i="data/Anvio8/Parsed_NCBI"

for file in "$path_i"/*.txt; do
    if [ -f "$file" ]; then
        # Create a temporary file for the updated content
        tmp_file=$(mktemp)

        # Perform the text replacement and save it to the temporary file
        sed 's/prodigal/Prodigal/g' "$file" > "$tmp_file"

        # Replace the original file with the temporary file
        mv "$tmp_file" "$file"

        echo "Text replacement completed for $file"
    fi
done

2.3 Generating contigs databases

In this step the .fa files and the gene calls from NCBI (path_i) get imported to generate anvi’o contigs databases (path_o).

#conda activate anvio-dev

path_i="data/Anvio8/Parsed_NCBI"
path_o="data/Anvio8/Contigs_db"
mkdir -p "$path_o"

for file in $path_i/*.fa; do
    FILENAME=`basename ${file%.*}`
    anvi-gen-contigs-database -f $file \
                              -o $path_o/$FILENAME.db \
                              --external-gene-calls $path_i/calls_$FILENAME.txt \
                              -n $FILENAME;
done

2.4 Importing Prokka functional annotation

Then the NCBI external annotations (path_e) get imported into the Anvi’o contigs databases (path_i).

#conda activate anvio-dev

path_i="data/Anvio8/Contigs_db"
path_e="data/Anvio8/Parsed_NCBI"

for file in $path_i/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-import-functions -c $file \
                          -i $path_e/annot_$FILENAME.txt
      
done

References

Flores Ramos, S., Brugger, S. D., Escapa, I. F., Skeete, C. A., Cotton, S. L., Eslami, S. M., Gao, W., Bomar, L., Tran, T. H., Jones, D. S., et al. (2021). Genomic Stability and Genetic Defense Systems inDolosigranulum pigrum, a Candidate Beneficial Bacterium from the Human Microbiome. mSystems 6,.

Hyatt, D., Chen, G.-L., LoCascio, P. F., Land, M. L., Larimer, F. W. and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11,.

Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069.