AI bioinformatics : converting shell scripts to snakemake/nextflow workflows
During my teaching / consulting work with scientists, one of the most common situations I find myself in is looking over somebody’s shoulder at a bash script and saying, “you know, this should really be implemented in a proper workflow engine.” By which I generally mean snakemake or nextflow.
Continuing in our series of investigations about good uses of agentic AI in bioinformatics, today we’re going to try using various agents to convert shell scripts to proper workflow templates automatically. This should be a good use case for AI in bioinformatics for a number of reasons.
One is that coding agents tend to be very good at porting code from one language to another, which is kind of what’s going on here. Perhaps not between languages, but it certainly involves transforming one type of structured data into another, which is exactly the kind of thing that large language models have proven themselves to be good at.
Secondly, there’s a nice objective test criteria that we can use to see if the agent has done a good job. We simply rerun the same workflow in our workflow engine, compare the outputs to the originals in bash, and we’ll very quickly get some feedback as to whether it’s been correctly re-implemented or not.
These two points combined mean that I think we should be less worried about inaccuracies and hallucinations when using AI for this kind of task since the agent will have an objective starting point, and we have an easy way to identify any inaccuracies.
A simple linear script
We will start with a pretty straightforward challenge. Here’s a very typical bash script from a bacterial genomics pipeline. The original bash script comes from the bacterial-upstream-analysis workflow and represents the read-mapping stage of a bacterial genomics pipeline. Its purpose is to take paired, trimmed sequencing reads, index a reference genome with BWA and samtools, align the reads to that reference with bwa mem, and convert the resulting SAM file into sorted and indexed BAM files for downstream analysis such as variant calling.
bash
Here’s the starting script in full:
#!/bin/bash
set -euo pipefail
# Activate your conda environment for mapping sequence
CONDA_SH="${CONDA_SH:-$HOME/miniconda3/etc/profile.d/conda.sh}"
if [ ! -f "$CONDA_SH" ]; then
CONDA_SH="$HOME/miniforge3/etc/profile.d/conda.sh"
fi
source "$CONDA_SH"
conda activate bwa-sam
# define sra_id of your read
SRA_ID=$(head -n 1 data/accession.txt)
#define necessery path
REF="data/reference/reference.fasta"
READ_F="results/trimming/${SRA_ID}_1_paired.fastq"
READ_R="results/trimming/${SRA_ID}_2_paired.fastq"
OUTDIR="results/mapping/"
if [ ! -s "$REF" ]; then
echo "Reference genome not found: $REF"
exit 1
fi
#create the output directory
mkdir -p $OUTDIR
#index the refrence sequence
bwa index $REF
# Generate .fai index file for the reference
samtools faidx $REF
#map the sample against reference
bwa mem $REF $READ_F $READ_R > $OUTDIR/${SRA_ID}.sam
#converting .sam to .bam
samtools view -@ 20 -S -b $OUTDIR/${SRA_ID}.sam > $OUTDIR/${SRA_ID}.bam
#sort the bam file upon map(consider ram )
samtools sort -@ 2 -m 500M -o $OUTDIR/${SRA_ID}.sorted.bam $OUTDIR/${SRA_ID}.bam
#index the sorted bam file
samtools index $OUTDIR/${SRA_ID}.sorted.bam
echo "mapping complete , check out - $OUTDIR/${SRA_ID}.sorted.bam"It makes a fair few assumptions about the structure of our dataset including folder names etc. but is otherwise pretty linear. The only two bits of logic are the two tests to check if we are in the right conda environment and if the correct reference genome exists.
Converting to snakemake
After prompting codex cli (GPT5.5-medium) with a fairly terse prompt to convert this to snakemake, we are left with the following snakemake file:
import os
from pathlib import Path
from snakemake.shell import shell
CONDA_SH = os.environ.get("CONDA_SH")
if not CONDA_SH:
for candidate in (
Path.home() / "miniconda3/etc/profile.d/conda.sh",
Path.home() / "miniforge3/etc/profile.d/conda.sh",
):
if candidate.exists():
CONDA_SH = str(candidate)
break
shell.executable("/bin/bash")
shell.prefix(
f"source {CONDA_SH} && conda activate bwa-sam && "
)
SRA_ID = open("data/accession.txt").read().strip()
REF = "data/reference/reference.fasta"
READ_F = f"results/trimming/{SRA_ID}_1_paired.fastq"
READ_R = f"results/trimming/{SRA_ID}_2_paired.fastq"
OUTDIR = "results/mapping_snakemake"
rule all:
input:
f"{OUTDIR}/{SRA_ID}.sam",
f"{OUTDIR}/{SRA_ID}.bam",
f"{OUTDIR}/{SRA_ID}.sorted.bam",
f"{OUTDIR}/{SRA_ID}.sorted.bam.bai",
rule index_reference:
input:
ref=REF,
output:
amb=f"{REF}.amb",
ann=f"{REF}.ann",
bwt=f"{REF}.bwt",
pac=f"{REF}.pac",
sa=f"{REF}.sa",
fai=f"{REF}.fai",
shell:
"""
bwa index {input.ref}
samtools faidx {input.ref}
"""
rule map_reads:
input:
ref=REF,
read_f=READ_F,
read_r=READ_R,
indexes=rules.index_reference.output,
output:
sam=f"{OUTDIR}/{SRA_ID}.sam",
shell:
"bwa mem {input.ref} {input.read_f} {input.read_r} > {output.sam}"
rule sam_to_bam:
input:
sam=f"{OUTDIR}/{SRA_ID}.sam",
output:
bam=f"{OUTDIR}/{SRA_ID}.bam",
threads: 20
shell:
"samtools view -@ {threads} -S -b {input.sam} > {output.bam}"
rule sort_bam:
input:
bam=f"{OUTDIR}/{SRA_ID}.bam",
output:
sorted_bam=f"{OUTDIR}/{SRA_ID}.sorted.bam",
threads: 2
resources:
mem_mb=1000,
shell:
"samtools sort -@ {threads} -m 500M -o {output.sorted_bam} {input.bam}"
rule index_bam:
input:
sorted_bam=f"{OUTDIR}/{SRA_ID}.sorted.bam",
output:
bai=f"{OUTDIR}/{SRA_ID}.sorted.bam.bai",
shell:
"samtools index {input.sorted_bam}"Unsurprisingly, the file is quite a bit longer since the declarative style of Snakemake generally occupies more space than a bash script. But the steps are much more nicely laid out as rules, and it even handles very nicely the shell commands to get us into the right conda environment in the first place.
Similar to the bash script, there are variables to store the various locations for files that we need, but I think most people would agree they’ll be easier to find and edit in the Snakefile. Whether you find this easier or harder to read than the bash, of course, depends on what you’re used to.
Converting to nextflow
Next we’ll try a very similar prompt but asking the AI for a NextFlow workflow. I think it’s fair to say that between the two of them, Snakemake and NextFlow encompass most of the workflow engines that are in practical use for scientific research.
Take a look at the NextFlow file:
nextflow.enable.dsl = 2
params.sra_id = file('data/accession.txt').text.trim()
params.ref = 'data/reference/reference.fasta'
params.read_f = "results/trimming/${params.sra_id}_1_paired.fastq"
params.read_r = "results/trimming/${params.sra_id}_2_paired.fastq"
process mapping {
tag params.sra_id
cpus 8
publishDir 'results/mapping_nextflow', mode: 'copy'
output:
path "${params.sra_id}.sam"
path "${params.sra_id}.bam"
path "${params.sra_id}.sorted.bam"
path "${params.sra_id}.sorted.bam.bai"
script:
def condaInit = System.getenv('CONDA_SH') ?: "${System.getenv('HOME')}/miniconda3/etc/profile.d/conda.sh"
"""
set -euo pipefail
CONDA_SH="${condaInit}"
if [ ! -f "\$CONDA_SH" ]; then
CONDA_SH="\$HOME/miniforge3/etc/profile.d/conda.sh"
fi
source "\$CONDA_SH"
conda activate bwa-sam
mkdir -p data/reference results/trimming
ln -sf "${projectDir}/${params.ref}" "${params.ref}"
ln -sf "${projectDir}/${params.read_f}" "${params.read_f}"
ln -sf "${projectDir}/${params.read_r}" "${params.read_r}"
if [ ! -s "${params.ref}" ]; then
echo "Reference genome not found: ${params.ref}"
exit 1
fi
bwa index "${params.ref}"
samtools faidx "${params.ref}"
bwa mem "${params.ref}" "${params.read_f}" "${params.read_r}" > "${params.sra_id}.sam"
samtools view -@ 20 -S -b "${params.sra_id}.sam" > "${params.sra_id}.bam"
samtools sort -@ 2 -m 500M -o "${params.sra_id}.sorted.bam" "${params.sra_id}.bam"
samtools index "${params.sra_id}.sorted.bam"
"""
}
workflow {
mapping()
}As a side effect of this little project, comparing the NextFlow and SnakeMake versions of this workflow is actually quite instructive to see how the two workflow engines handle things differently and which things are similar.
The NextFlow file resembles the bash script a bit more in the sense that it’s got the command lines written more similar to how they would be run on the terminal, with less metadata and decoration as you see in the snake link. But the overall structure is pretty similar. We’ve got various different declarations of paths and file names at the start. We’ve got a little bit of logic to handle the environment. And then we’ve got the actual commands run in a linear fashion.
Execution
As expected for such a simple starting script, both of the workflow engines execute the workflow with no issues and produce exactly the same output as the original bash script. So definitely a good easy win for any project that currently has things handled by ad hoc bash scripts.
In a future article we’ll try the same technique on something with a bit more complicated structure. I keep referring to this example as linear because it has very little logic involved and nothing in the way of loops. Obviously many real-world scripts will have lots more of those two things along with various complications to do with parallelism and so on.