CodonTransformer: a multispecies codon optimizer using context-aware neural networks

CodonTransformer: a multispecies codon optimizer using context-aware neural networks

Data

A total of 1,001,197 DNA sequences were collected from NCBI resources from 164 organisms including Humans (Homo sapiens), thale cress (A. thaliana), C. elegans, C. reinhardtii and its chloroplast, Zebrafish (D. rerio), fruit fly (D. melanogaster), house mouse (M. musculus), tobacco (N. tabacum) and its chloroplast, P. putida, and baker’s yeast (S. cerevisiae) from Eukaryotes, along with all species of the Entrobactreacea order such as Escherichia coli and selected species from Archea such as T. barophilus and Sulfolobus solfataricus. Depending on the organism, these DNA sequences came in the gene or CDS format and were translated to protein sequences using NCBI Codon Tables52. During preprocessing, only the DNA sequences with a length divisible by three, starting with a start codon, ending with a stop codon, and only having a single stop codon were chosen. We made this dataset available at https://zenodo.org/records/12509224.

Model input

The input for our Transformer model is a tokenized sequence created using both the DNA and protein sequences. Representing the DNA as a sequence of codons and the protein as a sequence of amino acids allows us to define tokens that integrate both the codon and the amino acid. For example, a hypothetical protein sequence of “M A L W _”, where “_” represents the end of the sequence, and the corresponding DNA sequence of “ATG GCC CTG TGG TAA” are tokenized as a sequence of 5 tokens: “[M_ATG] [A_GCC] [L_CTG] [W_TGG] [__TAA]”.

During Masked Language Modeling (MLM), a given token with a known codon and amino acid is changed to an alternative token with a known amino acid but unknown codon. For example, masking [A_GCC] will yield the token [A_UNK]. During inference, since we only know the protein sequence, all input tokens will be of the tokens with the form [aminoacid_UNK]. In both scenarios, the model’s objective is to predict the correct [aminoacid_codon] token for a given [amino acid_UNK] token.

This approach introduces a training scheme we call STREAM (Shared Token Representation and Encoding with Aligned Multi-masking). Unlike traditional MLM methods that employ a single generic mask token, STREAM leverages the inherent alignment between DNA and protein sequences to create multiple, specialized mask tokens. Using tokens of the form [aminoacid_UNK], we maintain partial information during masking, allowing the model to simultaneously learn a shared representation of both DNA and protein sequences. This multi-masking strategy, combined with the natural alignment of the data, enables a more nuanced and context-aware learning process.

As a result, our vocabulary includes 64 tokens with both a known codon and a known amino acid, 21 tokens with an unknown codon and a known amino acid, and 5 special tokens including the general unknown token ([UNK]), the start of sequence token ([CLS]), the end of sequence token ([SEP]), padding token ([PAD]), and a general masking token ([MASK]). This will bring our total vocabulary size to 90 tokens.

The final input to the model is a truncated and padded sequence with a maximum length of 2048 tokens, that starts with a [CLS] token, includes the sequence tokens, ends with a [SEP] token, and is followed by several [PAD] tokens. A simple example could be “[CLS] [M_ATG] [A_GCC] [L_CTG] [W_TGG] [__TAA] [SEP] [PAD] … [PAD]”. In addition to the input sequence, the model also receives a taxonomy ID as input, a unique number defining the target organism. Subsequently, this organism ID will be considered as the token type ID of each token in the sequence for that organism.

Embedding

The model learns an embedding vector for each token in the vocabulary, as well as for each organism ID, which is provided as a token type ID. During training, these token type embeddings are added to the learned embeddings for each token in the sequence53 to integrate organism-specific information into the token representations. Additionally, the model incorporates positional embeddings that represent the absolute position of each token in the sequence, enabling the model to capture and leverage sequential dependencies within the input data.

Model structure

Our model architecture is based on BigBird29, a variant of the BERT Transformer30, with a block sparse attention mechanism, enabling it to efficiently process much longer sequences than a standard BERT Transformer. Hence, CodonTransformer can optimize the DNA sequence for a protein sequence with a maximum length of 2048 tokens. CodonTransformer has 12 hidden layers, 12 attention heads, a hidden size of 768, an intermediate size of 3072, and an attention type of block sparse with a block size of 64, bringing the total number of parameters to 89.6 million. The model is curated using the open-source Transformers package from Hugging Face54.

Training

There are two stages in model training: pretraining and fine-tuning. The goal of the pretraining is to teach the model what a general input sequence looks like, while the fine-tuning focuses on adapting the model to specifically predict DNA sequences that are highly optimized for the target organism.

Both stages share the same MLM objective, in which 15% of tokens are randomly selected, and out of them, 80% are masked (e.g., the chosen amino acid token with a known codon like [M_ATG] is swapped with the corresponding amino acid token with an unknown codon, [M_UNK]), 10% are swapped with a random token, and 10% remain unchanged. Following this, the model has to predict the correct token that was masked.

Pretraining uses all sequences in the dataset. It uses a batch size of 6 and 16 NVIDIA V100 GPUs, for 5 epochs. The learning rate starts from 5e-7, linearly increases to 5e-5 over the first 10% of training steps, and then linearly decreases to 5e-7 over the remaining.

Fine-tuning follows a similar process to pretraining but uses a subset of the dataset. To ensure high optimization, we select the top 10% of genes with the highest codon similarity index (CSI) from various organisms: E. coli, B. subtilis, P. putida, T. barophilus, S. cerevisiae, C. reinhardtii (and its chloroplast), N. tabacum (and its chloroplast), A. thaliana, C. elegans, D. rerio, D. melanogaster, M. musculus and H. sapiens. The batch size and learning rate are the same as pretraining, and we use 4 NVIDIA V100 GPUs for 15 epochs. This fine-tuning step allows the model to learn the codon distribution patterns of highly optimized genes.

Inference

Inference was conducted using two primary methods:

Protein sequence only: In this method, the model is given only the protein sequence. Each amino acid is represented by tokens in the format [aminoacid_UNK]. The model then processes these tokens to unmask and convert them into the [aminoacid_codon] format, thereby generating the corresponding DNA sequence.

Protein with DNA sequence: In this method, the model is provided with both the protein sequence and a plausible DNA sequence. Since both are provided, the sequence is initially encoded using [aminoacid_codon] tokens. The model optimizes these tokens by replacing them with more suitable [aminoacid_codon] tokens for the given protein sequence.

Following, the codon parts from all tokens are concatenated to produce the predicted DNA sequence.

Custom fine-tuning

Fine-tuning customizes the base CodonTransformer model to a specific set of genes. This process, exemplified by our use of sequences with high (top 10%) codon similarity index (CSI), enhances the model’s performance for optimizing DNA sequences for various species and conditions. The CodonTransformer repository provides a guide on fine-tuning the base model (“Code availability”).

Evaluation metrics

Several metrics are used to assess codon-optimized sequences, providing a comprehensive view of codon usage and sequence properties:

Codon similarity index (CSI)

CSI25 is an application of CAI8 to the codon usage table of an organism (representing the codon frequency across the entire genome). It has the advantage that it does not rely on the arbitrary selection of a reference gene set. The rationale is that codon optimization based on the most frequent codons (as determined from highly expressed genes) may be detrimental to the host and impair the correct protein folding. Therefore, CSI may provide a softer estimator of codon usage for a specific organism. For the computational implementation of the CSI calculations, we utilized the CAI and CAI-PyPI Python libraries55, for which we used the overall codon usage tables56.

The relative adaptiveness of a codon wij is calculated as the ratio of its frequency xij to one of the most used codon ximax for the same amino acid.

$$_=\frac{_{{ij}}}{{x}_{{{\rm{imax}}}}}$$

(1)

The CSI for a gene is the geometric mean of these relative adaptiveness values across all codons in a DNA sequence (gene) with a length of L codons:

$${{\rm{CSI}}}=\exp \left(\frac{1}{L}{\sum }_{k=1}^{L}{\mathrm{ln}}\,{w}_{k}\right)$$

(2)

Codon frequency distribution (CFD)

The CFD quantifies the percentage of rare codons (those used less than 30% as often as the most frequent codon) in a gene. A weight wij is assigned to each codon, where wij = 1 if the codon is rare and 0 otherwise:

$${w}_{{ij}}=\left\{1\,{{\rm{if}}}\frac{{x}_{{ij}}}{{x}_{{{\rm{imax}}}}} < 0.30,\,0\,{{\rm{otherwise}}}\,\right\}$$

(3)

The CFD is the mean of these weights:

$${{\rm{CDF}}}=\frac{1}{L}{\sum }_{k=1}^{L}{w}_{k}$$

(4)

GC content (%GC)

GC content measures the proportion of guanine (G) and cytosine (C) bases in a DNA sequence:

$${{\rm{GC}}}=\frac{{{\rm{G}}}+{{\rm{C}}}}{{{\rm{A}}}+{{\rm{T}}}+{{\rm{G}}}+{{\rm{C}}}}$$

(5)

Negative cis-regulatory elements

The number of negative cis-regulatory elements was determined using the Genscript codon analysis tool (https://www.genscript.com/tools/rare-codon-analysis).

%MinMax

The %MinMax metric41 evaluates the balance between high and low frequency codons within a window of specific length sliding along the sequence.

For a given window size of w (w = 18 as previously reported41), the %MinMax for each window i is calculated as:

$${\%{{\rm{MinMax}}}}_{i}=\frac{{x}_{\max,i}-{x}_{\min,i}}{{x}_{\max,i}+{x}_{\min,i}}\times 100$$

(6)

where xmax,i and xmin,i are the maximum and minimum codon usage frequencies within the i-th window, respectively. The overall %MinMax for the gene is represented as an array of %MinMax values for each position of the window:

$$\%{{\rm{MinMax}}}=\left[{\%{{\rm{MinMax}}}}_{1}\,,{\%{{\rm{MinMax}}}}_{2}\,,\,…\,,\,{\%{{\rm{MinMax}}}}_{n}\,\right]$$

(7)

where n is the number of windows in the sequence.

To compute minmax profiles, we used the R package kodonz available at using built-in codon usage tables for the different organisms.

Dynamic time warping (DTW)

DTW is an algorithm for measuring the similarity between two temporal sequences or shape matching57,58. We employ DTW to quantify distances in %MinMax of codon usage patterns. DTW calculates the optimal alignment between two time series X and Y, as:

$${{\rm{DTW}}}(X,\,Y)=\sqrt{{\sum }_{(i,j)\varepsilon \pi }|{{{\boldsymbol{X}}}}_{i}-{{{\boldsymbol{Y}}}}_{j}{|}^{2}}$$

(8)

Where X and Y are the %MinMax arrays, π is the optimal alignment path, Xi and Yj are aligned points in the sequences, and ||XiYj|| is the Euclidean distance between these points. This metric enables quantitative assessment of how closely one sequence’s codon usage pattern aligns with that of a target sequence.

To compute the DTW distances, we used the main function from the R package dtw59 with default parameters and retrieved from the result object the distance or normalizedDistance items.

Jaccard index

We used the Jaccard index to evaluate the propensity of models to use the same codons in the generated sequences. Jaccard index is a set-based similarity measure that evaluates the similarity between two sets as the ratio between their intersection and their union.

For two DNA sequences X and Y composed of codon sets A and B, the Jaccard index is as follows:

$$J(X,\,Y)=\frac{A\,\cap \,B}{A\,\cup \,B}$$

(9)

Sequence similarity analysis

Sequence similarity measures the percentage of codons matching two sequences. This metric is crucial for assessing the functional and evolutionary relationships between genes.

For sequences A and B, the sequence similarity is calculated as:

$${{\rm{Similarity}}}(A,\,B)=\frac{1}{L}{\sum }_{i=1}^{L}{\delta }(a_{i}\,,{b}_{i})\times 100$$

(10)

where L is the length of the sequences, ai and bi are the codons at position i in sequences A and B, respectively, and \(\delta\) is the Kronecker delta function, which is 1 if ai = bi and 0 if aibi. This metric ranges from 0% to 100%, with 100% indicating identical sequences and 0% indicating completely different sequences.

Minimum free energy of RNA structures

To compute the minimum free energy for the sequences generated by the models, we translated those sequences to RNA using the R package XNAString60 with default sugar and backbone structures. This package then calls the RNAfold_MFE function from ViennaRNA package42 to compute the minimal folding energy for the RNA structure. The R code to calculate %MinMax, DTW, Jaccard index, sequence similarity and RNA minimum free energy has been provided with the Source Data.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link