A machine learning model that predicts ATG and near-cognate codon initiation sites (TIS) in nucleotide sequences

TIS Predictor is a machine learning software developed to predict translation initiation sites in a given nucleotide sequence [1]. It was trained on instances of ATG and near-cognate translation initiation. Furthermore, it outperforms the recently developed model, TITER [2], that had been created for the same function by more than 18% in accuracy.

Please cite https://www.biorxiv.org/content/10.1101/2021.08.17.456657v1 if predictions are used.

ATG translation-initiation codons are predicted with about 87.79% accuracy, and near-cognate codons (CTG, GTG and TTG) with 85.00% accuracy. Less common near-cognate codons are extrapolated (AAG, AGG, ACG, ATC, ATT, and ATA), based on patterns recognized from CTG, GTG, and TTG instances. Predictions for codons that the software was trained on show up in bold. Extrapolated codons show up without bold formatting (and should be akndowledged with less confidence).

Previously identified translation initiation sites upstream of repeat expansions for the C9ORF72, FMR1, DM1, and HDL2 genes – associated with neurologic disease – were compiled and used as a reference for software performance.

Table 1. Previously identified translation initiation sites from publications.

Gene

Codon

Number of Bases Upstream of Repeat

Peptide Repeat Translated

Kozak Similarity Score

C9orf72

(Sense) [3]

AGG

1

Poly-GR

0.66

CTG

24

Poly-GA

0.69

C9orf72 (Antisense) [3]

ATG

194

Poly-PG

0.61

FMR1

(Sense) [4, 5]

GTG

11

Poly-G

0.70

ACG

35

Poly-G

0.80

ACG

60

Poly-G

0.71

DM1

(Antisense) with slightly modified sequence [6]

ATC

7

Poly-A

0.61

ATG

17

Poly-S

0.66

ATT

23

Poly-S

0.74

HDL2 (Antisense) [6]

ATC

6

Poly-Q

0.74

 

All translation initiation sites were correctly identified with one exception: ATC, seven bases upstream of the repeat [6]

 

1.           Gleason AC, Ghadge G, Chen J, Sonobe Y, Roos RP. Machine learning predicts translation initiation sites in neurologic diseases with expanded repeats. bioRxiv. 2021:2021.08.17.456657.

2.           Zhang S, Hu H, Jiang T, Zhang L, Zeng J. TITER : predicting translation initiation sites by deep learning. Bioinformatics. 2017:33:i234–i42.

3.           Boivin M, Pfister V, Gaucherot A, Ruffenach F, Luc N, Sellier C, et al. Reduced autophagy upon C9ORF72 loss synergizes with dipeptide repeat protein toxicity in G4C2 repeat expansion disorders. EMBO J. 2020;39(4):e100574.

4.           Rodriguez CM, Wright SE, Kearse MG, Haenfler JM, Flores BN, Liu Y, et al. A native function for RAN translation and CGG repeats in regulating fragile X protein synthesis. Nature Neuroscience. 2020;23:386–97.

5.           Kearse MG, Green KM, Krans A, Rodriguez CM, Linsalata AE, Goldstrohm AC, et al. CGG Repeat associated non-AUG translation utilizes a cap-dependent, scanning mechanism of initiation to produce toxic proteins. Mol Cell. 2016;62(2):314-22.

6.           Zu T, Gibbens B, Doty NS, Gomes-Pereira M, Huguet A, Stone MD, et al. Non-ATG-initiated translation directed by microsatellite expansions. Proc Natl Acad Sci U S A. 2010;108(1):260-5.