A text-guided protein design framework

0 8 minutes read

Freschlin, C. R., Fahlberg, S. A. & Romero, P. A. Machine learning to navigate fitness landscapes for protein engineering. Curr. Opin. Biotechnol. 75, 102713 (2022).

MATH

Google Scholar

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

MATH

Google Scholar

Zhong, E. D., Lerer, A., Davis, J. H. & Berger, B. CryoDRGN2: ab initio neural reconstruction of 3D protein structures from real cryo-EM images. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 4046–4055 (IEEE, 2021).

Hsu, C. et al. Learning inverse folding from millions of predicted structures. Proc. Mach. Learning Res. 162, 8946–8970 (2022).

Rao, R. M. et al. MSA Transformer. Proc. Mach. Learning Res. 139, 8844–8856 (2021).

Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

Li, M. et al. SESNet: sequence–structure feature-integrated deep learning method for data-efficient protein engineering. J. Cheminformatics 15, 12 (2023).

Jing, B., Eismann, S., Suriana, P., Townshend, R. J. L. & Dror, R. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations (2021).

Wang, L., Liu, H., Liu, Y., Kurtin, J. & Ji, S. Learning protein representations via complete 3D graph networks. In The Eleventh International Conference on Learning Representations (2023).

Radford, A. et al. Learning transferable visual models from natural language supervision. Proc. Mach. Learning Res. 139, 8748–8763 (2021).

Nichol, A. Q. et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc. Mach. Learning Res. 162, 16784–16804 (2022).

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with clip latents. Preprint at (2022).

Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D. & Lischinski, D. StyleCLIP: text-driven manipulation of StyleGAN imagery. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 2065–2074 (IEEE, 2021).

Liu, S., Qu, M., Zhang, Z., Cai, H. & Tang, J. Structured multi-task learning for molecular property prediction. Proc. Mach. Learning Res. 151, 8906–8920 (2022).

Edwards, C., Zhai, C. & Ji, H. Text2mol: cross-modal molecule retrieval with natural language queries. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 595–607 (Association for Computational Linguistics, 2021).

Zeng, Z., Yao, Y., Liu, Z. & Sun, M. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nat. Commun. 13, 862 (2022).

MATH

Google Scholar

Liu, S. et al. Multi-modal molecule structure–text model for text-based retrieval and editing. Nat. Mach. Intell. 5, 1447–1457 (2023).

MATH

Google Scholar

Liu, S. et al. Conversational drug editing using retrieval and domain feedback. In The Twelfth International Conference on Learning Representations (2024).

The UniProt Consortium The Universal Protein Resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2007).

Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

MATH

Google Scholar

UniProt. Uniprotkg/swiss-prot (2023); class=”c-article-references__item js-c-reading-companion-references-item” data-counter=”23.”>

Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M. & Bairoch, A. in Plant Bioinformatics (ed. Edwards, D.) 89–112 (Springer, 2007).

Branden, C. I. & Tooze, J. Introduction to Protein Structure (Garland, 2012).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).

Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).

Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).

Google Scholar

Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).

MATH

Google Scholar

Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds Inui, K. et al.) 3615–3620 (Association for Computational Linguistics, 2019).

Fricke, S. Semantic Scholar. J. Med. Libr. Assoc. 106, 145–147 (2018).

MATH

Google Scholar

Taylor, R. et al. Galactica: a large language model for science. Preprint at (2022).

Li, Y., Xu, H., Zhao, H., Guo, H. & Liu, S. ChatPathway: conversational large language models for biology pathway detection. In NeurIPS 2023 AI for Science Workshop (2023).

Savage, N. Drug discovery companies are customizing ChatGPT: here’s how. Nat. Biotechnol. 41, 585–586 (2023).

Google Scholar

Gao, Z. et al. Empowering diffusion models on the embedding space for text generation. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K. et al.) 4664–4683 (Association for Computational Linguistics, 2024).

Lin, Z. et al. Text generation with diffusion language models: a pre-training approach with continuous paragraph denoise. Proc. Mach. Learning Res. 202, 21051–21064 (2023).

Bar-Tal, O. et al. Lumiere: a space–time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers 1–11 (Association for Computing Machinery, 2024).

Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (IEEE Computer Society, 2022).

Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).

MATH

Google Scholar

Binder, J. L. et al. AlphaFold illuminates half of the dark human proteins. Curr. Opin. Struct. Biol. 74, 102372 (2022).

MATH

Google Scholar

Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).

MathSciNet
MATH

Google Scholar

Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. in Methods in Enzymology Vol. 383 (eds Brand, L. & Johnson, M. L.) 66–93 (Elsevier, 2004).

Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).

MATH

Google Scholar

Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).

MATH

Google Scholar

Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

MATH

Google Scholar

Liu, S. et al. A multi-grained symmetric differential equation model for learning protein–ligand binding dynamics. Preprint at (2024).

McNutt, A. T. et al. gnina 1.0: molecular docking with deep learning. J. Cheminformatics 13, 43 (2021).

MATH

Google Scholar

Salsi, E. et al. Design of O-acetylserine sulfhydrylase inhibitors by mimicking nature. J. Med. Chem. 53, 345–356 (2010).

Google Scholar

Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32 (2019).

Klausen, M. S. et al. NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins 87, 520–527 (2019).

MATH

Google Scholar

Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2018).

MATH

Google Scholar

Fox, N. K., Brenner, S. E. & Chandonia, J.-M. SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2013).

Google Scholar

AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinform. 20, 311 (2019).

MATH

Google Scholar

Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—Round XII. Proteins 86, 7–15 (2018).

Google Scholar

Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

MATH

Google Scholar

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

MATH

Google Scholar

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

Zhang, N. et al. OntoProtein: protein pretraining with Gene Ontology embedding. In International Conference on Learning Representations (2022).

Ingraham, J. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).

MATH

Google Scholar

Wei, C.-H., Allot, A., Leaman, R. & Lu, Z. PubTator Central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 47, W587–W593 (2019).

Google Scholar

Angermueller, C. et al. Model-based reinforcement learning for biological sequence design. In International Conference on Learning Representations (2020).

Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).

Google Scholar

Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).

MATH

Google Scholar

Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

Google Scholar

Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. Proc. Mach. Learning Res. 162, 16990–17017 (2022).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).

MATH

Google Scholar

Lewis, M. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 7871–7880 (Association for Computational Linguistics, 2020).

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learning Res. 21, 1–67 (2020).

MathSciNet
MATH

Google Scholar

Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 23, 1661–1674 (2011).

MathSciNet
MATH

Google Scholar

Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32 (2019).

Song, Y. et al. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (2021).

Hjelm, R. D. et al. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations (2019).

Bachman, P., Hjelm, R. D. & Buchwalter, W. Learning representations by maximizing mutual information across views. Adv. Neural Inf. Process. Syst. 32 (2019).

Oord, A. v. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at (2018).

He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9729–9738 (IEEE, 2020).

Liu, S. et al. Pre-training molecular graph representation with 3D geometry. In International Conference on Learning Representations (2022).

LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. & Huang, F. in Predicting Structured Data Vol. 1 (eds Bakir, G. et al.) (MIT Press, 2006).

Khosla, P. et al. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 33, 18661–18673 (2020).

Liu, S., Guo, H. & Tang, J. Molecular geometry pretraining with SE(3)-invariant denoising distance matching. In International Conference on Learning Representations (2023).

Huang, W., Hayashi, T., Wu, Y., Kameoka, H. & Toda, T. Voice transformer network: sequence-to-sequence voice conversion using Transformer with text-to-speech pretraining. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020 (eds Meng, H. et al.) 4676–4680 (ISCA, 2020).

Karita, S. et al. A comparative study on Transformer vs RNN in speech applications. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14–18, 2019 449–456 (IEEE, 2019).

Chang, H. et al. Muse: text-to-image generation via masked generative transformers. Proc. Mach. Learning Res. 202, 4055–4075 (2023).

Song, Y. & Kingma, D. P. How to train your energy-based models. Preprint at (2021).

Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P. & Welling, M. Argmax flows and multinomial diffusion: learning categorical distributions. Adv. Neural Inf. Process. Syst. 34, 12454–12465 (2021).

Google Scholar

Austin, J., Johnson, D. D., Ho, J., Tarlow, D. & van den Berg, R. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 34, 17981–17993 (2021).

Google Scholar

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S. & Hashimoto, T. B. Diffusion-LM improves controllable text generation. Adv. Neural Inf. Process Syst. 35, 4328–4343 (2022).

Bond-Taylor, S., Hessey, P., Sasaki, H., Breckon, T. P. & Willcocks, C. G. Unleashing Transformers: parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proc., Part XXIII (eds Avidan, S.) 170–188 (Springer, 2022).

Liu, S. et al. A text-guided protein design framework. Zenodo (2025).

2025-03-27 00:00:00

0 8 minutes read