AI

A text-guided protein design framework

  • Freschlin, C. R., Fahlberg, S. A. & Romero, P. A. Machine learning to navigate fitness landscapes for protein engineering. Curr. Opin. Biotechnol. 75, 102713 (2022).

    MATH 

    Google Scholar 

  • Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    MATH 

    Google Scholar 

  • Zhong, E. D., Lerer, A., Davis, J. H. & Berger, B. CryoDRGN2: ab initio neural reconstruction of 3D protein structures from real cryo-EM images. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 4046–4055 (IEEE, 2021).

  • Hsu, C. et al. Learning inverse folding from millions of predicted structures. Proc. Mach. Learning Res. 162, 8946–8970 (2022).

  • Rao, R. M. et al. MSA Transformer. Proc. Mach. Learning Res. 139, 8844–8856 (2021).

  • Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

  • Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

  • Li, M. et al. SESNet: sequence–structure feature-integrated deep learning method for data-efficient protein engineering. J. Cheminformatics 15, 12 (2023).

  • Jing, B., Eismann, S., Suriana, P., Townshend, R. J. L. & Dror, R. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations (2021).

  • Wang, L., Liu, H., Liu, Y., Kurtin, J. & Ji, S. Learning protein representations via complete 3D graph networks. In The Eleventh International Conference on Learning Representations (2023).

  • Radford, A. et al. Learning transferable visual models from natural language supervision. Proc. Mach. Learning Res. 139, 8748–8763 (2021).

  • Nichol, A. Q. et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Proc. Mach. Learning Res. 162, 16784–16804 (2022).

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with clip latents. Preprint at (2022).

  • Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D. & Lischinski, D. StyleCLIP: text-driven manipulation of StyleGAN imagery. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 2065–2074 (IEEE, 2021).

  • Liu, S., Qu, M., Zhang, Z., Cai, H. & Tang, J. Structured multi-task learning for molecular property prediction. Proc. Mach. Learning Res. 151, 8906–8920 (2022).

  • Edwards, C., Zhai, C. & Ji, H. Text2mol: cross-modal molecule retrieval with natural language queries. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 595–607 (Association for Computational Linguistics, 2021).

  • Zeng, Z., Yao, Y., Liu, Z. & Sun, M. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nat. Commun. 13, 862 (2022).

    MATH 

    Google Scholar 

  • Liu, S. et al. Multi-modal molecule structure–text model for text-based retrieval and editing. Nat. Mach. Intell. 5, 1447–1457 (2023).

    MATH 

    Google Scholar 

  • Liu, S. et al. Conversational drug editing using retrieval and domain feedback. In The Twelfth International Conference on Learning Representations (2024).

  • The UniProt Consortium The Universal Protein Resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2007).

  • Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    MATH 

    Google Scholar 

  • UniProt. Uniprotkg/swiss-prot (2023); class=”c-article-references__item js-c-reading-companion-references-item” data-counter=”23.”>

    Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M. & Bairoch, A. in Plant Bioinformatics (ed. Edwards, D.) 89–112 (Springer, 2007).

  • Branden, C. I. & Tooze, J. Introduction to Protein Structure (Garland, 2012).

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).

  • Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).

  • Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).

    Google Scholar 

  • Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).

    MATH 

    Google Scholar 

  • Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds Inui, K. et al.) 3615–3620 (Association for Computational Linguistics, 2019).

  • Fricke, S. Semantic Scholar. J. Med. Libr. Assoc. 106, 145–147 (2018).

    MATH 

    Google Scholar 

  • Taylor, R. et al. Galactica: a large language model for science. Preprint at (2022).

  • Li, Y., Xu, H., Zhao, H., Guo, H. & Liu, S. ChatPathway: conversational large language models for biology pathway detection. In NeurIPS 2023 AI for Science Workshop (2023).

  • Savage, N. Drug discovery companies are customizing ChatGPT: here’s how. Nat. Biotechnol. 41, 585–586 (2023).

    Google Scholar 

  • Gao, Z. et al. Empowering diffusion models on the embedding space for text generation. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Duh, K. et al.) 4664–4683 (Association for Computational Linguistics, 2024).

  • Lin, Z. et al. Text generation with diffusion language models: a pre-training approach with continuous paragraph denoise. Proc. Mach. Learning Res. 202, 21051–21064 (2023).

  • Bar-Tal, O. et al. Lumiere: a space–time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers 1–11 (Association for Computing Machinery, 2024).

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (IEEE Computer Society, 2022).

  • Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).

    MATH 

    Google Scholar 

  • Binder, J. L. et al. AlphaFold illuminates half of the dark human proteins. Curr. Opin. Struct. Biol. 74, 102372 (2022).

    MATH 

    Google Scholar 

  • Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).

    MathSciNet 
    MATH 

    Google Scholar 

  • Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. in Methods in Enzymology Vol. 383 (eds Brand, L. & Johnson, M. L.) 66–93 (Elsevier, 2004).

  • Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).

    MATH 

    Google Scholar 

  • Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).

    MATH 

    Google Scholar 

  • Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    MATH 

    Google Scholar 

  • Liu, S. et al. A multi-grained symmetric differential equation model for learning protein–ligand binding dynamics. Preprint at (2024).

  • McNutt, A. T. et al. gnina 1.0: molecular docking with deep learning. J. Cheminformatics 13, 43 (2021).

    MATH 

    Google Scholar 

  • Salsi, E. et al. Design of O-acetylserine sulfhydrylase inhibitors by mimicking nature. J. Med. Chem. 53, 345–356 (2010).

    Google Scholar 

  • Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32 (2019).

  • Klausen, M. S. et al. NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins 87, 520–527 (2019).

    MATH 

    Google Scholar 

  • Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2018).

    MATH 

    Google Scholar 

  • Fox, N. K., Brenner, S. E. & Chandonia, J.-M. SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2013).

    Google Scholar 

  • AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinform. 20, 311 (2019).

    MATH 

    Google Scholar 

  • Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—Round XII. Proteins 86, 7–15 (2018).

    Google Scholar 

  • Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    MATH 

    Google Scholar 

  • Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    MATH 

    Google Scholar 

  • He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  • Zhang, N. et al. OntoProtein: protein pretraining with Gene Ontology embedding. In International Conference on Learning Representations (2022).

  • Ingraham, J. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).

    MATH 

    Google Scholar 

  • Wei, C.-H., Allot, A., Leaman, R. & Lu, Z. PubTator Central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 47, W587–W593 (2019).

    Google Scholar 

  • Angermueller, C. et al. Model-based reinforcement learning for biological sequence design. In International Conference on Learning Representations (2020).

  • Gelman, S., Fahlberg, S. A., Heinzelman, P., Romero, P. A. & Gitter, A. Neural networks to learn protein sequence–function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).

    Google Scholar 

  • Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).

    MATH 

    Google Scholar 

  • Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Google Scholar 

  • Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. Proc. Mach. Learning Res. 162, 16990–17017 (2022).

  • Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).

    MATH 

    Google Scholar 

  • Lewis, M. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 7871–7880 (Association for Computational Linguistics, 2020).

  • Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learning Res. 21, 1–67 (2020).

    MathSciNet 
    MATH 

    Google Scholar 

  • Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

  • Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 23, 1661–1674 (2011).

    MathSciNet 
    MATH 

    Google Scholar 

  • Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32 (2019).

  • Song, Y. et al. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (2021).

  • Hjelm, R. D. et al. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations (2019).

  • Bachman, P., Hjelm, R. D. & Buchwalter, W. Learning representations by maximizing mutual information across views. Adv. Neural Inf. Process. Syst. 32 (2019).

  • Oord, A. v. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at (2018).

  • He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9729–9738 (IEEE, 2020).

  • Liu, S. et al. Pre-training molecular graph representation with 3D geometry. In International Conference on Learning Representations (2022).

  • LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. & Huang, F. in Predicting Structured Data Vol. 1 (eds Bakir, G. et al.) (MIT Press, 2006).

  • Khosla, P. et al. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 33, 18661–18673 (2020).

  • Liu, S., Guo, H. & Tang, J. Molecular geometry pretraining with SE(3)-invariant denoising distance matching. In International Conference on Learning Representations (2023).

  • Huang, W., Hayashi, T., Wu, Y., Kameoka, H. & Toda, T. Voice transformer network: sequence-to-sequence voice conversion using Transformer with text-to-speech pretraining. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020 (eds Meng, H. et al.) 4676–4680 (ISCA, 2020).

  • Karita, S. et al. A comparative study on Transformer vs RNN in speech applications. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14–18, 2019 449–456 (IEEE, 2019).

  • Chang, H. et al. Muse: text-to-image generation via masked generative transformers. Proc. Mach. Learning Res. 202, 4055–4075 (2023).

  • Song, Y. & Kingma, D. P. How to train your energy-based models. Preprint at (2021).

  • Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P. & Welling, M. Argmax flows and multinomial diffusion: learning categorical distributions. Adv. Neural Inf. Process. Syst. 34, 12454–12465 (2021).

    Google Scholar 

  • Austin, J., Johnson, D. D., Ho, J., Tarlow, D. & van den Berg, R. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 34, 17981–17993 (2021).

    Google Scholar 

  • Li, X., Thickstun, J., Gulrajani, I., Liang, P. S. & Hashimoto, T. B. Diffusion-LM improves controllable text generation. Adv. Neural Inf. Process Syst. 35, 4328–4343 (2022).

  • Bond-Taylor, S., Hessey, P., Sasaki, H., Breckon, T. P. & Willcocks, C. G. Unleashing Transformers: parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proc., Part XXIII (eds Avidan, S.) 170–188 (Springer, 2022).

  • Liu, S. et al. A text-guided protein design framework. Zenodo (2025).

  • 2025-03-27 00:00:00

    Related Articles

    Back to top button