AI

Large language models for scientific discovery in molecular property prediction

  • Bloom, N., Jones, C. I., Reenen, J. & Webb, M. Are ideas getting harder to find? Am. Econ. Rev. 110, 1104–1144 (2020).

    Article 
    MATH 

    Google Scholar 

  • Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).

    Article 
    MATH 

    Google Scholar 

  • Frank, M. Baby steps in evaluating the capacities of large language models. Nat. Rev. Psychol. 2, 451–452 (2023).

  • Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process Syst. 33, 1877–901 (2020).

    MATH 

    Google Scholar 

  • Achiam, J. et al. Gpt-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

  • Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

  • Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

  • Mirza, A. et al. Are large language models superhuman chemists? Preprint at https://arxiv.org/abs/2404.01475 (2024).

  • Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    Article 
    MATH 

    Google Scholar 

  • Taylor, R. et al. Galactica: a large language model for science. Preprint at https://arxiv.org/abs/2211.09085 (2022).

  • Almazrouei, E. et al. The Falcon series of open language models. Preprint at https://arxiv.org/abs/2311.16867 (2023).

  • Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 64, 4–17 (2012).

    Article 
    MATH 

    Google Scholar 

  • Landrum, G. et al. rdkit/rdkit: 2024_09_5 (Q3 2024) Release (Release_2024_09_5). Zenodo https://doi.org/10.5281/zenodo.14779836 (2025).

  • Hu, W. et al. Strategies for pre-training graph neural networks. In Proc. International Conference on Learning Representations (2020).

  • You, Y. et al. Graph contrastive learning with augmentations. Adv. Neural Inf. Process. Sys. 33, 5812–5823 (2020).

    MATH 

    Google Scholar 

  • Wang, Y., Wang, J., Cao, Z. & Farimani, A. B. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).

    Article 
    MATH 

    Google Scholar 

  • Stärk, H. et al. 3D Infomax improves GNNs for molecular property prediction. In Proc. International Conference on Machine Learning (eds Chaudhuri, K. et al.) 20479–20502 (PMLR, 2022).

  • Liu, S. et al. Pre-training molecular graph representation with 3D geometry. In Proc. 10th International Conference on Learning Representations (2022).

  • Xia, J. et al. Mole-bert: rethinking pre-training graph neural networks for molecules. In Proc. 11th International Conference on Learning Representations (2023).

  • Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. Adv. Neural Inf. Process. Sys. 33, 12559–12571 (2020).

    MATH 

    Google Scholar 

  • Zhou, G. et al. Uni-mol: a universal 3D molecular representation learning framework. In Proc. 11th International Conference on Learning Representations (2023).

  • Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article 
    MATH 

    Google Scholar 

  • Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702.e13 (2020).

    Article 
    MATH 

    Google Scholar 

  • Wong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature 626, 177–185 (2024).

    Article 
    MATH 

    Google Scholar 

  • Zhang, D. et al. Chemllm: a chemical large language model. Preprint at https://arxiv.org/abs/2402.06852 (2024).

  • Zhao, Z. et al. ChemDFM: a large language foundation model for chemistry. In 38th Conference on Neural Information Processing Systems, Foundation Models for Science: Progress, Opportunities, and Challenges (NeurIPS, 2024).

  • Cai, Z. et al. Internlm2 technical report. Preprint at https://arxiv.org/abs/2403.17297 (2024).

  • Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  • Haque, M. & Li, S. Exploring ChatGPT and its impact on society. AI Ethics https://doi.org/10.1007/s43681-024-00435-4 (2024).

  • Wei, J. et al. Emergent abilities of large language models. Transact. Mach. Learn. Res. https://openreview.net/pdf?id=yzkSU5zdwD (2022).

  • McKnight, P. E. & Najab, J. in The Corsini Encyclopedia of Psychology (eds Weiner, I. B. & Craighead, W. E.) (Wiley, 2010).

  • Subramanian, G., Ramsundar, B., Pande, V. & Denny, R. A. Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. J. Chem. Inf. Model. 56, 1936–1949 (2016).

    Article 

    Google Scholar 

  • Wager, T. Defining desirable central nervous system drug space through the alignment of molecular properties, in vitro adme, and safety attributes. ACS Chem. Neurosci. 1, 420–434 (2010).

    Article 
    MATH 

    Google Scholar 

  • Wager, T., Hou, X., Verhoest, P. & Villalobos, A. Moving beyond rules: the development of a central nervous system multiparameter optimization (cns mpo) approach to enable alignment of druglike properties. ACS Chem. Neurosci. 1, 435–449 (2010).

    Article 

    Google Scholar 

  • Geldenhuys, W., Mohammad, A., Adkins, C. & Lockman, P. Molecular determinants of blood–brain barrier permeation. Ther. Deliv. 6, 961–971 (2015).

    Article 

    Google Scholar 

  • Liu, N. F. et al. Lost in the middle: how language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173 (2024).

    Article 
    MATH 

    Google Scholar 

  • Qin, G., Feng, Y. & Van Durme, B. The NLP task effectiveness of long-range transformers. In Proc. 17th Conference of the European Chapter of the Association for Computational Linguistics (eds Vlachos, A. & Augenstein, I.) 3774–3790 (ACL, 2023).

  • The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

    Article 

    Google Scholar 

  • Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).

    Article 

    Google Scholar 

  • Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article 
    MATH 

    Google Scholar 

  • Park, Y. J. et al. Can chatgpt be used to generate scientific hypotheses? J. Materiomics 10, 578–584 (2024).

    Article 
    MATH 

    Google Scholar 

  • Honda, S., Shi, S. & Ueda, H. R. Smiles transformer: pre-trained molecular fingerprint for low data drug discovery. Preprint at https://arxiv.org/abs/1911.04738 (2019).

  • zyzisastudyreallyhardguy & Ju, J. Code repository LLM4SD: release v.1.0. Zenodo https://doi.org/10.5281/zenodo.13986921 (2024).

  • Student. The probable error of a mean. Biometrika 6, 1–25 (1908).

  • https://media.springernature.com/m685/springer-static/image/art%3A10.1038%2Fs42256-025-00994-z/MediaObjects/42256_2025_994_Fig1_HTML.png

    2025-02-25 00:00:00

    Related Articles

    Back to top button