AI

High-level visual representations in the human brain are aligned with large language models

  • Kanwisher, N. Functional specificity in the human brain: a window into the functional architecture of the mind. Proc. Natl Acad. Sci. USA 107, 11163–11170 (2010).

    Google Scholar 

  • Konkle, T. & Oliva, A. A real-world size organization of object responses in occipitotemporal cortex. Neuron 74, 1114–1124 (2012).

    Google Scholar 

  • Bao, P., She, L., McGill, M. & Tsao, D. Y. A map of object space in primate inferotemporal cortex. Nature 583, 103–108 (2020).

    Google Scholar 

  • Kriegeskorte, N. et al. Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron 60, 1126–1141 (2008).

    Google Scholar 

  • Cichy, R. M., Kriegeskorte, N., Jozwik, K. M., van den Bosch, J. J. F. & Charest, I. The spatiotemporal neural dynamics underlying perceived similarity for real-world objects. NeuroImage 194, 12–24 (2019).

    Google Scholar 

  • Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 1, 417–446 (2015).

    Google Scholar 

  • Kriegeskorte, N. & Douglas, P. K. Cognitive computational neuroscience. Nat. Neurosci. 21, 1148–1160 (2018).

    Google Scholar 

  • DiCarlo, J. J., Zoccolan, D. & Rust, N. C. How does the brain solve visual object recognition? Neuron 73, 415–434 (2012).

    Google Scholar 

  • Bracci, S. & Op de Beeck, H. P. Understanding human object vision: a picture is worth a thousand representations. Annu. Rev. Psychol. 74, 113–135 (2023).

    Google Scholar 

  • Doerig, A. et al. The neuroconnectionist research programme. Nat. Rev. Neurosci. 24, 431–450 (2023).

    Google Scholar 

  • Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22, 1761–1770 (2019).

    Google Scholar 

  • Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).

    Google Scholar 

  • Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).

    Google Scholar 

  • Güçlü, U. & van Gerven, M. A. J. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015).

    Google Scholar 

  • Brandman, T. & Peelen, M. V. Interaction between scene and object processing revealed by human fMRI and MEG decoding. J. Neurosci. 37, 7700–7710 (2017).

    Google Scholar 

  • Sadeghi, Z., McClelland, J. L. & Hoffman, P. You shall know an object by the company it keeps: an investigation of semantic representations derived from object co-occurrence in visual scenes. Neuropsychologia 76, 52–61 (2015).

    Google Scholar 

  • Bonner, M. F. & Epstein, R. A. Object representations in the human brain reflect the co-occurrence statistics of vision and language. Nat. Commun. 12, 4081 (2021).

    Google Scholar 

  • Ackerman, C. M. & Courtney, S. M. Spatial relations and spatial locations are dissociated within prefrontal and parietal cortex. J. Neurophysiol. 108, 2419–2429 (2012).

    Google Scholar 

  • Chafee, M. V., Averbeck, B. B. & Crowe, D. A. Representing spatial relationships in posterior parietal cortex: single neurons code object-referenced position. Cereb. Cortex 17, 2914–2932 (2007).

    Google Scholar 

  • Graumann, M., Ciuffi, C., Dwivedi, K., Roig, G. & Cichy, R. M. The spatiotemporal neural dynamics of object location representations in the human brain. Nat. Hum. Behav. 6, 796–811 (2022).

    Google Scholar 

  • Zhang, B. & Naya, Y. Medial prefrontal cortex represents the object-based cognitive map when remembering an egocentric target location. Cereb. Cortex 30, 5356–5371 (2020).

    Google Scholar 

  • Bar, M. Visual objects in context. Nat. Rev. Neurosci. 5, 617–629 (2004).

    Google Scholar 

  • Russell, B., Torralba, A., Liu, C., Fergus, R. & Freeman, W. Object recognition by scene alignment. Adv. Neural Inf. Process. Syst. 20, (2007).

  • Võ, M. L.-H., Boettcher, S. E. & Draschkow, D. Reading scenes: how scene grammar guides attention and aids perception in real-world environments. Curr. Opin. Psychol. 29, 205–210 (2019).

    Google Scholar 

  • Kaiser, D., Quek, G. L., Cichy, R. M. & Peelen, M. V. Object vision in a structured world. Trends Cogn. Sci. 23, 672–685 (2019).

    Google Scholar 

  • Võ, M. L.-H. The meaning and structure of scenes. Vis. Res. 181, 10–20 (2021).

    Google Scholar 

  • Epstein, R. A. & Baker, C. I. Scene perception in the human brain. Annu. Rev. Vis. Sci. 5, 373–397 (2019).

    Google Scholar 

  • Bartnik, C. G. & Groen, I. I. A. Visual perception in the human brain: how the brain perceives and understands real-world scenes. In Oxford Research Encyclopedia of Neuroscience (2023).

  • Epstein, R. A. & Kanwisher, N. A cortical representation of the local visual environment. Nature 392, 598–601 (1998).

    Google Scholar 

  • Epstein, R., Harris, A., Stanley, D. & Kanwisher, N. The parahippocampal place area: recognition, navigation, or encoding? Neuron 23, 115–125 (1999).

    Google Scholar 

  • Epstein, R. A. Parahippocampal and retrosplenial contributions to human spatial navigation. Trends Cogn. Sci. 12, 388–396 (2008).

    Google Scholar 

  • Groen, I. I. A., Ghebreab, S., Prins, H., Lamme, V. A. F. & Scholte, H. S. From image statistics to scene gist: evoked neural activity reveals transition from low-level natural image structure to scene category. J. Neurosci. 33, 18814–18824 (2013).

    Google Scholar 

  • Stansbury, D. E., Naselaris, T. & Gallant, J. L. Natural scene statistics account for the representation of scene categories in human visual cortex. Neuron 79, 1025–1034 (2013).

    Google Scholar 

  • Groen, I. I. et al. Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior. eLife 7, e32962 (2018).

    Google Scholar 

  • Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

  • Cer, D. et al. Universal sentence encoder for English. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 169–174 (Association for Computational Linguistics, 2018).

    Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).

  • Arora, S., Liang, Y. & Ma, T. A simple but tough-to-beat baseline for sen-tence embeddings. In International Conference on Learning Representations (2017).

  • Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.-Y. MPNet: masked and permuted pre-training for language understanding. Adv. Neural Inf. Process. Syst. 33, 16857–16867 (2020).

  • Lu, J., Batra, D., Parikh, D. & Lee, S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32 (2019).

  • Tan, H. & Bansal, M. LXMERT: learning cross-modality encoder representations from transformers. Preprint at https://doi.org/10.48550/arXiv.1908.07490 (2019).

  • Pramanick, S. et al. VoLTA: vision-language transformer with weakly-supervised local-feature alignment. Preprint at https://doi.org/10.48550/arXiv.2210.04135 (2022).

  • Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning 8748–8763 (PMLR, 2021).

  • Du, Y., Liu, Z., Li, J. & Zhao, W. X. A survey of vision-language pre-trained models. Preprint at https://doi.org/10.48550/arXiv.2202.10936 (2022).

  • Chen, F.-L. et al. VLP: a survey on vision-language pre-training. Mach. Intell. Res. 20, 38–56 (2023).

    Google Scholar 

  • Allen, E. J. et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022).

    Google Scholar 

  • Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014: 13th European Conference 740–755 (Springer, 2014).

  • Chen, X. et al. Microsoft COCO captions: data collection and evaluation server. Preprint at https://doi.org/10.48550/arXiv.1504.00325 (2015).

  • Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017).

  • Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I. & Specia, L. SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proc. 11th International Workshop on Semantic Evaluation (SemEval-2017) 1–14 (Association for Computational Linguistics, 2017).

  • Kriegeskorte, N., Mur, M. & Bandettini, P. Representational similarity analysis—connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2, 4 (2008).

    Google Scholar 

  • Kriegeskorte, N. & Kievit, R. A. Representational geometry: integrating cognition, computation, and the brain. Trends Cogn. Sci. 17, 401–412 (2013).

    Google Scholar 

  • Nili, H. et al. A toolbox for representational similarity analysis. PLoS Comput. Biol. 10, e1003553 (2014).

    Google Scholar 

  • Rokem, A. & Kay, K. Fractional ridge regression: a fast, interpretable reparameterization of ridge regression. Gigascience 9, giaa133 (2020).

    Google Scholar 

  • Pennock, I. M. L. et al. Color-biased regions in the ventral visual pathway are food selective. Curr. Biol. 33, 134–146.e4 (2023).

    Google Scholar 

  • Kay, K. N., Naselaris, T., Prenger, R. J. & Gallant, J. L. Identifying natural images from human brain activity. Nature 452, 352–355 (2008).

    Google Scholar 

  • Sharma, P., Ding, N., Goodman, S. & Soricut, R. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Gurevych, I. & Miyao, Y.) 2556–2565 (Association for Computational Linguistics, 2018).

  • Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2016).

  • Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. Bag of tricks for efficient text classification. Preprint at https://doi.org/10.48550/arXiv.1607.01759 (2016).

  • Pennington, J., Socher, R. & Manning, C. GloVe: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Moschitti, A., Pang, B. & Daelemans, W.) 1532–1543 (Association for Computational Linguistics, 2014).

  • Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).

  • Hernandez, D., Kaplan, J., Henighan, T. & McCandlish, S. Scaling laws for transfer. Preprint at https://doi.org/10.48550/arXiv.2102.01293 (2021).

  • Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl Acad. Sci. USA 118, e2011417118 (2021).

    Google Scholar 

  • Kietzmann, T. C., McClure, P. & Kriegeskorte, N. Deep neural networks in computational neuroscience. In Oxford Research Encyclopedia of Neuroscience (2019).

  • Konkle, T. & Alvarez, G. A. A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. 13, 491 (2022).

    Google Scholar 

  • Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci. USA 118, e2014196118 (2021).

    Google Scholar 

  • Spoerer, C. J., Kietzmann, T. C., Mehrer, J., Charest, I. & Kriegeskorte, N. Recurrent neural networks can explain flexible trading of speed and accuracy in biological vision. PLoS Comput. Biol. 16, e1008215 (2020).

    Google Scholar 

  • Mehrer, J., Spoerer, C. J., Kriegeskorte, N. & Kietzmann, T. C. Individual differences among deep neural network models. Nat. Commun. 11, 5725 (2020).

    Google Scholar 

  • Hong, H., Yamins, D. L. K., Majaj, N. J. & DiCarlo, J. J. Explicit information for category-orthogonal object properties increases along the ventral stream. Nat. Neurosci. 19, 613–622 (2016).

    Google Scholar 

  • Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A. & Konkle, T. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines. Nat. Commun. 15, 9383 (2024).

    Google Scholar 

  • Han, Y., Poggio, T. & Cheung, B. System identification of neural systems: if we got it right, would we know? In International Conference on Machine Learning 12430–12444 (PMLR, 2023).

  • Storrs, K. R., Kietzmann, T. C., Walther, A., Mehrer, J. & Kriegeskorte, N. Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting. J. Cogn. Neurosci. 33, 2044–2064 (2021).

    Google Scholar 

  • Bo, Y., Soni, A., Srivastava, S. & Khosla, M. Evaluating representational similarity measures from the lens of functional correspondence. Preprint at https://doi.org/10.48550/arXiv.2411.14633 (2024).

  • He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2015).

  • Ungerleider, LG., Mishkin, L. in Analysis of Visual Behavior (eds Goodale, M. et al.) Ch. 18, 549 (MIT Press, 1982).

  • Goodale, M. A. & Milner, A. D. Separate visual pathways for perception and action. Trends Neurosci. 15, 20–25 (1992).

    Google Scholar 

  • Tanaka, K. Inferotemporal cortex and object vision. Annu. Rev. Neurosci. 19, 109–139 (1996).

    Google Scholar 

  • Ishai, A., Ungerleider, L. G., Martin, A., Schouten, J. L. & Haxby, J. V. Distributed representation of objects in the human ventral visual pathway. Proc. Natl Acad. Sci. USA 96, 9379–9384 (1999).

    Google Scholar 

  • Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A. & Konkle, T. What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? Nat. Commun. 15, 9383 (2023).

  • Schrimpf, M. et al. Brain-Score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv https://doi.org/10.1101/407007 (2018).

  • Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis., 115, 211–252 (2014).

  • Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452–1464 (2018).

    Google Scholar 

  • Zamir, A. et al. Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3712–3722 (2018).

  • Mahajan, D. et al. Exploring the limits of weakly supervised pretraining. In Proc. European Conference on Computer Vision 181–196 (2018).

  • Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning 1597–1607 (PMLR, 2020).

  • Ratan Murty, N. A., Bashivan, P., Abate, A., DiCarlo, J. J. & Kanwisher, N. Computational models of category-selective brain regions enable high-throughput tests of selectivity. Nat. Commun. 12, 5540 (2021).

    Google Scholar 

  • Güçlü, U. & van Gerven, M. A. J. Semantic vector space models predict neural responses to complex visual stimuli. In International Conference on Machine Learning 1597–1607 (PMLR, 2015).

  • Frisby, S. L., Halai, A. D., Cox, C. R., Lambon Ralph, M. A. & Rogers, T. T. Decoding semantic representations in mind and brain. Trends Cogn. Sci. 27, 258–281 (2023).

    Google Scholar 

  • Greene, M. R., Baldassano, C., Esteva, A., Beck, D. M. & Fei-Fei, L. Visual scenes are categorized by function. J. Exp. Psychol. Gen. 145, 82–94 (2016).

    Google Scholar 

  • Greene, M. R. Statistics of high-level scene context. Front. Psychol. 4, 777 (2013).

    Google Scholar 

  • Henderson, J. M. & Ferreira, F. in The Interface of Language, Vision, and Action: Eye Movements and the Visual World (ed. Henderson, J. M.) Vol. 399, 1–58 (Psychology Press, 2004).

  • Greene, M. R. & Oliva, A. The briefest of glances: the time course of natural scene understanding. Psychol. Sci. 20, 464–472 (2009).

    Google Scholar 

  • Malcolm, G. L. & Shomstein, S. Object-based attention in real-world scenes. J. Exp. Psychol. Gen. 144, 257–263 (2015).

    Google Scholar 

  • Biederman, I. Perceiving real-world scenes. Science 177, 77–80 (1972).

    Google Scholar 

  • Greene, M. R. Scene perception and understanding. In Oxford Research Encyclopedia of Psychology (2023).

  • Potter, M. C. Meaning in visual search. Science 187, 965–966 (1975).

    Google Scholar 

  • Carlson, T. A., Simmons, R. A., Kriegeskorte, N. & Slevc, L. R. The emergence of semantic meaning in the ventral temporal pathway. J. Cogn. Neurosci. 26, 120–131 (2014).

    Google Scholar 

  • Contier, O., Baker, C. I. & Hebart, M. N. Distributed representations of behaviorally-relevant object dimensions in the human visual system. Nat. Hum. Behav. 8, 2179–2193 (2024).

  • Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10, 94 (2016).

    Google Scholar 

  • Golan, T. et al. Deep neural networks are not a single hypothesis but a language for expressing computational hypotheses. Behav. Brain Sci. 46, e392 (2023).

    Google Scholar 

  • Conwell, C. et al. Monkey see, model knew: large language models accurately predict human AND macaque visual brain activity. In UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models (2024).

  • Popham, S. F. et al. Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nat. Neurosci. 24, 1628–1636 (2021).

    Google Scholar 

  • Wang, A. Y., Kay, K., Naselaris, T., Tarr, M. J. & Wehbe, L. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nat. Mach. Intell. 5, 1415–1426 (2023).

    Google Scholar 

  • Tang, J., Du, M., Vo, V. A., Lal, V. & Huth, A. G. Brain encoding models based on multimodal transformers can transfer across language and vision. Adv. Neural Inf. Process. Syst. 36, 29654–29666 (2023).

  • Kay, K., Bonnen, K., Denison, R. N., Arcaro, M. J. & Barack, D. L. Tasks and their role in visual neuroscience. Neuron 111, 1697–1713 (2023).

    Google Scholar 

  • Çukur, T., Nishimoto, S., Huth, A. G. & Gallant, J. L. Attention during natural vision warps semantic representation across the human brain. Nat. Neurosci. 16, 763–770 (2013).

    Google Scholar 

  • Goldstein, A. et al. Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns. Nat. Commun. 15, 2768 (2024).

    Google Scholar 

  • Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).

  • Zada, Z. et al. A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron 112, 3211–3222 (2023).

  • Cadieu, C. F. et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Comput. Biol. 10, e1003963 (2014).

    Google Scholar 

  • Bird, S., Klein, E., & Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit (Reilly Media, 2009).

  • Kriegeskorte, N., Goebel, R. & Bandettini, P. A. Information-based functional brain mapping. Proc. Natl Acad. Sci. USA 103, 3863–3868 (2006).

    Google Scholar 

  • Haynes, J. D. & Rees, G. Predicting the stream of consciousness from activity in human visual cortex. Curr. Biol. 15, 1301–1307 (2005).

    Google Scholar 

  • Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).

    MathSciNet 

    Google Scholar 

  • Kietzmann, T. C. et al. Recurrence is required to capture the representational dynamics of the human visual system. Proc. Natl Acad. Sci. USA 116, 21854–21863 (2019).

    Google Scholar 

  • Kubilius, J. et al. CORnet: modeling the neural mechanisms of core object recognition. Preprint at bioRxiv https://doi.org/10.1101/408385 (2018).

  • Muttenthaler, L. & Hebart, M. N. THINGSvision: a Python toolbox for streamlining the extraction of activations from deep neural networks. Front. Neuroinform. 15, 679838 (2021).

    Google Scholar 

  • Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (eds Pereira, F. et al.) Vol. 25 (Curran Associates, Inc., 2012).

  • timmdocs: documentation for Ross Wightman’s timm image model library. GitHub https://github.com/fastai/timmdocs (2025).

  • Doerig, A. Visuo_llm (v1.0). Zenodo https://doi.org/10.5281/ZENODO.15282176 (2025).

  • Don’t miss more hot News like this! Click here to discover the latest in AI news!

    2025-08-07 00:00:00

    Related Articles

    Back to top button