Massive Multilingual Text Embedding Benchmark

14 2 minutes read

[Submitted on 19 Feb 2025 (v1), last revised 13 Nov 2025 (this version, v4)]

Authors:Kenneth Enevoldsen, Isaac Chung, Emini Kirbois, Marton Kardos, Ashwin Mathur, David Stapp, Jay Gala, Sam Cibellini, Dominic Krzeminski, Jinta Indra Winata, Saba Sturua, Saitja Otpala, Matteo Ciancone, Marion Schaefer, Gabriel Sequeira, Diganta Misra, Shreya Dhakal, Jonathan. Riström, Roman Solomatin, Omar Jattan, Akash Kundu, Martin Bernstorff, Chitao Xiao, Akshita Sukhlecha, Bhavesh Pahwa, Rafal Busuyata, Kranthi Kiran JV, Shaun Ashraf, Daniel Uras, Björn Plüster, Jan-Philippe Harris, Loic Magne, Isabelle Mohr, Maria Hendriksen, Dawi. Zhou, Hippolyte Jesirot Boukhlef, Tom Arsen, Jan Kostkan, Konrad Wojtasek, Timin Lee, Marek Šuba, Christina Zhang, Roberta Rocca, Mohamed Hamdi, Andrianus Michel, John Yang, Manuel Weiss, Alexei Vatulin, Nandan Thakur, Manan Dey, Deepam Vasani, Pranjal Chitali, Simon Tedeschi, Nguyen Thai, Artem Snigirev, Michael Gunter, Menzo Xia, Weijia Shi, Xing Han Lu, Jordan Clive, Gayatri Krishnakumar, Anna Maximova, Sylvan Werle, Maria Tikhonova, Hennel Panchal, Alexander Abramov, Malte Ostendorf, Zheng Liu, Simon Klimatid, Lester James Miranda, Alina Finogenova, Guangyu Song, Ruqaya Bensafi, Wen Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jamie Lin, Howard Yin, Lacy Hansen, Sarah Hooker, Zhenghao Xiao, Vaibhav Adlakha, Orion Wheeler, Siva Reddy, and Niklas Monigoff.

View PDF of the article MMTEB: Multilingual Massive Text Embedding Standard, by Kenneth Enevoldsen and 84 other authors

View PDF HTML (beta)

a summary:Text embeddings are typically evaluated on a limited set of tasks, which are limited by language, domain, and task diversity. To address these limitations and provide a more comprehensive assessment, we introduce the Mega Multilingual Text Embedding Benchmark (MMTEB) – a large-scale, community-driven extension of MTEB, covering more than 500 quality-controlled assessment tasks across more than 250 languages. The MMTEB includes a variety of challenging new tasks such as following instructions, retrieval of long documents, and code retrieval, representing the largest multilingual set of assessment tasks for model embedding to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We found that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on subsets of languages and task classes, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a new downsampling method based on correlation between tasks, which ensures diverse selection while preserving relative model rankings. Furthermore, we improve tasks such as retrieval by sampling difficult negatives, creating smaller but effective splits. These improvements allow us to introduce benchmarks that significantly reduce computational requirements. For example, the recently introduced English zero-benchmark maintains a similar classification order to the full version but at a fraction of the computational cost.