Massive Multilingual Text Embedding Benchmark
View PDF of the article MMTEB: Multilingual Massive Text Embedding Standard, by Kenneth Enevoldsen and 84 other authors
View PDF HTML (beta)
a summary:Text embeddings are typically evaluated on a limited set of tasks, which are limited by language, domain, and task diversity. To address these limitations and provide a more comprehensive assessment, we introduce the Mega Multilingual Text Embedding Benchmark (MMTEB) – a large-scale, community-driven extension of MTEB, covering more than 500 quality-controlled assessment tasks across more than 250 languages. The MMTEB includes a variety of challenging new tasks such as following instructions, retrieval of long documents, and code retrieval, representing the largest multilingual set of assessment tasks for model embedding to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We found that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on subsets of languages and task classes, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a new downsampling method based on correlation between tasks, which ensures diverse selection while preserving relative model rankings. Furthermore, we improve tasks such as retrieval by sampling difficult negatives, creating smaller but effective splits. These improvements allow us to introduce benchmarks that significantly reduce computational requirements. For example, the recently introduced English zero-benchmark maintains a similar classification order to the full version but at a fraction of the computational cost.
Submission date
By: Kenneth Enevoldsen [view email]
[v1]
Wednesday, February 19, 2025, 10:13:43 UTC (5,089 KB)
[v2]
Tuesday, 8 April 2025, 08:57:22 UTC (6,870 KB)
[v3]
Sunday, 8 June 2025, 15:10:54 UTC (6,872 KB)
[v4]
Thursday, 13 November 2025, 08:51:24 UTC (6,899 KB)
Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!
2025-11-14 05:00:00


