MMTEB inconsistency #2026

Muennighoff · 2025-02-10T17:01:50Z

I have one quick question regarding the paper vs leaderboard, and it would be nice to get your answer on this.

The list of datasets in the paper (https://openreview.net/pdf?id=zl3pfz4VCV) doesn't seem to be matching with the ones on the leaderboard (https://huggingface.co/spaces/mteb/leaderboard). For instance, Task information from the leaderboard does not show ClimateFEVERHardNegatives, which you seem to have included in the paper appendix (Table 16). Should we just follow what is shown on the leaderboard?

Also it was a bit confusing that the "task" and "dataset" are used interchangeably in the leaderboard (e.g. BUCC should be a dataset name, not task name and performance-per dataset would be more accurate below), which I think you can easily fix.

KennethEnevoldsen · 2025-02-10T17:30:41Z

Assuming we are talking about MTEB(eng)? "ClimateFEVERHardNegatives" which is also the case on the leaderboard:

Will however double-check the tables regardless.

For the leaderboard, I will update most things to refer to task

jhyuklee · 2025-02-10T17:48:59Z

Hi where is this table from?

I was mostly referring to the datasets in https://huggingface.co/spaces/mteb/leaderboard => Task information tab.

jhyuklee · 2025-02-10T17:50:28Z

Oh I didn't know that you can choose different benchmarks on the top. Do you provide any aggregate scores for all these different MTEBs? (one single score for all datasets in all MTEBs).

Muennighoff · 2025-02-10T17:52:52Z

Maybe we should change the UI to make that more prominent

jhyuklee · 2025-02-10T21:30:54Z

Just leaving other discpreancies between the paper and the leaderboard:

Table 1 and L427 say MTEB(eng) has 40 tasks, L259 saying 26 tasks, but it actually has 41 tasks on the leaderboard and Table 16.

MTEB(multilingual) also has 132 tasks on the leaderboard, but 131 tasks on the paper (Table 1 and L242).

KennethEnevoldsen · 2025-02-11T15:15:55Z

Hi where is this table from?

This is from the "Performance per Task" section

Do you provide any aggregate scores for all these different MTEBs? (one single score for all datasets in all MTEBs).

Not currently, though we could create an over Overview leaderboard (probably not for all of them, but for a selection).

Table 1 and L427 say MTEB(eng) has 40 tasks, L259 saying 26 tasks, but it actually has 41 tasks on the leaderboard and Table 16.

Hmm for the table 1 case you might have had an early version (at least in the latest version 26 have been corrected). Fixed 131 though (we added MiraclRetrievalHardNegatives as it was added after the iniitial paper submission)

For the overview tables (e.g. 16) it seems like the current tables was recreated from the wrong branch. I have recreated them to match the latest update. (@imenelydiaker do you mind reviewing these. I believe you might have used an earlier version of v2)

@jhyuklee if you want the updated version of the paper I will gladly forward you a copy

adresses #2026

jhyuklee · 2025-02-11T16:01:32Z

That'd be great! I can review thew updated version if you want (email: [email protected]).

I also wonder if you intended MMTEB to refer to only MTEB(multilingual) or the set of all MTEBs including MTEB (en), MTEB (code), and so on. Since the leaderboard shows only MTEB(multilingual) by default, people tend to believe that the MTEB(multilingual) is MMTEB. But after reading the paper I thought the latter was the case.

KennethEnevoldsen · 2025-02-11T16:15:27Z

Probably more the latter - making an overview page would be quite decent.

Something like:

MTEB(eng)
MTEB(Multilingual)
MTEB(Europe)
MTEB(Indic)
MTEB(Code)
MTEB(Law)
MTEB(Medical)
FollowIR
LongEmbed

You could argue for chinese as well.

danielcer · 2025-02-11T17:56:32Z

On the overview page, would it be possible to have some organization/grouping of the benchmarks?

For example, as one potential ordering, possibly lead with the most general MTEBs, followed by specific domains (Code, Law, Medical), then specific regions and then individual languages?

# Most general
MTEB(Multimodal)  # when released
MTEB(Multilingual)
...

# Specialization by Domain
MTEB(Code)
MTEB(Law)
MTEB(Medical)
FollowIR
LongEmbed

# Regions
MTEB(Europe)
MTEB(Indic)
...

# Languages
MTEB(Chinese)
MTEB(eng)
...

* Add SONAR metadata Add SONAR metadat, but without an implementation Fixes #1981 * fix: Add SONAR metadata Fixes #1981 * minor edits * reduced logging serverity of no model_meta.json * resolve missing models by ensuring that "Unknown" number of parameters is not filtered. Should resolve: #1979 #1976 This seems to have been caused by the restructering of calls on the leaderboard. * format * resolve missing models by ensuring that "Unknown" number of parameters is not filtered. Should resolve: #1979 #1976 This seems to have been caused by the `MAX_MODEL_SIZE/MIN_MODEL_SIZE` args. * format * format * added memory usage * fixed None checks * consistently refer to tasks as tasks not as datasets adresses #2026 * minor * removed used arg * revert fix of not allowing None in model name

KennethEnevoldsen · 2025-02-12T10:31:41Z

Have added new issue on this and then I will close this one

KennethEnevoldsen self-assigned this Feb 10, 2025

KennethEnevoldsen added a commit that referenced this issue Feb 11, 2025

consistently refer to tasks as tasks not as datasets

856d8e1

adresses #2026

KennethEnevoldsen mentioned this issue Feb 12, 2025

Add an overview benchmark #2038

Open

KennethEnevoldsen closed this as completed Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMTEB inconsistency #2026

MMTEB inconsistency #2026

Muennighoff commented Feb 10, 2025

KennethEnevoldsen commented Feb 10, 2025

jhyuklee commented Feb 10, 2025

jhyuklee commented Feb 10, 2025

Muennighoff commented Feb 10, 2025

jhyuklee commented Feb 10, 2025

KennethEnevoldsen commented Feb 11, 2025 •

edited

Loading

jhyuklee commented Feb 11, 2025 •

edited

Loading

KennethEnevoldsen commented Feb 11, 2025

danielcer commented Feb 11, 2025

KennethEnevoldsen commented Feb 12, 2025

MMTEB inconsistency #2026

MMTEB inconsistency #2026

Comments

Muennighoff commented Feb 10, 2025

KennethEnevoldsen commented Feb 10, 2025

jhyuklee commented Feb 10, 2025

jhyuklee commented Feb 10, 2025

Muennighoff commented Feb 10, 2025

jhyuklee commented Feb 10, 2025

KennethEnevoldsen commented Feb 11, 2025 • edited Loading

jhyuklee commented Feb 11, 2025 • edited Loading

KennethEnevoldsen commented Feb 11, 2025

danielcer commented Feb 11, 2025

KennethEnevoldsen commented Feb 12, 2025

KennethEnevoldsen commented Feb 11, 2025 •

edited

Loading

jhyuklee commented Feb 11, 2025 •

edited

Loading