Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMTEB inconsistency #2026

Closed
Muennighoff opened this issue Feb 10, 2025 · 10 comments
Closed

MMTEB inconsistency #2026

Muennighoff opened this issue Feb 10, 2025 · 10 comments
Assignees

Comments

@Muennighoff
Copy link
Contributor

From @jhyuklee

I have one quick question regarding the paper vs leaderboard, and it would be nice to get your answer on this.

The list of datasets in the paper (https://openreview.net/pdf?id=zl3pfz4VCV) doesn't seem to be matching with the ones on the leaderboard (https://huggingface.co/spaces/mteb/leaderboard). For instance, Task information from the leaderboard does not show ClimateFEVERHardNegatives, which you seem to have included in the paper appendix (Table 16). Should we just follow what is shown on the leaderboard?

Also it was a bit confusing that the "task" and "dataset" are used interchangeably in the leaderboard (e.g. BUCC should be a dataset name, not task name and performance-per dataset would be more accurate below), which I think you can easily fix.

@KennethEnevoldsen KennethEnevoldsen self-assigned this Feb 10, 2025
@KennethEnevoldsen
Copy link
Contributor

Assuming we are talking about MTEB(eng)? "ClimateFEVERHardNegatives" which is also the case on the leaderboard:

Image

Will however double-check the tables regardless.

For the leaderboard, I will update most things to refer to task

@jhyuklee
Copy link

Hi where is this table from?

I was mostly referring to the datasets in https://huggingface.co/spaces/mteb/leaderboard => Task information tab.

Image

@jhyuklee
Copy link

Oh I didn't know that you can choose different benchmarks on the top. Do you provide any aggregate scores for all these different MTEBs? (one single score for all datasets in all MTEBs).

@Muennighoff
Copy link
Contributor Author

Maybe we should change the UI to make that more prominent

@jhyuklee
Copy link

Just leaving other discpreancies between the paper and the leaderboard:

Table 1 and L427 say MTEB(eng) has 40 tasks, L259 saying 26 tasks, but it actually has 41 tasks on the leaderboard and Table 16.

MTEB(multilingual) also has 132 tasks on the leaderboard, but 131 tasks on the paper (Table 1 and L242).

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Feb 11, 2025

Hi where is this table from?

This is from the "Performance per Task" section

Do you provide any aggregate scores for all these different MTEBs? (one single score for all datasets in all MTEBs).

Not currently, though we could create an over Overview leaderboard (probably not for all of them, but for a selection).

Table 1 and L427 say MTEB(eng) has 40 tasks, L259 saying 26 tasks, but it actually has 41 tasks on the leaderboard and Table 16.

Hmm for the table 1 case you might have had an early version (at least in the latest version 26 have been corrected). Fixed 131 though (we added MiraclRetrievalHardNegatives as it was added after the iniitial paper submission)

For the overview tables (e.g. 16) it seems like the current tables was recreated from the wrong branch. I have recreated them to match the latest update. (@imenelydiaker do you mind reviewing these. I believe you might have used an earlier version of v2)

@jhyuklee if you want the updated version of the paper I will gladly forward you a copy

@jhyuklee
Copy link

jhyuklee commented Feb 11, 2025

That'd be great! I can review thew updated version if you want (email: [email protected]).

I also wonder if you intended MMTEB to refer to only MTEB(multilingual) or the set of all MTEBs including MTEB (en), MTEB (code), and so on. Since the leaderboard shows only MTEB(multilingual) by default, people tend to believe that the MTEB(multilingual) is MMTEB. But after reading the paper I thought the latter was the case.

@KennethEnevoldsen
Copy link
Contributor

Probably more the latter - making an overview page would be quite decent.

Something like:

MTEB(eng)
MTEB(Multilingual)
MTEB(Europe)
MTEB(Indic)
MTEB(Code)
MTEB(Law)
MTEB(Medical)
FollowIR
LongEmbed

You could argue for chinese as well.

@danielcer
Copy link

On the overview page, would it be possible to have some organization/grouping of the benchmarks?

For example, as one potential ordering, possibly lead with the most general MTEBs, followed by specific domains (Code, Law, Medical), then specific regions and then individual languages?

# Most general
MTEB(Multimodal)  # when released
MTEB(Multilingual)
...

# Specialization by Domain
MTEB(Code)
MTEB(Law)
MTEB(Medical)
FollowIR
LongEmbed

# Regions
MTEB(Europe)
MTEB(Indic)
...

# Languages
MTEB(Chinese)
MTEB(eng)
...

x-tabdeveloping pushed a commit that referenced this issue Feb 12, 2025
* Add SONAR metadata

Add SONAR metadat, but without an implementation

Fixes #1981

* fix: Add SONAR metadata

Fixes #1981

* minor edits

* reduced logging serverity of no model_meta.json

* resolve missing models

by ensuring that "Unknown" number of parameters is not filtered.

Should resolve:
#1979
#1976

This seems to have been caused by the restructering of calls on the leaderboard.

* format

* resolve missing models

by ensuring that "Unknown" number of parameters is not filtered.

Should resolve:
#1979
#1976

This seems to have been caused by the `MAX_MODEL_SIZE/MIN_MODEL_SIZE` args.

* format

* format

* added memory usage

* fixed None checks

* consistently refer to tasks as tasks not as datasets

adresses #2026

* minor

* removed used arg

* revert fix of not allowing None in model name
@KennethEnevoldsen
Copy link
Contributor

Have added new issue on this and then I will close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants