GitHub - yuval/wiki-sem-500: Release of the WikiSem500 dataset

This repository contains the WikiSem500 dataset described in "Automated Generation of Multilingual Clusters for Word Embedding Evaluation" by Philip Blair, Yuval Merhav, and Joel Barry.

The test groups themselves can be found in wiki-sem-500.tar.gz (wiki-sem-500-tokenized.tar.gz is pre-tokenized). The structure of the archive is as follows:

wiki-sem-500
├── de
│   ├── Q101352.txt
│   ├── Q105000.txt
│   ├── Q1061151.txt
│   ├── Q1065118.txt
│   ...
├── en
│   ├── Q101352.txt
│   ...
├── es
│   ├── Q101352.txt
│   ...
├── ja
│   ├── Q101352.txt
│   ...
├── zh
│   ├── Q101352.txt
│   ...

Note that while many classes are available in multiple languages, there are many that are not.

Each file contains a cluster, followed by a sequence of one or more outliers:

$ cat en/Q1060829.txt

Madison_Square_Garden
Walt_Disney_Concert_Hall
Olympia
Kodak_Theatre
Carnegie_Hall
Auditorio_de_Tenerife
Royal_Albert_Hall
Palau_de_la_Música_Catalana

CBGB
Buena_Vista_Social_Club
Arena_di_Verona
Barbican_Centre
RMS
HMHS

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
wiki-sem-500-tokenized.tar.gz		wiki-sem-500-tokenized.tar.gz
wiki-sem-500.tar.gz		wiki-sem-500.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

yuval/wiki-sem-500

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages