Skip to content

yuval/wiki-sem-500

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

This repository contains the WikiSem500 dataset described in "Automated Generation of Multilingual Clusters for Word Embedding Evaluation" by Philip Blair, Yuval Merhav, and Joel Barry.

The test groups themselves can be found in wiki-sem-500.tar.gz (wiki-sem-500-tokenized.tar.gz is pre-tokenized). The structure of the archive is as follows:

wiki-sem-500
├── de
│   ├── Q101352.txt
│   ├── Q105000.txt
│   ├── Q1061151.txt
│   ├── Q1065118.txt
│   ...
├── en
│   ├── Q101352.txt
│   ...
├── es
│   ├── Q101352.txt
│   ...
├── ja
│   ├── Q101352.txt
│   ...
├── zh
│   ├── Q101352.txt
│   ...

Note that while many classes are available in multiple languages, there are many that are not.

Each file contains a cluster, followed by a sequence of one or more outliers:

$ cat en/Q1060829.txt

Madison_Square_Garden
Walt_Disney_Concert_Hall
Olympia
Kodak_Theatre
Carnegie_Hall
Auditorio_de_Tenerife
Royal_Albert_Hall
Palau_de_la_Música_Catalana

CBGB
Buena_Vista_Social_Club
Arena_di_Verona
Barbican_Centre
RMS
HMHS

About

Release of the WikiSem500 dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published