Skip to content

Sign language translation dataset using SignWriting

Notifications You must be signed in to change notification settings

huijelee/signbank-plus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SignBank+ - Cleaning and Extending the SignBank Dataset

Paper

The SignBank dataset is a collection of SignWriting examples, contributed by the community. It is a great resource for SignWriting, but it is not immediately fit for machine translation. It includes SignWriting entries with text that is not parallel, or multiple terms where only some of them are parallel. For example, it includes a chapter and page number for a book, but not the text, or a word and its definition.

Data

Files

This repository includes the data directory, which includes:

  • raw.csv - The raw SignBank dataset (until June 2023).
  • manually-cleaned.csv - A manually cleaned subset of the raw dataset.
  • bible.csv - Automatically aligned data from the Bible for puddles 151 and 152.
  • gpt-3.5-cleaned.csv - A cleaned subset of the raw dataset, using GPT-3.5 Turbo (June 13).
  • gpt-3.5-expanded.csv - Expansion of the cleaned dataset, filtered to the source language.
  • gpt-3.5-expanded.en.csv - Expansion of the cleaned dataset, with English terms.
  • benchmark.csv - Small subset of data manually annotated and automatically cleaned in various ways.
  • fingerspelling/*.txt - Includes fingerspellings for various languages. fingerspelling.py is used to generate fingerspelling from words (see fingerspelling.csv).
  • signsuisse.csv - Automatically aligned data from the French Sign Language of Switzerland dictionary and SignBank.
  • sign2mint.csv - Extra German Sign Language SignWriting data from Sign2Mint.

Notes:

  • To separate between terms, we use the (U+16EB) character.
  • \n characters are escaped as \\n.

Fingerspelling

Using the fingerspelling script, you can generate SignWriting fingerspelling from words. This is useful for generating data for fingerspelling translation.

The fingerspelling_faker script generates synthetic fingerspelling data for machine translation.

Machine Translation

Using the signbank_plus/prep_nmt.py script, we can prepare the data for machine translation training, in the data/parallel directory.

The signbank_plus/nmt directory includes scripts for training and evaluating machine translation systems, like Fairseq, Sockeye, OpenNMT, and mT5.

Benchmarking Cleaning

Using the benchmark.csv file, we benchmark with signbank_plus/score_benchmark.py, to get a measure of how good various automatically cleaning methods are.

The results at the moment are:

Method IoU Average Tokens
E0: texts 0.497 0.0
E1: pred_rules 0.533 0.0
E2: pred_general 0.627 541.4
E3: pred_specific_5 0.712 521.4
E4: pred_general_specific_5 0.735 714.6
E5: pred_general_specific_5_gpt_4 0.801 713.2

GPT-4 is better than GPT-3.5 Turbo, but is also much more expensive.

There are possible improvements to the cleaning, such as using a better model, or better prompt.

About

Sign language translation dataset using SignWriting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 77.0%
  • Shell 23.0%