Skip to content

language_translation has typo which make loaded tgt tensor invalid #1355

Closed
@zwzmzd

Description

@zwzmzd

for _yield_token implementation in src/data.py, the third argument src expected to be True or False

# Turns an iterable into a generator
def _yield_tokens(iterable_data, tokenizer, src):

    # Iterable data stores the samples as (src, tgt) so this will help us select just one language or the other
    index = 0 if src else 1

    for data in iterable_data:
        yield tokenizer(data[index])

But the actual used argument is str (e.g. 'de' or 'en'), which will make _yield_tokens always construct tgt vocab from src tokens, so the loaded tgt tensor was wrong

    tgt_vocab = build_vocab_from_iterator(
        _yield_tokens(train_iterator, tgt_tokenizer, tgt_lang), <-- tgt_lang is 'de' or 'en'
        min_freq=1,
        specials=list(special_symbols.keys()),
        special_first=True

example of wrong tgt tensor, too much 0 values (which means unknown)

tensor([[   2,    2,    2,    2,    2,    2,    2,    2],
        [   0,    0,    0,    0,    0,    0,    0,    0],
        [   0,    0,    0,    0,    0,    0,    0,    0],
        [   0,    0,    0,    7,    0,    7,    0,    0],
        [   0,    0,    0,    0, 3425,    0,    0,    0],
        [   0,    0,    7,    0,    0,    0,    0,    0],
        [   0,    0,    0,    0,    0,    0,    0,    0],
        [   0,    0,    0,    0,    0,    0,   28,    0],
        [   7,    5,    0,    0,    0,   15,    5,    0],
        [   0,    3,    0,    0,    0,    0,    3,    0],
        [   0,    1,    5,    0,    5,    0,    1,    0],
        [   0,    1,    3,    0,    3,    0,    1, 5315],
        [   0,    1,    1,    0,    1,    0,    1,    0],
        [   5,    1,    1,    0,    1,    0,    1,    0],
        [   3,    1,    1,    0,    1,    0,    1,    5],
        [   1,    1,    1,    5,    1,    0,    1,    3],
        [   1,    1,    1,    3,    1,    5,    1,    1],
        [   1,    1,    1,    1,    1,    3,    1,    1]], device='cuda:0')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions