Closed
Description
for _yield_token
implementation in src/data.py
, the third argument src
expected to be True
or False
# Turns an iterable into a generator
def _yield_tokens(iterable_data, tokenizer, src):
# Iterable data stores the samples as (src, tgt) so this will help us select just one language or the other
index = 0 if src else 1
for data in iterable_data:
yield tokenizer(data[index])
But the actual used argument is str
(e.g. 'de' or 'en'), which will make _yield_tokens
always construct tgt
vocab from src
tokens, so the loaded tgt tensor was wrong
tgt_vocab = build_vocab_from_iterator(
_yield_tokens(train_iterator, tgt_tokenizer, tgt_lang), <-- tgt_lang is 'de' or 'en'
min_freq=1,
specials=list(special_symbols.keys()),
special_first=True
example of wrong tgt tensor, too much 0
values (which means unknown
)
tensor([[ 2, 2, 2, 2, 2, 2, 2, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 7, 0, 7, 0, 0],
[ 0, 0, 0, 0, 3425, 0, 0, 0],
[ 0, 0, 7, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 28, 0],
[ 7, 5, 0, 0, 0, 15, 5, 0],
[ 0, 3, 0, 0, 0, 0, 3, 0],
[ 0, 1, 5, 0, 5, 0, 1, 0],
[ 0, 1, 3, 0, 3, 0, 1, 5315],
[ 0, 1, 1, 0, 1, 0, 1, 0],
[ 5, 1, 1, 0, 1, 0, 1, 0],
[ 3, 1, 1, 0, 1, 0, 1, 5],
[ 1, 1, 1, 5, 1, 0, 1, 3],
[ 1, 1, 1, 3, 1, 5, 1, 1],
[ 1, 1, 1, 1, 1, 3, 1, 1]], device='cuda:0')
Metadata
Metadata
Assignees
Labels
No labels