Skip to content

Commit

Permalink
Merge pull request huggingface#512 from nuass/main
Browse files Browse the repository at this point in the history
docs(zh-cn): Reviewed No. 23 - What is dynamic padding?
  • Loading branch information
xianbaoqian authored Feb 27, 2023
2 parents b03d1b1 + 593070d commit ace04ee
Showing 1 changed file with 33 additions and 31 deletions.
64 changes: 33 additions & 31 deletions subtitles/zh-CN/23_what-is-dynamic-padding.srt
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ In the "Batching Inputs together" video,

4
00:00:10,890 --> 00:00:12,720
我们已经看到能够对输入进行分组
我们已经看到为了能够对(不同长度的同批次)输入进行分组
we have seen that to be able to group inputs

5
00:00:12,720 --> 00:00:15,300
同一批不同长度的,
同一批不同长度的
of different lengths in the same batch,

6
Expand All @@ -40,12 +40,12 @@ Here, for instance, the longest sentence is the third one,

9
00:00:24,600 --> 00:00:27,270
我们需要添加五个、两个或七个填充令牌
我们需要添加五个、两个或七个填充标记
and we need to add five, two, or seven pad tokens

10
00:00:27,270 --> 00:00:30,090
到其他句子有四个句子
到其他句子使得四个句子具有
to the other sentences to have four sentences

11
Expand All @@ -65,12 +65,12 @@ there are various padding strategies we can apply.

14
00:00:37,560 --> 00:00:39,540
最明显的一种是填充所有元素
最明显的一种是填充整个数据集所有的样本
The most obvious one is to pad all the elements

15
00:00:39,540 --> 00:00:40,923
数据集的相同长度
达到相同的长度
of the dataset to the same length:

16
Expand All @@ -80,67 +80,67 @@ the length of the longest sample.

17
00:00:44,070 --> 00:00:45,330
这会给我们批次
我们得到具有相同形状的批次
This will then give us batches

18
00:00:45,330 --> 00:00:46,890
都具有相同的形状

that all have the same shape

19
00:00:46,890 --> 00:00:49,800
由最大序列长度决定。
(其长度)由最大序列长度决定。
determined by the maximum sequence length.

20
00:00:49,800 --> 00:00:52,893
缺点是批次由短句组成
缺点是(如果)批次样本由短句组成
The downside is that batches composed from short sentences

21
00:00:52,893 --> 00:00:54,960
会有很多填充令牌
将带来很多填充符号
will have a lot of padding tokens

22
00:00:54,960 --> 00:00:57,660
这将在模型中引入更多计算
并且在模型中引入更多不必要的计算。
which will introduce more computations in the model

23
00:00:57,660 --> 00:00:58,910
我们最终不需要。

we ultimately don't need.

24
00:01:00,060 --> 00:01:03,300
为了避免这种情况,另一种策略是填充元素
为了避免这种情况,另一种策略是填充(短样本)符号
To avoid this, another strategy is to pad the elements

25
00:01:03,300 --> 00:01:05,280
当我们把它们批在一起时
当把它们放在一批时
when we batch them together,

26
00:01:05,280 --> 00:01:08,190
到批次中最长的句子
达到本批次中最长句子的长度
to the longest sentence inside the batch.

27
00:01:08,190 --> 00:01:12,000
这样,由短输入组成的批次会更小
这样,由短样本输入组成的批次大小
This way, batches composed of short inputs will be smaller

28
00:01:12,000 --> 00:01:13,920
比包含最长句子的批次
会比按整个数据集最长句子的长度(补齐)批次更小
than the batch containing the longest sentence

29
00:01:13,920 --> 00:01:15,510
在数据集中。

in the dataset.

30
Expand All @@ -155,7 +155,7 @@ The downside is that all batches

32
00:01:20,490 --> 00:01:22,140
然后会有不同的形状
会有不同的形状
will then have different shapes,

33
Expand All @@ -170,7 +170,7 @@ Let's see how to apply both strategies in practice.

35
00:01:29,370 --> 00:01:31,280
我们实际上已经看到了如何应用固定填充
我们实际上已经知道了如何使用固定填充
We have actually seen how to apply fixed padding

36
Expand All @@ -190,22 +190,22 @@ after loading the dataset and tokenizer,

39
00:01:38,250 --> 00:01:40,680
我们将标记化应用于所有数据集
我们将符号化应用于所有数据集
we applied the tokenization to all the dataset

40
00:01:40,680 --> 00:01:42,480
带填充和截断
包括填充和截断
with padding and truncation

41
00:01:42,480 --> 00:01:45,273
制作所有长度为 128 的样本
保证所有样本的长度为 128 。
to make all samples of length 128.

42
00:01:46,530 --> 00:01:48,360
结果,如果我们传递这个数据集
最后,如果我们传递这个数据集
As a result, if we pass this dataset

43
Expand All @@ -215,7 +215,8 @@ to a PyTorch DataLoader,

44
00:01:50,520 --> 00:01:55,503
我们得到形状批量大小的批次,这里是 16,乘以 128。
我们得到形状为 batch_size 乘以 16 乘以 128 的批次。

we get batches of shape batch size, here 16, by 128.

45
Expand All @@ -230,7 +231,7 @@ we must defer the padding to the batch preparation,

47
00:02:01,440 --> 00:02:04,740
所以我们从标记化函数中删除了那部分
所以我们从标记函数中删除了那部分
so we remove that part from our tokenize function.

48
Expand Down Expand Up @@ -295,7 +296,7 @@ We pass it to the PyTorch DataLoader as a collate function,

60
00:02:35,310 --> 00:02:37,620
然后观察生成的批次
然后观察到生成的批次
then observe that the batches generated

61
Expand All @@ -310,12 +311,13 @@ all way below the 128 from before.

63
00:02:42,660 --> 00:02:44,820
动态批处理几乎总是更快
动态批处理几乎在 CPU 和 GPU 上更快,

Dynamic batching will almost always be faster

64
00:02:44,820 --> 00:02:47,913
在 CPU 和 GPU 上,所以如果可以的话你应该应用它。
所以如果可以的话你应该应用它。
on CPUs and GPUs, so you should apply it if you can.

65
Expand All @@ -330,7 +332,7 @@ if you run your training script on TPU

67
00:02:53,490 --> 00:02:55,293
或者需要成批的固定形状
或者需要固定形状的批次输入
or need batches of fixed shapes.

68
Expand Down

0 comments on commit ace04ee

Please sign in to comment.