forked from huggingface/course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path43_why-are-fast-tokenizers-called-fast.srt
168 lines (131 loc) · 2.96 KB
/
43_why-are-fast-tokenizers-called-fast.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
1
00:00:00,418 --> 00:00:03,251
(dramatic whoosh)
2
00:00:05,340 --> 00:00:08,460
- Why are fast tokenizers called fast?
3
00:00:08,460 --> 00:00:10,950
In this video, we'll see
exactly how much faster,
4
00:00:10,950 --> 00:00:13,800
also, so-called fast
organizers are compared
5
00:00:13,800 --> 00:00:15,153
to their slow counterparts.
6
00:00:16,200 --> 00:00:19,260
For this benchmark, we'll
use the GLUE MNLI dataset
7
00:00:19,260 --> 00:00:23,160
which contains 432,000 spells of text.
8
00:00:23,160 --> 00:00:25,890
We'll see how long it takes
for the fast and slow versions
9
00:00:25,890 --> 00:00:28,143
of a BERT tokenizer to process them all.
10
00:00:29,670 --> 00:00:31,380
We define our fast and
slow token organizer
11
00:00:31,380 --> 00:00:33,717
using the AutoTokenizer API.
12
00:00:33,717 --> 00:00:37,110
The fast tokenizer is a
default when available.
13
00:00:37,110 --> 00:00:40,443
So we pass along, use_fast=False
to define the slow one.
14
00:00:41,430 --> 00:00:43,530
In a notebook, we can time the execution
15
00:00:43,530 --> 00:00:46,800
of itself with a time
magic command, like this.
16
00:00:46,800 --> 00:00:49,350
Processing the whole data
set is four times faster
17
00:00:49,350 --> 00:00:50,970
with a fast tokenizer.
18
00:00:50,970 --> 00:00:54,000
That quicker indeed,
but not very impressive.
19
00:00:54,000 --> 00:00:55,380
This is because we passed along the texts
20
00:00:55,380 --> 00:00:57,240
to the tokenizer one at a time.
21
00:00:57,240 --> 00:00:59,730
This is a common mistake
to do with fast organizers
22
00:00:59,730 --> 00:01:02,550
which are backed by Rust,
and thus able to prioritize
23
00:01:02,550 --> 00:01:05,370
the tokenization of multiple texts.
24
00:01:05,370 --> 00:01:07,290
Passing them only one text at a time,
25
00:01:07,290 --> 00:01:09,720
is like sending a cargo
ship between two continents
26
00:01:09,720 --> 00:01:13,140
with just one container,
it's very inefficient.
27
00:01:13,140 --> 00:01:15,810
To unleash the full speed
of our fast tokenizers,
28
00:01:15,810 --> 00:01:18,840
we need to send them batches
of texts, which we can do
29
00:01:18,840 --> 00:01:21,423
with the batched=True
argument of the map method.
30
00:01:22,620 --> 00:01:25,950
Now those are impressive
results, so the fast tokenizer
31
00:01:25,950 --> 00:01:28,410
takes 12 second to process
the dataset that takes four
32
00:01:28,410 --> 00:01:30,093
minute to the slow tokenizer.
33
00:01:31,440 --> 00:01:33,510
Summarizing the results in this table,
34
00:01:33,510 --> 00:01:36,630
you can see why we have
called those tokenizers fast.
35
00:01:36,630 --> 00:01:38,760
And this is only for tokenizing texts.
36
00:01:38,760 --> 00:01:40,710
If you ever need to train a new tokenizer,
37
00:01:40,710 --> 00:01:42,523
they do this very quickly too.