subtitles/en/43_why-are-fast-tokenizers-called-fast.srt

﻿1
00:00:00,418 --> 00:00:03,251
(dramatic whoosh)

2
00:00:05,340 --> 00:00:08,460
- Why are fast tokenizers called fast?

3
00:00:08,460 --> 00:00:10,950
In this video, we'll see
exactly how much faster,

4
00:00:10,950 --> 00:00:13,800
also, so-called fast
organizers are compared

5
00:00:13,800 --> 00:00:15,153
to their slow counterparts.

6
00:00:16,200 --> 00:00:19,260
For this benchmark, we'll
use the GLUE MNLI dataset

7
00:00:19,260 --> 00:00:23,160
which contains 432,000 spells of text.

8
00:00:23,160 --> 00:00:25,890
We'll see how long it takes
for the fast and slow versions

9
00:00:25,890 --> 00:00:28,143
of a BERT tokenizer to process them all.

10
00:00:29,670 --> 00:00:31,380
We define our fast and
slow token organizer

11
00:00:31,380 --> 00:00:33,717
using the AutoTokenizer API.

12
00:00:33,717 --> 00:00:37,110
The fast tokenizer is a
default when available.

13
00:00:37,110 --> 00:00:40,443
So we pass along, use_fast=False
to define the slow one.

14
00:00:41,430 --> 00:00:43,530
In a notebook, we can time the execution

15
00:00:43,530 --> 00:00:46,800
of itself with a time
magic command, like this.

16
00:00:46,800 --> 00:00:49,350
Processing the whole data
set is four times faster

17
00:00:49,350 --> 00:00:50,970
with a fast tokenizer.

18
00:00:50,970 --> 00:00:54,000
That quicker indeed,
but not very impressive.

19
00:00:54,000 --> 00:00:55,380
This is because we passed along the texts

20
00:00:55,380 --> 00:00:57,240
to the tokenizer one at a time.

21
00:00:57,240 --> 00:00:59,730
This is a common mistake
to do with fast organizers

22
00:00:59,730 --> 00:01:02,550
which are backed by Rust,
and thus able to prioritize

23
00:01:02,550 --> 00:01:05,370
the tokenization of multiple texts.

24
00:01:05,370 --> 00:01:07,290
Passing them only one text at a time,

25
00:01:07,290 --> 00:01:09,720
is like sending a cargo
ship between two continents

26
00:01:09,720 --> 00:01:13,140
with just one container,
it's very inefficient.

27
00:01:13,140 --> 00:01:15,810
To unleash the full speed
of our fast tokenizers,

28
00:01:15,810 --> 00:01:18,840
we need to send them batches
of texts, which we can do

29
00:01:18,840 --> 00:01:21,423
with the batched=True
argument of the map method.

30
00:01:22,620 --> 00:01:25,950
Now those are impressive
results, so the fast tokenizer

31
00:01:25,950 --> 00:01:28,410
takes 12 second to process
the dataset that takes four

32
00:01:28,410 --> 00:01:30,093
minute to the slow tokenizer.

33
00:01:31,440 --> 00:01:33,510
Summarizing the results in this table,

34
00:01:33,510 --> 00:01:36,630
you can see why we have
called those tokenizers fast.

35
00:01:36,630 --> 00:01:38,760
And this is only for tokenizing texts.

36
00:01:38,760 --> 00:01:40,710
If you ever need to train a new tokenizer,

37
00:01:40,710 --> 00:01:42,523
they do this very quickly too.