From 1d92a9025e7e387ba270cbd7374e090fa21e205d Mon Sep 17 00:00:00 2001
From: Pavel Nesterov <pavel.a.nesterov@gmail.com>
Date: Fri, 10 Mar 2023 16:33:20 +0100
Subject: [PATCH] Explain why there are more tokens, than reviews (#476)

* Explain why there are more tokens, than reviews

* Update chapters/en/chapter5/3.mdx

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
---
 chapters/en/chapter5/3.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/en/chapter5/3.mdx b/chapters/en/chapter5/3.mdx
index 3f9c37dc2..4a3ddc7b5 100644
--- a/chapters/en/chapter5/3.mdx
+++ b/chapters/en/chapter5/3.mdx
@@ -387,7 +387,7 @@ ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
 
 Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
 
-The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:
+The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using `return_overflowing_tokens=True`). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:
 
 ```py
 tokenized_dataset = drug_dataset.map(