Skip to content

Commit

Permalink
Merge pull request huggingface#1 from chenglu99/main
Browse files Browse the repository at this point in the history
running python utils/code_formatter.py
  • Loading branch information
yaoqih authored Feb 18, 2023
2 parents cc83c68 + 0a0a179 commit 97b3123
Show file tree
Hide file tree
Showing 70 changed files with 337 additions and 106 deletions.
4 changes: 3 additions & 1 deletion chapters/de/chapter1/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,9 @@ from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to", max_length=30, num_return_sequences=2,
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)
```

Expand Down
3 changes: 2 additions & 1 deletion chapters/de/chapter3/3_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ model.compile(
metrics=["accuracy"],
)
model.fit(
tf_train_dataset, validation_data=tf_validation_dataset,
tf_train_dataset,
validation_data=tf_validation_dataset,
)
```

Expand Down
4 changes: 3 additions & 1 deletion chapters/en/chapter1/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,9 @@ from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to", max_length=30, num_return_sequences=2,
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)
```

Expand Down
5 changes: 4 additions & 1 deletion chapters/en/chapter2/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,10 @@ from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!",]
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)
```

Expand Down
3 changes: 2 additions & 1 deletion chapters/en/chapter3/3_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ model.compile(
metrics=["accuracy"],
)
model.fit(
tf_train_dataset, validation_data=tf_validation_dataset,
tf_train_dataset,
validation_data=tf_validation_dataset,
)
```

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter5/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Here the `rss` attribute refers to the _resident set size_, which is the fractio

```py
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024 ** 3)
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")
```

Expand Down
4 changes: 3 additions & 1 deletion chapters/en/chapter6/8.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,9 @@ Great! Now that we're done, we can save the tokenizer like before, and wrap it i
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer, bos_token="<|endoftext|>", eos_token="<|endoftext|>",
tokenizer_object=tokenizer,
bos_token="<|endoftext|>",
eos_token="<|endoftext|>",
)
```

Expand Down
17 changes: 13 additions & 4 deletions chapters/en/chapter7/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -413,7 +413,9 @@ Now we can just pass them to the `TFAutoModelForTokenClassification.from_pretrai
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained(
model_checkpoint, id2label=id2label, label2id=label2id,
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
```

Expand Down Expand Up @@ -661,7 +663,9 @@ Now we can just pass them to the `AutoModelForTokenClassification.from_pretraine
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
model_checkpoint, id2label=id2label, label2id=label2id,
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
```

Expand Down Expand Up @@ -770,7 +774,10 @@ First we need to build the `DataLoader`s from our datasets. We'll reuse our `dat
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=data_collator, batch_size=8,
tokenized_datasets["train"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
Expand All @@ -781,7 +788,9 @@ Next we reinstantiate our model, to make sure we're not continuing the fine-tuni

```py
model = AutoModelForTokenClassification.from_pretrained(
model_checkpoint, id2label=id2label, label2id=label2id,
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
```

Expand Down
10 changes: 8 additions & 2 deletions chapters/en/chapter7/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -639,11 +639,17 @@ Once we're logged in, we can create our `tf.data` datasets. To do so, we'll use

```python
tf_train_dataset = model.prepare_tf_dataset(
downsampled_dataset["train"], collate_fn=data_collator, shuffle=True, batch_size=32,
downsampled_dataset["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
downsampled_dataset["test"], collate_fn=data_collator, shuffle=False, batch_size=32,
downsampled_dataset["test"],
collate_fn=data_collator,
shuffle=False,
batch_size=32,
)
```

Expand Down
10 changes: 8 additions & 2 deletions chapters/en/chapter7/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,10 @@ We can now use this `data_collator` to convert each of our datasets to a `tf.dat

```python
tf_train_dataset = model.prepare_tf_dataset(
tokenized_datasets["train"], collate_fn=data_collator, shuffle=True, batch_size=32,
tokenized_datasets["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
Expand Down Expand Up @@ -793,7 +796,10 @@ from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=data_collator, batch_size=8,
tokenized_datasets["train"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
Expand Down
12 changes: 9 additions & 3 deletions chapters/en/chapter7/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,9 @@ max_target_length = 30

def preprocess_function(examples):
model_inputs = tokenizer(
examples["review_body"], max_length=max_input_length, truncation=True,
examples["review_body"],
max_length=max_input_length,
truncation=True,
)
labels = tokenizer(
examples["review_title"], max_length=max_target_length, truncation=True
Expand Down Expand Up @@ -673,7 +675,10 @@ We're almost ready to train! We just need to convert our datasets to `tf.data.Da

```python
tf_train_dataset = model.prepare_tf_dataset(
tokenized_datasets["train"], collate_fn=data_collator, shuffle=True, batch_size=8,
tokenized_datasets["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=8,
)
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
Expand Down Expand Up @@ -944,7 +949,8 @@ for epoch in range(num_train_epochs):
for step, batch in enumerate(eval_dataloader):
with torch.no_grad():
generated_tokens = accelerator.unwrap_model(model).generate(
batch["input_ids"], attention_mask=batch["attention_mask"],
batch["input_ids"],
attention_mask=batch["attention_mask"],
)

generated_tokens = accelerator.pad_across_processes(
Expand Down
10 changes: 8 additions & 2 deletions chapters/en/chapter7/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -383,10 +383,16 @@ Now we can use the `prepare_tf_dataset()` method to convert our datasets to Tens

```python
tf_train_dataset = model.prepare_tf_dataset(
tokenized_dataset["train"], collate_fn=data_collator, shuffle=True, batch_size=32,
tokenized_dataset["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_dataset["valid"], collate_fn=data_collator, shuffle=False, batch_size=32,
tokenized_dataset["valid"],
collate_fn=data_collator,
shuffle=False,
batch_size=32,
)
```

Expand Down
15 changes: 12 additions & 3 deletions chapters/en/chapter7/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -863,10 +863,16 @@ And now we create the datasets as usual.

```python
tf_train_dataset = model.prepare_tf_dataset(
train_dataset, collate_fn=data_collator, shuffle=True, batch_size=16,
train_dataset,
collate_fn=data_collator,
shuffle=True,
batch_size=16,
)
tf_eval_dataset = model.prepare_tf_dataset(
validation_dataset, collate_fn=data_collator, shuffle=False, batch_size=16,
validation_dataset,
collate_fn=data_collator,
shuffle=False,
batch_size=16,
)
```

Expand Down Expand Up @@ -1017,7 +1023,10 @@ validation_set = validation_dataset.remove_columns(["example_id", "offset_mappin
validation_set.set_format("torch")

train_dataloader = DataLoader(
train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=8,
train_dataset,
shuffle=True,
collate_fn=default_data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
validation_set, collate_fn=default_data_collator, batch_size=8
Expand Down
4 changes: 3 additions & 1 deletion chapters/es/chapter1/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,9 @@ from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to", max_length=30, num_return_sequences=2,
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)
```

Expand Down
2 changes: 1 addition & 1 deletion chapters/es/chapter5/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ El atributo `rss` se refiere al _resident set size_, que es la fracción de memo

```py
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024 ** 3)
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")
```

Expand Down
5 changes: 4 additions & 1 deletion chapters/fa/chapter2/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!",]
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)
```

Expand Down
3 changes: 2 additions & 1 deletion chapters/fr/chapter3/3_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,8 @@ model.compile(
metrics=["accuracy"],
)
model.fit(
tf_train_dataset, validation_data=tf_validation_dataset,
tf_train_dataset,
validation_data=tf_validation_dataset,
)
```

Expand Down
2 changes: 1 addition & 1 deletion chapters/fr/chapter5/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Ici, l'attribut `rss` fait référence à la _taille de l'ensemble résident_, q

```py
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024 ** 3)
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")
```

Expand Down
4 changes: 3 additions & 1 deletion chapters/fr/chapter6/8.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -408,7 +408,9 @@ Super ! Maintenant que nous avons terminé, nous pouvons sauvegarder le tokenize
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer, bos_token="<|endoftext|>", eos_token="<|endoftext|>",
tokenizer_object=tokenizer,
bos_token="<|endoftext|>",
eos_token="<|endoftext|>",
)
```

Expand Down
17 changes: 13 additions & 4 deletions chapters/fr/chapter7/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -416,7 +416,9 @@ Maintenant, nous pouvons simplement les passer à la méthode `TFAutoModelForTok
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained(
model_checkpoint, id2label=id2label, label2id=label2id,
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
```

Expand Down Expand Up @@ -664,7 +666,9 @@ Maintenant nous pouvons simplement les passer à la méthode `AutoModelForTokenC
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
model_checkpoint, id2label=id2label, label2id=label2id,
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
```

Expand Down Expand Up @@ -773,7 +777,10 @@ D'abord nous devons construire le `DataLoader`s à partir de nos jeux de donnée
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=data_collator, batch_size=8,
tokenized_datasets["train"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
Expand All @@ -784,7 +791,9 @@ Ensuite, nous réinstantifions notre modèle pour nous assurer que nous ne conti

```py
model = AutoModelForTokenClassification.from_pretrained(
model_checkpoint, id2label=id2label, label2id=label2id,
model_checkpoint,
id2label=id2label,
label2id=label2id,
)
```

Expand Down
10 changes: 8 additions & 2 deletions chapters/fr/chapter7/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -644,11 +644,17 @@ Une fois que nous sommes connectés, nous pouvons créer nos jeux de données `t

```python
tf_train_dataset = model.prepare_tf_dataset(
downsampled_dataset["train"], collate_fn=data_collator, shuffle=True, batch_size=32,
downsampled_dataset["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
downsampled_dataset["test"], collate_fn=data_collator, shuffle=False, batch_size=32,
downsampled_dataset["test"],
collate_fn=data_collator,
shuffle=False,
batch_size=32,
)
```

Expand Down
10 changes: 8 additions & 2 deletions chapters/fr/chapter7/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,10 @@ Nous pouvons maintenant utiliser ce `data_collator` pour convertir chacun de nos

```python
model.prepare_tf_dataset(
tokenized_datasets["train"], collate_fn=data_collator, shuffle=True, batch_size=32,
tokenized_datasets["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
Expand Down Expand Up @@ -805,7 +808,10 @@ from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=data_collator, batch_size=8,
tokenized_datasets["train"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
Expand Down
Loading

0 comments on commit 97b3123

Please sign in to comment.