Merge pull request google-research#16 from soskek:patch-1

PiperOrigin-RevId: 285882182
zsgd901 · Dec 17, 2019 · cf62c25 · cf62c25
2 parents c2eedd2 + 48a4dde
commit cf62c25
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@ ALBERT
 
 ***************New October 31, 2019 ***************
 
-Version 2 of ALBERT models is relased. TF-Hub modules are available:
+Version 2 of ALBERT models is released. TF-Hub modules are available:
 
 - https://tfhub.dev/google/albert_base/2
 - https://tfhub.dev/google/albert_large/2
@@ -12,7 +12,7 @@ Version 2 of ALBERT models is relased. TF-Hub modules are available:
 
 In this version, we apply 'no dropout', 'additional training data' and 'long training time' strategies to all models. We train ALBERT-base for 10M steps and other models for 3M steps.
 
-The result comparsion to the v1 models are as followings:
+The result comparison to the v1 models is as followings:
 
 |                | Average  | SQuAD1.1 | SQuAD2.0 | MNLI     | SST-2    | RACE     |
 |----------------|----------|----------|----------|----------|----------|----------|
@@ -27,7 +27,7 @@ The result comparsion to the v1 models are as followings:
 |ALBERT-xlarge   |85.5      |92.5/86.1 | 86.1/83.1|86.4      |92.4      | 74.8     |
 |ALBERT-xxlarge  |91.0      |94.8/89.3 | 90.2/87.4|90.8      |96.9      | 86.5     |
 
-The comparison shows that for ALBERT-base, ALBERT-large, and ALBERT-xlarge, v2 is much better than v1, indicating the importance of applying the above three strategies. On average, ALBERT-xxlarge is slightly worse than the v1, because of the following two reasons: 1) Training additional 1.5 M steps (the only difference between these two models are training for 1.5M steps and 3M steps) did not lead to significant performance improvement. 2) For v1, we did a little bit hyperparameter search among the parameters sets given by BERT, Roberta, and XLnet. For v2, we simply adopt the parameters from v1 except for RACE, where we use a learning rate of 1e-5 and 0 [ALBERT DR](https://arxiv.org/pdf/1909.11942.pdf) (droput rate for ALBERT in finetuning). The original (v1) RACE hyperpamter will cause model divergence for v2 models. Given that the downstream tasks are sensitive to the fine-tuning hyperparameters, we should be careful about so called slight improvements.
+The comparison shows that for ALBERT-base, ALBERT-large, and ALBERT-xlarge, v2 is much better than v1, indicating the importance of applying the above three strategies. On average, ALBERT-xxlarge is slightly worse than the v1, because of the following two reasons: 1) Training additional 1.5 M steps (the only difference between these two models is training for 1.5M steps and 3M steps) did not lead to significant performance improvement. 2) For v1, we did a little bit hyperparameter search among the parameters sets given by BERT, Roberta, and XLnet. For v2, we simply adopt the parameters from v1 except for RACE, where we use a learning rate of 1e-5 and 0 [ALBERT DR](https://arxiv.org/pdf/1909.11942.pdf) (dropout rate for ALBERT in finetuning). The original (v1) RACE hyperparameter will cause model divergence for v2 models. Given that the downstream tasks are sensitive to the fine-tuning hyperparameters, we should be careful about so called slight improvements.
 
 ALBERT is "A Lite" version of BERT, a popular unsupervised language
 representation learning algorithm. ALBERT uses parameter-reduction techniques