MiVOLO: Multi-input Transformer for Age and Gender Estimation, Maksim Kuprashevich, Irina Tolstykh, 2023 arXiv 2307.04616
[Paper
] [Demo
] [BibTex
] [Data
]
Gender & Age recognition performance.
Model | Type | Dataset | Age MAE | Age CS@5 | Gender Accuracy | download |
---|---|---|---|---|---|---|
volo_d1 | face_only, age | IMDB-cleaned | 4.29 | 67.71 | - | checkpoint |
volo_d1 | face_only, age, gender | IMDB-cleaned | 4.22 | 68.68 | 99.38 | checkpoint |
mivolo_d1 | face_body, age, gender | IMDB-cleaned | 4.24 [face+body] 6.87 [body] |
68.32 [face+body] 46.32 [body] |
99.46 [face+body] 96.48 [body] |
checkpoint |
volo_d1 | face_only, age | UTKFace | 4.23 | 69.72 | - | checkpoint |
volo_d1 | face_only, age, gender | UTKFace | 4.23 | 69.78 | 97.69 | checkpoint |
mivolo_d1 | face_body, age, gender | Lagenda | 3.99 [face+body] | 71.27 [face+body] | 97.36 [face+body] | demo |
Please, cite our paper if you use any this data!
-
Lagenda dataset: images and annotation.
-
IMDB-clean: follow these instructions to get images and download our annotations.
-
UTK dataset: origin full images and our annotation: split from the article, random full split.
-
Adience dataset: follow these instructions to get images and download our annotations.
Click to expand!
After downloading them, your
data
directory should look something like this:data └── Adience ├── annotations (folder with our annotations) ├── aligned (will not be used) ├── faces ├── fold_0_data.txt ├── fold_1_data.txt ├── fold_2_data.txt ├── fold_3_data.txt └── fold_4_data.txt
We use coarse aligned images from
faces/
dir.Using our detector we found a face bbox for each image (see tools/prepare_adience.py).
This dataset has five folds. The performance metric is accuracy on five-fold cross validation.
images before removal fold 0 fold 1 fold 2 fold 3 fold 4 19,370 4,484 3,730 3,894 3,446 3,816 Not complete data
only age not found only gender not found SUM 40 1170 1,210 (6.2 %) Removed data
failed to process image age and gender not found SUM 0 708 708 (3.6 %) Genders
female male 9,372 8,120 Ages (8 classes) after mapping to not intersected ages intervals
0-2 4-6 8-12 15-20 25-32 38-43 48-53 60-100 2,509 2,140 2,293 1,791 5,589 2,490 909 901 -
FairFace dataset: follow these instructions to get images and download our annotations.
Click to expand!
After downloading them, your
data
directory should look something like this:data └── FairFace ├── annotations (folder with our annotations) ├── fairface-img-margin025-trainval (will not be used) ├── train ├── val ├── fairface-img-margin125-trainval ├── train ├── val ├── fairface_label_train.csv ├── fairface_label_val.csv
We use aligned images from
fairface-img-margin125-trainval/
dir.Using our detector we found a face bbox for each image and added a person bbox if it was possible (see tools/prepare_fairface.py).
This dataset has 2 splits: train and val. The performance metric is accuracy on validation.
images train images val 86,744 10,954 Genders for validation
female male 5,162 5,792 Ages for validation (9 classes):
0-2 3-9 10-19 20-29 30-39 40-49 50-59 60-69 70+ 199 1,356 1,181 3,300 2,330 1,353 796 321 118
Install pytorch 1.13+ and other requirements.
pip install -r requirements.txt
pip install .
- Download body + face detector model to
models/yolov8x_person_face.pt
- Download mivolo checkpoint to
models/mivolo_imbd.pth.tar
wget https://variety.com/wp-content/uploads/2023/04/MCDNOHA_SP001.jpg -O jennifer_lawrence.jpg
python3 demo.py \
--input "jennifer_lawrence.jpg" \
--output "output" \
--detector-weights "models/yolov8x_person_face.pt " \
--checkpoint "models/mivolo_imbd.pth.tar" \
--device "cuda:0" \
--with-persons \
--draw
To run demo for a youtube video:
python3 demo.py \
--input "https://www.youtube.com/shorts/pVh32k0hGEI" \
--output "output" \
--detector-weights "models/yolov8x_person_face.pt" \
--checkpoint "models/mivolo_imbd.pth.tar" \
--device "cuda:0" \
--draw \
--with-persons
To reproduce validation metrics:
- Download prepared annotations for imbd-clean / utk / adience / lagenda / fairface.
- Download checkpoint
- Run validation:
python3 eval_pretrained.py \
--dataset_images /path/to/dataset/utk/images \
--dataset_annotations /path/to/dataset/utk/annotation \
--dataset_name utk \
--split valid \
--batch-size 512 \
--checkpoint models/mivolo_imbd.pth.tar \
--half \
--with-persons \
--device "cuda:0"
Supported dataset names: "utk", "imdb", "lagenda", "fairface", "adience".
15.08.20223 - 0.4.1dev
- Support for video streams, including YouTube URLs
- Instructions and explanations for various export types.
- Removed CutOff operation. It has been proven to be ineffective for inference time and quite costly at the same time. Now it is only used during training.
As of now (11.08.2023), while ONNX export is technically feasible, it is not advisable due to the poor performance of the resulting model with batch processing. TensorRT and OpenVINO export is impossible due to its lack of support for col2im.
If you remain absolutely committed to utilizing ONNX export, you can refer to these instructions.
The most highly recommended export method at present is using TorchScript. You can achieve this with a single line of code:
torch.jit.trace(model)
This approach provides you with a model that maintains its original speed and only requires a single file for usage, eliminating the need for additional code.
Please, see here
If you use our models, code or dataset, we kindly request you to cite the following paper and give repository a ⭐
@article{mivolo2023,
Author = {Maksim Kuprashevich and Irina Tolstykh},
Title = {MiVOLO: Multi-input Transformer for Age and Gender Estimation},
Year = {2023},
Eprint = {arXiv:2307.04616},
}