Skip to content

Commit 6ac5612

Browse files
authoredDec 10, 2022
Improve dataset manifest installation (cvat-ai#5447)
Extracted from cvat-ai#5083 Related cvat-ai#5096 - Improved dataset manifest docs - Dataset manifest requirements are now installed in the server image - Package dependencies are aligned with the server
1 parent c9f214a commit 6ac5612

File tree

5 files changed

+148
-47
lines changed

5 files changed

+148
-47
lines changed
 

‎Dockerfile

+8-2
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,14 @@ RUN python3 -m venv /opt/venv
4747
ENV PATH="/opt/venv/bin:${PATH}"
4848
RUN python3 -m pip install --no-cache-dir -U pip==22.0.2 setuptools==60.6.0 wheel==0.37.1
4949
COPY cvat/requirements/ /tmp/requirements/
50-
RUN DATUMARO_HEADLESS=1 python3 -m pip install --no-cache-dir -r /tmp/requirements/${DJANGO_CONFIGURATION}.txt
51-
50+
COPY utils/dataset_manifest/ /tmp/dataset_manifest/
51+
52+
# The server implementation depends on the dataset_manifest utility
53+
# so we need to install its dependencies too
54+
# https://github.com/opencv/cvat/issues/5096
55+
RUN DATUMARO_HEADLESS=1 python3 -m pip install --no-cache-dir \
56+
-r /tmp/requirements/${DJANGO_CONFIGURATION}.txt \
57+
-r /tmp/dataset_manifest/requirements.txt
5258

5359
FROM ubuntu:20.04
5460

‎site/content/en/docs/contributing/development-environment.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ description: 'Installing a development environment for different operating syste
7878
python3 -m venv .env
7979
. .env/bin/activate
8080
pip install -U pip wheel setuptools
81-
pip install -r cvat/requirements/development.txt
81+
pip install -r cvat/requirements/development.txt -r utils/dataset_manifest/requirements.txt
8282
python manage.py migrate
8383
python manage.py collectstatic
8484
```

‎site/content/en/docs/manual/advanced/dataset_manifest.md

+130-40
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,80 @@
22

33
---
44

5-
title: 'Simple command line to prepare dataset manifest file'
5+
title: 'Dataset Manifest'
66
linkTitle: 'Dataset manifest'
77
weight: 30
8-
description: This section on [GitHub](https://github.com/cvat-ai/cvat/tree/develop/utils/dataset_manifest)
8+
description:
99

1010
---
1111

1212
<!--lint disable heading-style-->
1313

14-
### Steps before use
14+
## Overview
1515

16-
When used separately from Computer Vision Annotation Tool(CVAT), the required dependencies must be installed
16+
When we create a new task in CVAT, we need to specify where to get the input data from.
17+
CVAT allows to use different data sources, including local file uploads, a mounted
18+
file share on the server, cloud storages and remote URLs. In some cases CVAT
19+
needs to have extra information about the input data. This information can be provided
20+
in Dataset manifest files. They are mainly used when working with cloud storages to
21+
reduce the amount of network traffic used and speed up the task creation process.
22+
However, they can also be used in other cases, which will be explained below.
1723

18-
#### Ubuntu:20.04
24+
A dataset manifest file is a text file in the JSONL format. These files can be created
25+
automatically with [the special command-line tool](https://github.com/opencv/cvat/tree/develop/utils/dataset_manifest),
26+
or manually, following [the manifest file format specification](#file-format).
27+
28+
## How and when to use manifest files
29+
30+
Manifest files can be used in the following cases:
31+
- A video file or a set of images is used as the data source and
32+
the caching mode is enabled. [Read more](/docs/manual/advanced/data_on_fly/)
33+
- The data is located in a cloud storage. [Read more](/docs/manual/basics/cloud-storages/)
34+
35+
## How to generate manifest files
36+
37+
CVAT provides a dedicated Python tool to generate manifest files.
38+
The source code can be found [here](https://github.com/opencv/cvat/tree/develop/utils/dataset_manifest).
39+
40+
Using the tool is the recommended way to create manifest files for you data. The data must be
41+
available locally to the tool to generate manifest.
42+
43+
### Usage
44+
45+
```bash
46+
usage: create.py [-h] [--force] [--output-dir .] source
47+
48+
positional arguments:
49+
source Source paths
50+
51+
optional arguments:
52+
-h, --help show this help message and exit
53+
--force Use this flag to prepare the manifest file for video data
54+
if by default the video does not meet the requirements
55+
and a manifest file is not prepared
56+
--output-dir OUTPUT_DIR
57+
Directory where the manifest file will be saved
58+
```
59+
60+
### Use the script from a Docker image
61+
62+
This is the recommended way to use the tool.
63+
64+
The script can be used from the `cvat/server` image:
65+
66+
```bash
67+
docker run -it --rm -u "$(id -u)":"$(id -g)" \
68+
-v "${PWD}":"/local" \
69+
--entrypoint python3 \
70+
cvat/server \
71+
utils/dataset_manifest/create.py --output-dir /local /local/<path/to/sources>
72+
```
73+
74+
Make sure to adapt the command to your file locations.
75+
76+
### Use the script directly
77+
78+
#### Ubuntu 20.04
1979
2080
Install dependencies:
2181
@@ -38,72 +98,102 @@ Create an environment and install the necessary python modules:
3898
python3 -m venv .env
3999
. .env/bin/activate
40100
pip install -U pip
41-
pip install -r requirements.txt
42-
```
43-
44-
### Using
45-
46-
```bash
47-
usage: python create.py [-h] [--force] [--output-dir .] source
48-
49-
positional arguments:
50-
source Source paths
51-
52-
optional arguments:
53-
-h, --help show this help message and exit
54-
--force Use this flag to prepare the manifest file for video data if by default the video does not meet the requirements
55-
and a manifest file is not prepared
56-
--output-dir OUTPUT_DIR
57-
Directory where the manifest file will be saved
101+
pip install -r utils/dataset_manifest/requirements.txt
58102
```
59103
60-
### Alternative way to use with cvat/server
61-
62-
```bash
63-
docker run -it -u root --entrypoint bash -v /path/to/host/data/:/path/inside/container/:rw cvat/server -c "pip3 install -r utils/dataset_manifest/requirements.txt && python3 utils/dataset_manifest/create.py --output-dir /path/to/manifest/directory/ /path/to/data/"
64-
```
104+
> Please note that if used with video this way, the results may be different from what
105+
would the server decode. It is related to the ffmpeg library version. For this reason,
106+
using the Docker-based version of the tool is recommended.
65107
66-
### Examples of using
108+
### Examples
67109
68110
Create a dataset manifest in the current directory with video which contains enough keyframes:
69111
70112
```bash
71-
python create.py ~/Documents/video.mp4
113+
python utils/dataset_manifest/create.py ~/Documents/video.mp4
72114
```
73115
74116
Create a dataset manifest with video which does not contain enough keyframes:
75117
76118
```bash
77-
python create.py --force --output-dir ~/Documents ~/Documents/video.mp4
119+
python utils/dataset_manifest/create.py --force --output-dir ~/Documents ~/Documents/video.mp4
78120
```
79121
80122
Create a dataset manifest with images:
81123
82124
```bash
83-
python create.py --output-dir ~/Documents ~/Documents/images/
125+
python utils/dataset_manifest/create.py --output-dir ~/Documents ~/Documents/images/
84126
```
85127
86128
Create a dataset manifest with pattern (may be used `*`, `?`, `[]`):
87129
88130
```bash
89-
python create.py --output-dir ~/Documents "/home/${USER}/Documents/**/image*.jpeg"
131+
python utils/dataset_manifest/create.py --output-dir ~/Documents "/home/${USER}/Documents/**/image*.jpeg"
90132
```
91133
92-
Create a dataset manifest with `cvat/server`:
134+
Create a dataset manifest using Docker image:
93135
94136
```bash
95-
docker run -it --entrypoint python3 -v ~/Documents/data/:${HOME}/manifest/:rw cvat/server
96-
utils/dataset_manifest/create.py --output-dir ~/manifest/ ~/manifest/images/
137+
docker run -it --rm -u "$(id -u)":"$(id -g)" \
138+
-v ~/Documents/data/:${HOME}/manifest/:rw \
139+
--entrypoint '/usr/bin/bash' \
140+
cvat/server \
141+
utils/dataset_manifest/create.py --output-dir ~/manifest/ ~/manifest/images/
97142
```
98143
99-
### Examples of generated `manifest.jsonl` files
144+
### File format
145+
146+
The dataset manifest files are text files in JSONL format. These files have 2 sub-formats:
147+
_for video_ and _for images and 3d data_.
148+
149+
> Each top-level entry enclosed in curly braces must use 1 string, no empty strings is allowed.
150+
> The formatting in the descriptions below is only for demonstration.
151+
152+
#### Dataset manifest for video
100153
101-
A manifest file contains some intuitive information and some specific like:
154+
The file describes a single video.
102155
103156
`pts` - time at which the frame should be shown to the user
104-
`checksum` - `md5` hash sum for the specific image/frame
157+
`checksum` - `md5` hash sum for the specific image/frame decoded
158+
159+
```json
160+
{ "version": <string, version id> }
161+
{ "type": "video" }
162+
{ "properties": {
163+
"name": <string, filename>,
164+
"resolution": [<int, width>, <int, height>],
165+
"length": <int, frame count>
166+
}}
167+
{
168+
"number": <int, frame number>,
169+
"pts": <int, frame pts>,
170+
"checksum": <string, md5 frame hash>
171+
} (repeatable)
172+
```
173+
174+
#### Dataset manifest for images and other data types
175+
176+
The file describes an ordered set of images and 3d point clouds.
177+
178+
`name` - file basename and leading directories from the dataset root
179+
`checksum` - `md5` hash sum for the specific image/frame decoded
180+
181+
```json
182+
{ "version": <string, version id> }
183+
{ "type": "images" }
184+
{
185+
"name": <string, image filename>,
186+
"extension": <string, . + file extension>,
187+
"width": <int, width>,
188+
"height": <int, height>,
189+
"meta": <dict, optional>,
190+
"checksum": <string, md5 hash, optional>
191+
} (repeatable)
192+
```
193+
194+
### Example files
105195
106-
#### For a video
196+
#### Manifest for a video
107197
108198
```json
109199
{"version":"1.0"}
@@ -117,7 +207,7 @@ A manifest file contains some intuitive information and some specific like:
117207
{"number":675,"pts":2430000,"checksum":"0e72faf67e5218c70b506445ac91cdd7"}
118208
```
119209
120-
#### For a dataset with images
210+
#### Manifest for a dataset with images
121211
122212
```json
123213
{"version":"1.0"}

‎utils/dataset_manifest/create.py

100644100755
+5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1+
#!/usr/bin/env python3
2+
13
# Copyright (C) 2021-2022 Intel Corporation
4+
# Copyright (C) 2022 CVAT.ai Corporation
25
#
36
# SPDX-License-Identifier: MIT
7+
48
import argparse
59
import os
610
import sys
@@ -89,6 +93,7 @@ def main():
8993
sys.exit(str(ex))
9094

9195
print('The manifest file has been prepared')
96+
9297
if __name__ == "__main__":
9398
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
9499
sys.path.append(base_dir)
+4-4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
av==9.2.0 --no-binary=av
2-
opencv-python-headless==4.4.0.42
1+
av==9.2.0 --no-binary=av # Pinned for the whole CVAT
2+
opencv-python-headless>=4.4.0.42
33
Pillow==9.3.0
4-
tqdm==4.58.0
5-
natsort==8.0.0
4+
tqdm>=4.58.0
5+
natsort>=8.0.0

0 commit comments

Comments
 (0)
Please sign in to comment.