Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove tensorflow and pytorch code for simplicity #47

Merged
merged 1 commit into from
Feb 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test_unittests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
run: |
python -m pip install --upgrade --user pip --quiet
python -m pip install coverage codecov --upgrade-strategy only-if-needed --quiet
python -m pip install -e .['tests, torch']
python -m pip install -e .['tests']
python --version
pip --version
python -m pip list
Expand Down
34 changes: 21 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,20 @@
this package makes it simpler to obtain media data from the GBIF database to be used for training __machine learning classification__ tasks. It wraps the [GBIF API](https://www.gbif.org/developer/summary) and supports directly querying the api to obtain and download a list of urls.
Existing saved queries can also be obtained using the download api of GBIF simply by providing GBIF DOI key.
The package provides an efficient downloader that uses python asyncio modules to speed up downloading of many small files as typically occur in downloads.
Ultimately `gbif-dl` can also directly return [pytorch]() or [tensorflow]() data loaders.

## Installation

Installation can be done via pip.
`
pip install gbif-dl
`

If pytorch or tensorflow dataset shall be returned, additional dependencies can be installed e.g. using `pip install gbif-dl['pytorch']`.

```
pip install gbif-dl
```
## Usage

The usage of `gbif-dl` helps users to create their own GBIF based media pipeline for training machine learning models. The package provides three core functionalities as followed:
The usage of `gbif-dl` helps users to create their own GBIF based media pipeline for training machine learning models. The package provides two core functionalities as followed:

1. `gbif-dl.generators`: Generators provide image urls from the GBIF database given queries or a pre-defined URL.
2. `gbif-dl.io`: Provides efficient media downloading to write the data to a storage device.
3. `gbif-dl.dataloaders`: Provide simple dataloaders for `PyTorch` and `Tensorflow` to access the downloaded data.

### 1. Retrieve media urls from GBIF

Expand Down Expand Up @@ -153,18 +149,30 @@ gbif_dl.io.download(data_generator, root="my_dataset")

The downloader provides very fast download speeds by using an async queue. Some fail-safe functionality is provided by setting the number of `retries`, default to 3.

### Training Datasets/Dataloaders
### Training Datasets

#### PyTorch

`gbif-dl` makes it simple to train a PyTorch image classification model by providing a standard `torch.dataset`. Users can directly pass a query or dwca generator to the dataset and enable downloading, to simplify the code.
`gbif-dl` makes it simple to train a PyTorch image classification model by using e.g. `torchvision.ImageFolder`. Each item in the `data_generator` can be randomly assigned to a `train` or `test` subset using `random_subsets`. That way users can directly use the subsets.

```python
from gbif_dl.dataloaders.torch import GBIFImageDataset
dataset = GBIFImageDataset(root='my_dataset', generator=data_generator, download=True)
import torchvision
gbif_dl.io.download(data_generator, root="my_dataset", random_subsets={'train': 0.9, 'test': 0.1})
train_dataset = torchvision.datasets.ImageFolder(root='my_dataset/train', ...)
test_dataset = torchvision.datasets.ImageFolder(root='my_dataset/test', ...)
```

> ⚠️ Note that we do not provide train/validation/test splits of the dataset as this would be more useful to design specifically to the downstream task.
#### Tensorflow

The simpliest way to generate a `tf.data.Dataset` pipeline from a data generator is to use `tf.keras.preprocessing.image_dataset_from_directory`.
Similarily to the pytorch example, users just need to provide the root paths of the downloaded datasets.

```python
import tensorflow as tf
gbif_dl.io.download(data_generator, root="my_dataset", random_subsets={'train': 0.9, 'test': 0.1})
tf.keras.preprocessing.image_dataset_from_directory(root='my_dataset/train', label_mode="categorical", labels="inferred", *args, **kwargs)
tf.keras.preprocessing.image_dataset_from_directory(root='my_dataset/test', label_mode="categorical", labels="inferred", *args, **kwargs)
```

## FAQ

Expand Down
Empty file removed gbif_dl/dataloaders/__init__.py
Empty file.
31 changes: 0 additions & 31 deletions gbif_dl/dataloaders/tensorflow.py

This file was deleted.

43 changes: 0 additions & 43 deletions gbif_dl/dataloaders/torch.py

This file was deleted.

6 changes: 1 addition & 5 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,7 @@
"tqdm",
"typing-extensions; python_version < '3.8'",
],
extras_require={
"tests": ["pytest"],
"torch": ["torch>=1.7.0", "torchvision"],
"tensorflow": ["tensorflow>=2.4.0"],
},
extras_require={"tests": ["pytest"]},
# entry_points={"console_scripts": ["gbif_dl=gbif_dl.cli:download"]},
packages=find_packages(),
include_package_data=True,
Expand Down
34 changes: 0 additions & 34 deletions tests/test_tf.py

This file was deleted.

34 changes: 0 additions & 34 deletions tests/test_torch.py

This file was deleted.