This extension contains harvester plugins for harvesting from sources used by NextGEOSS as well as a metaharvester plugin for adding additional tags and metadata retrieved from an iTag instance.
- What's in the repository
- Basic usage
- Harvesting Sentinel products
- Harvesting CMEMS products
- Harvesting GOME-2 products
- Harvesting PROBA-V products
- Harvesting Static EBVs
- Harvesting GLASS LAI products
- Harvesting Plan4All products
- Harvesting DEIMOS-2 products
- Harvesting EBAS-NILU products
- Harvesting SIMOcean products
- Harvesting EPOS-Sat products
- Harvesting MODIS products
- Harvesting GDACS Average Flood products
- Harvesting DEIMOS-2 products
- Harvesting Food Security pilot outputs
- Harvesting SAEON products
- Harvesting Landsat-8 outputs
- Harvesting VITO CGS S1 products
- Harvesting Cold Regions pilot outputs
- Harvesting NOA Groundsegment products
- Harvesting MELOA products
- Harvesting FSSCAT products
- Harvesting NOA GeObservatory products
- Harvesting Energy Data products
- Developing new harvesters
- iTag
- Testing testing testing
- Suggested cron jobs
- Logs
The repository contains four plugins:
nextgeossharvest
, the base CKAN pluginesa
, a harvester plugin for harvesting Sentinel products from SciHub, NOA, and CODE-DE via their DHuS interfacescmems
, a harvester plugin for harvesting the following types of CMEMS products:- Arctic Ocean Physics Analysis and Forecast (OCN)
- Global Observed Sea Surface Temperature (SST)
- Antarctic Ocean Observed Sea Ice Concentration (SIC South)
- Arctic Ocean Observed Sea Ice Concentration (SIC North)
gome2
, a harvester plugin for harvesting the following types of GOME-2 coverage products:- GOME2_O3
- GOME2_NO2
- GOME2_TropNO2
- GOME2_SO2
- GOME2_SO2mass
itag
a harvester plugin for adding additional tags and metadata to datasets that have already been harvested (more on this later)
- Run
python setup.py develop
in theckanext-nextgeossharvest
directory. - Run
pip install -r requirements.txt
n theckanext-nextgeossharvest
directory. - You will also need the following CKAN extensions:
ckanext-harvest
ckanext-spatial
- You will want to configure
ckanext-spatial
to usesolr-spatial-field
for the spatial search backend. Instructions can be found here: http://docs.ckan.org/projects/ckanext-spatial/en/latest/spatial-search.html. You cannot usesolr
as the spatial search backend becausesolr
only supports footprints that are effectively bounding boxes (polygons composed of five points), while the footprints of the datasets harvested by these plugins can be considerably more complex. Usingpostgis
as the spatial search backend is strongly discouraged, as it will choke on the large numbers of datasets that these harvesters will pull down. - Add the harvester and spatial plugins to the list of plugins in your
.ini
file, as well asnextgeossharvest
and any of the NextGEOSS harvester plugins that you want to use. - If you will be harvesting from SciHub, NOA or CODE-DE, add your username and password to
ckanext.nextgeossharvest.nextgeoss_username=
andckanext.nextgeossharvest.nextgeoss_password=
in your.ini
file. The credentials are stored here rather than in the source config partly for security reasons and partly because of the way the extension is deployed. (It may make sense to move them to the source config in the future.) - If you want to log the response times and status codes of requests to harvest sources, you must include
ckanext.nextgeossharvest.provider_log_dir=/path/to/your/logs
in your.ini
file. The log entries will look like this:INFO | esa_scihub | 2018-03-08 14:17:04.474262 | 200 | 2.885231s
(the second field will always be 12 characters and will be padded if necessary). - Create a cron job like the following so that your harvest jobs will be marked
Finished
when complete:0 * * * * paster --plugin=ckanext-harvest harvester run -c /srv/app/production.ini >> /var/log/cron.log 2>&1
To harvest Sentinel products, activate the esa
plugin, which you will use to create a harvester that harvests from SciHub, NOA or CODE-DE. To harvest from more than one of those sources, just create more than one harvester and point it at a different source.
Note: The configuration object is required for all of these harvesters.
Create a new harvest source and select ESA Sentinel Harvester New
. The URL does not matter—the harvester only harvests from SciHub, NOA, or CODE-DE, depending on the configuration below.
To harvest from SciHub, source
must be set to "esa_scihub"
in the configuration. See Sentinel settings (SciHub, NOA & CODE-DE) for a complete description of the settings.
Note: you must place your username and password in the .ini
file as described above.
After saving the configuration, you can click Reharvest and the job will begin (assuming you have a cronjob like the one described above). Alternatively, you can use the paster command run_test
described in the ckanext-harvest
documentation to run the harvester without setting up the the gather consumer, etc.
Create a new harvest source and select ESA Sentinel Harvester New
. The URL does not matter—the harvester only harvests from SciHub, NOA, or CODE-DE, depending on the configuration below.
To harvest from NOA, source
must be set to "esa_noa"
in the configuration. See Sentinel settings (SciHub, NOA & CODE-DE) for a complete description of the settings.
Note: you must place your username and password in the .ini
file as described above.
After saving the configuration, you can click Reharvest and the job will begin (assuming you have a cronjob like the one described above). Alternatively, you can use the paster command run_test
described in the ckanext-harvest
documentation to run the harvester without setting up the the gather consumer, etc.
Create a new harvest source and select ESA Sentinel Harvester New
. The URL does not matter—the harvester only harvests from SciHub, NOA, or CODE-DE, depending on the configuration below.
To harvest from NOA, source
must be set to "esa_code"
in the configuration. See Sentinel settings (SciHub, NOA & CODE-DE) for a complete description of the settings.
Note: you must place your username and password in the .ini
file as described above.
After saving the configuration, you can click Reharvest and the job will begin (assuming you have a cronjob like the one described above). Alternatively, you can use the paster command run_test
described in the ckanext-harvest
documentation to run the harvester without setting up the the gather consumer, etc.
source
: (required, string) determines whether the harvester harvests from SciHub, NOA, or CODE-DE. To harvest from SciHub, use"source": "esa_scihub"
. To harvest from NOA, use"source": "esa_noa"
. To harvest from CODE-DE, use"source": "esa_code"
.update_all
: (optional, boolean, default isfalse
) determines whether or not the harvester updates datasets that already have metadadata from this source. For example: if we have"update_all": true
, and dataset Foo has already been created or updated by harvesting from SciHub, then it will be updated again when the harvester runs. If we have"update_all": false
and Foo has already been created or updated by harvesting from SciHub, then the dataset will not be updated when the harvester runs. And regardless of whetherupdate_all
istrue
orfalse
, if a dataset has not been created or updated with metadata from SciHub (it's new, or it was created via NOA or CODE-DE and has no SciHub metadata), then it will be updated with the additional SciHub metadata.start_date
: (optional, datetime string, default is "any" or "from the earliest date onwards" if the harvester is new, or from the ingestion date of the most recently harvested product if it has been run before) determines the end of the date range for harvester queries. Example: "start_date": "2018-01-16T10:30:00.000Z". Note that the entire datetime string is required.2018-01-01
is not valid. Using full datetimes is especially useful when testing, as it is possible to restrict the number of possible results by searching only within a small time span, like 20 minutes.end_date
: (optional, datetime string, default is "now" or "to the latest possible date") determines the end of the date range for harvester queries. Example: "end_date": "2018-01-16T11:00:00.000Z". Note that the entire datetime string is required.2018-01-01
is not valid. Using full datetimes is especially useful when testing, as it is possible to restrict the number of possible results by searching only within a small time span, like 20 minutes.product_type
: (optional, string) determines the Sentinel collection (product type) to be considered by the harvester when querying the data provider interface. The possible values areSLC
,GRD
,OCN
,S2MSI1C
,S2MSI2A
,S2MSI2Ap
,OL_1_EFR___
,OL_1_ERR___
,OL_2_LFR___
,OL_2_LRR___
,SR_1_SRA___
,SR_1_SRA_A_
,SR_1_SRA_BS
,SR_2_LAN___
,SL_1_RBT___
,SL_2_LST___
,SY_2_SYN___
,SY_2_V10___
,SY_2_VG1___
orSY_2_VGP___
. If no product_type is provided, the harvester will have the normal behavior and consider all.aoi
: (optional, string with POLYGON) determines the Area of Interest to be considered by the harvester when querying the data provider interface. The aoi shall be provided with the following format:POLYGON((-180 -90,-180 90,180 90,180 -90,-180 -90))
. More points can be added to the polygon. If no aoi is provided, the harvester will consider as global.datasets_per_job
: (optional, integer, defaults to 1000) determines the maximum number of products that will be harvested during each job. If a query returns 2,501 results, only the first 1000 will be harvested if you're using the default. This is useful for running the harvester via recurring jobs intended to harvest products incrementally (i.e., you want to start from the beginning and harvest all available products). The harvester will harvest products in groups of 1000, rather than attmepting to harvest all x-hundred-thousand at once. You'll get feedback after each job, so you'll know if there are errors without waiting for the whole job to run. And the harvester will automatically resume from the harvested dataset if you're running it via a recurring cron job.timeout
: (optional, integer, defaults to 4) determines the number of seconds to wait before timing out a request.skip_raw
: (optional, boolean, defaults to false) determines whether RAW products are skipped or included in the harvest.make_private
is optional and defaults tofalse
. Iftrue
, the datasets created by the harvester will be marked private. This setting is not retroactive. It only applies to datasets created by the harvester while the setting istrue
.
Example configuration with all variables present:
{
"source": "esa_scihub",
"update_all": false,
"start_date": "2018-01-16T10:30:00.000Z",
"end_date": "2018-01-16T11:00:00.000Z",
"datasets_per_job": 100,
"timeout": 4,
"skip_raw": true,
"make_private": false
}
{
"source": "esa_scihub",
"update_all": false,
"start_date": "2019-01-01T00:00:00.000Z",
"aoi": "POLYGON((2.0524444097380456 51.60572085265915,5.184653052425238 51.67771256185287,7.138937077349725 50.43826001622307,5.612989277066222 49.25292867929642,1.9721313676178616 50.83443942461676,2.0524444097380456 51.60572085265915,2.0524444097380456 51.60572085265915))",
"product_type": "S2MSI2A",
"datasets_per_job": 100,
"timeout": 20,
"skip_raw": true,
"make_private": false
}
Note: you must place your username and password in the .ini
file as described above.
To harvest from more than one Sentinel source, just create a harvester source for each Sentinel source.For example, to harvest from all three sources:
- Create a harvest source called (just a suggestion) "SciHub Harvester", select
ESA Sentinel Harvester New
and make sure that the configuration contains"source": "esa_scihub"
. - Create a harvest source called (just a suggestion) "NOA Harvester", select
ESA Sentinel Harvester New
and make sure that the configuration contains"source": "esa_noa"
. - Create a harvest source called (just a suggestion) "CODE-DE Harvester", select
ESA Sentinel Harvester New
and make sure that the configuration contains"source": "esa_code"
.
You'll probably want to specify start and end times as well as the number of datasets per job for each harvester. If you don't, don't worry—the default number of datasets per job is 1000, so you won't be flooded with datasets.
Then just run each of the harvesters. You can run them all at the same time. If a product has already been harvested by another harvester, then the other harvesters will only update the existing dataset and add additional resources and metadata. They will not overwrite the resources and metadata that already exist (e.g., the SciHub harvester won't replace resources from CODE-DE with resources from SciHub, it will just add SciHub resources to the dataset alongside the existing CODE-DE resources.
The three (really, two) Sentinel harvesters all inherit from the same base harvester classes. As mentioned above, the SciHub, NOA and CODE-DE "harvesters" are all the same harvester with different configurations. The "source"
configuration is a switch that 1) causes the harvester to use a different base URL for querying the OpenSearch service and 2) changes the labels added to the resources. In all cases, the same methods are used for creating/updating the datasets.
The workflow for all the harvesters is:
- Gather: query the OpenSearch service for all products within the specified date range, then page through the results, creating harvest objects for each entry. Each harvest object contains the content of the entry, which will be parsed later, as well as a preliminary test of whether the product already exists in CKAN or not.
- Fetch: just returns true, as the OpenSearch service already provides all the content in the gather stage.
- Import: parse the content of each harvest object and then either create a new dataset for the respective product, or update an existing dataset. It's possible that another harvester may have created a dataset for the product before the current import phase began, so if creating a dataset for a "new" product fails because the dataset already exists, the harvester catches the exception and performs an update instead. For the sake of simplicity, the create and update pipelines are the same. The only difference is the API call at the end. All three harvesters can run at the same time, harvesting from the same date range, without conflicts.
The created/updated counts for each harvester job will be accurate. The count that appears in the sidebar on each harvester's page, however, will not be accurate. Besides issues with how Solr updates the harvest_source_id
associated with each dataset, the fact that up to three harvesters may be creating or updating a single dataset means that only one harvest source can "own" a dataset at any given time. If you need to evaluate the performance of a harvester, use the job reports.
To harvest CMEMS products, activate the cmems
plugin, which you will use to create a harvester that harvests one of the following types of CMEMS product:
- Global Observed Sea Surface Temperature (sst) from ftp://nrt.cmems-du.eu/Core/SST_GLO_SST_L4_NRT_OBSERVATIONS_010_001/METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2/
- Arctic Ocean Observed Sea Ice Concentration (sic_north) from ftp://mftp.cmems.met.no/Core/SEAICE_GLO_SEAICE_L4_NRT_OBSERVATIONS_011_001/METNO-GLO-SEAICE_CONC-NORTH-L4-NRT-OBS/
- Antarctic Ocean Observed Sea Ice Concentration (sic_south) from ftp://mftp.cmems.met.no/Core/SEAICE_GLO_SEAICE_L4_NRT_OBSERVATIONS_011_001/METNO-GLO-SEAICE_CONC-SOUTH-L4-NRT-OBS/
- Arctic Ocean Physics Analysis and Forecast (ocn) from ftp://mftp.cmems.met.no/Core/ARCTIC_ANALYSIS_FORECAST_PHYS_002_001_a/dataset-topaz4-arc-myoceanv2-be/
- Global Ocean Gridded L4 Sea Surface Heights and Derived Variables NRT (slv) from ftp://nrt.cmems-du.eu/Core/SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046/dataset-duacs-nrt-global-merged-allsat-phy-l4
- Global Ocean Physics Analysis and Forecast - Hourly (gpaf) from ftp://nrt.cmems-du.eu/Core/GLOBAL_ANALYSIS_FORECAST_PHY_001_024/global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh
- Global Total Surface and 15m Current - Hourly (mog) from ftp://nrt.cmems-du.eu/Core/MULTIOBS_GLO_PHY_NRT_015_003/dataset-uv-nrt-hourly
To harvest more than one of those types of product, just create more than one harvester and configure a different harvester_type
.
The URL you enter in the harvester GUI does not matter--the plugin determines the correct URL based on the harvester_type
.
The different products are hosted on different services, so separate harvesters are necessary for ensuring that the harvesting of one is not affected by errors or outages on the others.
harvester_type
determines which type of product will be harvested. It must be one of the following seven strings: sst
, sic_north
, sic_south
, ocn
, gpaf
, slv
or mog
.
start_date
determines the start date for the harvester job. It must be the string YESTERDAY
or a string describing a date in the format YYYY-MM-DD
, like 2017-01-01
.
end_date
determines the end date for the harvester job. It must be the string TODAY
or a string describing a date in the format YYYY-MM-DD
, like 2017-01-01
. The end_date is not mandatory and if not included the harvester will run until catch up the current day.
The harvester will harvest all the products available on the start date and on every date up to but not including the end date. If the start and end dates are YESTERDAY
and TODAY
, respectively, then the harvester will harvest all the products available yesterday but not any of the products available today. If the start and end dates are 2018-01-01
and 2018-02-01
, respectively, then the harvester will harvest all the products available in the month of January (and none from the month of February).
timeout
determines how long the harvester will wait for a response from a server before cancelling the attempt. It must be a postive integer. Not mandatory.
username
and password
are your username and password for accessing the CMEMS products at the source for the harvester type you selected above.
make_private
is optional and defaults to false
. If true
, the datasets created by the harvester will be marked private. This setting is not retroactive. It only applies to datasets created by the harvester while the setting is true
.
Examples of config:
{
"harvester_type":"slv",
"start_date":"2017-01-01",
"username":"your_username",
"password":"your_password",
"make_private":false
}
{
"harvester_type": "sic_south",
"start_date": "2017-01-01",
"end_date": "TODAY",
"timeout": 10,
"username": "your_username",
"password": "your_password",
"make_private": false
}
You can run the harvester on a Daily update frequencey with YESTERDAY
and TODAY
as the start and end dates. Since requests may time out, you can also run the harvester more than once a day using the Manual update frequency and a cron job. There's no way to recover from outages at the moment; the CMEMS harvester could be more robust.
The GOME-2 harvester harvests products from the following GOME-2 coverages:
- GOME2_O3
- GOME2_NO2
- GOME2_TropNO2
- GOME2_SO2
- GOME2_SO2mass
Unlike other harvesters, the GOME-2 harvester only makes requests to verify that a product exists. It programmatically creates datasets and resources for products that do exist within the specified date range.
The GOME-2 harvester has two required and one optional setting.
start_date
(required) determines the date on which the harvesting begins. It must be in the formatYYY-MM-DD
or the string"YESTERDAY"
. If you want to harvest from the earliest product onwards, use2007-01-04
. If you will be harvesting on a daily basis, use"YESTERDAY"
end_date
(required) determines the date on which the harvesting ends. It must be in the formatYYY-MM-DD
or the string"TODAY"
. It is exclusive, i.e., if the end date is2017-03-2
, then products will be harvested up to and including 2017-03-01 and no products from 2017-03-02 will be included. For daily harvesting use"TODAY"
.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date": "2017-03-01",
"end_date": "2017-03-02",
"make_private": false
}
or
{
"start_date": "YESTERDAY",
"end_date": "TODAY",
"make_private": false
}
- Add
gome2
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- The URL you enter does not matter--the GOME-2 harvester only makes requests to a predetermined set of URLs. Select
GOME2
from the list of harvesters. - Add a config as described above.
- Select a frequency from the frequencey options. If you want to use a cron job (recommended) to run the harvester, select
Manual
.
The PROBA-V harvester harvests products from the following collections:
- On time collections:
- PROBAV_L2A_1KM_V001
- Proba-V Level-1C
- Proba-V S1-TOC (1KM)
- Proba-V S1-TOA (1KM)
- Proba-V S10-TOC (1KM)
- Proba-V S10-TOC NDVI (1KM)
- One month delayed collections with 333M resolution:
- Proba-V Level-2A (333M)
- Proba-V S1-TOA (333M)
- Proba-V S1-TOC (333M)
- Proba-V S10-TOC (333M)
- Proba-V S10-TOC NDVI (333M)
- One month delayed collections with 100M resolution:
- Proba-V Level-2A (100M)
- Proba-V S1-TOA (100M)
- Proba-V S1-TOC (100M)
- Proba-V S1-TOC NDVI (100M)
- Proba-V S5-TOA (100M)
- Proba-V S5-TOC (100M)
- Proba-V S5-TOC NDVI (100M)
The products from the on time collections are created and published on the same day. The product from delayed collections are published with one month delay after being created.
The collections were also splitted according to the resoltion to avoid a huge number of datasets being harvested. L1C, L2A and S1 products are published daily. S5 products are published every 5 days. S10 products are published every 10 days. S1, S5 and S10 products are tiles covering almost the entire world. Each dataset correspond to a single tile.
The PROBA-V harvester has configuration as:
start_date
(required) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DD
. If you want to harvest from the earliest product onwards, use2018-01-01
end_date
(optional) determines the end date for the harvester job. It must be a string describing a date in the formatYYYY-MM-DD
, like 2018-01-31. The end_date is not mandatory and if not included the harvester will run until catch up the current day. To limit the number of datasets per job each job will harvest a maximum of 2 days of data.username
andpassword
are your username and password for accessing the PROBA-V products at the source.collection
(required) to define the collection that will be collected. It can bePROBAV_P_V001
,PROBAV_S1-TOA_1KM_V001
,PROBAV_S1-TOC_1KM_V001
,PROBAV_S10-TOC_1KM_V001
,PROBAV_S10-TOC-NDVI_1KM_V001
,PROBAV_S1-TOA_100M_V001
,PROBAV_S1-TOC-NDVI_100M_V001
,PROBAV_S5-TOC-NDVI_100M_V001
,PROBAV_S5-TOA_100M_V001
,PROBAV_S5-TOC_100M_V001
,PROBAV_S1-TOC_100M_V001
,PROBAV_S1-TOA_333M_V001
,PROBAV_S1-TOC_333M_V001
,PROBAV_S10-TOC_333M_V001
,PROBAV_S10-TOC-NDVI_333M_V001
,PROBAV_L2A_1KM_V001
,PROBAV_L2A_100M_V001
orPROBAV_L2A_333M_V001
.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date":"2018-08-01",
"collection":"PROBAV_S1-TOC_1KM_V001",
"username":"nextgeoss",
"password":"nextgeoss",
"make_private":false
}
{
"start_date":"2018-08-01",
"collection":"PROBAV_L2A_1KM_V001",
"username":"nextgeoss",
"password":"nextgeoss",
"make_private":false
}
{
"start_date":"2018-08-01",
"collection":"PROBAV_P_V001",
"username":"nextgeoss",
"password":"nextgeoss",
"make_private":false
}
The start_date for the delayed collections can be any date before the current_day - 1 month. For the current collections the start_date can be any date.
- Add
probav
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
Proba-V Harvester
from the list of harvesters. - Add a config as described above.
- Select a frequency from the frequencey options. If you want to use a cron job (recommended) to run the harvester, select
Manual
. - Run the harvester. It will programmatically create datasets.
The GLASS LAI harvester harvests products from the following collections:
- LAI_1KM_AVHRR_8DAYS_GL (from 1982 to 2015)
- LAI_1KM_MODIS_8DAYS_GL (from 2001 to 2015)
The GLASS LAI harvester has configuration as:
sensor
to define if the harvester will collect products based on AVHRR (avhrr
) or MODIS (modis
).make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"sensor":"avhrr",
"make_private":false
}
{
"sensor":"modis",
"make_private":false
}
- Add
glass_lai
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
GLASS LAI Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. The harvester only needs to run twice (with two different configurations). - Run the harvester. It will programmatically create datasets.
The static EBVs harvester harvests products from the following collections:
- TREE_SPECIES_DISTRIBUTION_HABITAT_SUITABILITY
- FLOOD_HAZARD_EU_GL
- RSP_AVHRR_1KM_ANNUAL_USA
- EMODIS_PHENOLOGY_250M_ANNUAL_USA
The Static EBVs harvester has configuration as:
make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"make_private":false
}
- Add
ebvs
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
EBVs
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. The harvester only needs to run once because the datasets are static.
The Plan4All harvester harvests products from the following collections:
- Open Land Use Map (from the European Project: Plan4All)
The Plan4All harvester has configuration as:
datasets_per_job
(optional, integer, defaults to 100) determines the maximum number of products that will be harvested during each job. If a query returns 2,501 results, only the first 100 will be harvested if you're using the default. This is useful for running the harvester via recurring jobs intended to harvest products incrementally (i.e., you want to start from the beginning and harvest all available products). The harvester will harvest products in groups of 100, rather than attmepting to harvest all x-hundred-thousand at once. You'll get feedback after each job, so you'll know if there are errors without waiting for the whole job to run. And the harvester will automatically resume from the harvested dataset if you're running it via a recurring cron job.timeout
(optional, integer, defaults to 60) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"datasets_per_job": 10,
"timeout": 60,
"make_private": false
}
- Add
plan4all
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
Plan4All Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The MODIS harvester harvests products from the following collections, which can be divided by time resolution:
- 8 days:
- MOD17A2H (currently 249848 datasets, starting at 2000-02-18T00:00:00Z)
- MYD15A2H (currently 218456 datasets, starting at 2002-07-04T00:00:00Z)
- MOD15A2H (currently 248647 datasets, starting at 2000-02-18T00:00:00Z)
- MOD14A2 (currently 255161 datasets, starting at 2000-02-18T00:00:00Z)
- MYD14A2 (currently 223424 datasets, starting at 2002-07-04T00:00:00Z)
- 16 days:
- MYD13Q1 (currently 110551 datasets, starting at 2002-07-04T00:00:00Z)
- MYD13A1 (currently 110551 datasets, starting at 2002-07-04T00:00:00Z)
- MYD13A2 (currently 110540 datasets, starting at 2002-07-04T00:00:00Z)
- MOD13Q1 (currently 126268 datasets, starting at 2000-02-18T00:00:00Z)
- MOD13A1 (currently 126268 datasets, starting at 2000-02-18T00:00:00Z)
- MOD13A2 (currently 126268 datasets, starting at 2000-02-18T00:00:00Z)
- Yearly:
- MOD17A3H (currently 4110 datasets, starting at 2000-12-26T00:00:00Z)
All collections, with the exception of collection MOD17A3H, are updated on a weekly / biweekly basis. Collection MOD17A3H is the only collection that is static, where the last dataset refers to 2015-01-03.
Due to the fact that granule queries now require collection identifiers, each collection has to be harvested with different harvesters.
The MODIS harvester has configuration has:
collection
(required) to define the collection that will be collected. It can beMYD13Q1
,MYD13A1
,MYD13A2
,MOD13Q1
,MOD13A1
,MOD13A2
,MOD17A3H
,MOD17A2H
,MYD15A2H
,MOD15A2H
,MOD14A2
,MYD14A2
.start_date
(required) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DDTHH:MM:SSZ
. If you want to harvest from the earliest product onwards, use the starting dates presented in "Harvesting MODIS products"timeout
(optional, integer, defaults to 10) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"collection": "MYD13Q1",
"start_date": "2002-07-04T00:00:00Z",
"make_private": false
}
- Add
modis
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
MODIS Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the freuqency options.
The GDACS harvester harvests products from the following collections:
- AVERAGE_FLOOD_SIGNAL (from 1997/12 to present - 1 dataset per day)
- AVERAGE_FLOOD_MAGNITUDE (from 1997/12 to present - 1 dataset per day)
The GDACS harvester has configuration as:
-
data_type
determines which collection will be harvested. It must be one of the following two strings:signal
ormagnitude
. -
request_check
determines if the URL of each harvested dataset will be tested. It must be one of the following two strings:yes
orno
. -
start_date
determines the start date for the harvester job. It must be the stringYESTERDAY
or a string describing a date in the formatYYYY-MM-DD
, like1997-12-01
. -
end_date
determines the end date for the harvester job. It must be the stringTODAY
or a string describing a date in the formatYYYY-MM-DD
, like1997-12-01
. The end_date is not mandatory and if not included the harvester will run until catch up the current day. -
timeout
determines how long the harvester will wait for a response from a server before cancelling the attempt. It must be a postive integer. Not mandatory. -
make_private
is optional and defaults tofalse
. Iftrue
, the datasets created by the harvester will be marked private. This setting is not retroactive. It only applies to datasets created by the harvester while the setting istrue
.
{
"data_type":"signal",
"request_check":"yes",
"start_date":"1997-12-01",
"make_private":false
}
or
{
"data_type":"magnitude",
"request_check":"yes",
"start_date":"1997-12-01",
"make_private":false
}
- Add
gdacs
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
GDACS
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The DEIMOS-2 harvester harvests products from the following collections:
- DEIMOS-2 PM4 Level-1B
- DEIMOS-2 PSH Level-1B
- DEIMOS-2 PSH Level-1C
The number of products is static, and thus the harvaster only needs to be run once.
The DEIMOS-2 harvester has configuration as:
harvester_type
determines the ftp domain, as well as the directories in said domain.username
andpassword
are your username and password for accessing the DEIMOS-2 products at the source for the harvester type you selected above.timeout
(optional, integer, defaults to 60) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"harvester_type":"deimos_imaging",
"username":"your_username",
"password":"your_password",
"make_private":false
}
- Add
deimosimg
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
DEIMOS Imaging
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The EBAS-NILU harvester collects products from the following collections:
- EBAS NILU Data Archive
The EBAS-NILU harvester has configuration as:
start_date
: (optional, datetime string, if the harvester is new, or from the ingestion date of the most recently harvested product if it has been run before) determines the start of the date range for harvester queries. Example: "start_date": "2018-01-16T10:30:00Z". Note that the entire datetime string is required.2018-01-01
is not valid.end_date
: (optional, datetime string, default is "NOW") determines the end of the date range for harvester queries. Example: "end_date": "2018-01-16T11:00:00Z". Note that the entire datetime string is required.2018-01-01
is not valid.timeout
: (optional, integer, defaults to 10) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date": "2017-01-01T00:00:00Z",
"timeout": 4,
"make_private": false
}
- Add
ebas
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
EBAS Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options.
The SIMOcean harvester harvests products from the following collections:
- SIMOcean Mean Sea Level Pressure Forecast
- SIMOcean Port Sea State Forecast From SMARTWAVE
- SIMOcean Tidal Data
- SIMOcean Sea Surface Temperature Forecast
- SIMOcean Data From Multiparametric Buoys
- SIMOcean Surface Wind Forecast From AROME
- SIMOcean Cloudiness Forecast From AROME
- SIMOcean Air Surface Temperature Forecast
- SIMOcean Surface Currents From HF Radar
- SIMOcean Sea Wave Period Forecast
- SIMOcean Nearshore Sea State Forecast From SWAN
- SIMOcean Sea Surface Wind Forecast
- SIMOcean Precipitation Forecast From AROME
- SIMOcean Significant Wave Height Forecast
- SIMOcean Sea Wave Direction Forecast
- SIMOcean Surface Forecast From HYCOM
New products of these collections are created and published daily.
The SIMOcean harvester has configuration as:
start_date
: (required, datetime string, if the harvester is new, or from the ingestion date of the most recently harvested product if it has been run before) determines the start of the date range for harvester queries. Example: "start_date": "2018-01-16T10:30:00Z". Note that the entire datetime string is required.2018-01-01
is not valid.end_date
: (optional, datetime string, default is "NOW") determines the end of the date range for harvester queries. Example: "end_date": "2018-01-16T11:00:00Z". Note that the entire datetime string is required.2018-01-01
is not valid.datasets_per_job
: (optional, integer, defaults to 100) determines the maximum number of products that will be harvested during each job.timeout
: (optional, integer, defaults to 10) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date": "2017-01-01T00:00:00Z",
"timeout": 4,
"datasets_per_job": 100,
"make_private": false
}
- Add
simocean
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
SIMOcean Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options.
The EPOS-Sat harvester harvests products from the following collections:
- Unwrapped Interferogram
- Wrapped Interferogram
- LOS Displacement Timeseries
- Spatial Coherence
- Interferogram APS Global Model
- Map of LOS Vector
The number of products is low, due to the fact that currently there are only sample data. A large quantity of data is expected to start being injected in September of 2019.
The EPOS-Sat harvester has configuration as:
collection
(required) to define the collection that will be collected. It can beinu
,inw
,dts
,coh
,aps
,cosneu
.start_date
(required) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DDTHH:MM:SSZ
. If you want to harvest from the earliest product onwards, use2010-01-01T00:00:00Z
.end_date
(optional) determines the date on which the harvesting ends. It must be in the formatYYYY-MM-DDTHH:MM:SSZ
, it defaults intoTODAY
.datasets_per_job
(optional, integer, defaults to 100) determines the maximum number of products that will be harvested during each job.timeout
(optional, integer, defaults to 4) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"collection": "inw",
"start_date": "2010-01-16T10:30:00Z",
"timeout": 4,
"make_private": false
}
- Add
epos
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
EPOS Sat Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The Food Security harvester harvests the VITO pilot outputs for the following collections:
1. NextGEOSS Sentinel-2 FAPAR
2. NextGEOSS Sentinel-2 FCOVER
3. NextGEOSS Sentinel-2 LAI
4. NextGEOSS Sentinel-2 NDVI
The date of the pilot outputs can be different of the current date since the pilot processes old Sentinel Data.
The Food Security harvester has configuration has:
start_date
(required) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DD
. If you want to harvest from the earliest product onwards, use2017-01-01
end_date
(optional) determines the end date for the harvester job. It must be a string describing a date in the formatYYYY-MM-DD
, like 2018-01-31. The end_date is not mandatory and if not included the harvester will run until catch up the current day. To limit the number of datasets per job each job will harvest a maximum of 2 days of data.username
andpassword
are your username and password for accessing the PROBA-V products at the source.collection
(required) to define the collection that will be collected. It can beFAPAR
,FCOVER
,LAI
orNDVI
.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date":"2017-01-01",
"collection":"FAPAR",
"username":"nextgeoss",
"password":"nextgeoss",
}
- Add
foodsecurity
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
Food Security Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The VITO CGS S1 harvester collects the products of an external VITO project for the following collections:
1. VITO CGS S1
2. CGS S1 GRD L1
3. CGS S1 GRD SIGMA0 L1 (NOT AVAILABLE YET)
The Food Security harvester has configuration has:
start_date
(required) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DD
. If you want to harvest from the earliest product onwards, use2017-01-01
timeout
(optional, integer, defaults to 4) determines the number of seconds to wait before timing out a request.username
andpassword
are your username and password for accessing the PROBA-V products at the source.collection
(required) to define the collection that will be collected. It can beSLC_L1
,GRD_L1
.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date":"2018-01-01",
"collection":"SLC_L1",
"username":"username",
"password":"password",
"timeout":1,
"make_private":false
}
- Add
gdacs
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
GDACS Harvester
from the list of harvesters.
- Add
cgss1
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
VITO CGS S1 Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The Cold Regions harvester harvests the NERSC pilot outputs for the following collections:
1. Sentinel-1 HH/HV based ice/water classification
2. Sea ice and water classification in the Arctic for INTAROS 2018 field experiment
3. Sea ice and water classification in the Arctic for CAATEX/INTAROS 2019 field experiment
4. Average sea ice drift in the Arctic for INTAROS 2018 field experiment
5. Average sea ice drift in the Arctic for CAATEX 2019 field experiment
The Cold Regions harvester will run one time per collection and it will collect all the cold regions datasets within the input collection(static data). In the command line run:
$ python ./ckanext/nextgeossharvest/harvesters/coldregions.py <destination_ckan_URL> <destination_ckan_apikey> "nersc" <collection_id>
The following collection IDs are available:
- S1_ARCTIC_SEAICEEDGE_CLASSIFICATION
- S1_ARCTIC_SEAICEEDGE_CLASSIFICATION_INTAROS_2018
- S1_ARCTIC_SEAICEEDGE_CLASSIFICATION_CAATEX_INTAROS_2019
- S1_ARCTIC_SEAICEDRIFT_AVERAGE_INTAROS_2018
- S1_ARCTIC_SEAICEDRIFT_AVERAGE_CAATEX_2019
The Landsat-8 harvester collects the Level-1 data products generated from Landsat 8 Operational Land Imager (OLI)/Thermal Infrared Sensor (TIRS). The following collection 1 Tiers are harvested:
1. Landsat-8 Real-Time (RT)
2. Landsat-8 Tier 1 (T1)
3. Landsat-8 Tier 2 (T2)
The pre-processed products are not harvested due to the fact that they are deleted in a time interval of 6 months in favor of calibrated products.
The Landsat-8 harvester has configuration has:
path
(optional) determines the WRS path, where the product collection will start.row
(optional) determines the WRS row, where the product collection will start.access_key
andsecret_key
(required) are your AWS account access and secret key.bucket
(required) to define the AWS S3 bucket to harvest, for Landsat-8 uselandsat-pds
.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"path":1,
"row":1,
"access_key":"your_access_key",
"secret_key": "your_secret_key",
"bucket": "landsat-pds",
"make_private": false
}
- Add
landsat8
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
Landsat-8 Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The MELOA harvester harvests products from the following collections:
- MELOA Wavy Measurements - Littoral
- MELOA Wavy Measurements - Ocean
- MELOA Wavy Measurements - Basic
New products of these collections are created and published after the campaigns.
The MELOA harvester has configuration as:
start_date
: (required, datetime string, if the harvester is new, or from the ingestion date of the most recently harvested product if it has been run before) determines the start of the date range for harvester queries. Example: "start_date": "2019-10-01T00:00:00Z". Note that the entire datetime string is required.2019-10-01
is not valid.end_date
: (optional, datetime string, default is "NOW") determines the end of the date range for harvester queries. Example: "end_date": "2020-01-01T00:00:00Z". Note that the entire datetime string is required.2020-01-01
is not valid.datasets_per_job
: (optional, integer, defaults to 100) determines the maximum number of products that will be harvested during each job.timeout
: (optional, integer, defaults to 10) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date": "2019-10-01T00:00:00Z",
"timeout": 4,
"datasets_per_job": 100,
"make_private": false
}
- Add
meloa
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
MELOA Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options.
The SAEON harvester collects the products for the following collections:
1. Climate Systems Analysis Group (South Africa)
The SAEON harvester has configuration has:
datasets_per_job
(optional, integer, defaults to 100) determines the maximum number of products that will be harvested during each job. If a query returns 2,501 results, only the first 100 will be harvested if you're using the default. This is useful for running the harvester via recurring jobs intended to harvest products incrementally (i.e., you want to start from the beginning and harvest all available products). The harvester will harvest products in groups of 100, rather than attmepting to harvest all x-hundred-thousand at once. You'll get feedback after each job, so you'll know if there are errors without waiting for the whole job to run. And the harvester will automatically resume from the harvested dataset if you're running it via a recurring cron job.timeout
(optional, integer, defaults to 60) determines the number of seconds to wait before timing out a request.update_all
(optional, boolean, default isfalse
) determines whether or not the harvester updates datasets that already have metadadata from this source. For example: if we have "update_all": true, and dataset Foo has already been created or updated by harvesting, then it will be updated again when the harvester runs. If we have "update_all": false and Foo has already been created or updated by harvesting, then the dataset will not be updated when the harvester runs. And regardless of whether update_all is true or false, if a dataset has not been collected, then it will be created in the catalogue.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.source_url
determines the base URL for the data source to query.
{
"datasets_per_job": 100,
"timeout": 60,
"make_private": false,
"source_url": "https://staging.saeon.ac.za"
}
- Add
saeon
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
SAEON Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The NOA Groundsegment harvester collects the products for the following instruments:
1. VIIRS (Visible Infrared Imaging Radiometer Suite)
2. MODIS (Moderate Resolution Imaging Spectroradiometer)
3. AIRS (Atmospheric InfraRed Sounder)
4. MERSI (Medium Resolution Spectral Imager)
5. AVHRR/3 (Advanced Very-High-Resolution Radiometer)
The NOA Groundsegment harvester configuration contains the following options:
start_date
(optional) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DDTHH:mm:ssZ
.end_date
(optional) determines the date on which the harvesting ends. It must be in the formatYYYY-MM-DDTHH:mm:ssZ
.username
(required) Enter your NOA groundsegment username.password
(required) Enter your NOA groundsegment password.page_timeout
(optional, integer, defaults to 2) determines the maximum number of pages that will be harvested during each job. If a query returns 25 pages, only the first 5 will be harvested if you're using the default. Each page corresponds to 100 products. This is useful for running the harvester via recurring jobs intended to harvest products incrementally (i.e., you want to start from the beginning and harvest all available products). The harvester will harvest products in groups of 500, rather than attempting to harvest all x-hundred-thousand at once. You'll get feedback after each job, so you'll know if there are errors without waiting for the whole job to run. And the harvester will automatically resume from the harvested dataset if you're running it via a recurring cron job.update_all
(optional, boolean, default isfalse
) determines whether or not the harvester updates datasets that already have metadadata from this source. For example: if we have "update_all": true, and dataset Foo has already been created or updated by harvesting, then it will be updated again when the harvester runs. If we have "update_all": false and Foo has already been created or updated by harvesting, then the dataset will not be updated when the harvester runs. And regardless of whether update_all is true or false, if a dataset has not been collected, then it will be created in the catalogue.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date":"2019-01-01T00:00:00Z",
"end_date":"2020-08-01T23:59:00Z",
"username":"your_username",
"password":"your_password",
"page_timeout": "2"
}
- Add
noa_groundsegment
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
NOA Groundsegment Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options. - Run the harvester. It will programmatically create datasets.
The FSSCAT harvester collects the products for the following file types:
1. FS1_GRF_L1B_CAL
2. FS1_GRF_L1B_SCI
3. FS1_GRF_L1C_SCI
4. FS1_GRF_L2__SIE
5. FS1_GRF_L3__ICM
6. FS1_MWR_L1B_SCI
7. FS1_MWR_L1C_SCI
8. FS1_MWR_L2A_TB_
9. FS1_MWR_L2B_SIT
10. FS1_MWR_L2B_SM_
11. FS1_MWR_L3__TB_
12. FS1_MWR_L3__SIT
13. FS1_MWR_L3__SM_
14. FS1_MWR_L4__SM_
15. FS2_HPS_L1C_SCI
16. FS2_HPS_L2__RDI
17. FSS_SYN_L4__SM_
The FSSCAT harvester has configuration as:
file_type
(required) determines the FSSCAT file type to be catalogued. It can beFS1_GRF_L1B_CAL
,FS1_GRF_L1B_SCI
,FS1_GRF_L1C_SCI
,FS1_GRF_L2__SIE
,FS1_GRF_L3__ICM
,FS1_MWR_L1B_SCI
,FS1_MWR_L1C_SCI
,FS1_MWR_L2A_TB_
,FS1_MWR_L2B_SIT
,FS1_MWR_L2B_SM_
,FS1_MWR_L3__TB_
,FS1_MWR_L3__SIT
,FS1_MWR_L3__SM_
,FS1_MWR_L4__SM_
,FS2_HPS_L1C_SCI
,FS2_HPS_L2__RDI
orFSS_SYN_L4__SM_
.start_date
(mandatory) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DD
.end_date
(optional) determines the date on which the harvesting ends. It must be in the formatYYYY-MM-DD
.ftp_domain
(required) URL of the FSSCAT FTP.ftp_path
(required) Path of the FSSCAT FTP where the list of files are published.ftp_pass
(required) Password of the FSSCAT FTP.ftp_user
(required) Username of the user allowed to access the harvesting directory in the FSSCAT FTP.ftp_port
(optional, integer, defaults to 21) Port of the FSSCAT FTP.ftp_timeout
(optional, integer, defaults to 20) determines the seconds until the timeout when accessing the FTP.update_all
(optional, boolean, default isfalse
) determines whether or not the harvester updates datasets that already have metadadata from this source. For example: if we have "update_all": true, and dataset Foo has already been created or updated by harvesting, then it will be updated again when the harvester runs. If we have "update_all": false and Foo has already been created or updated by harvesting, then the dataset will not be updated when the harvester runs. And regardless of whether update_all is true or false, if a dataset has not been collected, then it will be created in the catalogue.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.max_datasets
(optional, integer, defaults to 100) determines the maximum number of datasets to be catalogued.
{
"start_date": "2020-10-30",
"end_date": "2020-11-01",
"file_type": "FS1_GRF_L1C_SCI",
"ftp_domain": "<FSSCAT_FTP_DOMAIN>",
"ftp_path": "<FSSCAT_FTP_PATH>",
"ftp_pass": "<FSSCAT_FTP_PASS>",
"ftp_user": "<FSSCAT_FTP_USER>",
"ftp_port": 21,
"make_private": false,
"max_dataset": 10
}
- Add
fsscat
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
FSSCAT Harvester
from the list of harvesters.
The NOA GeObservatory is activated in major geohazard events (earthquakes, volcanic activity, landslides,etc.) and automatically produces a series of Sentinel-1 based co-event interferograms (DInSAR) to map the surface deformation associated with the event. It also produces pre-event interferograms to be used as a benchmark.
This harvester collects the aforementioned interferograms.
The NOA GeObservatory harvester configuration contains the following options:
start_date
(optional) determines the date on which the harvesting begins. It must be in the formatYYYY-MM-DDTHH:mm:ssZ
.end_date
(optional) determines the date on which the harvesting ends. It must be in the formatYYYY-MM-DDTHH:mm:ssZ
.page_timeout
(optional, integer, defaults to 2) determines the maximum number of pages that will be harvested during each job. If a query returns 25 pages, only the first 5 will be harvested if you're using the default. Each page corresponds to 100 products. This is useful for running the harvester via recurring jobs intended to harvest products incrementally (i.e., you want to start from the beginning and harvest all available products). The harvester will harvest products in groups of 500, rather than attempting to harvest all x-hundred-thousand at once. You'll get feedback after each job, so you'll know if there are errors without waiting for the whole job to run. And the harvester will automatically resume from the harvested dataset if you're running it via a recurring cron job.update_all
(optional, boolean, default isfalse
) determines whether or not the harvester updates datasets that already have metadadata from this source. For example: if we have "update_all": true, and dataset Foo has already been created or updated by harvesting, then it will be updated again when the harvester runs. If we have "update_all": false and Foo has already been created or updated by harvesting, then the dataset will not be updated when the harvester runs. And regardless of whether update_all is true or false, if a dataset has not been collected, then it will be created in the catalogue.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date":"2017-01-01T00:00:00Z",
"end_date":"2020-08-01T23:59:00Z",
"page_timeout": "2"
}
-
Add
noa_geobservatory
to the list of plugins in your .ini file. -
Create a new harvester via the harvester interface.
-
Select
NOA GeObservatory Harvester
from the list of harvesters. -
Add a config as described above.
-
Select
Manual
from the frequency options. -
Run the harvester. It will programmatically create datasets.
The Energy Data harvester harvests products from the Energy Data API
The Energy Data harvester has configuration as:
start_date
: (required, datetime string, if the harvester is new, or from the ingestion date of the most recently harvested product if it has been run before) determines the start of the date range for harvester queries. Example: "start_date": "2019-10-01T00:00:00Z". Note that the entire datetime string is required.2019-10-01
is not valid.end_date
: (optional, datetime string, default is "NOW") determines the end of the date range for harvester queries. Example: "end_date": "2020-01-01T00:00:00Z". Note that the entire datetime string is required.2020-01-01
is not valid.datasets_per_job
: (optional, integer, defaults to 100) determines the maximum number of products that will be harvested during each job.timeout
: (optional, integer, defaults to 10) determines the number of seconds to wait before timing out a request.make_private
(optional) determines whether the datasets created by the harvester will be private or public. The default isfalse
, i.e., by default, all datasets created by the harvester will be public.
{
"start_date": "2017-10-01T00:00:00Z",
"datasets_per_job": 100
}
- Add
energydata
to the list of plugins in your .ini file. - Create a new harvester via the harvester interface.
- Select
Energy Data Harvester
from the list of harvesters. - Add a config as described above.
- Select
Manual
from the frequency options.
The basic harvester workflow is divided into three stages. Each stage has a related method, and each method must be included in the harvester plugin.
The three methods are:
gather_stage()
fetch_stage()
import_stage()
While the fetch_stage()
method must be included, it may be the case that the harvester does not require a fetch stage (for instance, if the source is an OpenSearch service, then the search results in the gather stage may already include the necessary content, so there's no need to fetch it again. In those cases, the fetch_stage()
method will still be implemented, but it will just return True
. The gather_stage()
and import_stage()
methods, however, will always include some amount of code, as they will always be used.
To simplify things, the gather stage is used to create a list of datasets that will be created or updated in the final import stage. That's really all it's for. It is not meant for parsing content into dictionaries for creating or updating datasets (that occurs in the import stage). It also isn't meant for acquiring or storing raw content that will be parsed later (that occurs in the fetch stage)—with certain exceptions, like OpenSearch services, where the content is already provided in the initial search results.
The gather_stage()
method returns a list
of harvest object IDs, which the harvester will use for the next two stages. The IDs are generated by creating harvest objects for each dataset that should be created or updated. If the necessary content is already provided, it can be stored in the harvest object's .content
attribute as a str
. You can also create harvest object extras--ad hoc harvest object attributes--to store information like the status of the dataset (e.g., new or change), or to keep track of other information about the harvest object or the dataset that will be created/updated. However, the harvest object extras are not intended to store things like the key/value pairs that will later be used to create the package dictionary for creating/updating the dataset. 1) The gather stage is not the time to perform such parsing and 2) since the raw content can be saved in the .content
attribute, it is easier to just skip the intermediate step and create the package dictionary in the import stage.
The gather stage may proceed quickly because it does not require querying the source for each individual dataset. The goal is not to aquire the content in this stage—just to get a list of the datasets for which content is required. If individual source queries are necessary, they will be performed in the fetch stage.
During the gather stage, the gather_stage()
method will be called once.
During the fetch stage, the fetch_stage()
method will be called for each harvest object/dataset in the list created during the gather stage.
The purpose of the fetch stage is to get the content necessary for creating or updating the dataset in the import stage. The raw content can be stored as a str
in the harvest object's .content
attribute.
As in the gather stage, the harvest object extras should only be used to store information about the harvest object.
The fetch stage is the time to make individual queries to the source. If that's not necessary (e.g., the source is an OpenSearch service), then fetch_stage()
should just return True
.
During the import stage, the import_stage()
method will be called for each harvest object/dataset in the list created during the gather stage except for those that raised exceptions during the fetch stage. In other words, the import_stage()
method is called for every harvest object/dataset that has .content
.
The purpose of the import stage is to parse the content and use it, as well as any additional context or information provided by the harvest object extras, to create or update a dataset.
See the OpenSearchExample harvester skeleton for an example of how to use the libraries in this repository to build an OpenSearch-based harvester. There are detailed comments in the code, which can be copied as the starting point of a new harvester. If your harvester will not use an OpenSearch source, you'll also need to modify the gather_stage
and possibly the fetch_stage
methods, but the import_stage
will remain the same.
The iTag "harvester (ITageEnricher
) is better described as a metaharvester. It uses the harvester infrastructure to add new tags and metadata to existing datasets. It is completely separate from the other harvesters, meaning: if you want to harvest Sentinel products, you'll use one of the Sentinel harvesters. If you want to enrich Sentinel datasets, you'll use an instance of ITagEnricher
. But you'll use them separately, and they won't interact with eachother at all.
During the gather stage, it queries the CKAN instance itself to get a list of existing datasets that 1) have the spatial
extra and 2) have not yet been updated by the ITageEnricher. Based on this list, it then creates harvest objects. This stage might be described as self-harvesting.
During the fetch stage, it queries an iTag instance using the coordinates from each dataset's spatial
extra and then stores the response from iTag as .content
, which will be used in the import stage. As long as iTag returns a valid response, the dataset moves on to the import stage—in other words, all that matters is that the query succeeded, not whether the iTag was able to find tags for a particular footprint. See below for an explanation.
During the import stage, it parses the iTag response to extract any additional tags and/or metadata. Regardless of whether any additional tags or metadata are found, the extra itag: tagged
will be added to the dataset. This extra is used in the gather stage to filter out datasets for which successful iTag queries have been made.
To set it up, create a new harvester source (we'll call ours "iTag Enricher" for the sake of example). Select manual
for the update frequency. Select an organization (currently required—the metaharvester will only act on datasets that belong to that organization).
There are three configuration options:
base_url
: (required, string) determines the base URL to use when querying your iTag instance.timeout
: (integer, defaults to 5) determines the number of seconds before a request times out.datasets_per_job
: (integer, defaults to 10) determines the maximum number of datasets per job.
Once you've created the harvester source, create the cron job below, using the name or ID of the source you just created:
* * * * * paster --plugin=ckanext-harvest harvester job {name or id of harvest source} -c {path to CKAN config}
The cron job will continually attempt to create a new harvest job. If there already is a running job for the source, the attempt will simply fail (this is the intended behaviour). If there is no running job, then a new job will be created, which will then be run by the harvester run
cron job that you should already have set up. The metaharvester will then make a list of all the datasets that should be enriched with iTag, but which have not yet been enriched, and then try to enrich them.
If a query to iTag fails, 1) it will be reported in the error report for the respective job and 2) the metaharvester will automatically try to enrich that dataset the next time it runs. No additional logs or tracking are required--as long as a dataset hasn't been tagged, and should be tagged, it will be added to the list each time a job is created. Once a dataset has been tagged (or it has been determined that there are no tags that can be added to it), it will no longer appear on the list of datasets that should be tagged.
Currently, ITagEnricher only creates a list of max. 1,000 datasets for each job. This limit is intended to speed up the rate at which jobs are completed (and feedback on performance is available). Since a new job will be created as soon as the current one is marked Finished
, this behaviour does not slow down the pace of tagging.
Sentinel-3 datasets have complex polygons that seem to cause iTag to timeout more often than it does when processesing requests related to other datasets, so Sentinel-3 datasets are currently filtered out of the list of datasets that need to be tagged.
In general, requests to iTag seem to timeout rather often, so it may be necessary to experiment with rate limiting. It may also be necessary to set up a more robust infrastructure for the iTag instance.
All harvesters should have tests that actually run the harvester, from start to finish, more than once. Such tests verify that the harvester will work as intended in production. The requests_mock
library allows us to easily mock the content returned by real requests to real URLs, so we can save the XML returned by OpenSearch interfaces, etc. and re-use it when testing. We can then write tests that verify 1) that the harvester starts, runs, finishes, and runs again (e.g., there are no errors that cause it to hang), 2) that it behaves as expected (e.g., it only updates datasets when a specific flag is set, or it restarts from a specific date following a failed request), and 3) that the datasets it creates or updates have exactly the metadata that we want them to have.
See TestESAHarvester().test_harvester()
for an example of how to run a harvester in a testing environment with mocked requests that return real XML.
The test itself needs to be refined. Some of the blocks should be helper functions or fixtures. But the method itself contains all the necessary components of full test of harvester functionality: create a harvester with a given config, run it to completion under different conditions, and verify that the results are as expected.
The same structure can be used for our other harvesters (with different mocked requests, of course, and with different expected results).
Using the same structure, we can also add tests that verify that the metadata of the datasets that are created also match the expected/intended results.
* * * * * paster --plugin=ckanext-harvest harvester run -c /srv/app/production.ini >> /var/log/cron.log 2>&1
* * * * * paster --plugin=ckanext-harvest harvester job itag-sentinel -c /srv/app/production.ini >> /var/log/cron.log 2>&1
* * * * * paster --plugin=ckanext-harvest harvester job code-de-sentinel -c /srv/app/production.ini >> /var/log/cron.log 2>&1
* * * * * paster --plugin=ckanext-harvest harvester job noa-sentinel -c /srv/app/production.ini >> /var/log/cron.log 2>&1
* * * * * paster --plugin=ckanext-harvest harvester job scihub-sentinel -c /srv/app/production.ini >> /var/log/cron.log 2>&1
Both the ESA harvester and the iTag metadata harvester can optionally log the status codes and response times of the sources or services that they query. If you want to log the response times and status codes of requests to harvest sources and/or your iTag service, you must include ckanext.nextgeossharvest.provider_log_dir=/path/to/your/logs
in your .ini
file. The log entries will look like this: INFO | esa_scihub | 2018-03-08 14:17:04.474262 | 200 | 2.885231s
(the second field will always be 12 characters and will be padded if necessary).
The data provider log file is called dataproviders_info.log
. The iTag service provider log is called itag_uptime.log