tasks are the basic unit to been scheduled.
- A task is differentiate by
taskid
which ismd5(url)
by default. - Tasks are isolated between different projects.
- Task has four status:
- active
- failed
- success
- bad - not used
- Only tasks in active status will scheduled. Tasks scheduled by
exetime
andpriority
when a new task comes:
- it will been putted into queued and sorted with
exetime
andpriority
.
when a crawled task comes:
- if it is in
active
status(in queue), it will been ignored. Unlessforce_update
. - if it is finished(success or failed) task arrive, it will been re-crawled and rescheduled if
last_crawl_time + age < now
oritag
now equal to the its last value.