Skip to content

Commit

Permalink
Merge pull request crawlab-team#460 from crawlab-team/release
Browse files Browse the repository at this point in the history
Release
  • Loading branch information
tikazyq authored Jan 17, 2020
2 parents fb395d7 + ef95ae5 commit 8bf6d3f
Show file tree
Hide file tree
Showing 230 changed files with 32,641 additions and 292 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -121,4 +121,5 @@ _book/
.idea
*.lock

backend/spiders
backend/spiders
spiders/*.zip
17 changes: 17 additions & 0 deletions CHANGELOG-zh.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# 0.4.4 (2020-01-17)

### 功能 / 优化
- **邮件通知**. 允许用户发送邮件消息通知.
- **钉钉机器人通知**. 允许用户发送钉钉机器人消息通知.
- **企业微信机器人通知**. 允许用户发送企业微信机器人消息通知.
- **API 地址优化**. 在前端加入相对路径,因此用户不需要特别注明 `CRAWLAB_API_ADDRESS`.
- **SDK 兼容**. 允许用户通过 Crawlab SDK 与 Scrapy 或通用爬虫集成.
- **优化文件管理**. 加入树状文件侧边栏,让用户更方便的编辑文件.
- **高级定时任务 Cron**. 允许用户通过 Cron 可视化编辑器编辑定时任务.

### Bug 修复
- **`nil retuened` 错误**.
- **使用 HTTPS 出现的报错**.
- **无法在爬虫列表页运行可配置爬虫**.
- **上传爬虫文件缺少表单验证**.

# 0.4.3 (2020-01-07)

### 功能 / 优化
Expand Down
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
# 0.4.4 (2020-01-17)
### Features / Enhancement
- **Email Notification**. Allow users to send email notifications.
- **DingTalk Robot Notification**. Allow users to send DingTalk Robot notifications.
- **Wechat Robot Notification**. Allow users to send Wechat Robot notifications.
- **API Address Optimization**. Added relative URL path in frontend so that users don't have to specify `CRAWLAB_API_ADDRESS` explicitly.
- **SDK Compatiblity**. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
- **Enhanced File Management**. Added tree-like file sidebar to allow users to edit files much more easier.
- **Advanced Schedule Cron**. Allow users to edit schedule cron with visualized cron editor.

### Bug Fixes
- **`nil retuened` error**.
- **Error when using HTTPS**.
- **Unable to run Configurable Spiders on Spider List**.
- **Missing form validation before uploading spider files**.

# 0.4.3 (2020-01-07)

### Features / Enhancement
Expand Down
68 changes: 39 additions & 29 deletions README-zh.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Crawlab

<p>
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
</a>
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
</a>
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
Expand Down Expand Up @@ -138,22 +138,26 @@ Docker部署的详情,请见[相关文档](https://tikazyq.github.io/crawlab-d

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)

#### 爬虫文件
#### 爬虫文件编辑

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-file.png)
![](http://static-docs.crawlab.cn/file-edit.png)

#### 任务详情 - 抓取结果

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/task-results.png)

#### 定时任务

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/schedule.png)
![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)

#### 依赖安装

![](http://static-docs.crawlab.cn/node-install-dependencies.png)

#### 消息通知

<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">

## 架构

Crawlab的架构包括了一个主节点(Master Node)和多个工作节点(Worker Node),以及负责通信和数据储存的Redis和MongoDB数据库。
Expand Down Expand Up @@ -193,37 +197,43 @@ Redis是非常受欢迎的Key-Value数据库,在Crawlab中主要实现节点

## 与其他框架的集成

爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中,并以此来关联抓取数据。另外,`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称
[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) 提供了一些 `helper` 方法来让您的爬虫更好的集成到 Crawlab 中,例如保存结果数据到 Crawlab 中等等

在爬虫程序中,需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前,Crawlab只支持MongoDB。
### 集成 Scrapy

在 `settings.py` 中找到 `ITEM_PIPELINES`(`dict` 类型的变量),在其中添加如下内容。

### 集成Scrapy
```python
ITEM_PIPELINES = {
'crawlab.pipelines.CrawlabMongoPipeline': 888,
}
```

以下是Crawlab跟Scrapy集成的例子,利用了Crawlab传过来的task_id和collection_name。
然后,启动 Scrapy 爬虫,运行完成之后,您就应该能看到抓取结果出现在 **任务详情-结果** 里。

### 通用 Python 爬虫

将下列代码加入到您爬虫中的结果保存部分。

```python
import os
from pymongo import MongoClient
MONGO_HOST = '192.168.99.100'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'
# scrapy example in the pipeline
class JuejinPipeline(object):
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
db = mongo[MONGO_DB]
col_name = os.environ.get('CRAWLAB_COLLECTION')
if not col_name:
col_name = 'test'
col = db[col_name]
def process_item(self, item, spider):
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
self.col.save(item)
return item
# 引入保存结果方法
from crawlab import save_item
# 这是一个结果,需要为 dict 类型
result = {'name': 'crawlab'}
# 调用保存结果方法
save_item(result)
```

然后,启动爬虫,运行完成之后,您就应该能看到抓取结果出现在 **任务详情-结果** 里。

### 其他框架和语言

爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中,并以此来关联抓取数据。另外,`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称。

在爬虫程序中,需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前,Crawlab只支持MongoDB。

## 与其他框架比较

现在已经有一些爬虫管理框架了,因此为啥还要用Crawlab?
Expand Down
66 changes: 39 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Crawlab

<p>
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
</a>
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
</a>
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
Expand Down Expand Up @@ -136,22 +136,26 @@ For Docker Deployment details, please refer to [relevant documentation](https://

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)

#### Spider Files
#### Spider File Edit

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-file.png)
![](http://static-docs.crawlab.cn/file-edit.png)

#### Task Results

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/task-results.png)

#### Cron Job

![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/schedule.png)
![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)

#### Dependency Installation

![](http://static-docs.crawlab.cn/node-install-dependencies.png)

#### Notifications

<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">

## Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.
Expand Down Expand Up @@ -192,35 +196,43 @@ Frontend is a SPA based on

## Integration with Other Frameworks

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) provides some `helper` methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️Note: make sure you have already installed `crawlab-sdk` using pip.

### Scrapy

Below is an example to integrate Crawlab with Scrapy in pipelines.
In `settings.py` in your Scrapy project, find the variable named `ITEM_PIPELINES` (a `dict` variable). Add content below.

```python
ITEM_PIPELINES = {
'crawlab.pipelines.CrawlabMongoPipeline': 888,
}
```

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**

### General Python Spider

Please add below content to your spider files to save results.

```python
import os
from pymongo import MongoClient
MONGO_HOST = '192.168.99.100'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'
# scrapy example in the pipeline
class JuejinPipeline(object):
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
db = mongo[MONGO_DB]
col_name = os.environ.get('CRAWLAB_COLLECTION')
if not col_name:
col_name = 'test'
col = db[col_name]
def process_item(self, item, spider):
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
self.col.save(item)
return item
# import result saving method
from crawlab import save_item
# this is a result record, must be dict type
result = {'name': 'crawlab'}
# call result saving method
save_item(result)
```

Then, start the spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**

### Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.

## Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?
Expand Down
13 changes: 11 additions & 2 deletions backend/conf/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,15 @@ task:
workers: 4
other:
tmppath: "/tmp"
version: 0.4.3
version: 0.4.4
setting:
allowRegister: "N"
allowRegister: "N"
notification:
mail:
server: ''
port: ''
senderEmail: ''
senderIdentity: ''
smtp:
user: ''
password: ''
13 changes: 13 additions & 0 deletions backend/constants/notification.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
package constants

const (
NotificationTriggerOnTaskEnd = "notification_trigger_on_task_end"
NotificationTriggerOnTaskError = "notification_trigger_on_task_error"
NotificationTriggerNever = "notification_trigger_never"
)

const (
NotificationTypeMail = "notification_type_mail"
NotificationTypeDingTalk = "notification_type_ding_talk"
NotificationTypeWechat = "notification_type_wechat"
)
6 changes: 6 additions & 0 deletions backend/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,16 @@ require (
github.com/gomodule/redigo v2.0.0+incompatible
github.com/imroc/req v0.2.4
github.com/leodido/go-urn v1.1.0 // indirect
github.com/matcornic/hermes v1.2.0
github.com/matcornic/hermes/v2 v2.0.2 // indirect
github.com/pkg/errors v0.8.1
github.com/royeo/dingrobot v1.0.0
github.com/satori/go.uuid v1.2.0
github.com/smartystreets/goconvey v0.0.0-20190731233626-505e41936337
github.com/spf13/viper v1.4.0
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc // indirect
gopkg.in/go-playground/validator.v9 v9.29.1
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737
gopkg.in/russross/blackfriday.v2 v2.0.0 // indirect
gopkg.in/yaml.v2 v2.2.2
)
Loading

0 comments on commit 8bf6d3f

Please sign in to comment.