Merge pull request crawlab-team#460 from crawlab-team/release

Release
quan-xie · Jan 17, 2020 · 8bf6d3f · 8bf6d3f
2 parents fb395d7 + ef95ae5
commit 8bf6d3f
Show file tree

Hide file tree

Showing 230 changed files with 32,641 additions and 292 deletions.
diff --git a/.gitignore b/.gitignore
@@ -121,4 +121,5 @@ _book/
 .idea
 *.lock
 
-backend/spiders
+backend/spiders
+spiders/*.zip
diff --git a/CHANGELOG-zh.md b/CHANGELOG-zh.md
@@ -1,3 +1,20 @@
+# 0.4.4 (2020-01-17)
+
+### 功能 / 优化
+- **邮件通知**. 允许用户发送邮件消息通知.
+- **钉钉机器人通知**. 允许用户发送钉钉机器人消息通知.
+- **企业微信机器人通知**. 允许用户发送企业微信机器人消息通知.
+- **API 地址优化**. 在前端加入相对路径，因此用户不需要特别注明 `CRAWLAB_API_ADDRESS`.
+- **SDK 兼容**. 允许用户通过 Crawlab SDK 与 Scrapy 或通用爬虫集成.
+- **优化文件管理**. 加入树状文件侧边栏，让用户更方便的编辑文件.
+- **高级定时任务 Cron**. 允许用户通过 Cron 可视化编辑器编辑定时任务.
+
+### Bug 修复
+- **`nil retuened` 错误**.
+- **使用 HTTPS 出现的报错**.
+- **无法在爬虫列表页运行可配置爬虫**.
+- **上传爬虫文件缺少表单验证**.
+
 # 0.4.3 (2020-01-07)
 
 ### 功能 / 优化

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,19 @@
+# 0.4.4 (2020-01-17)
+### Features / Enhancement
+- **Email Notification**. Allow users to send email notifications.
+- **DingTalk Robot Notification**. Allow users to send DingTalk Robot notifications.
+- **Wechat Robot Notification**. Allow users to send Wechat Robot notifications.
+- **API Address Optimization**. Added relative URL path in frontend so that users don't have to specify `CRAWLAB_API_ADDRESS` explicitly.
+- **SDK Compatiblity**. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
+- **Enhanced File Management**. Added tree-like file sidebar to allow users to edit files much more easier.
+- **Advanced Schedule Cron**. Allow users to edit schedule cron with visualized cron editor.
+
+### Bug Fixes
+- **`nil retuened` error**.
+- **Error when using HTTPS**.
+- **Unable to run Configurable Spiders on Spider List**.
+- **Missing form validation before uploading spider files**.
+
 # 0.4.3 (2020-01-07)
 
 ### Features / Enhancement

diff --git a/README-zh.md b/README-zh.md
@@ -1,10 +1,10 @@
 # Crawlab
 
 <p>
-  <a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
+  <a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
     <img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
   </a>
-  <a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
+  <a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
     <img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
   </a>
   <a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
@@ -138,22 +138,26 @@ Docker部署的详情，请见[相关文档](https://tikazyq.github.io/crawlab-d
 
 ![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)
 
-#### 爬虫文件
+#### 爬虫文件编辑
 
-![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-file.png)
+![](http://static-docs.crawlab.cn/file-edit.png)
 
 #### 任务详情 - 抓取结果
 
 ![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/task-results.png)
 
 #### 定时任务
 
-![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/schedule.png)
+![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)
 
 #### 依赖安装
 
 ![](http://static-docs.crawlab.cn/node-install-dependencies.png)
 
+#### 消息通知
+
+<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
+
 ## 架构
 
 Crawlab的架构包括了一个主节点（Master Node）和多个工作节点（Worker Node），以及负责通信和数据储存的Redis和MongoDB数据库。
@@ -193,37 +197,43 @@ Redis是非常受欢迎的Key-Value数据库，在Crawlab中主要实现节点
 
 ## 与其他框架的集成
 
-爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中，并以此来关联抓取数据。另外，`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称。
+[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) 提供了一些 `helper` 方法来让您的爬虫更好的集成到 Crawlab 中，例如保存结果数据到 Crawlab 中等等。
 
-在爬虫程序中，需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前，Crawlab只支持MongoDB。
+### 集成 Scrapy
+
+在 `settings.py` 中找到 `ITEM_PIPELINES`（`dict` 类型的变量），在其中添加如下内容。
 
-### 集成Scrapy
+```python
+ITEM_PIPELINES = {
+    'crawlab.pipelines.CrawlabMongoPipeline': 888,
+}
+```
 
-以下是Crawlab跟Scrapy集成的例子，利用了Crawlab传过来的task_id和collection_name。
+然后，启动 Scrapy 爬虫，运行完成之后，您就应该能看到抓取结果出现在 **任务详情-结果** 里。
+
+### 通用 Python 爬虫
+
+将下列代码加入到您爬虫中的结果保存部分。
 
 ```python
-import os
-from pymongo import MongoClient
-
-MONGO_HOST = '192.168.99.100'
-MONGO_PORT = 27017
-MONGO_DB = 'crawlab_test'
-
-# scrapy example in the pipeline
-class JuejinPipeline(object):
-    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
-    db = mongo[MONGO_DB]
-    col_name = os.environ.get('CRAWLAB_COLLECTION')
-    if not col_name:
-        col_name = 'test'
-    col = db[col_name]
-
-    def process_item(self, item, spider):
-        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
-        self.col.save(item)
-        return item
+# 引入保存结果方法
+from crawlab import save_item
+
+# 这是一个结果，需要为 dict 类型
+result = {'name': 'crawlab'}
+
+# 调用保存结果方法
+save_item(result)
 ```
 
+然后，启动爬虫，运行完成之后，您就应该能看到抓取结果出现在 **任务详情-结果** 里。
+
+### 其他框架和语言
+
+爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中，并以此来关联抓取数据。另外，`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称。
+
+在爬虫程序中，需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前，Crawlab只支持MongoDB。
+
 ## 与其他框架比较
 
 现在已经有一些爬虫管理框架了，因此为啥还要用Crawlab？

diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
 # Crawlab
 
 <p>
-  <a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
+  <a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
     <img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
   </a>
-  <a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
+  <a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
     <img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
   </a>
   <a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
@@ -136,22 +136,26 @@ For Docker Deployment details, please refer to [relevant documentation](https://
 
 ![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)
 
-#### Spider Files
+#### Spider File Edit
 
-![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-file.png)
+![](http://static-docs.crawlab.cn/file-edit.png)
 
 #### Task Results
 
 ![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/task-results.png)
 
 #### Cron Job
 
-![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/schedule.png)
+![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)
 
 #### Dependency Installation
 
 ![](http://static-docs.crawlab.cn/node-install-dependencies.png)
 
+#### Notifications
+
+<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
+
 ## Architecture
 
 The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.
@@ -192,35 +196,43 @@ Frontend is a SPA based on
 
 ## Integration with Other Frameworks
 
-A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
+[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) provides some `helper` methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.
+
+⚠️Note: make sure you have already installed `crawlab-sdk` using pip.
 
 ### Scrapy
 
-Below is an example to integrate Crawlab with Scrapy in pipelines. 
+In `settings.py` in your Scrapy project, find the variable named `ITEM_PIPELINES` (a `dict` variable). Add content below.
+
+```python
+ITEM_PIPELINES = {
+    'crawlab.pipelines.CrawlabMongoPipeline': 888,
+}
+```
+
+Then, start the Scrapy spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
+
+### General Python Spider
+
+Please add below content to your spider files to save results.
 
 ```python
-import os
-from pymongo import MongoClient
-
-MONGO_HOST = '192.168.99.100'
-MONGO_PORT = 27017
-MONGO_DB = 'crawlab_test'
-
-# scrapy example in the pipeline
-class JuejinPipeline(object):
-    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
-    db = mongo[MONGO_DB]
-    col_name = os.environ.get('CRAWLAB_COLLECTION')
-    if not col_name:
-        col_name = 'test'
-    col = db[col_name]
-
-    def process_item(self, item, spider):
-        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
-        self.col.save(item)
-        return item
+# import result saving method
+from crawlab import save_item
+
+# this is a result record, must be dict type
+result = {'name': 'crawlab'}
+
+# call result saving method
+save_item(result)
 ```
 
+Then, start the spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
+
+### Other Frameworks / Languages
+
+A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
+
 ## Comparison with Other Frameworks
 
 There are existing spider management frameworks. So why use Crawlab? 

diff --git a/backend/conf/config.yml b/backend/conf/config.yml
@@ -35,6 +35,15 @@ task:
   workers: 4
 other:
   tmppath: "/tmp"
-version: 0.4.3
+version: 0.4.4
 setting:
-  allowRegister: "N"
+  allowRegister: "N"
+notification:
+  mail:
+    server: ''
+    port: ''
+    senderEmail: ''
+    senderIdentity: ''
+    smtp:
+      user: ''
+      password: ''
diff --git a/backend/constants/notification.go b/backend/constants/notification.go
@@ -0,0 +1,13 @@
+package constants
+
+const (
+	NotificationTriggerOnTaskEnd   = "notification_trigger_on_task_end"
+	NotificationTriggerOnTaskError = "notification_trigger_on_task_error"
+	NotificationTriggerNever       = "notification_trigger_never"
+)
+
+const (
+	NotificationTypeMail     = "notification_type_mail"
+	NotificationTypeDingTalk = "notification_type_ding_talk"
+	NotificationTypeWechat   = "notification_type_wechat"
+)
diff --git a/backend/go.mod b/backend/go.mod
@@ -13,10 +13,16 @@ require (
 	github.com/gomodule/redigo v2.0.0+incompatible
 	github.com/imroc/req v0.2.4
 	github.com/leodido/go-urn v1.1.0 // indirect
+	github.com/matcornic/hermes v1.2.0
+	github.com/matcornic/hermes/v2 v2.0.2 // indirect
 	github.com/pkg/errors v0.8.1
+	github.com/royeo/dingrobot v1.0.0
 	github.com/satori/go.uuid v1.2.0
 	github.com/smartystreets/goconvey v0.0.0-20190731233626-505e41936337
 	github.com/spf13/viper v1.4.0
+	gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc // indirect
 	gopkg.in/go-playground/validator.v9 v9.29.1
+	gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737
+	gopkg.in/russross/blackfriday.v2 v2.0.0 // indirect
 	gopkg.in/yaml.v2 v2.2.2
 )