diff --git a/README b/README new file mode 100755 index 000000000..e69de29bb diff --git a/README.md b/README.md index e6cde27b5..e5b5dd41f 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,101 @@ # DataX + DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。 -代码近期会上传,敬请期待。 + + + +# Features + +DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。 + + + +# DataX详细介绍 + +请参考: + +# System Requirements + +- Linux +- [JDK(1.6以上,推荐1.6) ](http://www.oracle.com/technetwork/cn/java/javase/downloads/index.html) +- [Python(推荐Python2.6.X) ](https://www.python.org/downloads/) +- [Apache Maven 3.x](https://maven.apache.org/download.cgi) (Compile DataX) + +# Quick Start + +请点击:[Quick Start](https://github.com/alibaba/DataX/wiki/Quick-Start) + +# Support Data Channels + +目前DataX支持的数据源有: + +### Reader + +> **Reader实现了从数据存储系统批量抽取数据,并转换为DataX标准数据交换协议,DataX任意Reader能与DataX任意Writer实现无缝对接,达到任意异构数据互通之目的。** + +**RDBMS 关系型数据库** + +- [MysqlReader](https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md): 使用JDBC批量抽取Mysql数据集。 +- [OracleReader](https://github.com/alibaba/DataX/blob/master/oraclereader/doc/oraclereader.md): 使用JDBC批量抽取Oracle数据集。 +- [SqlServerReader](https://github.com/alibaba/DataX/blob/master/sqlserverreader/doc/sqlserverreader.md): 使用JDBC批量抽取SqlServer数据集 +- [PostgresqlReader](https://github.com/alibaba/DataX/blob/master/postgresqlreader/doc/postgresqlreader.md): 使用JDBC批量抽取PostgreSQL数据集 +- [DrdsReader](https://github.com/alibaba/DataX/blob/master/drdsreader/doc/drdsreader.md): 针对公有云上DRDS的批量数据抽取工具。 + +**数仓数据存储** + +- [ODPSReader](https://github.com/alibaba/DataX/blob/master/odpsreader/doc/odpsreader.md): 使用ODPS Tunnel SDK批量抽取ODPS数据。 + +**NoSQL数据存储** + +- [OTSReader](https://github.com/alibaba/DataX/blob/master/otsreader/doc/otsreader.md): 针对公有云上OTS的批量数据抽取工具。 +- [HBaseReader](https://github.com/alibaba/DataX/blob/master/hbasereader/doc/hbasereader.md): 针对 HBase 0.94版本的在线数据抽取工具 +- [MongoDBReader](https://github.com/alibaba/DataX/blob/master/mongodbreader/doc/mongodbreader.md):MongoDBReader + +**无结构化数据存储** + +- [TxtFileReader](https://github.com/alibaba/DataX/blob/master/txtfilereader/doc/txtfilereader.md): 读取(递归/过滤)本地文件。 +- [FtpReader](https://github.com/alibaba/DataX/blob/master/ftpreader/doc/ftpreader.md): 读取(递归/过滤)远程ftp文件。 +- [HdfsReader](https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md): 针对Hdfs文件系统中textfile和orcfile文件批量数据抽取工具。 +- [OssReader](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md): 针对公有云OSS产品的批量数据抽取工具。 +- StreamReader + +### Writer + +------ + +> **Writer实现了从DataX标准数据交换协议,翻译为具体的数据存储类型并写入目的数据存储。DataX任意Writer能与DataX任意Reader实现无缝对接,达到任意异构数据互通之目的。** + +------ + +**RDBMS 关系型数据库** + +- [MysqlWriter](https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md): 使用JDBC(Insert,Replace方式)写入Mysql数据库 +- [OracleWriter](https://github.com/alibaba/DataX/blob/master/oraclewriter/doc/oraclewriter.md): 使用JDBC(Insert方式)写入Oracle数据库 +- [PostgresqlWriter](https://github.com/alibaba/DataX/blob/master/postgresqlwriter/doc/postgresqlwriter.md): 使用JDBC(Insert方式)写入PostgreSQL数据库 +- [SqlServerWriter](https://github.com/alibaba/DataX/blob/master/sqlserverwriter/doc/sqlserverwriter.md): 使用JDBC(Insert方式)写入sqlserver数据库 +- [DrdsWriter](https://github.com/alibaba/DataX/blob/master/drdswriter/doc/drdswriter.md): 使用JDBC(Replace方式)写入Drds数据库 + +**数仓数据存储** + +- [ODPSWriter](https://github.com/alibaba/DataX/blob/master/odpswriter/doc/odpswriter.md): 使用ODPS Tunnel SDK向ODPS写入数据。 +- [ADSWriter](https://github.com/alibaba/DataX/blob/master/adswriter/doc/adswriter.md): 使用ODPS中转将数据导入ADS。 + +**NoSQL数据存储** + +- [OTSWriter](https://github.com/alibaba/DataX/blob/master/otswriter/doc/otswriter.md): 使用OTS SDK向OTS Public模型的表中导入数据。 +- [OCSWriter](https://github.com/alibaba/DataX/blob/master/ocswriter/doc/ocswriter.md) +- [MongoDBWriter](https://github.com/alibaba/DataX/blob/master/mongodbwriter/doc/mongodbwriter.md):MongoDBWriter + +**无结构化数据存储** + +- [TxtFileWriter](https://github.com/alibaba/DataX/blob/master/txtfilewriter/doc/txtfilewriter.md): 提供写入本地文件功能。 +- [OssWriter](https://github.com/alibaba/DataX/blob/master/osswriter/doc/osswriter.md): 使用OSS SDK写入OSS数据。 +- [HdfsWriter](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md): 提供向Hdfs文件系统中写入textfile文件和orcfile文件功能。 +- StreamWriter + + + +# Contact us + +请及时提出issue给我们。请前往:[DataxIssue](https://github.com/alibaba/DataX/issues) + diff --git a/adswriter/adswriter.iml b/adswriter/adswriter.iml new file mode 100644 index 000000000..7a7f2115e --- /dev/null +++ b/adswriter/adswriter.iml @@ -0,0 +1,72 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/adswriter/doc/adswriter.md b/adswriter/doc/adswriter.md new file mode 100644 index 000000000..bdfdf8ded --- /dev/null +++ b/adswriter/doc/adswriter.md @@ -0,0 +1,298 @@ +# DataX ADS写入 + + +--- + + +## 1 快速介绍 + +
+ +欢迎ADS加入DataX生态圈!ADSWriter插件实现了其他数据源向ADS写入功能,现有DataX所有的数据源均可以无缝接入ADS,实现数据快速导入ADS。 + +ADS写入预计支持两种实现方式: + +* ADSWriter 支持向ODPS中转落地导入ADS方式,优点在于当数据量较大时(>1KW),可以以较快速度进行导入,缺点引入了ODPS作为落地中转,因此牵涉三方系统(DataX、ADS、ODPS)鉴权认证。 + +* ADSWriter 同时支持向ADS直接写入的方式,优点在于小批量数据写入能够较快完成(<1KW),缺点在于大数据导入较慢。 + + +注意: + +> 如果从ODPS导入数据到ADS,请用户提前在源ODPS的Project中授权ADS Build账号具有读取你源表ODPS的权限,同时,ODPS源表创建人和ADS写入属于同一个阿里云账号。 + +- + +> 如果从非ODPS导入数据到ADS,请用户提前在目的端ADS空间授权ADS Build账号具备Load data权限。 + +以上涉及ADS Build账号请联系ADS管理员提供。 + + +## 2 实现原理 + +ADS写入预计支持两种实现方式: + +### 2.1 Load模式 + +DataX 将数据导入ADS为当前导入任务分配的ADS项目表,随后DataX通知ADS完成数据加载。该类数据导入方式实际上是写ADS完成数据同步,由于ADS是分布式存储集群,因此该通道吞吐量较大,可以支持TB级别数据导入。 + +![中转导入](http://aligitlab.oss-cn-hangzhou-zmf.aliyuncs.com/uploads/cdp/cdp/f805dea46b/_____2015-04-10___12.06.21.png) + +1. CDP底层得到明文的 jdbc://host:port/dbname + username + password + table, 以此连接ADS, 执行show grants; 前置检查该用户是否有ADS中目标表的Load Data或者更高的权限。注意,此时ADSWriter使用用户填写的ADS用户名+密码信息完成登录鉴权工作。 + +2. 检查通过后,通过ADS中目标表的元数据反向生成ODPS DDL,在ODPS中间project中,以ADSWriter的账户建立ODPS表(非分区表,生命周期设为1-2Day), 并调用ODPSWriter把数据源的数据写入该ODPS表中。 + + 注意,这里需要使用中转ODPS的账号AK向中转ODPS写入数据。 + +3. 写入完成后,以中转ODPS账号连接ADS,发起Load Data From ‘odps://中转project/中转table/' [overwrite] into adsdb.adstable [partition (xx,xx=xx)]; 这个命令返回一个Job ID需要记录。 + + 注意,此时ADS使用自己的Build账号访问中转ODPS,因此需要中转ODPS对这个Build账号提前开放读取权限。 + +4. 连接ADS一分钟一次轮询执行 select state from information_schema.job_instances where job_id like ‘$Job ID’,查询状态,注意这个第一个一分钟可能查不到状态记录。 + +5. Success或者Fail后返回给用户,然后删除中转ODPS表,任务结束。 + +上述流程是从其他非ODPS数据源导入ADS流程,对于ODPS导入ADS流程使用如下流程: + +![直接导入](http://aligitlab.oss-cn-hangzhou-zmf.aliyuncs.com/uploads/cdp/cdp/b3a76459d1/_____2015-04-10___12.06.25.png) + +### 2.2 Insert模式 + +DataX 将数据直连ADS接口,利用ADS暴露的INSERT接口直写到ADS。该类数据导入方式写入吞吐量较小,不适合大批量数据写入。有如下注意点: + +* ADSWriter使用JDBC连接直连ADS,并只使用了JDBC Statement进行数据插入。ADS不支持PreparedStatement,故ADSWriter只能单行多线程进行写入。 + +* ADSWriter支持筛选部分列,列换序等功能,即用户可以填写列。 + +* 考虑到ADS负载问题,建议ADSWriter Insert模式建议用户使用TPS限流,最高在1W TPS。 + +* ADSWriter在所有Task完成写入任务后,Job Post单例执行flush工作,保证数据在ADS整体更新。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到ADS,使用Load模式进行导入的数据。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "adswriter", + "parameter": { + "odps": { + "accessId": "xasdfkladslfjsaifw224ysgsa5", + "accessKey": "asfjkljfp0w4624twfswe56346212341adsfa3", + "account": "xxx@aliyun.com", + "odpsServer": "http://service.odpsstg.aliyun-inc.com/stgnew", + "tunnelServer": "http://tunnel.odpsstg.aliyun-inc.com", + "accountType": "aliyun", + "project": "transfer_project" + }, + "writeMode": "load", + "url": "127.0.0.1:3306", + "schema": "schema", + "table": "table", + "username": "username", + "password": "password", + "partition": "", + "lifeCycle": 2, + "overWrite": true, + } + } + } + ] + } +} +``` + +* 这里使用一份从内存产生到ADS,使用Insert模式进行导入的数据。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "adswriter", + "parameter": { + "writeMode": "insert", + "url": "127.0.0.1:3306", + "schema": "schema", + "table": "table", + "column": ["*"], + "username": "username", + "password": "password", + "partition": "id,ds=2015" + } + } + } + ] + } +} +``` + + + +### 3.2 参数说明 (用户配置规格) + +* **url** + + * 描述:ADS连接信息,格式为"ip:port"。 + + * 必选:是
+ + * 默认值:无
+ +* **schema** + + * 描述:ADS的schema名称。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:ADS对应的username,目前就是accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:ADS对应的password,目前就是accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。 + + * 必选:是
+ + * 默认值:无
+ + +* **partition** + + * 描述:目标表的分区名称,当目标表为分区表,需要指定该字段。 + + * 必选:否
+ + * 默认值:无
+ +* **writeMode** + + * 描述:支持Load和Insert两种写入模式 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表字段列表,可以为["*"],或者具体的字段列表,例如["a", "b", "c"] + + * 必选:是
+ + * 默认值:无
+ +* **overWrite** + + * 描述:ADS写入是否覆盖当前写入的表,true为覆盖写入,false为不覆盖(追加)写入。当writeMode为Load,该值才会生效。 + + * 必选:是
+ + * 默认值:无
+ + +* **lifeCycle** + + * 描述:ADS 临时表生命周期。当writeMode为Load时,该值才会生效。 + + * 必选:是
+ + * 默认值:无
+ + +### 3.3 类型转换 + +| DataX 内部类型| ADS 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, int, bigint| +| Double |float, double, decimal| +| String |varchar | +| Date |date | +| Boolean |bool | +| Bytes |无 | + + 注意: + +* multivalue ADS支持multivalue类型,DataX对于该类型支持待定? + + +## 4 插件约束 + +如果Reader为ODPS,且ADSWriter写入模式为Load模式时,ODPS的partition只支持如下三种配置方式(以两级分区为例): +``` +"partition":["pt=*,ds=*"] (读取test表所有分区的数据) +"partition":["pt=1,ds=*"] (读取test表下面,一级分区pt=1下面的所有二级分区) +"partition":["pt=1,ds=hangzhou"] (读取test表下面,一级分区pt=1下面,二级分区ds=hz的数据) +``` + +## 5 性能报告(线上环境实测) + +### 5.1 环境准备 + +### 5.2 测试报告 + +## 6 FAQ diff --git a/adswriter/pom.xml b/adswriter/pom.xml new file mode 100644 index 000000000..5e6a12849 --- /dev/null +++ b/adswriter/pom.xml @@ -0,0 +1,113 @@ + + + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + 4.0.0 + + adswriter + adswriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + datax-core + ${datax-project-version} + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + org.slf4j + slf4j-api + + + org.apache.commons + commons-exec + 1.3 + + + com.alibaba.datax + odpswriter + ${datax-project-version} + + + ch.qos.logback + logback-classic + + + mysql + mysql-connector-java + 5.1.26 + + + commons-configuration + commons-configuration + 1.9 + + + commons-configuration + commons-configuration + 1.10 + + + commons-configuration + commons-configuration + 1.10 + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/adswriter/src/main/assembly/package.xml b/adswriter/src/main/assembly/package.xml new file mode 100644 index 000000000..c1fb64bb8 --- /dev/null +++ b/adswriter/src/main/assembly/package.xml @@ -0,0 +1,36 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + config.properties + plugin_job_template.json + + plugin/writer/adswriter + + + target/ + + adswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/adswriter + + + + + + false + plugin/writer/adswriter/libs + runtime + + + diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsException.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsException.java new file mode 100644 index 000000000..f0d6f9289 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsException.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.plugin.writer.adswriter; + +public class AdsException extends Exception { + + private static final long serialVersionUID = 1080618043484079794L; + + public final static int ADS_CONN_URL_NOT_SET = -100; + public final static int ADS_CONN_USERNAME_NOT_SET = -101; + public final static int ADS_CONN_PASSWORD_NOT_SET = -102; + public final static int ADS_CONN_SCHEMA_NOT_SET = -103; + + public final static int JOB_NOT_EXIST = -200; + public final static int JOB_FAILED = -201; + + public final static int ADS_LOADDATA_SCHEMA_NULL = -300; + public final static int ADS_LOADDATA_TABLE_NULL = -301; + public final static int ADS_LOADDATA_SOURCEPATH_NULL = -302; + public final static int ADS_LOADDATA_JOBID_NOT_AVAIL = -303; + public final static int ADS_LOADDATA_FAILED = -304; + + public final static int ADS_TABLEMETA_SCHEMA_NULL = -404; + public final static int ADS_TABLEMETA_TABLE_NULL = -405; + + public final static int OTHER = -999; + + private int code = OTHER; + private String message; + + public AdsException(int code, String message, Throwable e) { + super(message, e); + this.code = code; + this.message = message; + } + + @Override + public String getMessage() { + return "Code=" + this.code + " Message=" + this.message; + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriter.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriter.java new file mode 100644 index 000000000..f44a6b18b --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriter.java @@ -0,0 +1,328 @@ +package com.alibaba.datax.plugin.writer.adswriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.insert.AdsInsertProxy; +import com.alibaba.datax.plugin.writer.adswriter.insert.AdsInsertUtil; +import com.alibaba.datax.plugin.writer.adswriter.load.AdsHelper; +import com.alibaba.datax.plugin.writer.adswriter.load.TableMetaHelper; +import com.alibaba.datax.plugin.writer.adswriter.load.TransferProjectConf; +import com.alibaba.datax.plugin.writer.adswriter.odps.TableMeta; +import com.alibaba.datax.plugin.writer.adswriter.util.AdsUtil; +import com.alibaba.datax.plugin.writer.adswriter.util.Constant; +import com.alibaba.datax.plugin.writer.adswriter.util.Key; +import com.alibaba.datax.plugin.writer.odpswriter.OdpsWriter; +import com.aliyun.odps.Instance; +import com.aliyun.odps.Odps; +import com.aliyun.odps.OdpsException; +import com.aliyun.odps.account.Account; +import com.aliyun.odps.account.AliyunAccount; +import com.aliyun.odps.task.SQLTask; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +public class AdsWriter extends Writer { + + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Writer.Job.class); + public final static String ODPS_READER = "odpsreader"; + + private OdpsWriter.Job odpsWriterJobProxy = new OdpsWriter.Job(); + private Configuration originalConfig; + private Configuration readerConfig; + + /** + * 持有ads账号的ads helper + */ + private AdsHelper adsHelper; + /** + * 持有odps账号的ads helper + */ + private AdsHelper odpsAdsHelper; + /** + * 中转odps的配置,对应到writer配置的parameter.odps部分 + */ + private TransferProjectConf transProjConf; + private final int ODPSOVERTIME = 120000; + private String odpsTransTableName; + + private String writeMode; + private long startTime; + + @Override + public void init() { + startTime = System.currentTimeMillis(); + this.originalConfig = super.getPluginJobConf(); + this.writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if(null == this.writeMode) { + LOG.warn("您未指定[writeMode]参数, 默认采用load模式, load模式只能用于离线表"); + this.writeMode = Constant.LOADMODE; + this.originalConfig.set(Key.WRITE_MODE, "load"); + } + + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + AdsUtil.checkNecessaryConfig(this.originalConfig, this.writeMode); + loadModeInit(); + } else if(Constant.INSERTMODE.equalsIgnoreCase(this.writeMode)) { + AdsUtil.checkNecessaryConfig(this.originalConfig, this.writeMode); + List allColumns = AdsInsertUtil.getAdsTableColumnNames(originalConfig); + AdsInsertUtil.dealColumnConf(originalConfig, allColumns); + + LOG.debug("After job init(), originalConfig now is:[\n{}\n]", + originalConfig.toJSON()); + } else { + throw DataXException.asDataXException(AdsWriterErrorCode.INVALID_CONFIG_VALUE, "writeMode 必须为 'load' 或者 'insert'"); + } + } + + private void loadModeInit() { + this.adsHelper = AdsUtil.createAdsHelper(this.originalConfig); + this.odpsAdsHelper = AdsUtil.createAdsHelperWithOdpsAccount(this.originalConfig); + this.transProjConf = TransferProjectConf.create(this.originalConfig); + + /** + * 如果是从odps导入到ads,直接load data然后System.exit() + */ + if (super.getPeerPluginName().equals(ODPS_READER)) { + transferFromOdpsAndExit(); + } + + + Account odpsAccount; + odpsAccount = new AliyunAccount(transProjConf.getAccessId(), transProjConf.getAccessKey()); + + Odps odps = new Odps(odpsAccount); + odps.setEndpoint(transProjConf.getOdpsServer()); + odps.setDefaultProject(transProjConf.getProject()); + + TableMeta tableMeta; + try { + String adsTable = this.originalConfig.getString(Key.ADS_TABLE); + TableInfo tableInfo = adsHelper.getTableInfo(adsTable); + int lifeCycle = this.originalConfig.getInt(Key.Life_CYCLE); + tableMeta = TableMetaHelper.createTempODPSTable(tableInfo, lifeCycle); + this.odpsTransTableName = tableMeta.getTableName(); + String sql = tableMeta.toDDL(); + LOG.info("正在创建ODPS临时表: "+sql); + Instance instance = SQLTask.run(odps, transProjConf.getProject(), sql, null, null); + boolean terminated = false; + int time = 0; + while (!terminated && time < ODPSOVERTIME) { + Thread.sleep(1000); + terminated = instance.isTerminated(); + time += 1000; + } + LOG.info("正在创建ODPS临时表成功"); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED, e); + }catch (OdpsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED,e); + } catch (InterruptedException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED,e); + } + + Configuration newConf = AdsUtil.generateConf(this.originalConfig, this.odpsTransTableName, + tableMeta, this.transProjConf); + odpsWriterJobProxy.setPluginJobConf(newConf); + odpsWriterJobProxy.init(); + } + + /** + * 当reader是odps的时候,直接call ads的load接口,完成后退出。 + * 这种情况下,用户在odps reader里头填写的参数只有部分有效。 + * 其中accessId、accessKey是忽略掉iao的。 + */ + private void transferFromOdpsAndExit() { + this.readerConfig = super.getPeerPluginJobConf(); + String odpsTableName = this.readerConfig.getString(Key.ODPSTABLENAME); + List userConfiguredPartitions = this.readerConfig.getList(Key.PARTITION, String.class); + + if (userConfiguredPartitions == null) { + userConfiguredPartitions = Collections.emptyList(); + } + + if(userConfiguredPartitions.size() > 1) + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_PARTITION_FAILED, ""); + + if(userConfiguredPartitions.size() == 0) { + loadAdsData(adsHelper, odpsTableName,null); + }else { + loadAdsData(adsHelper, odpsTableName,userConfiguredPartitions.get(0)); + } + System.exit(0); + } + + // 一般来说,是需要推迟到 task 中进行pre 的执行(单表情况例外) + @Override + public void prepare() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + //导数据到odps表中 + this.odpsWriterJobProxy.prepare(); + } else { + //todo 目前insert模式不支持presql + } + } + + @Override + public List split(int mandatoryNumber) { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + return this.odpsWriterJobProxy.split(mandatoryNumber); + } else { + List splitResult = new ArrayList(); + for(int i = 0; i < mandatoryNumber; i++) { + splitResult.add(this.originalConfig.clone()); + } + return splitResult; + } + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + @Override + public void post() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + loadAdsData(odpsAdsHelper, this.odpsTransTableName, null); + this.odpsWriterJobProxy.post(); + } else { + //insert mode do noting + } + } + + @Override + public void destroy() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + this.odpsWriterJobProxy.destroy(); + } else { + //insert mode do noting + } + } + + private void loadAdsData(AdsHelper helper, String odpsTableName, String odpsPartition) { + + String table = this.originalConfig.getString(Key.ADS_TABLE); + String project; + if (super.getPeerPluginName().equals(ODPS_READER)) { + project = this.readerConfig.getString(Key.PROJECT); + } else { + project = this.transProjConf.getProject(); + } + String partition = this.originalConfig.getString(Key.PARTITION); + String sourcePath = AdsUtil.generateSourcePath(project,odpsTableName,odpsPartition); + /** + * 因为之前检查过,所以不用担心unbox的时候NPE + */ + boolean overwrite = this.originalConfig.getBool(Key.OVER_WRITE); + try { + String id = helper.loadData(table,partition,sourcePath,overwrite); + LOG.info("ADS Load Data任务已经提交,job id: " + id); + boolean terminated = false; + int time = 0; + while(!terminated) { + Thread.sleep(120000); + terminated = helper.checkLoadDataJobStatus(id); + time += 2; + LOG.info("ADS 正在导数据中,整个过程需要20分钟以上,请耐心等待,目前已执行 "+ time+" 分钟"); + } + LOG.info("ADS 导数据已成功"); + } catch (AdsException e) { + if (super.getPeerPluginName().equals(ODPS_READER)) { + // TODO 使用云账号 + AdsWriterErrorCode.ADS_LOAD_ODPS_FAILED.setAdsAccount(helper.getUserName()); + throw DataXException.asDataXException(AdsWriterErrorCode.ADS_LOAD_ODPS_FAILED,e); + } else { + throw DataXException.asDataXException(AdsWriterErrorCode.ADS_LOAD_TEMP_ODPS_FAILED,e); + } + } catch (InterruptedException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED,e); + } + } + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private OdpsWriter.Task odpsWriterTaskProxy = new OdpsWriter.Task(); + + + private String writeMode; + private int columnNumber; + + @Override + public void init() { + writerSliceConfig = super.getPluginJobConf(); + this.writeMode = this.writerSliceConfig.getString(Key.WRITE_MODE); + + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.setPluginJobConf(writerSliceConfig); + odpsWriterTaskProxy.init(); + } else if(Constant.INSERTMODE.equalsIgnoreCase(this.writeMode)) { + List allColumns = AdsInsertUtil.getAdsTableColumnNames(writerSliceConfig); + AdsInsertUtil.dealColumnConf(writerSliceConfig, allColumns); + List userColumns = writerSliceConfig.getList(Key.COLUMN, String.class); + this.columnNumber = userColumns.size(); + } else { + throw DataXException.asDataXException(AdsWriterErrorCode.INVALID_CONFIG_VALUE, "writeMode 必须为 'load' 或者 'insert'"); + } + + } + + @Override + public void prepare() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.prepare(); + } else { + //do nothing + } + } + + //TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver) { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.setTaskPluginCollector(super.getTaskPluginCollector()); + odpsWriterTaskProxy.startWrite(recordReceiver); + } else { + //todo insert 模式 + String username = writerSliceConfig.getString(Key.USERNAME); + String password = writerSliceConfig.getString(Key.PASSWORD); + String adsURL = writerSliceConfig.getString(Key.ADS_URL); + String schema = writerSliceConfig.getString(Key.SCHEMA); + String table = writerSliceConfig.getString(Key.ADS_TABLE); + List columns = writerSliceConfig.getList(Key.COLUMN, String.class); + String jdbcUrl = "jdbc:mysql://" + adsURL + "/" + schema + "?useUnicode=true&characterEncoding=UTF-8"; + Connection connection = DBUtil.getConnection(DataBaseType.ADS, + jdbcUrl, username, password); + TaskPluginCollector taskPluginCollector = super.getTaskPluginCollector(); + AdsInsertProxy proxy = new AdsInsertProxy(schema + "." + table, columns, writerSliceConfig, taskPluginCollector); + proxy.startWriteWithConnection(recordReceiver, connection, columnNumber); + } + } + + @Override + public void post() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.post(); + } else { + //do noting until now + } + } + + @Override + public void destroy() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.destroy(); + } else { + //do noting until now + } + } + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriterErrorCode.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriterErrorCode.java new file mode 100644 index 000000000..a1ac3c107 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriterErrorCode.java @@ -0,0 +1,54 @@ +package com.alibaba.datax.plugin.writer.adswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum AdsWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("AdsWriter-00", "您缺失了必须填写的参数值."), + NO_ADS_TABLE("AdsWriter-01", "ADS表不存在."), + ODPS_CREATETABLE_FAILED("AdsWriter-02", "创建ODPS临时表失败,请联系ADS 技术支持"), + ADS_LOAD_TEMP_ODPS_FAILED("AdsWriter-03", "ADS从ODPS临时表导数据失败,请联系ADS 技术支持"), + TABLE_TRUNCATE_ERROR("AdsWriter-04", "清空 ODPS 目的表时出错."), + CREATE_ADS_HELPER_FAILED("AdsWriter-05", "创建ADSHelper对象出错,请联系ADS 技术支持"), + ODPS_PARTITION_FAILED("AdsWriter-06", "ODPS Reader不允许配置多个partition,目前只支持三种配置方式,\"partition\":[\"pt=*,ds=*\"](读取test表所有分区的数据); \n" + + "\"partition\":[\"pt=1,ds=*\"](读取test表下面,一级分区pt=1下面的所有二级分区); \n" + + "\"partition\":[\"pt=1,ds=hangzhou\"](读取test表下面,一级分区pt=1下面,二级分区ds=hz的数据)"), + ADS_LOAD_ODPS_FAILED("AdsWriter-07", "ADS从ODPS导数据失败,请联系ADS 技术支持,先检查ADS账号是否已加到该ODPS Project中。ADS账号为:"), + INVALID_CONFIG_VALUE("AdsWriter-08", "不合法的配置值."), + + GET_ADS_TABLE_MEATA_FAILED("AdsWriter-11", "获取ADS table原信息失败"); + + private final String code; + private final String description; + private String adsAccount; + + + private AdsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + public void setAdsAccount(String adsAccount) { + this.adsAccount = adsAccount; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + if (this.code.equals("AdsWriter-07")){ + return String.format("Code:[%s], Description:[%s][%s]. ", this.code, + this.description,adsAccount); + }else{ + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnDataType.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnDataType.java new file mode 100644 index 000000000..9062d0fd7 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnDataType.java @@ -0,0 +1,414 @@ +package com.alibaba.datax.plugin.writer.adswriter.ads; + +import java.math.BigDecimal; +import java.sql.Date; +import java.sql.Time; +import java.sql.Timestamp; +import java.sql.Types; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; + +/** + * ADS column data type. + * + * @since 0.0.1 + */ +public class ColumnDataType { + + // public static final int NULL = 0; + public static final int BOOLEAN = 1; + public static final int BYTE = 2; + public static final int SHORT = 3; + public static final int INT = 4; + public static final int LONG = 5; + public static final int DECIMAL = 6; + public static final int DOUBLE = 7; + public static final int FLOAT = 8; + public static final int TIME = 9; + public static final int DATE = 10; + public static final int TIMESTAMP = 11; + public static final int STRING = 13; + // public static final int STRING_IGNORECASE = 14; + // public static final int STRING_FIXED = 21; + + public static final int MULTI_VALUE = 22; + + public static final int TYPE_COUNT = MULTI_VALUE + 1; + + /** + * The list of types. An ArrayList so that Tomcat doesn't set it to null when clearing references. + */ + private static final ArrayList TYPES = new ArrayList(); + private static final HashMap TYPES_BY_NAME = new HashMap(); + private static final ArrayList TYPES_BY_VALUE_TYPE = new ArrayList(); + + /** + * @param dataTypes + * @return + */ + public static String getNames(int[] dataTypes) { + List names = new ArrayList(dataTypes.length); + for (final int dataType : dataTypes) { + names.add(ColumnDataType.getDataType(dataType).name); + } + return names.toString(); + } + + public int type; + public String name; + public int sqlType; + public String jdbc; + + /** + * How closely the data type maps to the corresponding JDBC SQL type (low is best). + */ + public int sqlTypePos; + + static { + for (int i = 0; i < TYPE_COUNT; i++) { + TYPES_BY_VALUE_TYPE.add(null); + } + // add(NULL, Types.NULL, "Null", new String[] { "NULL" }); + add(STRING, Types.VARCHAR, "String", new String[] { "VARCHAR", "VARCHAR2", "NVARCHAR", "NVARCHAR2", + "VARCHAR_CASESENSITIVE", "CHARACTER VARYING", "TID" }); + add(STRING, Types.LONGVARCHAR, "String", new String[] { "LONGVARCHAR", "LONGNVARCHAR" }); + // add(STRING_FIXED, Types.CHAR, "String", new String[] { "CHAR", "CHARACTER", "NCHAR" }); + // add(STRING_IGNORECASE, Types.VARCHAR, "String", new String[] { "VARCHAR_IGNORECASE" }); + add(BOOLEAN, Types.BOOLEAN, "Boolean", new String[] { "BOOLEAN", "BIT", "BOOL" }); + add(BYTE, Types.TINYINT, "Byte", new String[] { "TINYINT" }); + add(SHORT, Types.SMALLINT, "Short", new String[] { "SMALLINT", "YEAR", "INT2" }); + add(INT, Types.INTEGER, "Int", new String[] { "INTEGER", "INT", "MEDIUMINT", "INT4", "SIGNED" }); + add(INT, Types.INTEGER, "Int", new String[] { "SERIAL" }); + add(LONG, Types.BIGINT, "Long", new String[] { "BIGINT", "INT8", "LONG" }); + add(LONG, Types.BIGINT, "Long", new String[] { "IDENTITY", "BIGSERIAL" }); + add(DECIMAL, Types.DECIMAL, "BigDecimal", new String[] { "DECIMAL", "DEC" }); + add(DECIMAL, Types.NUMERIC, "BigDecimal", new String[] { "NUMERIC", "NUMBER" }); + add(FLOAT, Types.REAL, "Float", new String[] { "REAL", "FLOAT4" }); + add(DOUBLE, Types.DOUBLE, "Double", new String[] { "DOUBLE", "DOUBLE PRECISION" }); + add(DOUBLE, Types.FLOAT, "Double", new String[] { "FLOAT", "FLOAT8" }); + add(TIME, Types.TIME, "Time", new String[] { "TIME" }); + add(DATE, Types.DATE, "Date", new String[] { "DATE" }); + add(TIMESTAMP, Types.TIMESTAMP, "Timestamp", new String[] { "TIMESTAMP", "DATETIME", "SMALLDATETIME" }); + add(MULTI_VALUE, Types.VARCHAR, "String", new String[] { "MULTIVALUE" }); + } + + private static void add(int type, int sqlType, String jdbc, String[] names) { + for (int i = 0; i < names.length; i++) { + ColumnDataType dt = new ColumnDataType(); + dt.type = type; + dt.sqlType = sqlType; + dt.jdbc = jdbc; + dt.name = names[i]; + for (ColumnDataType t2 : TYPES) { + if (t2.sqlType == dt.sqlType) { + dt.sqlTypePos++; + } + } + TYPES_BY_NAME.put(dt.name, dt); + if (TYPES_BY_VALUE_TYPE.get(type) == null) { + TYPES_BY_VALUE_TYPE.set(type, dt); + } + TYPES.add(dt); + } + } + +// /** +// * Get the list of ads data types. +// * +// * @return the ads data types +// */ +// public static ArrayList getTypes() { +// return TYPES; +// } + +// /** +// * Get the name of the Java class for the given value type. +// * +// * @param type the value type +// * @return the class name +// */ +// public static String getTypeClassName(int type) { +// switch (type) { +// case BOOLEAN: +// // "java.lang.Boolean"; +// return Boolean.class.getName(); +// case BYTE: +// // "java.lang.Byte"; +// return Byte.class.getName(); +// case SHORT: +// // "java.lang.Short"; +// return Short.class.getName(); +// case INT: +// // "java.lang.Integer"; +// return Integer.class.getName(); +// case LONG: +// // "java.lang.Long"; +// return Long.class.getName(); +// case DECIMAL: +// // "java.math.BigDecimal"; +// return BigDecimal.class.getName(); +// case TIME: +// // "java.sql.Time"; +// return Time.class.getName(); +// case DATE: +// // "java.sql.Date"; +// return Date.class.getName(); +// case TIMESTAMP: +// // "java.sql.Timestamp"; +// return Timestamp.class.getName(); +// case STRING: +// // case STRING_IGNORECASE: +// // case STRING_FIXED: +// case MULTI_VALUE: +// // "java.lang.String"; +// return String.class.getName(); +// case DOUBLE: +// // "java.lang.Double"; +// return Double.class.getName(); +// case FLOAT: +// // "java.lang.Float"; +// return Float.class.getName(); +// // case NULL: +// // return null; +// default: +// throw new IllegalArgumentException("type=" + type); +// } +// } + + /** + * Get the data type object for the given value type. + * + * @param type the value type + * @return the data type object + */ + public static ColumnDataType getDataType(int type) { + if (type < 0 || type >= TYPE_COUNT) { + throw new IllegalArgumentException("type=" + type); + } + ColumnDataType dt = TYPES_BY_VALUE_TYPE.get(type); + // if (dt == null) { + // dt = TYPES_BY_VALUE_TYPE.get(NULL); + // } + return dt; + } + + /** + * Convert a value type to a SQL type. + * + * @param type the value type + * @return the SQL type + */ + public static int convertTypeToSQLType(int type) { + return getDataType(type).sqlType; + } + + /** + * Convert a SQL type to a value type. + * + * @param sqlType the SQL type + * @return the value type + */ + public static int convertSQLTypeToValueType(int sqlType) { + switch (sqlType) { + // case Types.CHAR: + // case Types.NCHAR: + // return STRING_FIXED; + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + return STRING; + case Types.NUMERIC: + case Types.DECIMAL: + return DECIMAL; + case Types.BIT: + case Types.BOOLEAN: + return BOOLEAN; + case Types.INTEGER: + return INT; + case Types.SMALLINT: + return SHORT; + case Types.TINYINT: + return BYTE; + case Types.BIGINT: + return LONG; + case Types.REAL: + return FLOAT; + case Types.DOUBLE: + case Types.FLOAT: + return DOUBLE; + case Types.DATE: + return DATE; + case Types.TIME: + return TIME; + case Types.TIMESTAMP: + return TIMESTAMP; + // case Types.NULL: + // return NULL; + default: + throw new IllegalArgumentException("JDBC Type: " + sqlType); + } + } + + /** + * Get the value type for the given Java class. + * + * @param x the Java class + * @return the value type + */ + public static int getTypeFromClass(Class x) { + // if (x == null || Void.TYPE == x) { + // return NULL; + // } + if (x.isPrimitive()) { + x = getNonPrimitiveClass(x); + } + if (String.class == x) { + return STRING; + } else if (Integer.class == x) { + return INT; + } else if (Long.class == x) { + return LONG; + } else if (Boolean.class == x) { + return BOOLEAN; + } else if (Double.class == x) { + return DOUBLE; + } else if (Byte.class == x) { + return BYTE; + } else if (Short.class == x) { + return SHORT; + } else if (Float.class == x) { + return FLOAT; + // } else if (Void.class == x) { + // return NULL; + } else if (BigDecimal.class.isAssignableFrom(x)) { + return DECIMAL; + } else if (Date.class.isAssignableFrom(x)) { + return DATE; + } else if (Time.class.isAssignableFrom(x)) { + return TIME; + } else if (Timestamp.class.isAssignableFrom(x)) { + return TIMESTAMP; + } else if (java.util.Date.class.isAssignableFrom(x)) { + return TIMESTAMP; + } else { + throw new IllegalArgumentException("class=" + x); + } + } + + /** + * ads getNonPrimitiveClass + * @param clazz + * @return + */ + public static Class getNonPrimitiveClass(Class clazz) { + if (!clazz.isPrimitive()) { + return clazz; + } else if (clazz == boolean.class) { + // ads return "java.lang.Boolean"; + return Boolean.class; + } else if (clazz == char.class) { + // ads return "java.lang.Character"; + return Character.class; + } else if (clazz == byte.class) { + // ads return "java.lang.Byte"; + return Byte.class; + } else if (clazz == double.class) { + // ads return "java.lang.Double"; + return Double.class; + } else if (clazz == float.class) { + // ads return "java.lang.Float"; + return Float.class; + } else if (clazz == int.class) { + // ads return "java.lang.Integer"; + return Integer.class; + } else if (clazz == short.class) { + // ads return "java.lang.Short"; + return Short.class; + } else if (clazz == long.class) { + // ads return "java.lang.Long"; + return Long.class; + } else if (clazz == void.class) { + // ads return "java.lang.Void"; + return Void.class; + } + return clazz; + } + + /** + * Get a data type object from a type name. + * + * @param s the type name + * @return the data type object + */ + public static ColumnDataType getTypeByName(String s) { + return TYPES_BY_NAME.get(s); + } + + /** + * Check if the given value type is a String (VARCHAR,...). + * + * @param type the value type + * @return true if the value type is a String type + */ + public static boolean isStringType(int type) { + if (type == STRING /* || type == STRING_FIXED || type == STRING_IGNORECASE */ + || type == MULTI_VALUE) { + return true; + } + return false; + } + + /** + * @return + */ + public boolean supportsAdd() { + return supportsAdd(type); + } + + /** + * Check if the given value type supports the add operation. + * + * @param type the value type + * @return true if add is supported + */ + public static boolean supportsAdd(int type) { + switch (type) { + case BYTE: + case DECIMAL: + case DOUBLE: + case FLOAT: + case INT: + case LONG: + case SHORT: + return true; + default: + return false; + } + } + + /** + * Get the data type that will not overflow when calling 'add' 2 billion times. + * + * @param type the value type + * @return the data type that supports adding + */ + public static int getAddProofType(int type) { + switch (type) { + case BYTE: + return LONG; + case FLOAT: + return DOUBLE; + case INT: + return LONG; + case LONG: + return DECIMAL; + case SHORT: + return LONG; + default: + return type; + } + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnInfo.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnInfo.java new file mode 100644 index 000000000..030ce35d1 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnInfo.java @@ -0,0 +1,72 @@ +package com.alibaba.datax.plugin.writer.adswriter.ads; + +/** + * ADS column meta.
+ *

+ * select ordinal_position,column_name,data_type,type_name,column_comment
+ * from information_schema.columns
+ * where table_schema='db_name' and table_name='table_name'
+ * and is_deleted=0
+ * order by ordinal_position limit 1000
+ *

+ * + * @since 0.0.1 + */ +public class ColumnInfo { + + private int ordinal; + private String name; + private ColumnDataType dataType; + private boolean isDeleted; + private String comment; + + public int getOrdinal() { + return ordinal; + } + + public void setOrdinal(int ordinal) { + this.ordinal = ordinal; + } + + public String getName() { + return name; + } + + public void setName(String name) { + this.name = name; + } + + public ColumnDataType getDataType() { + return dataType; + } + + public void setDataType(ColumnDataType dataType) { + this.dataType = dataType; + } + + public boolean isDeleted() { + return isDeleted; + } + + public void setDeleted(boolean isDeleted) { + this.isDeleted = isDeleted; + } + + public String getComment() { + return comment; + } + + public void setComment(String comment) { + this.comment = comment; + } + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("ColumnInfo [ordinal=").append(ordinal).append(", name=").append(name).append(", dataType=") + .append(dataType).append(", isDeleted=").append(isDeleted).append(", comment=").append(comment) + .append("]"); + return builder.toString(); + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/TableInfo.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/TableInfo.java new file mode 100644 index 000000000..f2395d6b7 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/TableInfo.java @@ -0,0 +1,69 @@ +package com.alibaba.datax.plugin.writer.adswriter.ads; + +import java.util.List; + +/** + * ADS table meta.
+ *

+ * select table_schema, table_name,comments
+ * from information_schema.tables
+ * where table_schema='alimama' and table_name='click_af' limit 1
+ *

+ *

+ * select ordinal_position,column_name,data_type,type_name,column_comment
+ * from information_schema.columns
+ * where table_schema='db_name' and table_name='table_name'
+ * and is_deleted=0
+ * order by ordinal_position limit 1000
+ *

+ * + * @since 0.0.1 + */ +public class TableInfo { + + private String tableSchema; + private String tableName; + private List columns; + private String comments; + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("TableInfo [tableSchema=").append(tableSchema).append(", tableName=").append(tableName) + .append(", columns=").append(columns).append(", comments=").append(comments).append("]"); + return builder.toString(); + } + + public String getTableSchema() { + return tableSchema; + } + + public void setTableSchema(String tableSchema) { + this.tableSchema = tableSchema; + } + + public String getTableName() { + return tableName; + } + + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public List getColumns() { + return columns; + } + + public void setColumns(List columns) { + this.columns = columns; + } + + public String getComments() { + return comments; + } + + public void setComments(String comments) { + this.comments = comments; + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/package-info.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/package-info.java new file mode 100644 index 000000000..b396c49ff --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/package-info.java @@ -0,0 +1,6 @@ +/** + * ADS meta and service. + * + * @since 0.0.1 + */ +package com.alibaba.datax.plugin.writer.adswriter.ads; \ No newline at end of file diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertProxy.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertProxy.java new file mode 100644 index 000000000..bd01a1553 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertProxy.java @@ -0,0 +1,286 @@ +package com.alibaba.datax.plugin.writer.adswriter.insert; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.util.Constant; +import com.alibaba.datax.plugin.writer.adswriter.util.Key; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.Triple; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.*; +import java.util.ArrayList; +import java.util.List; + + +public class AdsInsertProxy { + + private static final Logger LOG = LoggerFactory + .getLogger(AdsInsertProxy.class); + + private String table; + private List columns; + private TaskPluginCollector taskPluginCollector; + private Configuration configuration; + private Boolean emptyAsNull; + + private Triple, List, List> resultSetMetaData; + + public AdsInsertProxy(String table, List columns, Configuration configuration, TaskPluginCollector taskPluginCollector) { + this.table = table; + this.columns = columns; + this.configuration = configuration; + this.taskPluginCollector = taskPluginCollector; + this.emptyAsNull = configuration.getBool(Key.EMPTY_AS_NULL, false); + } + + public void startWriteWithConnection(RecordReceiver recordReceiver, + Connection connection, + int columnNumber) { + //目前 ads 新建的表 如果未插入数据 不能通过select colums from table where 1=2,获取列信息。 +// this.resultSetMetaData = DBUtil.getColumnMetaData(connection, +// this.table, StringUtils.join(this.columns, ",")); + + this.resultSetMetaData = AdsInsertUtil.getColumnMetaData(configuration, columns); + + int batchSize = this.configuration.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_SIZE); + List writeBuffer = new ArrayList(batchSize); + try { + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + if (record.getColumnNumber() != columnNumber) { + // 源头读取字段列数与目的表字段写入列数不相等,直接报错 + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "列配置信息有错误. 因为您配置的任务中,源头读取字段数:%s 与 目的表要写入的字段数:%s 不相等. 请检查您的配置并作出修改.", + record.getColumnNumber(), + columnNumber)); + } + + writeBuffer.add(record); + + if (writeBuffer.size() >= batchSize) { + doOneInsert(connection, writeBuffer); + writeBuffer.clear(); + } + } + if (!writeBuffer.isEmpty()) { + doOneInsert(connection, writeBuffer); + writeBuffer.clear(); + } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + writeBuffer.clear(); + DBUtil.closeDBResources(null, null, connection); + } + } + + protected void doBatchInsert(Connection connection, List buffer) throws SQLException { + Statement statement = null; + try { + connection.setAutoCommit(false); + statement = connection.createStatement(); + + for (Record record : buffer) { + String sql = generateInsertSql(record); + statement.addBatch(sql); + } + statement.executeBatch(); + connection.commit(); + } catch (SQLException e) { + LOG.warn("回滚此次写入, 采用每次写入一行方式提交. 因为:" + e.getMessage()); + connection.rollback(); + doOneInsert(connection, buffer); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(statement, null); + } + } + + protected void doOneInsert(Connection connection, List buffer) { + Statement statement = null; + String sql = null; + try { + connection.setAutoCommit(true); + statement = connection.createStatement(); + + for (Record record : buffer) { + try { + sql = generateInsertSql(record); + int status = statement.executeUpdate(sql); + sql = null; + } catch (SQLException e) { + LOG.error("sql: " + sql, e.getMessage()); + this.taskPluginCollector.collectDirtyRecord(record, e); + } + } + } catch (Exception e) { + LOG.error("插入异常, sql: " + sql); + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(statement, null); + } + } + + private String generateInsertSql(Record record) throws SQLException { + StringBuilder sqlSb = new StringBuilder("insert into " + this.table + "(" + + StringUtils.join(columns, ",") + ") values("); + for (int i = 0; i < columns.size(); i++) { + int columnSqltype = this.resultSetMetaData.getMiddle().get(i); + checkColumnType(columnSqltype, sqlSb, record.getColumn(i), i); + if((i+1) != columns.size()) { + sqlSb.append(","); + } + } + sqlSb.append(")"); + return sqlSb.toString(); + } + + private void checkColumnType(int columnSqltype, StringBuilder sqlSb, Column column, int columnIndex) throws SQLException { + java.util.Date utilDate; + switch (columnSqltype) { + case Types.CHAR: + case Types.NCHAR: + case Types.CLOB: + case Types.NCLOB: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + String strValue = column.asString(); + if(null == strValue) { + sqlSb.append("null"); + } else { + String optStr = column.asString().replace("\\","\\\\"); + sqlSb.append("'").append(optStr).append("'"); + } + break; + + case Types.SMALLINT: + case Types.INTEGER: + case Types.BIGINT: + case Types.NUMERIC: + case Types.DECIMAL: + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + String numValue = column.asString(); + if(emptyAsNull && "".equals(numValue) || numValue == null){ + sqlSb.append("null"); + } else{ + sqlSb.append(numValue); + } + break; + + //tinyint is a little special in some database like mysql {boolean->tinyint(1)} + case Types.TINYINT: + Long longValue = column.asLong(); + if (null == longValue) { + sqlSb.append("null"); + } else { + sqlSb.append(longValue); + } + break; + + case Types.DATE: + java.sql.Date sqlDate = null; + try { + if("".equals(column.getRawData())) { + utilDate = null; + } else { + utilDate = column.asDate(); + } + } catch (DataXException e) { + throw new SQLException(String.format( + "Date 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlDate = new java.sql.Date(utilDate.getTime()); + sqlSb.append("'").append(sqlDate).append("'"); + } else { + sqlSb.append("null"); + } + break; + + case Types.TIME: + java.sql.Time sqlTime = null; + try { + if("".equals(column.getRawData())) { + utilDate = null; + } else { + utilDate = column.asDate(); + } + } catch (DataXException e) { + throw new SQLException(String.format( + "TIME 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTime = new java.sql.Time(utilDate.getTime()); + sqlSb.append("'").append(sqlTime).append("'"); + } else { + sqlSb.append("null"); + } + break; + + case Types.TIMESTAMP: + java.sql.Timestamp sqlTimestamp = null; + try { + if("".equals(column.getRawData())) { + utilDate = null; + } else { + utilDate = column.asDate(); + } + } catch (DataXException e) { + throw new SQLException(String.format( + "TIMESTAMP 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTimestamp = new java.sql.Timestamp( + utilDate.getTime()); + sqlSb.append("'").append(sqlTimestamp).append("'"); + } else { + sqlSb.append("null"); + } + break; + + case Types.BOOLEAN: + case Types.BIT: + String bitValue = column.asString(); + if(bitValue == null) { + sqlSb.append("null"); + } else { + sqlSb.append(bitValue); + } + break; + default: + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d], 字段Java类型:[%s]. 请修改表中该字段的类型或者不同步该字段.", + this.resultSetMetaData.getLeft() + .get(columnIndex), + this.resultSetMetaData.getMiddle() + .get(columnIndex), + this.resultSetMetaData.getRight() + .get(columnIndex))); + } + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertUtil.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertUtil.java new file mode 100644 index 000000000..11550b979 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertUtil.java @@ -0,0 +1,134 @@ +package com.alibaba.datax.plugin.writer.adswriter.insert; + +import com.alibaba.datax.common.exception.DataXException; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.writer.adswriter.AdsException; +import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.load.AdsHelper; +import com.alibaba.datax.plugin.writer.adswriter.util.Key; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.ImmutableTriple; +import org.apache.commons.lang3.tuple.Triple; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.*; +import java.util.ArrayList; +import java.util.List; + + +public class AdsInsertUtil { + + private static final Logger LOG = LoggerFactory + .getLogger(AdsInsertUtil.class); + + public Connection getAdsConnect(Configuration conf) { + String userName = conf.getString(Key.USERNAME); + String passWord = conf.getString(Key.PASSWORD); + String adsURL = conf.getString(Key.ADS_URL); + String schema = conf.getString(Key.SCHEMA); + String jdbcUrl = "jdbc:mysql://" + adsURL + "/" + schema + "?useUnicode=true&characterEncoding=UTF-8"; + + Connection connection = DBUtil.getConnection(DataBaseType.ADS, userName, passWord, jdbcUrl); + return connection; + } + + + public static List getAdsTableColumnNames(Configuration conf) { + List tableColumns = new ArrayList(); + String userName = conf.getString(Key.USERNAME); + String passWord = conf.getString(Key.PASSWORD); + String adsUrl = conf.getString(Key.ADS_URL); + String schema = conf.getString(Key.SCHEMA); + String tableName = conf.getString(Key.ADS_TABLE); + AdsHelper adsHelper = new AdsHelper(adsUrl, userName, passWord, schema); + TableInfo tableInfo= null; + try { + tableInfo = adsHelper.getTableInfo(tableName); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.GET_ADS_TABLE_MEATA_FAILED, e); + } + + List columnInfos = tableInfo.getColumns(); + for(ColumnInfo columnInfo: columnInfos) { + tableColumns.add(columnInfo.getName()); + } + + LOG.info("table:[{}] all columns:[\n{}\n].", tableName, + StringUtils.join(tableColumns, ",")); + return tableColumns; + } + + public static Triple, List, List> getColumnMetaData + (Configuration configuration, List userColumns) { + Triple, List, List> columnMetaData = new ImmutableTriple, List, List>( + new ArrayList(), new ArrayList(), + new ArrayList()); + + List columnInfoList = getAdsTableColumns(configuration); + for(String column : userColumns) { + for (ColumnInfo columnInfo : columnInfoList) { + if(column.equals(columnInfo.getName())) { + columnMetaData.getLeft().add(columnInfo.getName()); + columnMetaData.getMiddle().add(columnInfo.getDataType().sqlType); + columnMetaData.getRight().add( + columnInfo.getDataType().name); + } + } + } + return columnMetaData; + } + + public static List getAdsTableColumns(Configuration conf) { + String userName = conf.getString(Key.USERNAME); + String passWord = conf.getString(Key.PASSWORD); + String adsUrl = conf.getString(Key.ADS_URL); + String schema = conf.getString(Key.SCHEMA); + String tableName = conf.getString(Key.ADS_TABLE); + AdsHelper adsHelper = new AdsHelper(adsUrl, userName, passWord, schema); + TableInfo tableInfo= null; + try { + tableInfo = adsHelper.getTableInfo(tableName); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.GET_ADS_TABLE_MEATA_FAILED, e); + } + + List columnInfos = tableInfo.getColumns(); + + return columnInfos; + } + + public static void dealColumnConf(Configuration originalConfig, List tableColumns) { + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, String.class); + if (null == userConfiguredColumns || userConfiguredColumns.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的配置文件中的列配置信息有误. 因为您未配置写入数据库表的列名称,DataX获取不到列信息. 请检查您的配置并作出修改."); + } else { + if (1 == userConfiguredColumns.size() && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改."); + + // 回填其值,需要以 String 的方式转交后续处理 + originalConfig.set(Key.COLUMN, tableColumns); + } else if (userConfiguredColumns.size() > tableColumns.size()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + String.format("您的配置文件中的列配置信息有误. 因为您所配置的写入数据库表的字段个数:%s 大于目的表的总字段总个数:%s. 请检查您的配置并作出修改.", + userConfiguredColumns.size(), tableColumns.size())); + } else { + // 确保用户配置的 column 不重复 + ListUtil.makeSureNoValueDuplicate(userConfiguredColumns, false); + + // 检查列是否都为数据库表中正确的列(通过执行一次 select column from table 进行判断) + ListUtil.makeSureBInA(tableColumns, userConfiguredColumns, true); + } + } + } + + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/AdsHelper.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/AdsHelper.java new file mode 100644 index 000000000..5f429294c --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/AdsHelper.java @@ -0,0 +1,388 @@ +/** + * + */ +package com.alibaba.datax.plugin.writer.adswriter.load; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.writer.adswriter.AdsException; +import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnDataType; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.*; +import java.util.ArrayList; +import java.util.List; +import java.util.Properties; + +public class AdsHelper { + private static final Logger LOG = LoggerFactory + .getLogger(AdsHelper.class); + + private String adsURL; + private String userName; + private String password; + private String schema; + + public AdsHelper(String adsUrl, String userName, String password, String schema) { + this.adsURL = adsUrl; + this.userName = userName; + this.password = password; + this.schema = schema; + } + + public String getAdsURL() { + return adsURL; + } + + public void setAdsURL(String adsURL) { + this.adsURL = adsURL; + } + + public String getUserName() { + return userName; + } + + public void setUserName(String userName) { + this.userName = userName; + } + + public String getPassword() { + return password; + } + + public void setPassword(String password) { + this.password = password; + } + + public String getSchema() { + return schema; + } + + public void setSchema(String schema) { + this.schema = schema; + } + + /** + * Obtain the table meta information. + * + * @param table The table + * @return The table meta information + * @throws com.alibaba.datax.plugin.writer.adswriter.AdsException + */ + public TableInfo getTableInfo(String table) throws AdsException { + + if (table == null) { + throw new AdsException(AdsException.ADS_TABLEMETA_TABLE_NULL, "Table is null.", null); + } + + if (adsURL == null) { + throw new AdsException(AdsException.ADS_CONN_URL_NOT_SET, "ADS JDBC connection URL was not set.", null); + } + + if (userName == null) { + throw new AdsException(AdsException.ADS_CONN_USERNAME_NOT_SET, + "ADS JDBC connection user name was not set.", null); + } + + if (password == null) { + throw new AdsException(AdsException.ADS_CONN_PASSWORD_NOT_SET, "ADS JDBC connection password was not set.", + null); + } + + if (schema == null) { + throw new AdsException(AdsException.ADS_CONN_SCHEMA_NOT_SET, "ADS JDBC connection schema was not set.", + null); + } + + String sql = "select ordinal_position,column_name,data_type,type_name,column_comment from information_schema.columns where table_schema='" + + schema + "' and table_name='" + table + "' order by ordinal_position"; + + Connection connection = null; + Statement statement = null; + ResultSet rs = null; + try { + Class.forName("com.mysql.jdbc.Driver"); + String url = "jdbc:mysql://" + adsURL + "/" + schema + "?useUnicode=true&characterEncoding=UTF-8&socketTimeout=3600000"; + + Properties connectionProps = new Properties(); + connectionProps.put("user", userName); + connectionProps.put("password", password); + connection = DriverManager.getConnection(url, connectionProps); + statement = connection.createStatement(); + + rs = statement.executeQuery(sql); + + TableInfo tableInfo = new TableInfo(); + List columnInfoList = new ArrayList(); + while (DBUtil.asyncResultSetNext(rs)) { + ColumnInfo columnInfo = new ColumnInfo(); + columnInfo.setOrdinal(rs.getInt(1)); + columnInfo.setName(rs.getString(2)); + //columnInfo.setDataType(ColumnDataType.getDataType(rs.getInt(3))); //for ads version < 0.7 + //columnInfo.setDataType(ColumnDataType.getTypeByName(rs.getString(3).toUpperCase())); //for ads version 0.8 + columnInfo.setDataType(ColumnDataType.getTypeByName(rs.getString(4).toUpperCase())); //for ads version 0.8 & 0.7 + columnInfo.setComment(rs.getString(5)); + columnInfoList.add(columnInfo); + } + if (columnInfoList.isEmpty()) { + throw DataXException.asDataXException(AdsWriterErrorCode.NO_ADS_TABLE, table + "不存在或者查询不到列信息. "); + } + + tableInfo.setColumns(columnInfoList); + tableInfo.setTableSchema(schema); + tableInfo.setTableName(table); + + return tableInfo; + + } catch (ClassNotFoundException e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } catch (SQLException e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } catch ( DataXException e) { + throw e; + } catch (Exception e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } finally { + if (rs != null) { + try { + rs.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (statement != null) { + try { + statement.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // Ignore exception + } + } + } + + } + + /** + * Submit LOAD DATA command. + * + * @param table The target ADS table + * @param partition The partition option in the form of "(partition_name,...)" + * @param sourcePath The source path + * @param overwrite + * @return + * @throws AdsException + */ + public String loadData(String table, String partition, String sourcePath, boolean overwrite) throws AdsException { + + if (table == null) { + throw new AdsException(AdsException.ADS_LOADDATA_TABLE_NULL, "ADS LOAD DATA table is null.", null); + } + + if (sourcePath == null) { + throw new AdsException(AdsException.ADS_LOADDATA_SOURCEPATH_NULL, "ADS LOAD DATA source path is null.", + null); + } + + if (adsURL == null) { + throw new AdsException(AdsException.ADS_CONN_URL_NOT_SET, "ADS JDBC connection URL was not set.", null); + } + + if (userName == null) { + throw new AdsException(AdsException.ADS_CONN_USERNAME_NOT_SET, + "ADS JDBC connection user name was not set.", null); + } + + if (password == null) { + throw new AdsException(AdsException.ADS_CONN_PASSWORD_NOT_SET, "ADS JDBC connection password was not set.", + null); + } + + if (schema == null) { + throw new AdsException(AdsException.ADS_CONN_SCHEMA_NOT_SET, "ADS JDBC connection schema was not set.", + null); + } + + StringBuilder sb = new StringBuilder(); + sb.append("LOAD DATA FROM "); + if (sourcePath.startsWith("'") && sourcePath.endsWith("'")) { + sb.append(sourcePath); + } else { + sb.append("'" + sourcePath + "'"); + } + if (overwrite) { + sb.append(" OVERWRITE"); + } + sb.append(" INTO TABLE "); + sb.append(schema + "." + table); + if (partition != null && !partition.trim().equals("")) { + String partitionTrim = partition.trim(); + if(partitionTrim.startsWith("(") && partitionTrim.endsWith(")")) { + sb.append(" PARTITION " + partition); + } else { + sb.append(" PARTITION " + "(" + partition + ")"); + } + } + + Connection connection = null; + Statement statement = null; + ResultSet rs = null; + try { + Class.forName("com.mysql.jdbc.Driver"); + String url = "jdbc:mysql://" + adsURL + "/" + schema + "?useUnicode=true&characterEncoding=UTF-8&socketTimeout=3600000"; + + Properties connectionProps = new Properties(); + connectionProps.put("user", userName); + connectionProps.put("password", password); + connection = DriverManager.getConnection(url, connectionProps); + statement = connection.createStatement(); + LOG.info("正在从ODPS数据库导数据到ADS中: "+sb.toString()); + LOG.info("由于ADS的限制,ADS导数据最少需要20分钟,请耐心等待"); + rs = statement.executeQuery(sb.toString()); + + String jobId = null; + while (DBUtil.asyncResultSetNext(rs)) { + jobId = rs.getString(1); + } + + if (jobId == null) { + throw new AdsException(AdsException.ADS_LOADDATA_JOBID_NOT_AVAIL, + "Job id is not available for the submitted LOAD DATA." + jobId, null); + } + + return jobId; + + } catch (ClassNotFoundException e) { + throw new AdsException(AdsException.ADS_LOADDATA_FAILED, e.getMessage(), e); + } catch (SQLException e) { + throw new AdsException(AdsException.ADS_LOADDATA_FAILED, e.getMessage(), e); + } catch (Exception e) { + throw new AdsException(AdsException.ADS_LOADDATA_FAILED, e.getMessage(), e); + } finally { + if (rs != null) { + try { + rs.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (statement != null) { + try { + statement.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // Ignore exception + } + } + } + + } + + /** + * Check the load data job status. + * + * @param jobId The job id to + * @return true if load data job succeeded, false if load data job failed. + * @throws AdsException + */ + public boolean checkLoadDataJobStatus(String jobId) throws AdsException { + + if (adsURL == null) { + throw new AdsException(AdsException.ADS_CONN_URL_NOT_SET, "ADS JDBC connection URL was not set.", null); + } + + if (userName == null) { + throw new AdsException(AdsException.ADS_CONN_USERNAME_NOT_SET, + "ADS JDBC connection user name was not set.", null); + } + + if (password == null) { + throw new AdsException(AdsException.ADS_CONN_PASSWORD_NOT_SET, "ADS JDBC connection password was not set.", + null); + } + + if (schema == null) { + throw new AdsException(AdsException.ADS_CONN_SCHEMA_NOT_SET, "ADS JDBC connection schema was not set.", + null); + } + + Connection connection = null; + Statement statement = null; + ResultSet rs = null; + try { + Class.forName("com.mysql.jdbc.Driver"); + String url = "jdbc:mysql://" + adsURL + "/" + schema + "?useUnicode=true&characterEncoding=UTF-8&socketTimeout=3600000"; + + Properties connectionProps = new Properties(); + connectionProps.put("user", userName); + connectionProps.put("password", password); + connection = DriverManager.getConnection(url, connectionProps); + statement = connection.createStatement(); + + String sql = "select state from information_schema.job_instances where job_id like '" + jobId + "'"; + rs = statement.executeQuery(sql); + + String state = null; + while (DBUtil.asyncResultSetNext(rs)) { + state = rs.getString(1); + } + + if (state == null) { + throw new AdsException(AdsException.JOB_NOT_EXIST, "Target job does not exist for id: " + jobId, null); + } + + if (state.equals("SUCCEEDED")) { + return true; + } else if (state.equals("FAILED")) { + throw new AdsException(AdsException.JOB_FAILED, "Target job failed for id: " + jobId, null); + } else { + return false; + } + + } catch (ClassNotFoundException e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } catch (SQLException e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } catch (Exception e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } finally { + if (rs != null) { + try { + rs.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (statement != null) { + try { + statement.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // Ignore exception + } + } + } + + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TableMetaHelper.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TableMetaHelper.java new file mode 100644 index 000000000..1ecad7561 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TableMetaHelper.java @@ -0,0 +1,87 @@ +package com.alibaba.datax.plugin.writer.adswriter.load; + +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnDataType; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.odps.DataType; +import com.alibaba.datax.plugin.writer.adswriter.odps.FieldSchema; +import com.alibaba.datax.plugin.writer.adswriter.odps.TableMeta; + +import java.util.ArrayList; +import java.util.List; +import java.util.Random; + +/** + * Table meta helper for ADS writer. + * + * @since 0.0.1 + */ +public class TableMetaHelper { + + private TableMetaHelper() { + } + + /** + * Create temporary ODPS table. + * + * @param tableMeta table meta + * @param lifeCycle for temporary table + * @return ODPS temporary table meta + */ + public static TableMeta createTempODPSTable(TableInfo tableMeta, int lifeCycle) { + TableMeta tempTable = new TableMeta(); + tempTable.setComment(tableMeta.getComments()); + tempTable.setLifeCycle(lifeCycle); + String tableSchema = tableMeta.getTableSchema(); + String tableName = tableMeta.getTableName(); + tempTable.setTableName(generateTempTableName(tableSchema, tableName)); + List tempColumns = new ArrayList(); + List columns = tableMeta.getColumns(); + for (ColumnInfo column : columns) { + FieldSchema tempColumn = new FieldSchema(); + tempColumn.setName(column.getName()); + tempColumn.setType(toODPSDataType(column.getDataType())); + tempColumn.setComment(column.getComment()); + tempColumns.add(tempColumn); + } + tempTable.setCols(tempColumns); + tempTable.setPartitionKeys(null); + return tempTable; + } + + private static String toODPSDataType(ColumnDataType columnDataType) { + int type; + switch (columnDataType.type) { + case ColumnDataType.BOOLEAN: + type = DataType.STRING; + break; + case ColumnDataType.BYTE: + case ColumnDataType.SHORT: + case ColumnDataType.INT: + case ColumnDataType.LONG: + type = DataType.INTEGER; + break; + case ColumnDataType.DECIMAL: + case ColumnDataType.DOUBLE: + case ColumnDataType.FLOAT: + type = DataType.DOUBLE; + break; + case ColumnDataType.DATE: + case ColumnDataType.TIME: + case ColumnDataType.TIMESTAMP: + case ColumnDataType.STRING: + case ColumnDataType.MULTI_VALUE: + type = DataType.STRING; + break; + default: + throw new IllegalArgumentException("columnDataType=" + columnDataType); + } + return DataType.toString(type); + } + + private static String generateTempTableName(String tableSchema, String tableName) { + int randNum = 1000 + new Random(System.currentTimeMillis()).nextInt(1000); + return tableSchema + "__" + tableName + "_" + System.currentTimeMillis() + randNum; + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TransferProjectConf.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TransferProjectConf.java new file mode 100644 index 000000000..bff4b7b90 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TransferProjectConf.java @@ -0,0 +1,65 @@ +package com.alibaba.datax.plugin.writer.adswriter.load; + +import com.alibaba.datax.common.util.Configuration; + +/** + * Created by xiafei.qiuxf on 15/4/13. + */ +public class TransferProjectConf { + + public final static String KEY_ACCESS_ID = "odps.accessId"; + public final static String KEY_ACCESS_KEY = "odps.accessKey"; + public final static String KEY_ACCOUNT = "odps.account"; + public final static String KEY_ODPS_SERVER = "odps.odpsServer"; + public final static String KEY_ODPS_TUNNEL = "odps.tunnelServer"; + public final static String KEY_ACCOUNT_TYPE = "odps.accountType"; + public final static String KEY_PROJECT = "odps.project"; + + private String accessId; + private String accessKey; + private String account; + private String odpsServer; + private String odpsTunnel; + private String accountType; + private String project; + + public static TransferProjectConf create(Configuration adsWriterConf) { + TransferProjectConf res = new TransferProjectConf(); + res.accessId = adsWriterConf.getString(KEY_ACCESS_ID); + res.accessKey = adsWriterConf.getString(KEY_ACCESS_KEY); + res.account = adsWriterConf.getString(KEY_ACCOUNT); + res.odpsServer = adsWriterConf.getString(KEY_ODPS_SERVER); + res.odpsTunnel = adsWriterConf.getString(KEY_ODPS_TUNNEL); + res.accountType = adsWriterConf.getString(KEY_ACCOUNT_TYPE, "aliyun"); + res.project = adsWriterConf.getString(KEY_PROJECT); + return res; + } + + public String getAccessId() { + return accessId; + } + + public String getAccessKey() { + return accessKey; + } + + public String getAccount() { + return account; + } + + public String getOdpsServer() { + return odpsServer; + } + + public String getOdpsTunnel() { + return odpsTunnel; + } + + public String getAccountType() { + return accountType; + } + + public String getProject() { + return project; + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/DataType.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/DataType.java new file mode 100644 index 000000000..595b1dfd2 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/DataType.java @@ -0,0 +1,77 @@ +package com.alibaba.datax.plugin.writer.adswriter.odps; + +/** + * ODPS 数据类型. + *

+ * 当前定义了如下类型: + *

    + *
  • INTEGER + *
  • DOUBLE + *
  • BOOLEAN + *
  • STRING + *
  • DATETIME + *
+ *

+ * + * @since 0.0.1 + */ +public class DataType { + + public final static byte INTEGER = 0; + public final static byte DOUBLE = 1; + public final static byte BOOLEAN = 2; + public final static byte STRING = 3; + public final static byte DATETIME = 4; + + public static String toString(int type) { + switch (type) { + case INTEGER: + return "bigint"; + case DOUBLE: + return "double"; + case BOOLEAN: + return "boolean"; + case STRING: + return "string"; + case DATETIME: + return "datetime"; + default: + throw new IllegalArgumentException("type=" + type); + } + } + + /** + * 字符串的数据类型转换为byte常量定义的数据类型. + *

+ * 转换规则: + *

    + *
  • tinyint, int, bigint, long - {@link #INTEGER} + *
  • double, float - {@link #DOUBLE} + *
  • string - {@link #STRING} + *
  • boolean, bool - {@link #BOOLEAN} + *
  • datetime - {@link #DATETIME} + *
+ *

+ * + * @param type 字符串的数据类型 + * @return byte常量定义的数据类型 + * @throws IllegalArgumentException + */ + public static byte convertToDataType(String type) throws IllegalArgumentException { + type = type.toLowerCase().trim(); + if ("string".equals(type)) { + return STRING; + } else if ("bigint".equals(type) || "int".equals(type) || "tinyint".equals(type) || "long".equals(type)) { + return INTEGER; + } else if ("boolean".equals(type) || "bool".equals(type)) { + return BOOLEAN; + } else if ("double".equals(type) || "float".equals(type)) { + return DOUBLE; + } else if ("datetime".equals(type)) { + return DATETIME; + } else { + throw new IllegalArgumentException("unkown type: " + type); + } + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/FieldSchema.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/FieldSchema.java new file mode 100644 index 000000000..701ee261c --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/FieldSchema.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.plugin.writer.adswriter.odps; + +/** + * ODPS列属性,包含列名和类型 列名和类型与SQL的DESC表或分区显示的列名和类型一致 + * + * @since 0.0.1 + */ +public class FieldSchema { + + /** 列名 */ + private String name; + + /** 列类型,如:string, bigint, boolean, datetime等等 */ + private String type; + + private String comment; + + public String getName() { + return name; + } + + public void setName(String name) { + this.name = name; + } + + public String getType() { + return type; + } + + public void setType(String type) { + this.type = type; + } + + public String getComment() { + return comment; + } + + public void setComment(String comment) { + this.comment = comment; + } + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("FieldSchema [name=").append(name).append(", type=").append(type).append(", comment=") + .append(comment).append("]"); + return builder.toString(); + } + + /** + * @return "col_name data_type [COMMENT col_comment]" + */ + public String toDDL() { + StringBuilder builder = new StringBuilder(); + builder.append(name).append(" ").append(type); + String comment = this.comment; + if (comment != null && comment.length() > 0) { + builder.append(" ").append("COMMENT \"" + comment + "\""); + } + return builder.toString(); + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/TableMeta.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/TableMeta.java new file mode 100644 index 000000000..d0adc4eae --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/TableMeta.java @@ -0,0 +1,114 @@ +package com.alibaba.datax.plugin.writer.adswriter.odps; + +import java.util.Iterator; +import java.util.List; + +/** + * ODPS table meta. + * + * @since 0.0.1 + */ +public class TableMeta { + + private String tableName; + + private List cols; + + private List partitionKeys; + + private int lifeCycle; + + private String comment; + + public String getTableName() { + return tableName; + } + + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public List getCols() { + return cols; + } + + public void setCols(List cols) { + this.cols = cols; + } + + public List getPartitionKeys() { + return partitionKeys; + } + + public void setPartitionKeys(List partitionKeys) { + this.partitionKeys = partitionKeys; + } + + public int getLifeCycle() { + return lifeCycle; + } + + public void setLifeCycle(int lifeCycle) { + this.lifeCycle = lifeCycle; + } + + public String getComment() { + return comment; + } + + public void setComment(String comment) { + this.comment = comment; + } + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("TableMeta [tableName=").append(tableName).append(", cols=").append(cols) + .append(", partitionKeys=").append(partitionKeys).append(", lifeCycle=").append(lifeCycle) + .append(", comment=").append(comment).append("]"); + return builder.toString(); + } + + /** + * @return
+ * "CREATE TABLE [IF NOT EXISTS] table_name
+ * [(col_name data_type [COMMENT col_comment], ...)]
+ * [COMMENT table_comment]
+ * [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
+ * [LIFECYCLE days]
+ * [AS select_statement] "
+ */ + public String toDDL() { + StringBuilder builder = new StringBuilder(); + builder.append("CREATE TABLE " + tableName).append(" "); + List cols = this.cols; + if (cols != null && cols.size() > 0) { + builder.append("(").append(toDDL(cols)).append(")").append(" "); + } + String comment = this.comment; + if (comment != null && comment.length() > 0) { + builder.append("COMMENT \"" + comment + "\" "); + } + List partitionKeys = this.partitionKeys; + if (partitionKeys != null && partitionKeys.size() > 0) { + builder.append("PARTITIONED BY "); + builder.append("(").append(toDDL(partitionKeys)).append(")").append(" "); + } + if (lifeCycle > 0) { + builder.append("LIFECYCLE " + lifeCycle).append(" "); + } + builder.append(";"); + return builder.toString(); + } + + private String toDDL(List cols) { + StringBuilder builder = new StringBuilder(); + Iterator iter = cols.iterator(); + builder.append(iter.next().toDDL()); + while (iter.hasNext()) { + builder.append(", ").append(iter.next().toDDL()); + } + return builder.toString(); + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/package-info.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/package-info.java new file mode 100644 index 000000000..92dfd09da --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/package-info.java @@ -0,0 +1,6 @@ +/** + * ODPS meta. + * + * @since 0.0.1 + */ +package com.alibaba.datax.plugin.writer.adswriter.odps; \ No newline at end of file diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/package-info.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/package-info.java new file mode 100644 index 000000000..139a39106 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/package-info.java @@ -0,0 +1,6 @@ +/** + * ADS Writer. + * + * @since 0.0.1 + */ +package com.alibaba.datax.plugin.writer.adswriter; \ No newline at end of file diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/AdsUtil.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/AdsUtil.java new file mode 100644 index 000000000..778ce5eb8 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/AdsUtil.java @@ -0,0 +1,134 @@ +package com.alibaba.datax.plugin.writer.adswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.adswriter.load.AdsHelper; +import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.load.TransferProjectConf; +import com.alibaba.datax.plugin.writer.adswriter.odps.FieldSchema; +import com.alibaba.datax.plugin.writer.adswriter.odps.TableMeta; +import com.alibaba.datax.plugin.writer.odpswriter.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by judy.lt on 2015/1/30. + */ +public class AdsUtil { + private static final Logger LOG = LoggerFactory.getLogger(AdsUtil.class); + + /*检查配置文件中必填的配置项是否都已填 + * */ + public static void checkNecessaryConfig(Configuration originalConfig, String writeMode) { + //检查ADS必要参数 + originalConfig.getNecessaryValue(Key.ADS_URL, + AdsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.USERNAME, + AdsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.PASSWORD, + AdsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.SCHEMA, + AdsWriterErrorCode.REQUIRED_VALUE); + if(Constant.LOADMODE.equals(writeMode)) { + originalConfig.getNecessaryValue(Key.Life_CYCLE, + AdsWriterErrorCode.REQUIRED_VALUE); + Integer lifeCycle = originalConfig.getInt(Key.Life_CYCLE); + if (lifeCycle <= 0) { + throw DataXException.asDataXException(AdsWriterErrorCode.INVALID_CONFIG_VALUE, "配置项[lifeCycle]的值必须大于零."); + } + originalConfig.getNecessaryValue(Key.ADS_TABLE, + AdsWriterErrorCode.REQUIRED_VALUE); + Boolean overwrite = originalConfig.getBool(Key.OVER_WRITE); + if (overwrite == null) { + throw DataXException.asDataXException(AdsWriterErrorCode.REQUIRED_VALUE, "配置项[overWrite]是必填项."); + } + } + } + + /*生成AdsHelp实例 + * */ + public static AdsHelper createAdsHelper(Configuration originalConfig){ + //Get adsUrl,userName,password,schema等参数,创建AdsHelp实例 + String adsUrl = originalConfig.getString(Key.ADS_URL); + String userName = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + String schema = originalConfig.getString(Key.SCHEMA); + return new AdsHelper(adsUrl,userName,password,schema); + } + + public static AdsHelper createAdsHelperWithOdpsAccount(Configuration originalConfig) { + String adsUrl = originalConfig.getString(Key.ADS_URL); + String userName = originalConfig.getString(TransferProjectConf.KEY_ACCESS_ID); + String password = originalConfig.getString(TransferProjectConf.KEY_ACCESS_KEY); + String schema = originalConfig.getString(Key.SCHEMA); + return new AdsHelper(adsUrl, userName, password, schema); + } + + /*生成ODPSWriter Plugin所需要的配置文件 + * */ + public static Configuration generateConf(Configuration originalConfig, String odpsTableName, TableMeta tableMeta, TransferProjectConf transConf){ + Configuration newConfig = originalConfig.clone(); + newConfig.set(Key.ODPSTABLENAME, odpsTableName); + newConfig.set(Key.ODPS_SERVER, transConf.getOdpsServer()); + newConfig.set(Key.TUNNEL_SERVER,transConf.getOdpsTunnel()); + newConfig.set(Key.ACCESS_ID,transConf.getAccessId()); + newConfig.set(Key.ACCESS_KEY,transConf.getAccessKey()); + newConfig.set(Key.PROJECT,transConf.getProject()); + newConfig.set(Key.TRUNCATE, true); + newConfig.set(Key.PARTITION,null); +// newConfig.remove(Key.PARTITION); + List cols = tableMeta.getCols(); + List allColumns = new ArrayList(); + if(cols != null && !cols.isEmpty()){ + for(FieldSchema col:cols){ + allColumns.add(col.getName()); + } + } + newConfig.set(Key.COLUMN,allColumns); + return newConfig; + } + + /*生成ADS数据导入时的source_path + * */ + public static String generateSourcePath(String project, String tmpOdpsTableName, String odpsPartition){ + StringBuilder builder = new StringBuilder(); + String partition = transferOdpsPartitionToAds(odpsPartition); + builder.append("odps://").append(project).append("/").append(tmpOdpsTableName); + if(odpsPartition != null && !odpsPartition.isEmpty()){ + builder.append("/").append(partition); + } + return builder.toString(); + } + + public static String transferOdpsPartitionToAds(String odpsPartition){ + if(odpsPartition == null || odpsPartition.isEmpty()) + return null; + String adsPartition = formatPartition(odpsPartition);; + String[] partitions = adsPartition.split("/"); + for(int last = partitions.length; last > 0; last--){ + + String partitionPart = partitions[last-1]; + String newPart = partitionPart.replace(".*", "*").replace("*", ".*"); + if(newPart.split("=")[1].equals(".*")){ + adsPartition = adsPartition.substring(0,adsPartition.length()-partitionPart.length()); + }else{ + break; + } + if(adsPartition.endsWith("/")){ + adsPartition = adsPartition.substring(0,adsPartition.length()-1); + } + } + if (adsPartition.contains("*")) + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_PARTITION_FAILED, ""); + return adsPartition; + } + + public static String formatPartition(String partition) { + return partition.trim().replaceAll(" *= *", "=") + .replaceAll(" */ *", ",").replaceAll(" *, *", ",") + .replaceAll("'", "").replaceAll(",", "/"); + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Constant.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Constant.java new file mode 100644 index 000000000..3842cd011 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Constant.java @@ -0,0 +1,13 @@ +package com.alibaba.datax.plugin.writer.adswriter.util; + +public class Constant { + + public static final String LOADMODE = "load"; + + public static final String INSERTMODE = "insert"; + + public static final String REPLACEMODE = "replace"; + + public static final int DEFAULT_BATCH_SIZE = 32; + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Key.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Key.java new file mode 100644 index 000000000..e0822878b --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Key.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.plugin.writer.adswriter.util; + + +public final class Key { + + public final static String ADS_URL = "url"; + + public final static String USERNAME = "username"; + + public final static String PASSWORD = "password"; + + public final static String SCHEMA = "schema"; + + public final static String ADS_TABLE = "table"; + + public final static String Life_CYCLE = "lifeCycle"; + + public final static String OVER_WRITE = "overWrite"; + + public final static String WRITE_MODE = "writeMode"; + + + public final static String COLUMN = "column"; + + public final static String EMPTY_AS_NULL = "emptyAsNull"; + + public final static String BATCH_SIZE = "batchSize"; + + /** + * 以下是odps writer的key + */ + public final static String PARTITION = "partition"; + + public final static String ODPSTABLENAME = "table"; + + public final static String ODPS_SERVER = "odpsServer"; + + public final static String TUNNEL_SERVER = "tunnelServer"; + + public final static String ACCESS_ID = "accessId"; + + public final static String ACCESS_KEY = "accessKey"; + + public final static String PROJECT = "project"; + + public final static String TRUNCATE = "truncate"; + +} \ No newline at end of file diff --git a/adswriter/src/main/resources/plugin.json b/adswriter/src/main/resources/plugin.json new file mode 100644 index 000000000..a70fb3646 --- /dev/null +++ b/adswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "adswriter", + "class": "com.alibaba.datax.plugin.writer.adswriter.AdsWriter", + "description": "", + "developer": "alibaba" +} \ No newline at end of file diff --git a/adswriter/src/main/resources/plugin_job_template.json b/adswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..0753a226e --- /dev/null +++ b/adswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "adswriter", + "parameter": { + "url": "", + "username": "", + "password": "", + "schema": "", + "table": "", + "partition": "", + "overWrite": "", + "lifeCycle": 2 + } +} \ No newline at end of file diff --git a/common/datax-common.iml b/common/datax-common.iml new file mode 100644 index 000000000..87f7dace6 --- /dev/null +++ b/common/datax-common.iml @@ -0,0 +1,30 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/common/pom.xml b/common/pom.xml new file mode 100755 index 000000000..86c0a22dd --- /dev/null +++ b/common/pom.xml @@ -0,0 +1,81 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + datax-common + datax-common + jar + + + + org.apache.commons + commons-lang3 + + + com.google.guava + guava + 18.0 + + + com.alibaba + fastjson + + + commons-io + commons-io + + + + junit + junit + test + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + org.apache.httpcomponents + httpclient + 4.4 + test + + + org.apache.httpcomponents + fluent-hc + 4.4 + test + + + org.apache.commons + commons-math3 + 3.1.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/base/BaseObject.java b/common/src/main/java/com/alibaba/datax/common/base/BaseObject.java new file mode 100755 index 000000000..e7d06a950 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/base/BaseObject.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.common.base; + +import org.apache.commons.lang3.builder.EqualsBuilder; +import org.apache.commons.lang3.builder.HashCodeBuilder; +import org.apache.commons.lang3.builder.ToStringBuilder; +import org.apache.commons.lang3.builder.ToStringStyle; + +public class BaseObject { + + @Override + public int hashCode() { + return HashCodeBuilder.reflectionHashCode(this, false); + } + + @Override + public boolean equals(Object object) { + return EqualsBuilder.reflectionEquals(this, object, false); + } + + @Override + public String toString() { + return ToStringBuilder.reflectionToString(this, + ToStringStyle.MULTI_LINE_STYLE); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/constant/CommonConstant.java b/common/src/main/java/com/alibaba/datax/common/constant/CommonConstant.java new file mode 100755 index 000000000..423e16f92 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/constant/CommonConstant.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.common.constant; + +public final class CommonConstant { + /** + * 用于插件对自身 split 的每个 task 标识其使用的资源,以告知core 对 reader/writer split 之后的 task 进行拼接时需要根据资源标签进行更有意义的 shuffle 操作 + */ + public static String LOAD_BALANCE_RESOURCE_MARK = "loadBalanceResourceMark"; + +} diff --git a/common/src/main/java/com/alibaba/datax/common/constant/PluginType.java b/common/src/main/java/com/alibaba/datax/common/constant/PluginType.java new file mode 100755 index 000000000..ceee089e9 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/constant/PluginType.java @@ -0,0 +1,20 @@ +package com.alibaba.datax.common.constant; + +/** + * Created by jingxing on 14-8-31. + */ +public enum PluginType { + //pluginType还代表了资源目录,很难扩展,或者说需要足够必要才扩展。先mark Handler(其实和transformer一样),再讨论 + READER("reader"), TRANSFORMER("transformer"), WRITER("writer"), HANDLER("handler"); + + private String pluginType; + + private PluginType(String pluginType) { + this.pluginType = pluginType; + } + + @Override + public String toString() { + return this.pluginType; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/BoolColumn.java b/common/src/main/java/com/alibaba/datax/common/element/BoolColumn.java new file mode 100755 index 000000000..7699e152a --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/BoolColumn.java @@ -0,0 +1,115 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + */ +public class BoolColumn extends Column { + + public BoolColumn(Boolean bool) { + super(bool, Column.Type.BOOL, 1); + } + + public BoolColumn(final String data) { + this(true); + this.validate(data); + if (null == data) { + this.setRawData(null); + this.setByteSize(0); + } else { + this.setRawData(Boolean.valueOf(data)); + this.setByteSize(1); + } + return; + } + + public BoolColumn() { + super(null, Column.Type.BOOL, 1); + } + + @Override + public Boolean asBoolean() { + if (null == super.getRawData()) { + return null; + } + + return (Boolean) super.getRawData(); + } + + @Override + public Long asLong() { + if (null == this.getRawData()) { + return null; + } + + return this.asBoolean() ? 1L : 0L; + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + return this.asBoolean() ? 1.0d : 0.0d; + } + + @Override + public String asString() { + if (null == super.getRawData()) { + return null; + } + + return this.asBoolean() ? "true" : "false"; + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + return BigInteger.valueOf(this.asLong()); + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + return BigDecimal.valueOf(this.asLong()); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bool类型不能转为Date ."); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Boolean类型不能转为Bytes ."); + } + + private void validate(final String data) { + if (null == data) { + return; + } + + if ("true".equalsIgnoreCase(data) || "false".equalsIgnoreCase(data)) { + return; + } + + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s]不能转为Bool .", data)); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/BytesColumn.java b/common/src/main/java/com/alibaba/datax/common/element/BytesColumn.java new file mode 100755 index 000000000..d3cc59936 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/BytesColumn.java @@ -0,0 +1,84 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang3.ArrayUtils; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + */ +public class BytesColumn extends Column { + + public BytesColumn() { + this(null); + } + + public BytesColumn(byte[] bytes) { + super(ArrayUtils.clone(bytes), Column.Type.BYTES, null == bytes ? 0 + : bytes.length); + } + + @Override + public byte[] asBytes() { + if (null == this.getRawData()) { + return null; + } + + return (byte[]) this.getRawData(); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + + try { + return ColumnCast.bytes2String(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("Bytes[%s]不能转为String .", this.toString())); + } + } + + @Override + public Long asLong() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Long ."); + } + + @Override + public BigDecimal asBigDecimal() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为BigDecimal ."); + } + + @Override + public BigInteger asBigInteger() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为BigInteger ."); + } + + @Override + public Double asDouble() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Long ."); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Date ."); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Boolean ."); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/Column.java b/common/src/main/java/com/alibaba/datax/common/element/Column.java new file mode 100755 index 000000000..ed68e88d6 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/Column.java @@ -0,0 +1,75 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.fastjson.JSON; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + *

+ */ +public abstract class Column { + + private Type type; + + private Object rawData; + + private int byteSize; + + public Column(final Object object, final Type type, int byteSize) { + this.rawData = object; + this.type = type; + this.byteSize = byteSize; + } + + public Object getRawData() { + return this.rawData; + } + + public Type getType() { + return this.type; + } + + public int getByteSize() { + return this.byteSize; + } + + protected void setType(Type type) { + this.type = type; + } + + protected void setRawData(Object rawData) { + this.rawData = rawData; + } + + protected void setByteSize(int byteSize) { + this.byteSize = byteSize; + } + + public abstract Long asLong(); + + public abstract Double asDouble(); + + public abstract String asString(); + + public abstract Date asDate(); + + public abstract byte[] asBytes(); + + public abstract Boolean asBoolean(); + + public abstract BigDecimal asBigDecimal(); + + public abstract BigInteger asBigInteger(); + + @Override + public String toString() { + return JSON.toJSONString(this); + } + + public enum Type { + BAD, NULL, INT, LONG, DOUBLE, STRING, BOOL, DATE, BYTES + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/ColumnCast.java b/common/src/main/java/com/alibaba/datax/common/element/ColumnCast.java new file mode 100755 index 000000000..89d0a7c62 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/ColumnCast.java @@ -0,0 +1,199 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.time.DateFormatUtils; +import org.apache.commons.lang3.time.FastDateFormat; + +import java.io.UnsupportedEncodingException; +import java.text.ParseException; +import java.util.*; + +public final class ColumnCast { + + public static void bind(final Configuration configuration) { + StringCast.init(configuration); + DateCast.init(configuration); + BytesCast.init(configuration); + } + + public static Date string2Date(final StringColumn column) + throws ParseException { + return StringCast.asDate(column); + } + + public static byte[] string2Bytes(final StringColumn column) + throws UnsupportedEncodingException { + return StringCast.asBytes(column); + } + + public static String date2String(final DateColumn column) { + return DateCast.asString(column); + } + + public static String bytes2String(final BytesColumn column) + throws UnsupportedEncodingException { + return BytesCast.asString(column); + } +} + +class StringCast { + static String datetimeFormat = "yyyy-MM-dd HH:mm:ss"; + + static String dateFormat = "yyyy-MM-dd"; + + static String timeFormat = "HH:mm:ss"; + + static List extraFormats = Collections.emptyList(); + + static String timeZone = "GMT+8"; + + static FastDateFormat dateFormatter; + + static FastDateFormat timeFormatter; + + static FastDateFormat datetimeFormatter; + + static TimeZone timeZoner; + + static String encoding = "UTF-8"; + + static void init(final Configuration configuration) { + StringCast.datetimeFormat = configuration.getString( + "common.column.datetimeFormat", StringCast.datetimeFormat); + StringCast.dateFormat = configuration.getString( + "common.column.dateFormat", StringCast.dateFormat); + StringCast.timeFormat = configuration.getString( + "common.column.timeFormat", StringCast.timeFormat); + StringCast.extraFormats = configuration.getList( + "common.column.extraFormats", Collections.emptyList(), String.class); + + StringCast.timeZone = configuration.getString("common.column.timeZone", + StringCast.timeZone); + StringCast.timeZoner = TimeZone.getTimeZone(StringCast.timeZone); + + StringCast.datetimeFormatter = FastDateFormat.getInstance( + StringCast.datetimeFormat, StringCast.timeZoner); + StringCast.dateFormatter = FastDateFormat.getInstance( + StringCast.dateFormat, StringCast.timeZoner); + StringCast.timeFormatter = FastDateFormat.getInstance( + StringCast.timeFormat, StringCast.timeZoner); + + StringCast.encoding = configuration.getString("common.column.encoding", + StringCast.encoding); + } + + static Date asDate(final StringColumn column) throws ParseException { + if (null == column.asString()) { + return null; + } + + try { + return StringCast.datetimeFormatter.parse(column.asString()); + } catch (ParseException ignored) { + } + + try { + return StringCast.dateFormatter.parse(column.asString()); + } catch (ParseException ignored) { + } + + ParseException e; + try { + return StringCast.timeFormatter.parse(column.asString()); + } catch (ParseException ignored) { + e = ignored; + } + + for (String format : StringCast.extraFormats) { + try{ + return FastDateFormat.getInstance(format, StringCast.timeZoner).parse(column.asString()); + } catch (ParseException ignored){ + e = ignored; + } + } + throw e; + } + + static byte[] asBytes(final StringColumn column) + throws UnsupportedEncodingException { + if (null == column.asString()) { + return null; + } + + return column.asString().getBytes(StringCast.encoding); + } +} + +/** + * 后续为了可维护性,可以考虑直接使用 apache 的DateFormatUtils. + * + * 迟南已经修复了该问题,但是为了维护性,还是直接使用apache的内置函数 + */ +class DateCast { + + static String datetimeFormat = "yyyy-MM-dd HH:mm:ss"; + + static String dateFormat = "yyyy-MM-dd"; + + static String timeFormat = "HH:mm:ss"; + + static String timeZone = "GMT+8"; + + static TimeZone timeZoner = TimeZone.getTimeZone(DateCast.timeZone); + + static void init(final Configuration configuration) { + DateCast.datetimeFormat = configuration.getString( + "common.column.datetimeFormat", datetimeFormat); + DateCast.timeFormat = configuration.getString( + "common.column.timeFormat", timeFormat); + DateCast.dateFormat = configuration.getString( + "common.column.dateFormat", dateFormat); + DateCast.timeZone = configuration.getString("common.column.timeZone", + DateCast.timeZone); + DateCast.timeZoner = TimeZone.getTimeZone(DateCast.timeZone); + return; + } + + static String asString(final DateColumn column) { + if (null == column.asDate()) { + return null; + } + + switch (column.getSubType()) { + case DATE: + return DateFormatUtils.format(column.asDate(), DateCast.dateFormat, + DateCast.timeZoner); + case TIME: + return DateFormatUtils.format(column.asDate(), DateCast.timeFormat, + DateCast.timeZoner); + case DATETIME: + return DateFormatUtils.format(column.asDate(), + DateCast.datetimeFormat, DateCast.timeZoner); + default: + throw DataXException + .asDataXException(CommonErrorCode.CONVERT_NOT_SUPPORT, + "时间类型出现不支持类型,目前仅支持DATE/TIME/DATETIME。该类型属于编程错误,请反馈给DataX开发团队 ."); + } + } +} + +class BytesCast { + static String encoding = "utf-8"; + + static void init(final Configuration configuration) { + BytesCast.encoding = configuration.getString("common.column.encoding", + BytesCast.encoding); + return; + } + + static String asString(final BytesColumn column) + throws UnsupportedEncodingException { + if (null == column.asBytes()) { + return null; + } + + return new String(column.asBytes(), encoding); + } +} \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/element/DateColumn.java b/common/src/main/java/com/alibaba/datax/common/element/DateColumn.java new file mode 100755 index 000000000..6626a6fbd --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/DateColumn.java @@ -0,0 +1,130 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + */ +public class DateColumn extends Column { + + private DateType subType = DateType.DATETIME; + + public static enum DateType { + DATE, TIME, DATETIME + } + + /** + * 构建值为null的DateColumn,使用Date子类型为DATETIME + * */ + public DateColumn() { + this((Long)null); + } + + /** + * 构建值为stamp(Unix时间戳)的DateColumn,使用Date子类型为DATETIME + * 实际存储有date改为long的ms,节省存储 + * */ + public DateColumn(final Long stamp) { + super(stamp, Column.Type.DATE, (null == stamp ? 0 : 8)); + } + + /** + * 构建值为date(java.util.Date)的DateColumn,使用Date子类型为DATETIME + * */ + public DateColumn(final Date date) { + this(date == null ? null : date.getTime()); + } + + /** + * 构建值为date(java.sql.Date)的DateColumn,使用Date子类型为DATE,只有日期,没有时间 + * */ + public DateColumn(final java.sql.Date date) { + this(date == null ? null : date.getTime()); + this.setSubType(DateType.DATE); + } + + /** + * 构建值为time(java.sql.Time)的DateColumn,使用Date子类型为TIME,只有时间,没有日期 + * */ + public DateColumn(final java.sql.Time time) { + this(time == null ? null : time.getTime()); + this.setSubType(DateType.TIME); + } + + /** + * 构建值为ts(java.sql.Timestamp)的DateColumn,使用Date子类型为DATETIME + * */ + public DateColumn(final java.sql.Timestamp ts) { + this(ts == null ? null : ts.getTime()); + this.setSubType(DateType.DATETIME); + } + + @Override + public Long asLong() { + + return (Long)this.getRawData(); + } + + @Override + public String asString() { + try { + return ColumnCast.date2String(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("Date[%s]类型不能转为String .", this.toString())); + } + } + + @Override + public Date asDate() { + if (null == this.getRawData()) { + return null; + } + + return new Date((Long)this.getRawData()); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为Bytes ."); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为Boolean ."); + } + + @Override + public Double asDouble() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为Double ."); + } + + @Override + public BigInteger asBigInteger() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为BigInteger ."); + } + + @Override + public BigDecimal asBigDecimal() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为BigDecimal ."); + } + + public DateType getSubType() { + return subType; + } + + public void setSubType(DateType subType) { + this.subType = subType; + } +} \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/element/DoubleColumn.java b/common/src/main/java/com/alibaba/datax/common/element/DoubleColumn.java new file mode 100755 index 000000000..17170ea6c --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/DoubleColumn.java @@ -0,0 +1,161 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +public class DoubleColumn extends Column { + + public DoubleColumn(final String data) { + this(data, null == data ? 0 : data.length()); + this.validate(data); + } + + public DoubleColumn(Long data) { + this(data == null ? (String) null : String.valueOf(data)); + } + + public DoubleColumn(Integer data) { + this(data == null ? (String) null : String.valueOf(data)); + } + + /** + * Double无法表示准确的小数数据,我们不推荐使用该方法保存Double数据,建议使用String作为构造入参 + * + * */ + public DoubleColumn(final Double data) { + this(data == null ? (String) null + : new BigDecimal(String.valueOf(data)).toPlainString()); + } + + /** + * Float无法表示准确的小数数据,我们不推荐使用该方法保存Float数据,建议使用String作为构造入参 + * + * */ + public DoubleColumn(final Float data) { + this(data == null ? (String) null + : new BigDecimal(String.valueOf(data)).toPlainString()); + } + + public DoubleColumn(final BigDecimal data) { + this(null == data ? (String) null : data.toPlainString()); + } + + public DoubleColumn(final BigInteger data) { + this(null == data ? (String) null : data.toString()); + } + + public DoubleColumn() { + this((String) null); + } + + private DoubleColumn(final String data, int byteSize) { + super(data, Column.Type.DOUBLE, byteSize); + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + try { + return new BigDecimal((String) this.getRawData()); + } catch (NumberFormatException e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s] 无法转换为Double类型 .", + (String) this.getRawData())); + } + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + String string = (String) this.getRawData(); + + boolean isDoubleSpecific = string.equals("NaN") + || string.equals("-Infinity") || string.equals("+Infinity"); + if (isDoubleSpecific) { + return Double.valueOf(string); + } + + BigDecimal result = this.asBigDecimal(); + OverFlowUtil.validateDoubleNotOverFlow(result); + + return result.doubleValue(); + } + + @Override + public Long asLong() { + if (null == this.getRawData()) { + return null; + } + + BigDecimal result = this.asBigDecimal(); + OverFlowUtil.validateLongNotOverFlow(result.toBigInteger()); + + return result.longValue(); + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + return this.asBigDecimal().toBigInteger(); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + return (String) this.getRawData(); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Double类型无法转为Bool ."); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Double类型无法转为Date类型 ."); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Double类型无法转为Bytes类型 ."); + } + + private void validate(final String data) { + if (null == data) { + return; + } + + if (data.equalsIgnoreCase("NaN") || data.equalsIgnoreCase("-Infinity") + || data.equalsIgnoreCase("Infinity")) { + return; + } + + try { + new BigDecimal(data); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s]无法转为Double类型 .", data)); + } + } + +} \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/element/LongColumn.java b/common/src/main/java/com/alibaba/datax/common/element/LongColumn.java new file mode 100755 index 000000000..d8113f7c0 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/LongColumn.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang3.math.NumberUtils; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +public class LongColumn extends Column { + + /** + * 从整形字符串表示转为LongColumn,支持Java科学计数法 + * + * NOTE:
+ * 如果data为浮点类型的字符串表示,数据将会失真,请使用DoubleColumn对接浮点字符串 + * + * */ + public LongColumn(final String data) { + super(null, Column.Type.LONG, 0); + if (null == data) { + return; + } + + try { + BigInteger rawData = NumberUtils.createBigDecimal(data) + .toBigInteger(); + super.setRawData(rawData); + + // 当 rawData 为[0-127]时,rawData.bitLength() < 8,导致其 byteSize = 0,简单起见,直接认为其长度为 data.length() + // super.setByteSize(rawData.bitLength() / 8); + super.setByteSize(data.length()); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s]不能转为Long .", data)); + } + } + + public LongColumn(Long data) { + this(null == data ? (BigInteger) null : BigInteger.valueOf(data)); + } + + public LongColumn(Integer data) { + this(null == data ? (BigInteger) null : BigInteger.valueOf(data)); + } + + public LongColumn(BigInteger data) { + this(data, null == data ? 0 : 8); + } + + private LongColumn(BigInteger data, int byteSize) { + super(data, Column.Type.LONG, byteSize); + } + + public LongColumn() { + this((BigInteger) null); + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + return (BigInteger) this.getRawData(); + } + + @Override + public Long asLong() { + BigInteger rawData = (BigInteger) this.getRawData(); + if (null == rawData) { + return null; + } + + OverFlowUtil.validateLongNotOverFlow(rawData); + + return rawData.longValue(); + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + BigDecimal decimal = this.asBigDecimal(); + OverFlowUtil.validateDoubleNotOverFlow(decimal); + + return decimal.doubleValue(); + } + + @Override + public Boolean asBoolean() { + if (null == this.getRawData()) { + return null; + } + + return this.asBigInteger().compareTo(BigInteger.ZERO) != 0 ? true + : false; + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + return new BigDecimal(this.asBigInteger()); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + return ((BigInteger) this.getRawData()).toString(); + } + + @Override + public Date asDate() { + if (null == this.getRawData()) { + return null; + } + return new Date(this.asLong()); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Long类型不能转为Bytes ."); + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/OverFlowUtil.java b/common/src/main/java/com/alibaba/datax/common/element/OverFlowUtil.java new file mode 100755 index 000000000..39460c7eb --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/OverFlowUtil.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.common.element; + +import java.math.BigDecimal; +import java.math.BigInteger; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +public final class OverFlowUtil { + public static final BigInteger MAX_LONG = BigInteger + .valueOf(Long.MAX_VALUE); + + public static final BigInteger MIN_LONG = BigInteger + .valueOf(Long.MIN_VALUE); + + public static final BigDecimal MIN_DOUBLE_POSITIVE = new BigDecimal( + String.valueOf(Double.MIN_VALUE)); + + public static final BigDecimal MAX_DOUBLE_POSITIVE = new BigDecimal( + String.valueOf(Double.MAX_VALUE)); + + public static boolean isLongOverflow(final BigInteger integer) { + return (integer.compareTo(OverFlowUtil.MAX_LONG) > 0 || integer + .compareTo(OverFlowUtil.MIN_LONG) < 0); + + } + + public static void validateLongNotOverFlow(final BigInteger integer) { + boolean isOverFlow = OverFlowUtil.isLongOverflow(integer); + + if (isOverFlow) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_OVER_FLOW, + String.format("[%s] 转为Long类型出现溢出 .", integer.toString())); + } + } + + public static boolean isDoubleOverFlow(final BigDecimal decimal) { + if (decimal.signum() == 0) { + return false; + } + + BigDecimal newDecimal = decimal; + boolean isPositive = decimal.signum() == 1; + if (!isPositive) { + newDecimal = decimal.negate(); + } + + return (newDecimal.compareTo(MIN_DOUBLE_POSITIVE) < 0 || newDecimal + .compareTo(MAX_DOUBLE_POSITIVE) > 0); + } + + public static void validateDoubleNotOverFlow(final BigDecimal decimal) { + boolean isOverFlow = OverFlowUtil.isDoubleOverFlow(decimal); + if (isOverFlow) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_OVER_FLOW, + String.format("[%s]转为Double类型出现溢出 .", + decimal.toPlainString())); + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/Record.java b/common/src/main/java/com/alibaba/datax/common/element/Record.java new file mode 100755 index 000000000..d06d80aaf --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/Record.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.common.element; + +/** + * Created by jingxing on 14-8-24. + */ + +public interface Record { + + public void addColumn(Column column); + + public void setColumn(int i, final Column column); + + public Column getColumn(int i); + + public String toString(); + + public int getColumnNumber(); + + public int getByteSize(); + + public int getMemorySize(); + +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/StringColumn.java b/common/src/main/java/com/alibaba/datax/common/element/StringColumn.java new file mode 100755 index 000000000..11209f468 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/StringColumn.java @@ -0,0 +1,163 @@ +package com.alibaba.datax.common.element; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +/** + * Created by jingxing on 14-8-24. + */ + +public class StringColumn extends Column { + + public StringColumn() { + this((String) null); + } + + public StringColumn(final String rawData) { + super(rawData, Column.Type.STRING, (null == rawData ? 0 : rawData + .length())); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + + return (String) this.getRawData(); + } + + private void validateDoubleSpecific(final String data) { + if ("NaN".equals(data) || "Infinity".equals(data) + || "-Infinity".equals(data)) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]属于Double特殊类型,不能转为其他类型 .", data)); + } + + return; + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + this.validateDoubleSpecific((String) this.getRawData()); + + try { + return this.asBigDecimal().toBigInteger(); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, String.format( + "String[\"%s\"]不能转为BigInteger .", this.asString())); + } + } + + @Override + public Long asLong() { + if (null == this.getRawData()) { + return null; + } + + this.validateDoubleSpecific((String) this.getRawData()); + + try { + BigInteger integer = this.asBigInteger(); + OverFlowUtil.validateLongNotOverFlow(integer); + return integer.longValue(); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Long .", this.asString())); + } + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + this.validateDoubleSpecific((String) this.getRawData()); + + try { + return new BigDecimal(this.asString()); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, String.format( + "String [\"%s\"] 不能转为BigDecimal .", this.asString())); + } + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + String data = (String) this.getRawData(); + if ("NaN".equals(data)) { + return Double.NaN; + } + + if ("Infinity".equals(data)) { + return Double.POSITIVE_INFINITY; + } + + if ("-Infinity".equals(data)) { + return Double.NEGATIVE_INFINITY; + } + + BigDecimal decimal = this.asBigDecimal(); + OverFlowUtil.validateDoubleNotOverFlow(decimal); + + return decimal.doubleValue(); + } + + @Override + public Boolean asBoolean() { + if (null == this.getRawData()) { + return null; + } + + if ("true".equalsIgnoreCase(this.asString())) { + return true; + } + + if ("false".equalsIgnoreCase(this.asString())) { + return false; + } + + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Bool .", this.asString())); + } + + @Override + public Date asDate() { + try { + return ColumnCast.string2Date(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Date .", this.asString())); + } + } + + @Override + public byte[] asBytes() { + try { + return ColumnCast.string2Bytes(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Bytes .", this.asString())); + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/exception/CommonErrorCode.java b/common/src/main/java/com/alibaba/datax/common/exception/CommonErrorCode.java new file mode 100755 index 000000000..8679ffb47 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/exception/CommonErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.common.exception; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * + */ +public enum CommonErrorCode implements ErrorCode { + + CONFIG_ERROR("Common-00", "您提供的配置文件存在错误信息,请检查您的作业配置 ."), + CONVERT_NOT_SUPPORT("Common-01", "同步数据出现业务脏数据情况,数据类型转换错误 ."), + CONVERT_OVER_FLOW("Common-02", "同步数据出现业务脏数据情况,数据类型转换溢出 ."), + RETRY_FAIL("Common-10", "方法调用多次仍旧失败 ."), + RUNTIME_ERROR("Common-11", "运行时内部调用错误 ."), + HOOK_INTERNAL_ERROR("Common-12", "Hook运行错误 ."), + SHUT_DOWN_TASK("Common-20", "Task收到了shutdown指令,为failover做准备"), + WAIT_TIME_EXCEED("Common-21", "等待时间超出范围"), + TASK_HUNG_EXPIRED("Common-22", "任务hung住,Expired"); + + private final String code; + + private final String describe; + + private CommonErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]", this.code, + this.describe); + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/exception/DataXException.java b/common/src/main/java/com/alibaba/datax/common/exception/DataXException.java new file mode 100755 index 000000000..9def28de7 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/exception/DataXException.java @@ -0,0 +1,60 @@ +package com.alibaba.datax.common.exception; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class DataXException extends RuntimeException { + + private static final long serialVersionUID = 1L; + + private ErrorCode errorCode; + + public DataXException(ErrorCode errorCode, String errorMessage) { + super(errorCode.toString() + " - " + errorMessage); + this.errorCode = errorCode; + } + + private DataXException(ErrorCode errorCode, String errorMessage, + Throwable cause) { + super(errorCode.toString() + " - " + getMessage(errorMessage) + + " - " + getMessage(cause), cause); + + this.errorCode = errorCode; + } + + public static DataXException asDataXException(ErrorCode errorCode, String message) { + return new DataXException(errorCode, message); + } + + public static DataXException asDataXException(ErrorCode errorCode, String message, + Throwable cause) { + if (cause instanceof DataXException) { + return (DataXException) cause; + } + return new DataXException(errorCode, message, cause); + } + + public static DataXException asDataXException(ErrorCode errorCode, + Throwable cause) { + if (cause instanceof DataXException) { + return (DataXException) cause; + } + return new DataXException(errorCode, getMessage(cause), cause); + } + + public ErrorCode getErrorCode() { + return this.errorCode; + } + + + private static String getMessage(Object obj) { + if (obj == null) { + return ""; + } + + if (obj instanceof Throwable) { + return ((Throwable) obj).getMessage(); + } else { + return obj.toString(); + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/exception/ExceptionTracker.java b/common/src/main/java/com/alibaba/datax/common/exception/ExceptionTracker.java new file mode 100644 index 000000000..f6d3732e2 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/exception/ExceptionTracker.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.common.exception; + +import java.io.PrintWriter; +import java.io.StringWriter; + +public final class ExceptionTracker { + public static final int STRING_BUFFER = 1024; + + public static String trace(Throwable ex) { + StringWriter sw = new StringWriter(STRING_BUFFER); + PrintWriter pw = new PrintWriter(sw); + ex.printStackTrace(pw); + return sw.toString(); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/AbstractJobPlugin.java b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractJobPlugin.java new file mode 100755 index 000000000..946adfd0e --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractJobPlugin.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.common.plugin; + +/** + * Created by jingxing on 14-8-24. + */ +public abstract class AbstractJobPlugin extends AbstractPlugin { + /** + * @return the jobPluginCollector + */ + public JobPluginCollector getJobPluginCollector() { + return jobPluginCollector; + } + + /** + * @param jobPluginCollector + * the jobPluginCollector to set + */ + public void setJobPluginCollector( + JobPluginCollector jobPluginCollector) { + this.jobPluginCollector = jobPluginCollector; + } + + private JobPluginCollector jobPluginCollector; + +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/AbstractPlugin.java b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractPlugin.java new file mode 100755 index 000000000..184ee89ec --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractPlugin.java @@ -0,0 +1,87 @@ +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.common.util.Configuration; + +public abstract class AbstractPlugin extends BaseObject implements Pluginable { + //作业的config + private Configuration pluginJobConf; + + //插件本身的plugin + private Configuration pluginConf; + + // by qiangsi.lq。 修改为对端的作业configuration + private Configuration peerPluginJobConf; + + private String peerPluginName; + + @Override + public String getPluginName() { + assert null != this.pluginConf; + return this.pluginConf.getString("name"); + } + + @Override + public String getDeveloper() { + assert null != this.pluginConf; + return this.pluginConf.getString("developer"); + } + + @Override + public String getDescription() { + assert null != this.pluginConf; + return this.pluginConf.getString("description"); + } + + @Override + public Configuration getPluginJobConf() { + return pluginJobConf; + } + + @Override + public void setPluginJobConf(Configuration pluginJobConf) { + this.pluginJobConf = pluginJobConf; + } + + @Override + public void setPluginConf(Configuration pluginConf) { + this.pluginConf = pluginConf; + } + + @Override + public Configuration getPeerPluginJobConf() { + return peerPluginJobConf; + } + + @Override + public void setPeerPluginJobConf(Configuration peerPluginJobConf) { + this.peerPluginJobConf = peerPluginJobConf; + } + + @Override + public String getPeerPluginName() { + return peerPluginName; + } + + @Override + public void setPeerPluginName(String peerPluginName) { + this.peerPluginName = peerPluginName; + } + + public void preCheck() { + } + + public void prepare() { + } + + public void post() { + } + + public void preHandler(Configuration jobConfiguration){ + + } + + public void postHandler(Configuration jobConfiguration){ + + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/AbstractTaskPlugin.java b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractTaskPlugin.java new file mode 100755 index 000000000..39fbbe9b5 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractTaskPlugin.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.common.plugin; + +/** + * Created by jingxing on 14-8-24. + */ +public abstract class AbstractTaskPlugin extends AbstractPlugin { + + //TaskPlugin 应该具备taskId + private int taskGroupId; + private int taskId; + private TaskPluginCollector taskPluginCollector; + + public TaskPluginCollector getTaskPluginCollector() { + return taskPluginCollector; + } + + public void setTaskPluginCollector( + TaskPluginCollector taskPluginCollector) { + this.taskPluginCollector = taskPluginCollector; + } + + public int getTaskId() { + return taskId; + } + + public void setTaskId(int taskId) { + this.taskId = taskId; + } + + public int getTaskGroupId() { + return taskGroupId; + } + + public void setTaskGroupId(int taskGroupId) { + this.taskGroupId = taskGroupId; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/JobPluginCollector.java b/common/src/main/java/com/alibaba/datax/common/plugin/JobPluginCollector.java new file mode 100755 index 000000000..6eb02ab4e --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/JobPluginCollector.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.common.plugin; + +import java.util.List; +import java.util.Map; + +/** + * Created by jingxing on 14-9-9. + */ +public interface JobPluginCollector extends PluginCollector { + + /** + * 从Task获取自定义收集信息 + * + * */ + Map> getMessage(); + + /** + * 从Task获取自定义收集信息 + * + * */ + List getMessage(String key); +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/PluginCollector.java b/common/src/main/java/com/alibaba/datax/common/plugin/PluginCollector.java new file mode 100755 index 000000000..f2af398dd --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/PluginCollector.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.common.plugin; + + +/** + * 这里只是一个标示类 + * */ +public interface PluginCollector { + +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/Pluginable.java b/common/src/main/java/com/alibaba/datax/common/plugin/Pluginable.java new file mode 100755 index 000000000..ac28f6a29 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/Pluginable.java @@ -0,0 +1,30 @@ +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.util.Configuration; + +public interface Pluginable { + String getDeveloper(); + + String getDescription(); + + void setPluginConf(Configuration pluginConf); + + void init(); + + void destroy(); + + String getPluginName(); + + Configuration getPluginJobConf(); + + Configuration getPeerPluginJobConf(); + + public String getPeerPluginName(); + + void setPluginJobConf(Configuration jobConf); + + void setPeerPluginJobConf(Configuration peerPluginJobConf); + + public void setPeerPluginName(String peerPluginName); + +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/RecordReceiver.java b/common/src/main/java/com/alibaba/datax/common/plugin/RecordReceiver.java new file mode 100755 index 000000000..74f236f37 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/RecordReceiver.java @@ -0,0 +1,26 @@ +/** + * (C) 2010-2013 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.element.Record; + +public interface RecordReceiver { + + public Record getFromReader(); + + public void shutdown(); +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/RecordSender.java b/common/src/main/java/com/alibaba/datax/common/plugin/RecordSender.java new file mode 100755 index 000000000..0d6926098 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/RecordSender.java @@ -0,0 +1,32 @@ +/** + * (C) 2010-2013 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.element.Record; + +public interface RecordSender { + + public Record createRecord(); + + public void sendToWriter(Record record); + + public void flush(); + + public void terminate(); + + public void shutdown(); +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/TaskPluginCollector.java b/common/src/main/java/com/alibaba/datax/common/plugin/TaskPluginCollector.java new file mode 100755 index 000000000..f0c85fe6c --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/TaskPluginCollector.java @@ -0,0 +1,57 @@ +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.element.Record; + +/** + * + * 该接口提供给Task Plugin用来记录脏数据和自定义信息。
+ * + * 1. 脏数据记录,TaskPluginCollector提供多种脏数据记录的适配,包括本地输出、集中式汇报等等
+ * 2. 自定义信息,所有的task插件运行过程中可以通过TaskPluginCollector收集信息,
+ * Job的插件在POST过程中通过getMessage()接口获取信息 + */ +public abstract class TaskPluginCollector implements PluginCollector { + /** + * 收集脏数据 + * + * @param dirtyRecord + * 脏数据信息 + * @param t + * 异常信息 + * @param errorMessage + * 错误的提示信息 + */ + public abstract void collectDirtyRecord(final Record dirtyRecord, + final Throwable t, final String errorMessage); + + /** + * 收集脏数据 + * + * @param dirtyRecord + * 脏数据信息 + * @param errorMessage + * 错误的提示信息 + */ + public void collectDirtyRecord(final Record dirtyRecord, + final String errorMessage) { + this.collectDirtyRecord(dirtyRecord, null, errorMessage); + } + + /** + * 收集脏数据 + * + * @param dirtyRecord + * 脏数据信息 + * @param t + * 异常信息 + */ + public void collectDirtyRecord(final Record dirtyRecord, final Throwable t) { + this.collectDirtyRecord(dirtyRecord, t, ""); + } + + /** + * 收集自定义信息,Job插件可以通过getMessage获取该信息
+ * 如果多个key冲突,内部使用List记录同一个key,多个value情况。
+ * */ + public abstract void collectMessage(final String key, final String value); +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/ErrorCode.java b/common/src/main/java/com/alibaba/datax/common/spi/ErrorCode.java new file mode 100755 index 000000000..053f99a47 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/ErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.common.spi; + +/** + * 尤其注意:最好提供toString()实现。例如: + * + *

+ * 
+ * @Override
+ * public String toString() {
+ * 	return String.format("Code:[%s], Description:[%s]. ", this.code, this.describe);
+ * }
+ * 
+ * + */ +public interface ErrorCode { + // 错误码编号 + String getCode(); + + // 错误码描述 + String getDescription(); + + /** 必须提供toString的实现 + * + *
+	 * @Override
+	 * public String toString() {
+	 * 	return String.format("Code:[%s], Description:[%s]. ", this.code, this.describe);
+	 * }
+	 * 
+ * + */ + String toString(); +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/Hook.java b/common/src/main/java/com/alibaba/datax/common/spi/Hook.java new file mode 100755 index 000000000..d510f57c1 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/Hook.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.common.spi; + +import com.alibaba.datax.common.util.Configuration; + +import java.util.Map; + +/** + * Created by xiafei.qiuxf on 14/12/17. + */ +public interface Hook { + + /** + * 返回名字 + * + * @return + */ + public String getName(); + + /** + * TODO 文档 + * + * @param jobConf + * @param msg + */ + public void invoke(Configuration jobConf, Map msg); + +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/Reader.java b/common/src/main/java/com/alibaba/datax/common/spi/Reader.java new file mode 100755 index 000000000..fec41a9f0 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/Reader.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.common.spi; + +import java.util.List; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.plugin.RecordSender; + +/** + * 每个Reader插件在其内部内部实现Job、Task两个内部类。 + * + * + * */ +public abstract class Reader extends BaseObject { + + /** + * 每个Reader插件必须实现Job内部类。 + * + * */ + public static abstract class Job extends AbstractJobPlugin { + + /** + * 切分任务 + * + * @param adviceNumber + * + * 着重说明下,adviceNumber是框架建议插件切分的任务数,插件开发人员最好切分出来的任务数>= + * adviceNumber。
+ *
+ * 之所以采取这个建议是为了给用户最好的实现,例如框架根据计算认为用户数据存储可以支持100个并发连接, + * 并且用户认为需要100个并发。 此时,插件开发人员如果能够根据上述切分规则进行切分并做到>=100连接信息, + * DataX就可以同时启动100个Channel,这样给用户最好的吞吐量
+ * 例如用户同步一张Mysql单表,但是认为可以到10并发吞吐量,插件开发人员最好对该表进行切分,比如使用主键范围切分, + * 并且如果最终切分任务数到>=10,我们就可以提供给用户最大的吞吐量。
+ *
+ * 当然,我们这里只是提供一个建议值,Reader插件可以按照自己规则切分。但是我们更建议按照框架提供的建议值来切分。
+ *
+ * 对于ODPS写入OTS而言,如果存在预排序预切分问题,这样就可能只能按照分区信息切分,无法更细粒度切分, + * 这类情况只能按照源头物理信息切分规则切分。
+ *
+ * + * + * */ + public abstract List split(int adviceNumber); + } + + public static abstract class Task extends AbstractTaskPlugin { + public abstract void startRead(RecordSender recordSender); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/Writer.java b/common/src/main/java/com/alibaba/datax/common/spi/Writer.java new file mode 100755 index 000000000..457eb6860 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/Writer.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.common.spi; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.plugin.RecordReceiver; + +import java.util.List; + +/** + * 每个Writer插件需要实现Writer类,并在其内部实现Job、Task两个内部类。 + * + * + * */ +public abstract class Writer extends BaseObject { + /** + * 每个Writer插件必须实现Job内部类 + */ + public abstract static class Job extends AbstractJobPlugin { + /** + * 切分任务。
+ * + * @param mandatoryNumber + * 为了做到Reader、Writer任务数对等,这里要求Writer插件必须按照源端的切分数进行切分。否则框架报错! + * + * */ + public abstract List split(int mandatoryNumber); + } + + /** + * 每个Writer插件必须实现Task内部类 + */ + public abstract static class Task extends AbstractTaskPlugin { + + public abstract void startWrite(RecordReceiver lineReceiver); + + public boolean supportFailOver(){return false;} + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/statistics/PerfRecord.java b/common/src/main/java/com/alibaba/datax/common/statistics/PerfRecord.java new file mode 100644 index 000000000..5174fcad2 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/statistics/PerfRecord.java @@ -0,0 +1,246 @@ +package com.alibaba.datax.common.statistics; + +import com.alibaba.datax.common.util.HostUtils; +import com.google.common.base.Objects; +import org.apache.commons.lang3.time.DateFormatUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Date; + +/** + * Created by liqiang on 15/8/23. + */ +public class PerfRecord implements Comparable { + private static Logger perf = LoggerFactory.getLogger(PerfRecord.class); + private static String datetimeFormat = "yyyy-MM-dd HH:mm:ss"; + + + public enum PHASE { + /** + * task total运行的时间,前10为框架统计,后面为部分插件的个性统计 + */ + TASK_TOTAL(0), + + READ_TASK_INIT(1), + READ_TASK_PREPARE(2), + READ_TASK_DATA(3), + READ_TASK_POST(4), + READ_TASK_DESTROY(5), + + WRITE_TASK_INIT(6), + WRITE_TASK_PREPARE(7), + WRITE_TASK_DATA(8), + WRITE_TASK_POST(9), + WRITE_TASK_DESTROY(10), + + /** + * SQL_QUERY: sql query阶段, 部分reader的个性统计 + */ + SQL_QUERY(100), + /** + * 数据从sql全部读出来 + */ + RESULT_NEXT_ALL(101), + + /** + * only odps block close + */ + ODPS_BLOCK_CLOSE(102), + + WAIT_READ_TIME(103), + + WAIT_WRITE_TIME(104); + + private int val; + + PHASE(int val) { + this.val = val; + } + + public int toInt(){ + return val; + } + } + + public enum ACTION{ + start, + end + } + + private final int taskGroupId; + private final int taskId; + private final PHASE phase; + private volatile ACTION action; + private volatile Date startTime; + private volatile long elapsedTimeInNs = -1; + private volatile long count = 0; + private volatile long size = 0; + + private volatile long startTimeInNs; + private volatile boolean isReport = false; + + public PerfRecord(int taskGroupId, int taskId, PHASE phase) { + this.taskGroupId = taskGroupId; + this.taskId = taskId; + this.phase = phase; + } + + public static void addPerfRecord(int taskGroupId, int taskId, PHASE phase, long startTime,long elapsedTimeInNs) { + if(PerfTrace.getInstance().isEnable()) { + PerfRecord perfRecord = new PerfRecord(taskGroupId, taskId, phase); + perfRecord.elapsedTimeInNs = elapsedTimeInNs; + perfRecord.action = ACTION.end; + perfRecord.startTime = new Date(startTime); + //在PerfTrace里注册 + PerfTrace.getInstance().tracePerfRecord(perfRecord); + perf.info(perfRecord.toString()); + } + } + + public void start() { + if(PerfTrace.getInstance().isEnable()) { + this.startTime = new Date(); + this.startTimeInNs = System.nanoTime(); + this.action = ACTION.start; + //在PerfTrace里注册 + PerfTrace.getInstance().tracePerfRecord(this); + perf.info(toString()); + } + } + + public void addCount(long count) { + this.count += count; + } + + public void addSize(long size) { + this.size += size; + } + + public void end() { + if(PerfTrace.getInstance().isEnable()) { + this.elapsedTimeInNs = System.nanoTime() - startTimeInNs; + this.action = ACTION.end; + PerfTrace.getInstance().tracePerfRecord(this); + perf.info(toString()); + } + } + + public void end(long elapsedTimeInNs) { + if(PerfTrace.getInstance().isEnable()) { + this.elapsedTimeInNs = elapsedTimeInNs; + this.action = ACTION.end; + perf.info(toString()); + } + } + + public String toString() { + return String.format("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s" + , getJobId(), taskGroupId, taskId, phase, action, + DateFormatUtils.format(startTime, datetimeFormat), elapsedTimeInNs, count, size,getHostIP()); + } + + + @Override + public int compareTo(PerfRecord o) { + if (o == null) { + return 1; + } + return this.elapsedTimeInNs > o.elapsedTimeInNs ? 1 : this.elapsedTimeInNs == o.elapsedTimeInNs ? 0 : -1; + } + + @Override + public int hashCode() { + return Objects.hashCode(getJobId(),taskGroupId,taskId,phase,startTime); + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + if(!(o instanceof PerfRecord)){ + return false; + } + + PerfRecord dst = (PerfRecord)o; + + if(!Objects.equal(this.getJobId(),dst.getJobId())) return false; + if(!Objects.equal(this.taskGroupId,dst.taskGroupId)) return false; + if(!Objects.equal(this.taskId,dst.taskId)) return false; + if(!Objects.equal(this.phase,dst.phase)) return false; + if(!Objects.equal(this.startTime,dst.startTime)) return false; + + return true; + } + + public PerfRecord copy() { + PerfRecord copy = new PerfRecord(this.taskGroupId, this.getTaskId(), this.phase); + copy.action = this.action; + copy.startTime = this.startTime; + copy.elapsedTimeInNs = this.elapsedTimeInNs; + copy.count = this.count; + copy.size = this.size; + return copy; + } + public int getTaskGroupId() { + return taskGroupId; + } + + public int getTaskId() { + return taskId; + } + + public PHASE getPhase() { + return phase; + } + + public ACTION getAction() { + return action; + } + + public long getElapsedTimeInNs() { + return elapsedTimeInNs; + } + + public long getCount() { + return count; + } + + public long getSize() { + return size; + } + + public long getJobId(){ + return PerfTrace.getInstance().getJobId(); + } + + public String getHostIP(){ + return HostUtils.IP; + } + + public String getHostName(){ + return HostUtils.HOSTNAME; + } + + public Date getStartTime() { + return startTime; + } + + public long getStartTimeInNs() { + return startTimeInNs; + } + + public String getDatetime(){ + if(startTime == null){ + return "null time"; + } + return DateFormatUtils.format(startTime, datetimeFormat); + } + + public boolean isReport() { + return isReport; + } + + public void setIsReport(boolean isReport) { + this.isReport = isReport; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/statistics/PerfTrace.java b/common/src/main/java/com/alibaba/datax/common/statistics/PerfTrace.java new file mode 100644 index 000000000..eb0706079 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/statistics/PerfTrace.java @@ -0,0 +1,422 @@ +package com.alibaba.datax.common.statistics; + +import com.alibaba.datax.common.util.Configuration; +import com.google.common.base.Optional; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.TimeUnit; + +/** + * PerfTrace 记录 job(local模式),taskGroup(distribute模式),因为这2种都是jvm,即一个jvm里只需要有1个PerfTrace。 + */ + +public class PerfTrace { + + private static Logger LOG = LoggerFactory.getLogger(PerfTrace.class); + private static PerfTrace instance; + private static final Object lock = new Object(); + private String perfTraceId; + private volatile boolean enable; + private volatile boolean isJob; + private long jobId; + private int priority; + private int batchSize = 500; + private volatile boolean perfReportEnalbe = true; + + //jobid_jobversion,instanceid,taskid, src_mark, dst_mark, + private Map taskDetails = new ConcurrentHashMap(); + //PHASE => PerfRecord + private ConcurrentHashMap perfRecordMaps = new ConcurrentHashMap(); + private Configuration jobInfo; + private final List startReportPool = new ArrayList(); + private final List endReportPool = new ArrayList(); + private final List totalEndReport = new ArrayList(); + private final Set waitingReportSet = new HashSet(); + + + /** + * 单实例 + * + * @param isJob + * @param jobId + * @param taskGroupId + * @return + */ + public static PerfTrace getInstance(boolean isJob, long jobId, int taskGroupId,int priority, boolean enable) { + + if (instance == null) { + synchronized (lock) { + if (instance == null) { + instance = new PerfTrace(isJob, jobId, taskGroupId,priority, enable); + } + } + } + return instance; + } + + /** + * 因为一个JVM只有一个,因此在getInstance(isJob,jobId,taskGroupId)调用完成实例化后,方便后续调用,直接返回该实例 + * + * @return + */ + public static PerfTrace getInstance() { + if (instance == null) { + LOG.error("PerfTrace instance not be init! must have some error! "); + synchronized (lock) { + if (instance == null) { + instance = new PerfTrace(false, -1111, -1111, 0, false); + } + } + } + return instance; + } + + private PerfTrace(boolean isJob, long jobId, int taskGroupId, int priority, boolean enable) { + this.perfTraceId = isJob ? "job_" + jobId : String.format("taskGroup_%s_%s", jobId, taskGroupId); + this.enable = enable; + this.isJob = isJob; + this.jobId = jobId; + this.priority = priority; + LOG.info(String.format("PerfTrace traceId=%s, isEnable=%s, priority=%s", this.perfTraceId, this.enable, this.priority)); + + } + + public void addTaskDetails(int taskId, String detail) { + if (enable) { + String before = ""; + int index = detail.indexOf("?"); + String current = detail.substring(0, index == -1 ? detail.length() : index); + if(current.indexOf("[")>=0){ + current+="]"; + } + if (taskDetails.containsKey(taskId)) { + before = taskDetails.get(taskId).trim(); + } + if (StringUtils.isEmpty(before)) { + before = ""; + } else { + before += ","; + } + this.taskDetails.put(taskId, before + current); + } + } + + public void tracePerfRecord(PerfRecord perfRecord) { + if (enable) { + //ArrayList非线程安全 + switch (perfRecord.getAction()) { + case start: + synchronized (startReportPool) { + startReportPool.add(perfRecord); + } + break; + case end: + synchronized (endReportPool) { + endReportPool.add(perfRecord); + } + break; + } + } + } + + public String summarizeNoException(){ + String res; + try { + res = summarize(); + } catch (Exception e) { + res = "PerfTrace summarize has Exception "+e.getMessage(); + } + return res; + } + + //任务结束时,对当前的perf总汇总统计 + private synchronized String summarize() { + if (!enable) { + return "PerfTrace not enable!"; + } + + if (totalEndReport.size() > 0) { + sumEndPerfRecords(totalEndReport); + } + + StringBuilder info = new StringBuilder(); + info.append("\n === total summarize info === \n"); + info.append("\n 1. all phase average time info and max time task info: \n\n"); + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %-100s\n", "PHASE", "AVERAGE USED TIME", "ALL TASK NUM", "MAX USED TIME", "MAX TASK ID", "MAX TASK INFO")); + + List keys = new ArrayList(perfRecordMaps.keySet()); + Collections.sort(keys, new Comparator() { + @Override + public int compare(PerfRecord.PHASE o1, PerfRecord.PHASE o2) { + return o1.toInt() - o2.toInt(); + } + }); + for (PerfRecord.PHASE phase : keys) { + SumPerfRecord sumPerfRecord = perfRecordMaps.get(phase); + if (sumPerfRecord == null) { + continue; + } + long averageTime = sumPerfRecord.getAverageTime(); + long maxTime = sumPerfRecord.getMaxTime(); + int maxTaskId = sumPerfRecord.maxTaskId; + int maxTaskGroupId = sumPerfRecord.getMaxTaskGroupId(); + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %-100s\n", + phase, unitTime(averageTime), sumPerfRecord.totalCount, unitTime(maxTime), jobId + "-" + maxTaskGroupId + "-" + maxTaskId, taskDetails.get(maxTaskId))); + } + + SumPerfRecord countSumPerf = Optional.fromNullable(perfRecordMaps.get(PerfRecord.PHASE.READ_TASK_DATA)).or(new SumPerfRecord()); + + long averageRecords = countSumPerf.getAverageRecords(); + long averageBytes = countSumPerf.getAverageBytes(); + long maxRecord = countSumPerf.getMaxRecord(); + long maxByte = countSumPerf.getMaxByte(); + int maxTaskId4Records = countSumPerf.getMaxTaskId4Records(); + int maxTGID4Records = countSumPerf.getMaxTGID4Records(); + + info.append("\n\n 2. record average count and max count task info :\n\n"); + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %18s | %-100s\n", "PHASE", "AVERAGE RECORDS", "AVERAGE BYTES", "MAX RECORDS", "MAX RECORD`S BYTES", "MAX TASK ID", "MAX TASK INFO")); + if (maxTaskId4Records > -1) { + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %18s | %-100s\n" + , PerfRecord.PHASE.READ_TASK_DATA, averageRecords, unitSize(averageBytes), maxRecord, unitSize(maxByte), jobId + "-" + maxTGID4Records + "-" + maxTaskId4Records, taskDetails.get(maxTaskId4Records))); + + } + return info.toString(); + } + + //缺省传入的时间是nano + public static String unitTime(long time) { + return unitTime(time, TimeUnit.NANOSECONDS); + } + + public static String unitTime(long time, TimeUnit timeUnit) { + return String.format("%,.3fs", ((float) timeUnit.toNanos(time)) / 1000000000); + } + + public static String unitSize(long size) { + if (size > 1000000000) { + return String.format("%,.2fG", (float) size / 1000000000); + } else if (size > 1000000) { + return String.format("%,.2fM", (float) size / 1000000); + } else if (size > 1000) { + return String.format("%,.2fK", (float) size / 1000); + } else { + return size + "B"; + } + } + + + public synchronized ConcurrentHashMap getPerfRecordMaps() { + synchronized (endReportPool) { + // perfRecordMaps.get(perfRecord.getPhase()).add(perfRecord); + waitingReportSet.addAll(endReportPool); + totalEndReport.addAll(endReportPool); + endReportPool.clear(); + } + if(totalEndReport.size() > 0 ){ + sumEndPerfRecords(totalEndReport); + } + return perfRecordMaps; + } + + public List getWaitingReportList() { + return new ArrayList(waitingReportSet); + } + + public List getStartReportPool() { + return startReportPool; + } + + public List getEndReportPool() { + return endReportPool; + } + + public List getTotalEndReport() { + return totalEndReport; + } + + public Map getTaskDetails() { + return taskDetails; + } + + public boolean isEnable() { + return enable; + } + + public boolean isJob() { + return isJob; + } + + public long getJobId() { + return jobId; + } + + private String cluster; + private String jobDomain; + private String srcType; + private String dstType; + private String srcGuid; + private String dstGuid; + private String dataxType; + + public void setJobInfo(Configuration jobInfo) { + this.jobInfo = jobInfo; + if (jobInfo != null) { + cluster = jobInfo.getString("cluster"); + + String srcDomain = jobInfo.getString("srcDomain", "null"); + String dstDomain = jobInfo.getString("dstDomain", "null"); + jobDomain = srcDomain + "|" + dstDomain; + srcType = jobInfo.getString("srcType"); + dstType = jobInfo.getString("dstType"); + srcGuid = jobInfo.getString("srcGuid"); + dstGuid = jobInfo.getString("dstGuid"); + long jobId = jobInfo.getLong("jobId"); + if (jobId > 0) { + //同步中心任务 + dataxType = "dsc"; + } else { + dataxType = "datax3"; + } + } else { + dataxType = "datax3"; + } + } + + public Configuration getJobInfo() { + return jobInfo; + } + + public void setBatchSize(int batchSize) { + this.batchSize = batchSize; + } + + public void setPerfReportEnalbe(boolean perfReportEnalbe) { + this.perfReportEnalbe = perfReportEnalbe; + } + + + private void sumEndPerfRecords(List totalEndReport) { + if (!enable || totalEndReport == null) { + return; + } + + for (PerfRecord perfRecord : totalEndReport) { + perfRecordMaps.putIfAbsent(perfRecord.getPhase(), new SumPerfRecord()); + perfRecordMaps.get(perfRecord.getPhase()).add(perfRecord); + } + + totalEndReport.clear(); + } + + + + public static class SumPerfRecord { + private long perfTimeTotal = 0; + private long averageTime = 0; + private long maxTime = 0; + private int maxTaskId = -1; + private int maxTaskGroupId = -1; + private int totalCount = 0; + + private long recordsTotal = 0; + private long sizesTotal = 0; + private long averageRecords = 0; + private long averageBytes = 0; + private long maxRecord = 0; + private long maxByte = 0; + private int maxTaskId4Records = -1; + private int maxTGID4Records = -1; + + synchronized void add(PerfRecord perfRecord) { + if (perfRecord == null) { + return; + } + perfTimeTotal += perfRecord.getElapsedTimeInNs(); + if (perfRecord.getElapsedTimeInNs() > maxTime) { + maxTime = perfRecord.getElapsedTimeInNs(); + maxTaskId = perfRecord.getTaskId(); + maxTaskGroupId = perfRecord.getTaskGroupId(); + } + + recordsTotal += perfRecord.getCount(); + sizesTotal += perfRecord.getSize(); + if (perfRecord.getCount() > maxRecord) { + maxRecord = perfRecord.getCount(); + maxByte = perfRecord.getSize(); + maxTaskId4Records = perfRecord.getTaskId(); + maxTGID4Records = perfRecord.getTaskGroupId(); + } + + totalCount++; + } + + public long getPerfTimeTotal() { + return perfTimeTotal; + } + + public long getAverageTime() { + if (totalCount > 0) { + averageTime = perfTimeTotal / totalCount; + } + return averageTime; + } + + public long getMaxTime() { + return maxTime; + } + + public int getMaxTaskId() { + return maxTaskId; + } + + public int getMaxTaskGroupId() { + return maxTaskGroupId; + } + + public long getRecordsTotal() { + return recordsTotal; + } + + public long getSizesTotal() { + return sizesTotal; + } + + public long getAverageRecords() { + if (totalCount > 0) { + averageRecords = recordsTotal / totalCount; + } + return averageRecords; + } + + public long getAverageBytes() { + if (totalCount > 0) { + averageBytes = sizesTotal / totalCount; + } + return averageBytes; + } + + public long getMaxRecord() { + return maxRecord; + } + + public long getMaxByte() { + return maxByte; + } + + public int getMaxTaskId4Records() { + return maxTaskId4Records; + } + + public int getMaxTGID4Records() { + return maxTGID4Records; + } + + public int getTotalCount() { + return totalCount; + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/statistics/VMInfo.java b/common/src/main/java/com/alibaba/datax/common/statistics/VMInfo.java new file mode 100644 index 000000000..85535fddd --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/statistics/VMInfo.java @@ -0,0 +1,412 @@ +package com.alibaba.datax.common.statistics; + +import com.google.common.collect.Maps; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.management.GarbageCollectorMXBean; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.OperatingSystemMXBean; +import java.lang.management.RuntimeMXBean; +import java.lang.reflect.Method; +import java.util.List; +import java.util.Map; + +/** + * Created by liqiang on 15/11/12. + */ +public class VMInfo { + private static final Logger LOG = LoggerFactory.getLogger(VMInfo.class); + static final long MB = 1024 * 1024; + static final long GB = 1024 * 1024 * 1024; + public static Object lock = new Object(); + private static VMInfo vmInfo; + + /** + * @return null or vmInfo. null is something error, job no care it. + */ + public static VMInfo getVmInfo() { + if (vmInfo == null) { + synchronized (lock) { + if (vmInfo == null) { + try { + vmInfo = new VMInfo(); + } catch (Exception e) { + LOG.warn("no need care, the fail is ignored : vmInfo init failed " + e.getMessage(), e); + } + } + } + + } + return vmInfo; + } + + // 数据的MxBean + private final OperatingSystemMXBean osMXBean; + private final RuntimeMXBean runtimeMXBean; + private final List garbageCollectorMXBeanList; + private final List memoryPoolMXBeanList; + /** + * 静态信息 + */ + private final String osInfo; + private final String jvmInfo; + + /** + * cpu个数 + */ + private final int totalProcessorCount; + + /** + * 机器的各个状态,用于中间打印和统计上报 + */ + private final PhyOSStatus startPhyOSStatus; + private final ProcessCpuStatus processCpuStatus = new ProcessCpuStatus(); + private final ProcessGCStatus processGCStatus = new ProcessGCStatus(); + private final ProcessMemoryStatus processMomoryStatus = new ProcessMemoryStatus(); + //ms + private long lastUpTime = 0; + //nano + private long lastProcessCpuTime = 0; + + + private VMInfo() { + //初始化静态信息 + osMXBean = java.lang.management.ManagementFactory.getOperatingSystemMXBean(); + runtimeMXBean = java.lang.management.ManagementFactory.getRuntimeMXBean(); + garbageCollectorMXBeanList = java.lang.management.ManagementFactory.getGarbageCollectorMXBeans(); + memoryPoolMXBeanList = java.lang.management.ManagementFactory.getMemoryPoolMXBeans(); + + osInfo = runtimeMXBean.getVmVendor() + " " + runtimeMXBean.getSpecVersion() + " " + runtimeMXBean.getVmVersion(); + jvmInfo = osMXBean.getName() + " " + osMXBean.getArch() + " " + osMXBean.getVersion(); + totalProcessorCount = osMXBean.getAvailableProcessors(); + + //构建startPhyOSStatus + startPhyOSStatus = new PhyOSStatus(); + LOG.info("VMInfo# operatingSystem class => " + osMXBean.getClass().getName()); + if (VMInfo.isSunOsMBean(osMXBean)) { + { + startPhyOSStatus.totalPhysicalMemory = VMInfo.getLongFromOperatingSystem(osMXBean, "getTotalPhysicalMemorySize"); + startPhyOSStatus.freePhysicalMemory = VMInfo.getLongFromOperatingSystem(osMXBean, "getFreePhysicalMemorySize"); + startPhyOSStatus.maxFileDescriptorCount = VMInfo.getLongFromOperatingSystem(osMXBean, "getMaxFileDescriptorCount"); + startPhyOSStatus.currentOpenFileDescriptorCount = VMInfo.getLongFromOperatingSystem(osMXBean, "getOpenFileDescriptorCount"); + } + } + + //初始化processGCStatus; + for (GarbageCollectorMXBean garbage : garbageCollectorMXBeanList) { + GCStatus gcStatus = new GCStatus(); + gcStatus.name = garbage.getName(); + processGCStatus.gcStatusMap.put(garbage.getName(), gcStatus); + } + + //初始化processMemoryStatus + if (memoryPoolMXBeanList != null && !memoryPoolMXBeanList.isEmpty()) { + for (MemoryPoolMXBean pool : memoryPoolMXBeanList) { + MemoryStatus memoryStatus = new MemoryStatus(); + memoryStatus.name = pool.getName(); + memoryStatus.initSize = pool.getUsage().getInit(); + memoryStatus.maxSize = pool.getUsage().getMax(); + processMomoryStatus.memoryStatusMap.put(pool.getName(), memoryStatus); + } + } + } + + public String toString() { + return "the machine info => \n\n" + + "\tosInfo:\t" + osInfo + "\n" + + "\tjvmInfo:\t" + jvmInfo + "\n" + + "\tcpu num:\t" + totalProcessorCount + "\n\n" + + startPhyOSStatus.toString() + "\n" + + processGCStatus.toString() + "\n" + + processMomoryStatus.toString() + "\n"; + } + + public String totalString() { + return (processCpuStatus.getTotalString() + processGCStatus.getTotalString()); + } + + public void getDelta() { + getDelta(true); + } + + public synchronized void getDelta(boolean print) { + + try { + if (VMInfo.isSunOsMBean(osMXBean)) { + long curUptime = runtimeMXBean.getUptime(); + long curProcessTime = getLongFromOperatingSystem(osMXBean, "getProcessCpuTime"); + //百分比, uptime是ms,processTime是nano + if ((curUptime > lastUpTime) && (curProcessTime >= lastProcessCpuTime)) { + float curDeltaCpu = (float) (curProcessTime - lastProcessCpuTime) / ((curUptime - lastUpTime) * totalProcessorCount * 10000); + processCpuStatus.setMaxMinCpu(curDeltaCpu); + processCpuStatus.averageCpu = (float) curProcessTime / (curUptime * totalProcessorCount * 10000); + + lastUpTime = curUptime; + lastProcessCpuTime = curProcessTime; + } + } + + for (GarbageCollectorMXBean garbage : garbageCollectorMXBeanList) { + + GCStatus gcStatus = processGCStatus.gcStatusMap.get(garbage.getName()); + if (gcStatus == null) { + gcStatus = new GCStatus(); + gcStatus.name = garbage.getName(); + processGCStatus.gcStatusMap.put(garbage.getName(), gcStatus); + } + + long curTotalGcCount = garbage.getCollectionCount(); + gcStatus.setCurTotalGcCount(curTotalGcCount); + + long curtotalGcTime = garbage.getCollectionTime(); + gcStatus.setCurTotalGcTime(curtotalGcTime); + } + + if (memoryPoolMXBeanList != null && !memoryPoolMXBeanList.isEmpty()) { + for (MemoryPoolMXBean pool : memoryPoolMXBeanList) { + + MemoryStatus memoryStatus = processMomoryStatus.memoryStatusMap.get(pool.getName()); + if (memoryStatus == null) { + memoryStatus = new MemoryStatus(); + memoryStatus.name = pool.getName(); + processMomoryStatus.memoryStatusMap.put(pool.getName(), memoryStatus); + } + memoryStatus.commitedSize = pool.getUsage().getCommitted(); + memoryStatus.setMaxMinUsedSize(pool.getUsage().getUsed()); + long maxMemory = memoryStatus.commitedSize > 0 ? memoryStatus.commitedSize : memoryStatus.maxSize; + memoryStatus.setMaxMinPercent(maxMemory > 0 ? (float) 100 * memoryStatus.usedSize / maxMemory : -1); + } + } + + if (print) { + LOG.info(processCpuStatus.getDeltaString() + processMomoryStatus.getDeltaString() + processGCStatus.getDeltaString()); + } + + } catch (Exception e) { + LOG.warn("no need care, the fail is ignored : vmInfo getDelta failed " + e.getMessage(), e); + } + } + + public static boolean isSunOsMBean(OperatingSystemMXBean operatingSystem) { + final String className = operatingSystem.getClass().getName(); + + return "com.sun.management.UnixOperatingSystem".equals(className); + } + + public static long getLongFromOperatingSystem(OperatingSystemMXBean operatingSystem, String methodName) { + try { + final Method method = operatingSystem.getClass().getMethod(methodName, (Class[]) null); + method.setAccessible(true); + return (Long) method.invoke(operatingSystem, (Object[]) null); + } catch (final Exception e) { + LOG.info(String.format("OperatingSystemMXBean %s failed, Exception = %s ", methodName, e.getMessage())); + } + + return -1; + } + + private class PhyOSStatus { + long totalPhysicalMemory = -1; + long freePhysicalMemory = -1; + long maxFileDescriptorCount = -1; + long currentOpenFileDescriptorCount = -1; + + public String toString() { + return String.format("\ttotalPhysicalMemory:\t%,.2fG\n" + + "\tfreePhysicalMemory:\t%,.2fG\n" + + "\tmaxFileDescriptorCount:\t%s\n" + + "\tcurrentOpenFileDescriptorCount:\t%s\n", + (float) totalPhysicalMemory / GB, (float) freePhysicalMemory / GB, maxFileDescriptorCount, currentOpenFileDescriptorCount); + } + } + + private class ProcessCpuStatus { + // 百分比的值 比如30.0 表示30.0% + float maxDeltaCpu = -1; + float minDeltaCpu = -1; + float curDeltaCpu = -1; + float averageCpu = -1; + + public void setMaxMinCpu(float curCpu) { + this.curDeltaCpu = curCpu; + if (maxDeltaCpu < curCpu) { + maxDeltaCpu = curCpu; + } + + if (minDeltaCpu == -1 || minDeltaCpu > curCpu) { + minDeltaCpu = curCpu; + } + } + + public String getDeltaString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [delta cpu info] => \n"); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s \n", "curDeltaCpu", "averageCpu", "maxDeltaCpu", "minDeltaCpu")); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s \n", + String.format("%,.2f%%", processCpuStatus.curDeltaCpu), + String.format("%,.2f%%", processCpuStatus.averageCpu), + String.format("%,.2f%%", processCpuStatus.maxDeltaCpu), + String.format("%,.2f%%\n", processCpuStatus.minDeltaCpu))); + + return sb.toString(); + } + + public String getTotalString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [total cpu info] => \n"); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", "averageCpu", "maxDeltaCpu", "minDeltaCpu")); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", + String.format("%,.2f%%", processCpuStatus.averageCpu), + String.format("%,.2f%%", processCpuStatus.maxDeltaCpu), + String.format("%,.2f%%\n", processCpuStatus.minDeltaCpu))); + + return sb.toString(); + } + + } + + private class ProcessGCStatus { + final Map gcStatusMap = Maps.newHashMap(); + + public String toString() { + return "\tGC Names\t" + gcStatusMap.keySet() + "\n"; + } + + public String getDeltaString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [delta gc info] => \n"); + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", "NAME", "curDeltaGCCount", "totalGCCount", "maxDeltaGCCount", "minDeltaGCCount", "curDeltaGCTime", "totalGCTime", "maxDeltaGCTime", "minDeltaGCTime")); + for (GCStatus gc : gcStatusMap.values()) { + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", + gc.name, gc.curDeltaGCCount, gc.totalGCCount, gc.maxDeltaGCCount, gc.minDeltaGCCount, + String.format("%,.3fs",(float)gc.curDeltaGCTime/1000), + String.format("%,.3fs",(float)gc.totalGCTime/1000), + String.format("%,.3fs",(float)gc.maxDeltaGCTime/1000), + String.format("%,.3fs",(float)gc.minDeltaGCTime/1000))); + + } + return sb.toString(); + } + + public String getTotalString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [total gc info] => \n"); + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", "NAME", "totalGCCount", "maxDeltaGCCount", "minDeltaGCCount", "totalGCTime", "maxDeltaGCTime", "minDeltaGCTime")); + for (GCStatus gc : gcStatusMap.values()) { + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", + gc.name, gc.totalGCCount, gc.maxDeltaGCCount, gc.minDeltaGCCount, + String.format("%,.3fs",(float)gc.totalGCTime/1000), + String.format("%,.3fs",(float)gc.maxDeltaGCTime/1000), + String.format("%,.3fs",(float)gc.minDeltaGCTime/1000))); + + } + return sb.toString(); + } + } + + private class ProcessMemoryStatus { + final Map memoryStatusMap = Maps.newHashMap(); + + public String toString() { + StringBuilder sb = new StringBuilder(); + sb.append("\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", "MEMORY_NAME", "allocation_size", "init_size")); + for (MemoryStatus ms : memoryStatusMap.values()) { + sb.append("\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", + ms.name, String.format("%,.2fMB", (float) ms.maxSize / MB), String.format("%,.2fMB", (float) ms.initSize / MB))); + } + return sb.toString(); + } + + public String getDeltaString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [delta memory info] => \n"); + sb.append("\t\t "); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s | %-30s \n", "NAME", "used_size", "used_percent", "max_used_size", "max_percent")); + for (MemoryStatus ms : memoryStatusMap.values()) { + sb.append("\t\t "); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s | %-30s \n", + ms.name, String.format("%,.2f", (float) ms.usedSize / MB) + "MB", + String.format("%,.2f", (float) ms.percent) + "%", + String.format("%,.2f", (float) ms.maxUsedSize / MB) + "MB", + String.format("%,.2f", (float) ms.maxpercent) + "%")); + + } + return sb.toString(); + } + } + + private class GCStatus { + String name; + long maxDeltaGCCount = -1; + long minDeltaGCCount = -1; + long curDeltaGCCount; + long totalGCCount = 0; + long maxDeltaGCTime = -1; + long minDeltaGCTime = -1; + long curDeltaGCTime; + long totalGCTime = 0; + + public void setCurTotalGcCount(long curTotalGcCount) { + this.curDeltaGCCount = curTotalGcCount - totalGCCount; + this.totalGCCount = curTotalGcCount; + + if (maxDeltaGCCount < curDeltaGCCount) { + maxDeltaGCCount = curDeltaGCCount; + } + + if (minDeltaGCCount == -1 || minDeltaGCCount > curDeltaGCCount) { + minDeltaGCCount = curDeltaGCCount; + } + } + + public void setCurTotalGcTime(long curTotalGcTime) { + this.curDeltaGCTime = curTotalGcTime - totalGCTime; + this.totalGCTime = curTotalGcTime; + + if (maxDeltaGCTime < curDeltaGCTime) { + maxDeltaGCTime = curDeltaGCTime; + } + + if (minDeltaGCTime == -1 || minDeltaGCTime > curDeltaGCTime) { + minDeltaGCTime = curDeltaGCTime; + } + } + } + + private class MemoryStatus { + String name; + long initSize; + long maxSize; + long commitedSize; + long usedSize; + float percent; + long maxUsedSize = -1; + float maxpercent = 0; + + void setMaxMinUsedSize(long curUsedSize) { + if (maxUsedSize < curUsedSize) { + maxUsedSize = curUsedSize; + } + this.usedSize = curUsedSize; + } + + void setMaxMinPercent(float curPercent) { + if (maxpercent < curPercent) { + maxpercent = curPercent; + } + this.percent = curPercent; + } + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/Configuration.java b/common/src/main/java/com/alibaba/datax/common/util/Configuration.java new file mode 100755 index 000000000..456920805 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/Configuration.java @@ -0,0 +1,1078 @@ +package com.alibaba.datax.common.util; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.ErrorCode; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.serializer.SerializerFeature; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.builder.ToStringBuilder; + +import java.io.*; +import java.util.*; + +/** + * Configuration 提供多级JSON配置信息无损存储
+ *
+ *

+ * 实例代码:
+ *

+ * 获取job的配置信息
+ * Configuration configuration = Configuration.from(new File("Config.json"));
+ * String jobContainerClass = + * configuration.getString("core.container.job.class");
+ *

+ *
+ * 设置多级List
+ * configuration.set("job.reader.parameter.jdbcUrl", Arrays.asList(new String[] + * {"jdbc", "jdbc"})); + *

+ *

+ *
+ *
+ * 合并Configuration:
+ * configuration.merge(another); + *

+ *

+ *
+ *
+ *
+ *

+ * Configuration 存在两种较好地实现方式
+ * 第一种是将JSON配置信息中所有的Key全部打平,用a.b.c的级联方式作为Map的Key,内部使用一个Map保存信息
+ * 第二种是将JSON的对象直接使用结构化树形结构保存
+ *

+ * 目前使用的第二种实现方式,使用第一种的问题在于:
+ * 1. 插入新对象,比较难处理,例如a.b.c="bazhen",此时如果需要插入a="bazhen",也即是根目录下第一层所有类型全部要废弃 + * ,使用"bazhen"作为value,第一种方式使用字符串表示key,难以处理这类问题。
+ * 2. 返回树形结构,例如 a.b.c.d = "bazhen",如果返回"a"下的所有元素,实际上是一个Map,需要合并处理
+ * 3. 输出JSON,将上述对象转为JSON,要把上述Map的多级key转为树形结构,并输出为JSON
+ */ +public class Configuration { + + /** + * 对于加密的keyPath,需要记录下来 + * 为的是后面分布式情况下将该值加密后抛到DataXServer中 + */ + private Set secretKeyPathSet = + new HashSet(); + + private Object root = null; + + /** + * 初始化空白的Configuration + */ + public static Configuration newDefault() { + return Configuration.from("{}"); + } + + /** + * 从JSON字符串加载Configuration + */ + public static Configuration from(String json) { + json = StrUtil.replaceVariable(json); + checkJSON(json); + + try { + return new Configuration(json); + } catch (Exception e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + e); + } + + } + + /** + * 从包括json的File对象加载Configuration + */ + public static Configuration from(File file) { + try { + return Configuration.from(IOUtils + .toString(new FileInputStream(file))); + } catch (FileNotFoundException e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("配置信息错误,您提供的配置文件[%s]不存在. 请检查您的配置文件.", file.getAbsolutePath())); + } catch (IOException e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("配置信息错误. 您提供配置文件[%s]读取失败,错误原因: %s. 请检查您的配置文件的权限设置.", + file.getAbsolutePath(), e)); + } + } + + /** + * 从包括json的InputStream对象加载Configuration + */ + public static Configuration from(InputStream is) { + try { + return Configuration.from(IOUtils.toString(is)); + } catch (IOException e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("请检查您的配置文件. 您提供的配置文件读取失败,错误原因: %s. 请检查您的配置文件的权限设置.", e)); + } + } + + /** + * 从Map对象加载Configuration + */ + public static Configuration from(final Map object) { + return Configuration.from(Configuration.toJSONString(object)); + } + + /** + * 从List对象加载Configuration + */ + public static Configuration from(final List object) { + return Configuration.from(Configuration.toJSONString(object)); + } + + public String getNecessaryValue(String key, ErrorCode errorCode) { + String value = this.getString(key, null); + if (StringUtils.isBlank(value)) { + throw DataXException.asDataXException(errorCode, + String.format("您提供配置文件有误,[%s]是必填参数,不允许为空或者留白 .", key)); + } + + return value; + } + + public String getUnnecessaryValue(String key,String defaultValue,ErrorCode errorCode) { + String value = this.getString(key, defaultValue); + if (StringUtils.isBlank(value)) { + value = defaultValue; + } + return value; + } + + public Boolean getNecessaryBool(String key, ErrorCode errorCode) { + Boolean value = this.getBool(key); + if (value == null) { + throw DataXException.asDataXException(errorCode, + String.format("您提供配置文件有误,[%s]是必填参数,不允许为空或者留白 .", key)); + } + + return value; + } + + /** + * 根据用户提供的json path,寻址具体的对象。 + *

+ *
+ *

+ * NOTE: 目前仅支持Map以及List下标寻址, 例如: + *

+ *
+ *

+ * 对于如下JSON + *

+ * {"a": {"b": {"c": [0,1,2,3]}}} + *

+ * config.get("") 返回整个Map
+ * config.get("a") 返回a下属整个Map
+ * config.get("a.b.c") 返回c对应的数组List
+ * config.get("a.b.c[0]") 返回数字0 + * + * @return Java表示的JSON对象,如果path不存在或者对象不存在,均返回null。 + */ + public Object get(final String path) { + this.checkPath(path); + try { + return this.findObject(path); + } catch (Exception e) { + return null; + } + } + + /** + * 用户指定部分path,获取Configuration的子集 + *

+ *
+ * 如果path获取的路径或者对象不存在,返回null + */ + public Configuration getConfiguration(final String path) { + Object object = this.get(path); + if (null == object) { + return null; + } + + return Configuration.from(Configuration.toJSONString(object)); + } + + /** + * 根据用户提供的json path,寻址String对象 + * + * @return String对象,如果path不存在或者String不存在,返回null + */ + public String getString(final String path) { + Object string = this.get(path); + if (null == string) { + return null; + } + return String.valueOf(string); + } + + /** + * 根据用户提供的json path,寻址String对象,如果对象不存在,返回默认字符串 + * + * @return String对象,如果path不存在或者String不存在,返回默认字符串 + */ + public String getString(final String path, final String defaultValue) { + String result = this.getString(path); + + if (null == result) { + return defaultValue; + } + + return result; + } + + /** + * 根据用户提供的json path,寻址Character对象 + * + * @return Character对象,如果path不存在或者Character不存在,返回null + */ + public Character getChar(final String path) { + String result = this.getString(path); + if (null == result) { + return null; + } + + try { + return CharUtils.toChar(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 因为配置文件路径[%s] 值非法,期望是字符类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Boolean对象,如果对象不存在,返回默认Character对象 + * + * @return Character对象,如果path不存在或者Character不存在,返回默认Character对象 + */ + public Character getChar(final String path, char defaultValue) { + Character result = this.getChar(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Boolean对象 + * + * @return Boolean对象,如果path值非true,false ,将报错.特别注意:当 path 不存在时,会返回:null. + */ + public Boolean getBool(final String path) { + String result = this.getString(path); + + if (null == result) { + return null; + } else if ("true".equalsIgnoreCase(result)) { + return Boolean.TRUE; + } else if ("false".equalsIgnoreCase(result)) { + return Boolean.FALSE; + } else { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("您提供的配置信息有误,因为从[%s]获取的值[%s]无法转换为bool类型. 请检查源表的配置并且做出相应的修改.", + path, result)); + } + + } + + /** + * 根据用户提供的json path,寻址Boolean对象,如果对象不存在,返回默认Boolean对象 + * + * @return Boolean对象,如果path不存在或者Boolean不存在,返回默认Boolean对象 + */ + public Boolean getBool(final String path, boolean defaultValue) { + Boolean result = this.getBool(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Integer对象 + * + * @return Integer对象,如果path不存在或者Integer不存在,返回null + */ + public Integer getInt(final String path) { + String result = this.getString(path); + if (null == result) { + return null; + } + + try { + return Integer.valueOf(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 配置文件路径[%s] 值非法, 期望是整数类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Integer对象,如果对象不存在,返回默认Integer对象 + * + * @return Integer对象,如果path不存在或者Integer不存在,返回默认Integer对象 + */ + public Integer getInt(final String path, int defaultValue) { + Integer object = this.getInt(path); + if (null == object) { + return defaultValue; + } + return object; + } + + /** + * 根据用户提供的json path,寻址Long对象 + * + * @return Long对象,如果path不存在或者Long不存在,返回null + */ + public Long getLong(final String path) { + String result = this.getString(path); + if (null == result) { + return null; + } + + try { + return Long.valueOf(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 配置文件路径[%s] 值非法, 期望是整数类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Long对象,如果对象不存在,返回默认Long对象 + * + * @return Long对象,如果path不存在或者Integer不存在,返回默认Long对象 + */ + public Long getLong(final String path, long defaultValue) { + Long result = this.getLong(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Double对象 + * + * @return Double对象,如果path不存在或者Double不存在,返回null + */ + public Double getDouble(final String path) { + String result = this.getString(path); + if (null == result) { + return null; + } + + try { + return Double.valueOf(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 配置文件路径[%s] 值非法, 期望是浮点类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Double对象,如果对象不存在,返回默认Double对象 + * + * @return Double对象,如果path不存在或者Double不存在,返回默认Double对象 + */ + public Double getDouble(final String path, double defaultValue) { + Double result = this.getDouble(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回null + */ + @SuppressWarnings("unchecked") + public List getList(final String path) { + List list = this.get(path, List.class); + if (null == list) { + return null; + } + return list; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回null + */ + @SuppressWarnings("unchecked") + public List getList(final String path, Class t) { + Object object = this.get(path, List.class); + if (null == object) { + return null; + } + + List result = new ArrayList(); + + List origin = (List) object; + for (final Object each : origin) { + result.add((T) each); + } + + return result; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回默认List + */ + @SuppressWarnings("unchecked") + public List getList(final String path, + final List defaultList) { + Object object = this.getList(path); + if (null == object) { + return defaultList; + } + return (List) object; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回默认List + */ + public List getList(final String path, final List defaultList, + Class t) { + List list = this.getList(path, t); + if (null == list) { + return defaultList; + } + return list; + } + + /** + * 根据用户提供的json path,寻址包含Configuration的List,如果对象不存在,返回默认null + */ + public List getListConfiguration(final String path) { + List lists = getList(path); + if (lists == null) { + return null; + } + + List result = new ArrayList(); + for (final Object object : lists) { + result.add(Configuration.from(Configuration.toJSONString(object))); + } + return result; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回null + */ + @SuppressWarnings("unchecked") + public Map getMap(final String path) { + Map result = this.get(path, Map.class); + if (null == result) { + return null; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回null; + */ + @SuppressWarnings("unchecked") + public Map getMap(final String path, Class t) { + Map map = this.get(path, Map.class); + if (null == map) { + return null; + } + + Map result = new HashMap(); + for (final String key : map.keySet()) { + result.put(key, (T) map.get(key)); + } + + return result; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回默认map + */ + @SuppressWarnings("unchecked") + public Map getMap(final String path, + final Map defaultMap) { + Object object = this.getMap(path); + if (null == object) { + return defaultMap; + } + return (Map) object; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回默认map + */ + public Map getMap(final String path, + final Map defaultMap, Class t) { + Map result = getMap(path, t); + if (null == result) { + return defaultMap; + } + return result; + } + + /** + * 根据用户提供的json path,寻址包含Configuration的Map,如果对象不存在,返回默认null + */ + @SuppressWarnings("unchecked") + public Map getMapConfiguration(final String path) { + Map map = this.get(path, Map.class); + if (null == map) { + return null; + } + + Map result = new HashMap(); + for (final String key : map.keySet()) { + result.put(key, Configuration.from(Configuration.toJSONString(map + .get(key)))); + } + + return result; + } + + /** + * 根据用户提供的json path,寻址具体的对象,并转为用户提供的类型 + *

+ *
+ *

+ * NOTE: 目前仅支持Map以及List下标寻址, 例如: + *

+ *
+ *

+ * 对于如下JSON + *

+ * {"a": {"b": {"c": [0,1,2,3]}}} + *

+ * config.get("") 返回整个Map
+ * config.get("a") 返回a下属整个Map
+ * config.get("a.b.c") 返回c对应的数组List
+ * config.get("a.b.c[0]") 返回数字0 + * + * @return Java表示的JSON对象,如果转型失败,将抛出异常 + */ + @SuppressWarnings("unchecked") + public T get(final String path, Class clazz) { + this.checkPath(path); + return (T) this.get(path); + } + + /** + * 格式化Configuration输出 + */ + public String beautify() { + return JSON.toJSONString(this.getInternal(), + SerializerFeature.PrettyFormat); + } + + /** + * 根据用户提供的json path,插入指定对象,并返回之前存在的对象(如果存在) + *

+ *
+ *

+ * 目前仅支持.以及数组下标寻址, 例如: + *

+ *
+ *

+ * config.set("a.b.c[3]", object); + *

+ *
+ * 对于插入对象,Configuration不做任何限制,但是请务必保证该对象是简单对象(包括Map、List),不要使用自定义对象,否则后续对于JSON序列化等情况会出现未定义行为。 + * + * @param path + * JSON path对象 + * @param object + * 需要插入的对象 + * @return Java表示的JSON对象 + */ + public Object set(final String path, final Object object) { + checkPath(path); + + Object result = this.get(path); + + setObject(path, extractConfiguration(object)); + + return result; + } + + /** + * 获取Configuration下所有叶子节点的key + *

+ *
+ *

+ * 对于
+ *

+ * {"a": {"b": {"c": [0,1,2,3]}}, "x": "y"} + *

+ * 下属的key包括: a.b.c[0],a.b.c[1],a.b.c[2],a.b.c[3],x + */ + public Set getKeys() { + Set collect = new HashSet(); + this.getKeysRecursive(this.getInternal(), "", collect); + return collect; + } + + /** + * 删除path对应的值,如果path不存在,将抛出异常。 + */ + public Object remove(final String path) { + final Object result = this.get(path); + if (null == result) { + throw DataXException.asDataXException( + CommonErrorCode.RUNTIME_ERROR, + String.format("配置文件对应Key[%s]并不存在,该情况是代码编程错误. 请联系DataX团队的同学.", path)); + } + + this.set(path, null); + return result; + } + + /** + * 合并其他Configuration,并修改两者冲突的KV配置 + * + * @param another + * 合并加入的第三方Configuration + * @param updateWhenConflict + * 当合并双方出现KV冲突时候,选择更新当前KV,或者忽略该KV + * @return 返回合并后对象 + */ + public Configuration merge(final Configuration another, + boolean updateWhenConflict) { + Set keys = another.getKeys(); + + for (final String key : keys) { + // 如果使用更新策略,凡是another存在的key,均需要更新 + if (updateWhenConflict) { + this.set(key, another.get(key)); + continue; + } + + // 使用忽略策略,只有another Configuration存在但是当前Configuration不存在的key,才需要更新 + boolean isCurrentExists = this.get(key) != null; + if (isCurrentExists) { + continue; + } + + this.set(key, another.get(key)); + } + return this; + } + + @Override + public String toString() { + return this.toJSON(); + } + + /** + * 将Configuration作为JSON输出 + */ + public String toJSON() { + return Configuration.toJSONString(this.getInternal()); + } + + /** + * 拷贝当前Configuration,注意,这里使用了深拷贝,避免冲突 + */ + public Configuration clone() { + Configuration config = Configuration + .from(Configuration.toJSONString(this.getInternal())); + config.addSecretKeyPath(this.secretKeyPathSet); + return config; + } + + /** + * 按照configuration要求格式的path + * 比如: + * a.b.c + * a.b[2].c + * @param path + */ + public void addSecretKeyPath(String path) { + if(StringUtils.isNotBlank(path)) { + this.secretKeyPathSet.add(path); + } + } + + public void addSecretKeyPath(Set pathSet) { + if(pathSet != null) { + this.secretKeyPathSet.addAll(pathSet); + } + } + + public void setSecretKeyPathSet(Set keyPathSet) { + if(keyPathSet != null) { + this.secretKeyPathSet = keyPathSet; + } + } + + public boolean isSecretPath(String path) { + return this.secretKeyPathSet.contains(path); + } + + @SuppressWarnings("unchecked") + void getKeysRecursive(final Object current, String path, Set collect) { + boolean isRegularElement = !(current instanceof Map || current instanceof List); + if (isRegularElement) { + collect.add(path); + return; + } + + boolean isMap = current instanceof Map; + if (isMap) { + Map mapping = ((Map) current); + for (final String key : mapping.keySet()) { + if (StringUtils.isBlank(path)) { + getKeysRecursive(mapping.get(key), key.trim(), collect); + } else { + getKeysRecursive(mapping.get(key), path + "." + key.trim(), + collect); + } + } + return; + } + + boolean isList = current instanceof List; + if (isList) { + List lists = (List) current; + for (int i = 0; i < lists.size(); i++) { + getKeysRecursive(lists.get(i), path + String.format("[%d]", i), + collect); + } + return; + } + + return; + } + + public Object getInternal() { + return this.root; + } + + private void setObject(final String path, final Object object) { + Object newRoot = setObjectRecursive(this.root, split2List(path), 0, + object); + + if (isSuitForRoot(newRoot)) { + this.root = newRoot; + return; + } + + throw DataXException.asDataXException(CommonErrorCode.RUNTIME_ERROR, + String.format("值[%s]无法适配您提供[%s], 该异常代表系统编程错误, 请联系DataX开发团队!", + ToStringBuilder.reflectionToString(object), path)); + } + + @SuppressWarnings("unchecked") + private Object extractConfiguration(final Object object) { + if (object instanceof Configuration) { + return extractFromConfiguration(object); + } + + if (object instanceof List) { + List result = new ArrayList(); + for (final Object each : (List) object) { + result.add(extractFromConfiguration(each)); + } + return result; + } + + if (object instanceof Map) { + Map result = new HashMap(); + for (final String key : ((Map) object).keySet()) { + result.put(key, + extractFromConfiguration(((Map) object) + .get(key))); + } + return result; + } + + return object; + } + + private Object extractFromConfiguration(final Object object) { + if (object instanceof Configuration) { + return ((Configuration) object).getInternal(); + } + + return object; + } + + Object buildObject(final List paths, final Object object) { + if (null == paths) { + throw DataXException.asDataXException( + CommonErrorCode.RUNTIME_ERROR, + "Path不能为null,该异常代表系统编程错误, 请联系DataX开发团队 !"); + } + + if (1 == paths.size() && StringUtils.isBlank(paths.get(0))) { + return object; + } + + Object child = object; + for (int i = paths.size() - 1; i >= 0; i--) { + String path = paths.get(i); + + if (isPathMap(path)) { + Map mapping = new HashMap(); + mapping.put(path, child); + child = mapping; + continue; + } + + if (isPathList(path)) { + List lists = new ArrayList( + this.getIndex(path) + 1); + expand(lists, this.getIndex(path) + 1); + lists.set(this.getIndex(path), child); + child = lists; + continue; + } + + throw DataXException.asDataXException( + CommonErrorCode.RUNTIME_ERROR, String.format( + "路径[%s]出现非法值类型[%s],该异常代表系统编程错误, 请联系DataX开发团队! .", + StringUtils.join(paths, "."), path)); + } + + return child; + } + + @SuppressWarnings("unchecked") + Object setObjectRecursive(Object current, final List paths, + int index, final Object value) { + + // 如果是已经超出path,我们就返回value即可,作为最底层叶子节点 + boolean isLastIndex = index == paths.size(); + if (isLastIndex) { + return value; + } + + String path = paths.get(index).trim(); + boolean isNeedMap = isPathMap(path); + if (isNeedMap) { + Map mapping; + + // 当前不是map,因此全部替换为map,并返回新建的map对象 + boolean isCurrentMap = current instanceof Map; + if (!isCurrentMap) { + mapping = new HashMap(); + mapping.put( + path, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return mapping; + } + + // 当前是map,但是没有对应的key,也就是我们需要新建对象插入该map,并返回该map + mapping = ((Map) current); + boolean hasSameKey = mapping.containsKey(path); + if (!hasSameKey) { + mapping.put( + path, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return mapping; + } + + // 当前是map,而且还竟然存在这个值,好吧,继续递归遍历 + current = mapping.get(path); + mapping.put(path, + setObjectRecursive(current, paths, index + 1, value)); + return mapping; + } + + boolean isNeedList = isPathList(path); + if (isNeedList) { + List lists; + int listIndexer = getIndex(path); + + // 当前是list,直接新建并返回即可 + boolean isCurrentList = current instanceof List; + if (!isCurrentList) { + lists = expand(new ArrayList(), listIndexer + 1); + lists.set( + listIndexer, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return lists; + } + + // 当前是list,但是对应的indexer是没有具体的值,也就是我们新建对象然后插入到该list,并返回该List + lists = (List) current; + lists = expand(lists, listIndexer + 1); + + boolean hasSameIndex = lists.get(listIndexer) != null; + if (!hasSameIndex) { + lists.set( + listIndexer, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return lists; + } + + // 当前是list,并且存在对应的index,没有办法继续递归寻找 + current = lists.get(listIndexer); + lists.set(listIndexer, + setObjectRecursive(current, paths, index + 1, value)); + return lists; + } + + throw DataXException.asDataXException(CommonErrorCode.RUNTIME_ERROR, + "该异常代表系统编程错误, 请联系DataX开发团队 !"); + } + + private Object findObject(final String path) { + boolean isRootQuery = StringUtils.isBlank(path); + if (isRootQuery) { + return this.root; + } + + Object target = this.root; + + for (final String each : split2List(path)) { + if (isPathMap(each)) { + target = findObjectInMap(target, each); + continue; + } else { + target = findObjectInList(target, each); + continue; + } + } + + return target; + } + + @SuppressWarnings("unchecked") + private Object findObjectInMap(final Object target, final String index) { + boolean isMap = (target instanceof Map); + if (!isMap) { + throw new IllegalArgumentException(String.format( + "您提供的配置文件有误. 路径[%s]需要配置Json格式的Map对象,但该节点发现实际类型是[%s]. 请检查您的配置并作出修改.", + index, target.getClass().toString())); + } + + Object result = ((Map) target).get(index); + if (null == result) { + throw new IllegalArgumentException(String.format( + "您提供的配置文件有误. 路径[%s]值为null,datax无法识别该配置. 请检查您的配置并作出修改.", index)); + } + + return result; + } + + @SuppressWarnings({ "unchecked" }) + private Object findObjectInList(final Object target, final String each) { + boolean isList = (target instanceof List); + if (!isList) { + throw new IllegalArgumentException(String.format( + "您提供的配置文件有误. 路径[%s]需要配置Json格式的Map对象,但该节点发现实际类型是[%s]. 请检查您的配置并作出修改.", + each, target.getClass().toString())); + } + + String index = each.replace("[", "").replace("]", ""); + if (!StringUtils.isNumeric(index)) { + throw new IllegalArgumentException( + String.format( + "系统编程错误,列表下标必须为数字类型,但该节点发现实际类型是[%s] ,该异常代表系统编程错误, 请联系DataX开发团队 !", + index)); + } + + return ((List) target).get(Integer.valueOf(index)); + } + + private List expand(List list, int size) { + int expand = size - list.size(); + while (expand-- > 0) { + list.add(null); + } + return list; + } + + private boolean isPathList(final String path) { + return path.contains("[") && path.contains("]"); + } + + private boolean isPathMap(final String path) { + return StringUtils.isNotBlank(path) && !isPathList(path); + } + + private int getIndex(final String index) { + return Integer.valueOf(index.replace("[", "").replace("]", "")); + } + + private boolean isSuitForRoot(final Object object) { + if (null != object && (object instanceof List || object instanceof Map)) { + return true; + } + + return false; + } + + private String split(final String path) { + return StringUtils.replace(path, "[", ".["); + } + + private List split2List(final String path) { + return Arrays.asList(StringUtils.split(split(path), ".")); + } + + private void checkPath(final String path) { + if (null == path) { + throw new IllegalArgumentException( + "系统编程错误, 该异常代表系统编程错误, 请联系DataX开发团队!."); + } + + for (final String each : StringUtils.split(".")) { + if (StringUtils.isBlank(each)) { + throw new IllegalArgumentException(String.format( + "系统编程错误, 路径[%s]不合法, 路径层次之间不能出现空白字符 .", path)); + } + } + } + + @SuppressWarnings("unused") + private String toJSONPath(final String path) { + return (StringUtils.isBlank(path) ? "$" : "$." + path).replace("$.[", + "$["); + } + + private static void checkJSON(final String json) { + if (StringUtils.isBlank(json)) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + "配置信息错误. 因为您提供的配置信息不是合法的JSON格式, JSON不能为空白. 请按照标准json格式提供配置信息. "); + } + } + + private Configuration(final String json) { + try { + this.root = JSON.parse(json); + } catch (Exception e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("配置信息错误. 您提供的配置信息不是合法的JSON格式: %s . 请按照标准json格式提供配置信息. ", e.getMessage())); + } + } + + private static String toJSONString(final Object object) { + return JSON.toJSONString(object); + } + + public Set getSecretKeyPathSet() { + return secretKeyPathSet; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/FilterUtil.java b/common/src/main/java/com/alibaba/datax/common/util/FilterUtil.java new file mode 100755 index 000000000..37b319a19 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/FilterUtil.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.common.util; + +import java.util.*; +import java.util.regex.Pattern; + +/** + * 提供从 List 中根据 regular 过滤的通用工具(返回值已经去重). 使用场景,比如:odpsreader + * 的分区筛选,hdfsreader/txtfilereader的路径筛选等 + */ +public final class FilterUtil { + + //已经去重 + public static List filterByRegular(List allStrs, + String regular) { + List matchedValues = new ArrayList(); + + // 语法习惯上的兼容处理(pt=* 实际正则应该是:pt=.*) + String newReqular = regular.replace(".*", "*").replace("*", ".*"); + + Pattern p = Pattern.compile(newReqular); + + for (String partition : allStrs) { + if (p.matcher(partition).matches()) { + if (!matchedValues.contains(partition)) { + matchedValues.add(partition); + } + } + } + + return matchedValues; + } + + //已经去重 + public static List filterByRegulars(List allStrs, + List regulars) { + List matchedValues = new ArrayList(); + + List tempMatched = null; + for (String regular : regulars) { + tempMatched = filterByRegular(allStrs, regular); + if (null != tempMatched && !tempMatched.isEmpty()) { + for (String temp : tempMatched) { + if (!matchedValues.contains(temp)) { + matchedValues.add(temp); + } + } + } + } + + return matchedValues; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/HostUtils.java b/common/src/main/java/com/alibaba/datax/common/util/HostUtils.java new file mode 100644 index 000000000..3980076c6 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/HostUtils.java @@ -0,0 +1,50 @@ +package com.alibaba.datax.common.util; + +import com.google.common.base.CharMatcher; +import com.google.common.io.ByteStreams; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.net.InetAddress; +import java.net.UnknownHostException; + +/** + * Created by liqiang on 15/8/25. + */ +public class HostUtils { + + public static final String IP; + public static final String HOSTNAME; + private static final Logger log = LoggerFactory.getLogger(HostUtils.class); + + static { + String ip; + String hostname; + try { + InetAddress addr = InetAddress.getLocalHost(); + ip = addr.getHostAddress(); + hostname = addr.getHostName(); + } catch (UnknownHostException e) { + log.error("Can't find out address: " + e.getMessage(), e); + ip = "UNKNOWN"; + hostname = "UNKNOWN"; + } + if (ip.equals("127.0.0.1") || ip.equals("::1") || ip.equals("UNKNOWN")) { + try { + Process process = Runtime.getRuntime().exec("hostname -i"); + if (process.waitFor() == 0) { + ip = new String(ByteStreams.toByteArray(process.getInputStream()), "UTF8"); + } + process = Runtime.getRuntime().exec("hostname"); + if (process.waitFor() == 0) { + hostname = CharMatcher.BREAKING_WHITESPACE.trimFrom(new String(ByteStreams.toByteArray(process.getInputStream()), "UTF8")); + } + } catch (Exception e) { + log.warn("get hostname failed {}", e.getMessage()); + } + } + IP = ip; + HOSTNAME = hostname; + log.info("IP {} HOSTNAME {}", IP, HOSTNAME); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/ListUtil.java b/common/src/main/java/com/alibaba/datax/common/util/ListUtil.java new file mode 100755 index 000000000..d7a5b7646 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/ListUtil.java @@ -0,0 +1,139 @@ +package com.alibaba.datax.common.util; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang3.StringUtils; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +/** + * 提供针对 DataX中使用的 List 较为常见的一些封装。 比如:checkIfValueDuplicate 可以用于检查用户配置的 writer + * 的列不能重复。makeSureNoValueDuplicate亦然,只是会严格报错。 + */ +public final class ListUtil { + + public static boolean checkIfValueDuplicate(List aList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty()) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + "您提供的作业配置有误,List不能为空."); + } + + try { + makeSureNoValueDuplicate(aList, caseSensitive); + } catch (Exception e) { + return true; + } + return false; + } + + public static void makeSureNoValueDuplicate(List aList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + if (1 == aList.size()) { + return; + } else { + List list = null; + if (!caseSensitive) { + list = valueToLowerCase(aList); + } else { + list = new ArrayList(aList); + } + + Collections.sort(list); + + for (int i = 0, len = list.size() - 1; i < len; i++) { + if (list.get(i).equals(list.get(i + 1))) { + throw DataXException + .asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format( + "您提供的作业配置信息有误, String:[%s] 不允许重复出现在列表中: [%s].", + list.get(i), + StringUtils.join(aList, ","))); + } + } + } + } + + public static boolean checkIfBInA(List aList, List bList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty() || null == bList + || bList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + try { + makeSureBInA(aList, bList, caseSensitive); + } catch (Exception e) { + return false; + } + return true; + } + + public static void makeSureBInA(List aList, List bList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty() || null == bList + || bList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + List all = null; + List part = null; + + if (!caseSensitive) { + all = valueToLowerCase(aList); + part = valueToLowerCase(bList); + } else { + all = new ArrayList(aList); + part = new ArrayList(bList); + } + + for (String oneValue : part) { + if (!all.contains(oneValue)) { + throw DataXException + .asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format( + "您提供的作业配置信息有误, String:[%s] 不存在于列表中:[%s].", + oneValue, StringUtils.join(aList, ","))); + } + } + + } + + public static boolean checkIfValueSame(List aList) { + if (null == aList || aList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + if (1 == aList.size()) { + return true; + } else { + Boolean firstValue = aList.get(0); + for (int i = 1, len = aList.size(); i < len; i++) { + if (firstValue.booleanValue() != aList.get(i).booleanValue()) { + return false; + } + } + return true; + } + } + + public static List valueToLowerCase(List aList) { + if (null == aList || aList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + List result = new ArrayList(aList.size()); + for (String oneValue : aList) { + result.add(null != oneValue ? oneValue.toLowerCase() : null); + } + + return result; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/RangeSplitUtil.java b/common/src/main/java/com/alibaba/datax/common/util/RangeSplitUtil.java new file mode 100755 index 000000000..791f9ea12 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/RangeSplitUtil.java @@ -0,0 +1,209 @@ +package com.alibaba.datax.common.util; + +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; + +import java.math.BigInteger; +import java.util.*; + +/** + * 提供通用的根据数字范围、字符串范围等进行切分的通用功能. + */ +public final class RangeSplitUtil { + + public static String[] doAsciiStringSplit(String left, String right, int expectSliceNumber) { + int radix = 128; + + BigInteger[] tempResult = doBigIntegerSplit(stringToBigInteger(left, radix), + stringToBigInteger(right, radix), expectSliceNumber); + String[] result = new String[tempResult.length]; + + //处理第一个字符串(因为:在转换为数字,再还原的时候,如果首字符刚好是 basic,则不知道应该添加多少个 basic) + result[0] = left; + result[tempResult.length - 1] = right; + + for (int i = 1, len = tempResult.length - 1; i < len; i++) { + result[i] = bigIntegerToString(tempResult[i], radix); + } + + return result; + } + + + public static long[] doLongSplit(long left, long right, int expectSliceNumber) { + BigInteger[] result = doBigIntegerSplit(BigInteger.valueOf(left), + BigInteger.valueOf(right), expectSliceNumber); + long[] returnResult = new long[result.length]; + for (int i = 0, len = result.length; i < len; i++) { + returnResult[i] = result[i].longValue(); + } + return returnResult; + } + + public static BigInteger[] doBigIntegerSplit(BigInteger left, BigInteger right, int expectSliceNumber) { + if (expectSliceNumber < 1) { + throw new IllegalArgumentException(String.format( + "切分份数不能小于1. 此处:expectSliceNumber=[%s].", expectSliceNumber)); + } + + if (null == left || null == right) { + throw new IllegalArgumentException(String.format( + "对 BigInteger 进行切分时,其左右区间不能为 null. 此处:left=[%s],right=[%s].", left, right)); + } + + if (left.compareTo(right) == 0) { + return new BigInteger[]{left, right}; + } else { + // 调整大小顺序,确保 left < right + if (left.compareTo(right) > 0) { + BigInteger temp = left; + left = right; + right = temp; + } + + //left < right + BigInteger endAndStartGap = right.subtract(left); + + BigInteger step = endAndStartGap.divide(BigInteger.valueOf(expectSliceNumber)); + BigInteger remainder = endAndStartGap.remainder(BigInteger.valueOf(expectSliceNumber)); + + //remainder 不可能超过expectSliceNumber,所以不需要检查remainder的 Integer 的范围 + + // 这里不能 step.intValue()==0,因为可能溢出 + if (step.compareTo(BigInteger.ZERO) == 0) { + expectSliceNumber = remainder.intValue(); + } + + BigInteger[] result = new BigInteger[expectSliceNumber + 1]; + result[0] = left; + result[expectSliceNumber] = right; + + BigInteger lowerBound; + BigInteger upperBound = left; + for (int i = 1; i < expectSliceNumber; i++) { + lowerBound = upperBound; + upperBound = lowerBound.add(step); + upperBound = upperBound.add((remainder.compareTo(BigInteger.valueOf(i)) >= 0) + ? BigInteger.ONE : BigInteger.ZERO); + result[i] = upperBound; + } + + return result; + } + } + + private static void checkIfBetweenRange(int value, int left, int right) { + if (value < left || value > right) { + throw new IllegalArgumentException(String.format("parameter can not <[%s] or >[%s].", + left, right)); + } + } + + /** + * 由于只支持 ascii 码对应字符,所以radix 范围为[1,128] + */ + public static BigInteger stringToBigInteger(String aString, int radix) { + if (null == aString) { + throw new IllegalArgumentException("参数 bigInteger 不能为空."); + } + + checkIfBetweenRange(radix, 1, 128); + + BigInteger result = BigInteger.ZERO; + BigInteger radixBigInteger = BigInteger.valueOf(radix); + + int tempChar; + int k = 0; + + for (int i = aString.length() - 1; i >= 0; i--) { + tempChar = aString.charAt(i); + if (tempChar >= 128) { + throw new IllegalArgumentException(String.format("根据字符串进行切分时仅支持 ASCII 字符串,而字符串:[%s]非 ASCII 字符串.", aString)); + } + result = result.add(BigInteger.valueOf(tempChar).multiply(radixBigInteger.pow(k))); + k++; + } + + return result; + } + + /** + * 把BigInteger 转换为 String.注意:radix 和 basic 范围都为[1,128], radix + basic 的范围也必须在[1,128]. + */ + private static String bigIntegerToString(BigInteger bigInteger, int radix) { + if (null == bigInteger) { + throw new IllegalArgumentException("参数 bigInteger 不能为空."); + } + + checkIfBetweenRange(radix, 1, 128); + + StringBuilder resultStringBuilder = new StringBuilder(); + + List list = new ArrayList(); + BigInteger radixBigInteger = BigInteger.valueOf(radix); + BigInteger currentValue = bigInteger; + + BigInteger quotient = currentValue.divide(radixBigInteger); + while (quotient.compareTo(BigInteger.ZERO) > 0) { + list.add(currentValue.remainder(radixBigInteger).intValue()); + currentValue = currentValue.divide(radixBigInteger); + quotient = currentValue; + } + Collections.reverse(list); + + if (list.isEmpty()) { + list.add(0, bigInteger.remainder(radixBigInteger).intValue()); + } + + Map map = new HashMap(); + for (int i = 0; i < radix; i++) { + map.put(i, (char) (i)); + } + +// String msg = String.format("%s 转为 %s 进制,结果为:%s", bigInteger.longValue(), radix, list); +// System.out.println(msg); + + for (Integer aList : list) { + resultStringBuilder.append(map.get(aList)); + } + + return resultStringBuilder.toString(); + } + + /** + * 获取字符串中的最小字符和最大字符(依据 ascii 进行判断).要求字符串必须非空,并且为 ascii 字符串. + * 返回的Pair,left=最小字符,right=最大字符. + */ + public static Pair getMinAndMaxCharacter(String aString) { + if (!isPureAscii(aString)) { + throw new IllegalArgumentException(String.format("根据字符串进行切分时仅支持 ASCII 字符串,而字符串:[%s]非 ASCII 字符串.", aString)); + } + + char min = aString.charAt(0); + char max = min; + + char temp; + for (int i = 1, len = aString.length(); i < len; i++) { + temp = aString.charAt(i); + min = min < temp ? min : temp; + max = max > temp ? max : temp; + } + + return new ImmutablePair(min, max); + } + + private static boolean isPureAscii(String aString) { + if (null == aString) { + return false; + } + + for (int i = 0, len = aString.length(); i < len; i++) { + char ch = aString.charAt(i); + if (ch >= 127 || ch < 0) { + return false; + } + } + return true; + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/RetryUtil.java b/common/src/main/java/com/alibaba/datax/common/util/RetryUtil.java new file mode 100755 index 000000000..51f3f277c --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/RetryUtil.java @@ -0,0 +1,171 @@ +package com.alibaba.datax.common.util; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.concurrent.*; + +public final class RetryUtil { + + private static final Logger LOG = LoggerFactory.getLogger(RetryUtil.class); + + private static final long MAX_SLEEP_MILLISECOND = 256 * 1000; + + /** + * 重试次数工具方法. + * + * @param callable 实际逻辑 + * @param retryTimes 最大重试次数(>1) + * @param sleepTimeInMilliSecond 运行失败后休眠对应时间再重试 + * @param exponential 休眠时间是否指数递增 + * @param 返回值类型 + * @return 经过重试的callable的执行结果 + */ + public static T executeWithRetry(Callable callable, + int retryTimes, + long sleepTimeInMilliSecond, + boolean exponential) throws Exception { + Retry retry = new Retry(); + return retry.doRetry(callable, retryTimes, sleepTimeInMilliSecond, exponential); + } + + /** + * 在外部线程执行并且重试。每次执行需要在timeoutMs内执行完,不然视为失败。 + * 执行异步操作的线程池从外部传入,线程池的共享粒度由外部控制。比如,HttpClientUtil共享一个线程池。 + *

+ * 限制条件:仅仅能够在阻塞的时候interrupt线程 + * + * @param callable 实际逻辑 + * @param retryTimes 最大重试次数(>1) + * @param sleepTimeInMilliSecond 运行失败后休眠对应时间再重试 + * @param exponential 休眠时间是否指数递增 + * @param timeoutMs callable执行超时时间,毫秒 + * @param executor 执行异步操作的线程池 + * @param 返回值类型 + * @return 经过重试的callable的执行结果 + */ + public static T asyncExecuteWithRetry(Callable callable, + int retryTimes, + long sleepTimeInMilliSecond, + boolean exponential, + long timeoutMs, + ThreadPoolExecutor executor) throws Exception { + Retry retry = new AsyncRetry(timeoutMs, executor); + return retry.doRetry(callable, retryTimes, sleepTimeInMilliSecond, exponential); + } + + /** + * 创建异步执行的线程池。特性如下: + * core大小为0,初始状态下无线程,无初始消耗。 + * max大小为5,最多五个线程。 + * 60秒超时时间,闲置超过60秒线程会被回收。 + * 使用SynchronousQueue,任务不会排队,必须要有可用线程才能提交成功,否则会RejectedExecutionException。 + * + * @return 线程池 + */ + public static ThreadPoolExecutor createThreadPoolExecutor() { + return new ThreadPoolExecutor(0, 5, + 60L, TimeUnit.SECONDS, + new SynchronousQueue()); + } + + + private static class Retry { + + public T doRetry(Callable callable, int retryTimes, long sleepTimeInMilliSecond, boolean exponential) + throws Exception { + + if (null == callable) { + throw new IllegalArgumentException("系统编程错误, 入参callable不能为空 ! "); + } + + if (retryTimes < 1) { + throw new IllegalArgumentException(String.format( + "系统编程错误, 入参retrytime[%d]不能小于1 !", retryTimes)); + } + + Exception saveException = null; + for (int i = 0; i < retryTimes; i++) { + try { + return call(callable); + } catch (Exception e) { + saveException = e; + + if (i + 1 < retryTimes && sleepTimeInMilliSecond > 0) { + long startTime = System.currentTimeMillis(); + + long timeToSleep; + if (exponential) { + timeToSleep = sleepTimeInMilliSecond * (long) Math.pow(2, i); + if(timeToSleep >= MAX_SLEEP_MILLISECOND) { + timeToSleep = MAX_SLEEP_MILLISECOND; + } + } else { + timeToSleep = sleepTimeInMilliSecond; + if(timeToSleep >= MAX_SLEEP_MILLISECOND) { + timeToSleep = MAX_SLEEP_MILLISECOND; + } + } + + try { + Thread.sleep(timeToSleep); + } catch (InterruptedException ignored) { + } + + long realTimeSleep = System.currentTimeMillis()-startTime; + + LOG.error(String.format("Exception when calling callable, 即将尝试执行第%s次重试.本次重试计划等待[%s]ms,实际等待[%s]ms, 异常Msg:[%s]", + i+1, timeToSleep,realTimeSleep, e.getMessage())); + + } + } + } + throw saveException; + } + + protected T call(Callable callable) throws Exception { + return callable.call(); + } + } + + private static class AsyncRetry extends Retry { + + private long timeoutMs; + private ThreadPoolExecutor executor; + + public AsyncRetry(long timeoutMs, ThreadPoolExecutor executor) { + this.timeoutMs = timeoutMs; + this.executor = executor; + } + + /** + * 使用传入的线程池异步执行任务,并且等待。 + *

+ * future.get()方法,等待指定的毫秒数。如果任务在超时时间内结束,则正常返回。 + * 如果抛异常(可能是执行超时、执行异常、被其他线程cancel或interrupt),都记录日志并且网上抛异常。 + * 正常和非正常的情况都会判断任务是否结束,如果没有结束,则cancel任务。cancel参数为true,表示即使 + * 任务正在执行,也会interrupt线程。 + * + * @param callable + * @param + * @return + * @throws Exception + */ + @Override + protected T call(Callable callable) throws Exception { + Future future = executor.submit(callable); + try { + return future.get(timeoutMs, TimeUnit.MILLISECONDS); + } catch (Exception e) { + LOG.warn("Try once failed", e); + throw e; + } finally { + if (!future.isDone()) { + future.cancel(true); + LOG.warn("Try once task not done, cancel it, active count: " + executor.getActiveCount()); + } + } + } + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/StrUtil.java b/common/src/main/java/com/alibaba/datax/common/util/StrUtil.java new file mode 100755 index 000000000..82222b0d4 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/StrUtil.java @@ -0,0 +1,85 @@ +package com.alibaba.datax.common.util; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; + +import java.text.DecimalFormat; +import java.util.HashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class StrUtil { + + private final static long KB_IN_BYTES = 1024; + + private final static long MB_IN_BYTES = 1024 * KB_IN_BYTES; + + private final static long GB_IN_BYTES = 1024 * MB_IN_BYTES; + + private final static long TB_IN_BYTES = 1024 * GB_IN_BYTES; + + private final static DecimalFormat df = new DecimalFormat("0.00"); + + private static final Pattern VARIABLE_PATTERN = Pattern + .compile("(\\$)\\{?(\\w+)\\}?"); + + private static String SYSTEM_ENCODING = System.getProperty("file.encoding"); + + static { + if (SYSTEM_ENCODING == null) { + SYSTEM_ENCODING = "UTF-8"; + } + } + + private StrUtil() { + } + + public static String stringify(long byteNumber) { + if (byteNumber / TB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) TB_IN_BYTES) + "TB"; + } else if (byteNumber / GB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) GB_IN_BYTES) + "GB"; + } else if (byteNumber / MB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) MB_IN_BYTES) + "MB"; + } else if (byteNumber / KB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) KB_IN_BYTES) + "KB"; + } else { + return String.valueOf(byteNumber) + "B"; + } + } + + + public static String replaceVariable(final String param) { + Map mapping = new HashMap(); + + Matcher matcher = VARIABLE_PATTERN.matcher(param); + while (matcher.find()) { + String variable = matcher.group(2); + String value = System.getProperty(variable); + if (StringUtils.isBlank(value)) { + value = matcher.group(); + } + mapping.put(matcher.group(), value); + } + + String retString = param; + for (final String key : mapping.keySet()) { + retString = retString.replace(key, mapping.get(key)); + } + + return retString; + } + + public static String compressMiddle(String s, int headLength, int tailLength) { + Validate.notNull(s, "Input string must not be null"); + Validate.isTrue(headLength > 0, "Head length must be larger than 0"); + Validate.isTrue(tailLength > 0, "Tail length must be larger than 0"); + + if(headLength + tailLength >= s.length()) { + return s; + } + return s.substring(0, headLength) + "..." + s.substring(s.length() - tailLength); + } + +} diff --git a/common/src/test/java/com/alibaba/datax/common/base/BaseTest.java b/common/src/test/java/com/alibaba/datax/common/base/BaseTest.java new file mode 100755 index 000000000..bbc88d9bb --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/base/BaseTest.java @@ -0,0 +1,30 @@ +package com.alibaba.datax.common.base; + +import java.io.File; +import java.io.IOException; + +import org.apache.commons.io.FileUtils; +import org.apache.commons.lang3.StringUtils; +import org.junit.BeforeClass; + +import com.alibaba.datax.common.element.ColumnCast; +import com.alibaba.datax.common.element.ColumnCastTest; +import com.alibaba.datax.common.util.Configuration; +import org.junit.Test; + +public class BaseTest { + + @BeforeClass + public static void beforeClass() throws IOException { + String path = ColumnCastTest.class.getClassLoader().getResource(".") + .getFile(); + ColumnCast.bind(Configuration.from(FileUtils.readFileToString(new File( + StringUtils.join(new String[] { path, "all.json" }, + File.separator))))); + } + + @Test + public void emptyTest() { + + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/element/BoolColumnTest.java b/common/src/test/java/com/alibaba/datax/common/element/BoolColumnTest.java new file mode 100755 index 000000000..d1db4d0c6 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/BoolColumnTest.java @@ -0,0 +1,79 @@ +package com.alibaba.datax.common.element; + +import org.junit.Assert; +import org.junit.Test; + +import com.alibaba.datax.common.base.BaseTest; +import com.alibaba.datax.common.exception.DataXException; + +public class BoolColumnTest extends BaseTest { + @Test + public void test_true() { + BoolColumn bool = new BoolColumn(true); + Assert.assertTrue(bool.asBoolean().equals(true)); + Assert.assertTrue(bool.asString().equals("true")); + Assert.assertTrue(bool.asDouble().equals(1.0d)); + Assert.assertTrue(bool.asLong().equals(1L)); + + try { + bool.asDate(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + bool.asBytes(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + } + + @Test + public void test_false() { + BoolColumn bool = new BoolColumn(false); + Assert.assertTrue(bool.asBoolean().equals(false)); + Assert.assertTrue(bool.asString().equals("false")); + Assert.assertTrue(bool.asDouble().equals(0.0d)); + Assert.assertTrue(bool.asLong().equals(0L)); + + try { + bool.asDate(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + bool.asBytes(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + } + + @Test + public void test_null() { + BoolColumn bool = new BoolColumn(); + Assert.assertTrue(bool.asBoolean() == null); + Assert.assertTrue(bool.asString() == null); + Assert.assertTrue(bool.asDouble() == null); + Assert.assertTrue(bool.asLong() == null); + + try { + bool.asDate(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + bool.asBytes(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + } + + @Test + public void test_nullReference() { + Boolean b = null; + BoolColumn boolColumn = new BoolColumn(b); + Assert.assertTrue(boolColumn.asBoolean() == null); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/element/ColumnCastTest.java b/common/src/test/java/com/alibaba/datax/common/element/ColumnCastTest.java new file mode 100755 index 000000000..ff800b5cb --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/ColumnCastTest.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.io.FileUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.time.DateFormatUtils; +import org.junit.Assert; +import org.junit.Test; + +import java.io.File; +import java.io.IOException; +import java.sql.Date; +import java.sql.Time; +import java.text.ParseException; + +public class ColumnCastTest { + private Configuration produce() throws IOException { + String path = ColumnCastTest.class.getClassLoader().getResource(".") + .getFile(); + String content = FileUtils.readFileToString(new File(StringUtils.join( + new String[] { path, "all.json" }, File.separator))); + return Configuration.from(content); + } + + @Test + public void test_string() throws IOException, ParseException { + Configuration configuration = this.produce(); + StringCast.init(configuration); + + System.out.println(StringCast.asDate(new StringColumn("2014-09-18"))); + Assert.assertTrue(StringCast.asDate(new StringColumn("2014-09-18")) + .getTime() == 1410969600000L); + + Assert.assertTrue(StringCast.asDate(new StringColumn("20140918")) + .getTime() == 1410969600000L); + + Assert.assertTrue(StringCast.asDate(new StringColumn("08:00:00")) + .getTime() == 0L); + + Assert.assertTrue(StringCast.asDate( + new StringColumn("2014-09-18 16:00:00")).getTime() == 1411027200000L); + configuration + .set("common.column.datetimeFormat", "yyyy/MM/dd HH:mm:ss"); + StringCast.init(configuration); + Assert.assertTrue(StringCast.asDate( + new StringColumn("2014/09/18 16:00:00")).getTime() == 1411027200000L); + + configuration.set("common.column.timeZone", "GMT"); + StringCast.init(configuration); + + java.util.Date date = StringCast.asDate(new StringColumn( + "2014/09/18 16:00:00")); + System.out.println(DateFormatUtils.format(date, "yyyy/MM/dd HH:mm:ss")); + Assert.assertTrue("2014/09/19 00:00:00".equals(DateFormatUtils.format( + date, "yyyy/MM/dd HH:mm:ss"))); + + } + + @Test + public void test_date() throws IOException { + Assert.assertTrue(DateCast.asString( + new DateColumn(System.currentTimeMillis())).startsWith("201")); + + Configuration configuration = this.produce(); + configuration + .set("common.column.datetimeFormat", "MM/dd/yyyy HH:mm:ss"); + DateCast.init(configuration); + System.out.println(DateCast.asString(new DateColumn(System + .currentTimeMillis()))); + Assert.assertTrue(!DateCast.asString( + new DateColumn(System.currentTimeMillis())).startsWith("2014")); + + DateColumn dateColumn = new DateColumn(new Time(0L)); + System.out.println(dateColumn.asString()); + Assert.assertTrue(dateColumn.asString().equals("08:00:00")); + + configuration.set("common.column.timeZone", "GMT"); + DateCast.init(configuration); + System.err.println(DateCast.asString(dateColumn)); + Assert.assertTrue(dateColumn.asString().equals("00:00:00")); + + configuration.set("common.column.timeZone", "GMT+8"); + DateCast.init(configuration); + System.out.println(dateColumn.asString()); + Assert.assertTrue(dateColumn.asString().equals("08:00:00")); + + dateColumn = new DateColumn(new Date(0L)); + System.out.println(dateColumn.asString()); + Assert.assertTrue(dateColumn.asString().equals("1970-01-01")); + + dateColumn = new DateColumn(new java.util.Date(0L)); + System.out.println(dateColumn.asString()); + Assert.assertTrue(dateColumn.asString().equals("01/01/1970 08:00:00")); + } +} \ No newline at end of file diff --git a/common/src/test/java/com/alibaba/datax/common/element/DateColumnTest.java b/common/src/test/java/com/alibaba/datax/common/element/DateColumnTest.java new file mode 100755 index 000000000..2c063db12 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/DateColumnTest.java @@ -0,0 +1,109 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.base.BaseTest; +import com.alibaba.datax.common.exception.DataXException; +import org.junit.Assert; +import org.junit.Test; + +import java.util.Date; + +public class DateColumnTest extends BaseTest { + @Test + public void test() { + long time = System.currentTimeMillis(); + DateColumn date = new DateColumn(time); + Assert.assertTrue(date.getType().equals(Column.Type.DATE)); + Assert.assertTrue(date.asDate().getTime() == time); + Assert.assertTrue(date.asLong().equals(time)); + System.out.println(date.asString()); + Assert.assertTrue(date.asString().startsWith("201")); + + try { + date.asBytes(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + date.asDouble(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + } + + @Test + public void test_null() { + DateColumn date = new DateColumn(); + DateColumn nul1 = new DateColumn((Long)null); + DateColumn nul2 = new DateColumn((Date)null); + DateColumn nul3 = new DateColumn((java.sql.Date)null); + DateColumn nul4 = new DateColumn((java.sql.Time)null); + DateColumn nul5 = new DateColumn((java.sql.Timestamp)null); + Assert.assertTrue(date.getType().equals(Column.Type.DATE)); + + Assert.assertTrue(date.asDate() == null); + Assert.assertTrue(date.asLong() == null); + Assert.assertTrue(date.asString() == null); + + Assert.assertTrue(nul1.asDate() == null); + Assert.assertTrue(nul1.asLong() == null); + Assert.assertTrue(nul1.asString() == null); + + Assert.assertTrue(nul2.asDate() == null); + Assert.assertTrue(nul2.asLong() == null); + Assert.assertTrue(nul2.asString() == null); + + Assert.assertTrue(nul3.asDate() == null); + Assert.assertTrue(nul3.asLong() == null); + Assert.assertTrue(nul3.asString() == null); + + Assert.assertTrue(nul4.asDate() == null); + Assert.assertTrue(nul4.asLong() == null); + Assert.assertTrue(nul4.asString() == null); + + Assert.assertTrue(nul5.asDate() == null); + Assert.assertTrue(nul5.asLong() == null); + Assert.assertTrue(nul5.asString() == null); + + try { + date.asBytes(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + date.asDouble(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + date.asBoolean(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + } + + @Test + public void testDataColumn() throws Exception { + DateColumn date = new DateColumn(1449925250000L); + Assert.assertEquals(date.asString(),"2015-12-12 21:00:50"); + Assert.assertEquals(date.asDate(),new Date(1449925250000L)); + + java.sql.Date dat2 = new java.sql.Date(1449925251001L); + date = new DateColumn(dat2); + Assert.assertEquals(date.asString(),"2015-12-12"); + Assert.assertEquals(date.asDate(),new Date(1449925251001L)); + + java.sql.Time dat3 = new java.sql.Time(1449925252002L); + date = new DateColumn(dat3); + Assert.assertEquals(date.asString(),"21:00:52"); + Assert.assertEquals(date.asDate(),new Date(1449925252002L)); + + java.sql.Timestamp ts = new java.sql.Timestamp(1449925253003L); + date = new DateColumn(ts); + Assert.assertEquals(date.asString(),"2015-12-12 21:00:53"); + Assert.assertEquals(date.asDate(),new Date(1449925253003L)); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/element/DoubleColumnTest.java b/common/src/test/java/com/alibaba/datax/common/element/DoubleColumnTest.java new file mode 100755 index 000000000..ab8d12ed7 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/DoubleColumnTest.java @@ -0,0 +1,250 @@ +package com.alibaba.datax.common.element; + +import org.junit.Assert; +import org.junit.Test; + +import java.math.BigDecimal; +import java.math.BigInteger; + +public class DoubleColumnTest { + @Test + public void test_null() { + DoubleColumn column = new DoubleColumn(); + + System.out.println(column.asString()); + Assert.assertTrue(column.asString() == null); + System.out.println(column.toString()); + Assert.assertTrue(column.toString().equals( + "{\"byteSize\":0,\"type\":\"DOUBLE\"}")); + Assert.assertTrue(column.asDouble() == null); + Assert.assertTrue(column.asString() == null); + + try { + Assert.assertTrue(column.asBoolean() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + Assert.assertTrue(column.asDate() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_double() { + DoubleColumn column = new DoubleColumn(1.0d); + + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals("1.0")); + System.out.println(column.toString()); + Assert.assertTrue(column.toString().equals( + "{\"byteSize\":3,\"rawData\":\"1.0\",\"type\":\"DOUBLE\"}")); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals(1.0d)); + + try { + Assert.assertTrue(column.asBoolean() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_float() { + DoubleColumn column = new DoubleColumn(1.0f); + + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals("1.0")); + System.out.println(column.toString()); + Assert.assertTrue(column.toString().equals( + "{\"byteSize\":3,\"rawData\":\"1.0\",\"type\":\"DOUBLE\"}")); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals(1.0d)); + + try { + Assert.assertTrue(column.asBoolean() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_string() { + DoubleColumn column = new DoubleColumn("1.0"); + + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals("1.0")); + System.out.println(column.toString()); + Assert.assertTrue(column.toString().equals( + "{\"byteSize\":3,\"rawData\":\"1.0\",\"type\":\"DOUBLE\"}")); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals(1.0d)); + + try { + Assert.assertTrue(column.asBoolean() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_BigDecimal() { + DoubleColumn column = new DoubleColumn(new BigDecimal("1E-100")); + + System.out.println(column.asString()); + System.out.println(column.asString().length()); + Assert.assertTrue(column.asString().length() == 102); + + Assert.assertTrue(column.asString().equals( + new BigDecimal("1E-100").toPlainString())); + + Assert.assertTrue(column + .toString() + .equals("{\"byteSize\":102,\"rawData\":\"0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001\",\"type\":\"DOUBLE\"}")); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals(1.0E-100)); + + try { + Assert.assertTrue(column.asBoolean() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_overflow() { + DoubleColumn column = new DoubleColumn(new BigDecimal("1E-1000")); + + System.out.println(column.asString()); + + Assert.assertTrue(column.asBigDecimal().equals( + new BigDecimal("1E-1000"))); + + Assert.assertTrue(column.asBigInteger().compareTo(BigInteger.ZERO) == 0); + Assert.assertTrue(column.asLong().equals(0L)); + + try { + column.asDouble(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + column = new DoubleColumn(new BigDecimal("1E1000")); + Assert.assertTrue(column.asBigDecimal().compareTo( + new BigDecimal("1E1000")) == 0); + Assert.assertTrue(column.asBigInteger().compareTo( + new BigDecimal("1E1000").toBigInteger()) == 0); + try { + column.asDouble(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + try { + column.asLong(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + } + + @Test + public void test_NaN() { + DoubleColumn column = new DoubleColumn(String.valueOf(Double.NaN)); + Assert.assertTrue(column.asString().equals("NaN")); + try { + column.asBigDecimal(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + column = new DoubleColumn(String.valueOf(Double.POSITIVE_INFINITY)); + Assert.assertTrue(column.asString().equals("Infinity")); + try { + column.asBigDecimal(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + column = new DoubleColumn(String.valueOf(Double.NEGATIVE_INFINITY)); + Assert.assertTrue(column.asString().equals("-Infinity")); + try { + column.asBigDecimal(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_doubleFormat() { + System.out.println(new BigDecimal("9801523474.1234567890987654321") + .toPlainString()); + + System.out.println("double: " + 9801523474.399621d); + System.out.println("bigdecimal: " + new BigDecimal(9801523474.399621d).toPlainString()); + System.out.println("bigdecimal: " + new BigDecimal(String.valueOf(9801523474.399621d)).toPlainString()); + System.out.println(new DoubleColumn(9801523474.399621d).asString()); + Assert.assertTrue("9801523474.39962".equals(new DoubleColumn( + 9801523474.399621d).asString())); + + Assert.assertTrue(!new DoubleColumn(Double.MAX_VALUE).asString() + .contains("E")); + Assert.assertTrue(!new DoubleColumn(Float.MAX_VALUE).asString() + .contains("E")); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/element/LongColumnTest.java b/common/src/test/java/com/alibaba/datax/common/element/LongColumnTest.java new file mode 100755 index 000000000..aa9f242f9 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/LongColumnTest.java @@ -0,0 +1,217 @@ +package com.alibaba.datax.common.element; + +import org.junit.Assert; +import org.junit.Test; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.sql.Date; + +public class LongColumnTest { + + @Test + public void test_null() { + LongColumn column = new LongColumn(); + System.out.println(column.asString()); + Assert.assertTrue(column.asString() == null); + System.out.println(column.toString()); + Assert.assertTrue(column.toString().equals( + "{\"byteSize\":0,\"type\":\"LONG\"}")); + Assert.assertTrue(column.asBoolean() == null); + Assert.assertTrue(column.asDouble() == null); + Assert.assertTrue(column.asString() == null); + Assert.assertTrue(column.asDate() == null); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_normal() { + LongColumn column = new LongColumn(1); + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals("1")); + System.out.println(column.toString()); + Assert.assertEquals(column.toString(), + "{\"byteSize\":8,\"rawData\":1,\"type\":\"LONG\"}"); + Assert.assertTrue(column.asBoolean().equals(true)); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals(1.0d)); + Assert.assertTrue(column.asDate().equals(new Date(1L))); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_max() { + LongColumn column = new LongColumn(Long.MAX_VALUE); + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals( + String.valueOf(Long.MAX_VALUE))); + System.out.println(column.toString()); + Assert.assertTrue(column + .toString() + .equals(String + .format("{\"byteSize\":8,\"rawData\":9223372036854775807,\"type\":\"LONG\"}", + Long.MAX_VALUE))); + Assert.assertTrue(column.asBoolean().equals(true)); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals((double) Long.MAX_VALUE)); + Assert.assertTrue(column.asDate().equals(new Date(Long.MAX_VALUE))); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_min() { + LongColumn column = new LongColumn(Long.MIN_VALUE); + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals( + String.valueOf(Long.MIN_VALUE))); + System.out.println(column.toString()); + Assert.assertTrue(column + .toString() + .equals(String + .format("{\"byteSize\":8,\"rawData\":-9223372036854775808,\"type\":\"LONG\"}", + Long.MIN_VALUE))); + Assert.assertTrue(column.asBoolean().equals(true)); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals((double) Long.MIN_VALUE)); + Assert.assertTrue(column.asDate().equals(new Date(Long.MIN_VALUE))); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_string() { + LongColumn column = new LongColumn(String.valueOf(Long.MIN_VALUE)); + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals( + String.valueOf(Long.MIN_VALUE))); + System.out.println(column.toString()); + Assert.assertTrue(column + .toString() + .equals("{\"byteSize\":20,\"rawData\":-9223372036854775808,\"type\":\"LONG\"}")); + Assert.assertTrue(column.asBoolean().equals(true)); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals((double) Long.MIN_VALUE)); + Assert.assertTrue(column.asDate().equals(new Date(Long.MIN_VALUE))); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_science() { + LongColumn column = new LongColumn(String.valueOf("4.7E+38")); + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals( + "470000000000000000000000000000000000000")); + System.out.println(column.toString()); + Assert.assertTrue(column.asBoolean().equals(true)); + + System.out.println(">>" + column.asBigDecimal()); + System.out.println(">>" + new BigDecimal("4.7E+38").toPlainString()); + Assert.assertTrue(column.asBigDecimal().toPlainString() + .equals(new BigDecimal("4.7E+38").toPlainString())); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_bigInteger() { + LongColumn column = new LongColumn(BigInteger.valueOf(Long.MIN_VALUE)); + System.out.println(column.asString()); + Assert.assertTrue(column.asString().equals( + String.valueOf(Long.MIN_VALUE))); + System.out.println(column.toString()); + Assert.assertEquals(column.toString() + ,String.format("{\"byteSize\":8,\"rawData\":-9223372036854775808,\"type\":\"LONG\"}", + Long.MIN_VALUE)); + Assert.assertTrue(column.asBoolean().equals(true)); + + System.out.println(column.asDouble()); + Assert.assertTrue(column.asDouble().equals((double) Long.MIN_VALUE)); + Assert.assertTrue(column.asDate().equals(new Date(Long.MIN_VALUE))); + + try { + Assert.assertTrue(column.asBytes() == null); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_overflow() { + LongColumn column = new LongColumn(String.valueOf(Long.MAX_VALUE) + + "000"); + + Assert.assertTrue(column.asBoolean().equals(true)); + Assert.assertTrue(column.asBigDecimal().equals( + new BigDecimal(String.valueOf(Long.MAX_VALUE) + "000"))); + Assert.assertTrue(column.asString().equals( + String.valueOf(Long.MAX_VALUE) + "000")); + Assert.assertTrue(column.asBigInteger().equals( + new BigInteger(String.valueOf(Long.MAX_VALUE) + "000"))); + + try { + column.asLong(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + column = new LongColumn(String.valueOf(Long.MIN_VALUE) + "000"); + + Assert.assertTrue(column.asBoolean().equals(true)); + Assert.assertTrue(column.asBigDecimal().equals( + new BigDecimal(String.valueOf(Long.MIN_VALUE) + "000"))); + Assert.assertTrue(column.asString().equals( + String.valueOf(Long.MIN_VALUE) + "000")); + Assert.assertTrue(column.asBigInteger().equals( + new BigInteger(String.valueOf(Long.MIN_VALUE) + "000"))); + + try { + column.asLong(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/element/ScientificTester.java b/common/src/test/java/com/alibaba/datax/common/element/ScientificTester.java new file mode 100755 index 000000000..e6a3a1e19 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/ScientificTester.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.common.element; + +import org.apache.commons.lang3.math.NumberUtils; +import org.junit.Test; + +public class ScientificTester { + @Test + public void test() { + System.out.println(NumberUtils.createBigDecimal("10E+6").toBigInteger().toString()); + System.err.println((String) null); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/element/StringColumnTest.java b/common/src/test/java/com/alibaba/datax/common/element/StringColumnTest.java new file mode 100755 index 000000000..ce32ef5a5 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/element/StringColumnTest.java @@ -0,0 +1,216 @@ +package com.alibaba.datax.common.element; + +import java.io.UnsupportedEncodingException; +import java.math.BigDecimal; +import java.math.BigInteger; + +import org.junit.Assert; +import org.junit.Test; + +import com.alibaba.datax.common.base.BaseTest; +import com.alibaba.datax.common.exception.DataXException; + +public class StringColumnTest extends BaseTest { + + @Test + public void test_double() { + DoubleColumn real = new DoubleColumn("3.14"); + Assert.assertTrue(real.asString().equals("3.14")); + Assert.assertTrue(real.asDouble().equals(3.14d)); + Assert.assertTrue(real.asLong().equals(3L)); + + try { + real.asBoolean(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + real.asDate(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + } + + @Test + public void test_int() { + LongColumn integer = new LongColumn("3"); + Assert.assertTrue(integer.asString().equals("3")); + Assert.assertTrue(integer.asDouble().equals(3.0d)); + Assert.assertTrue(integer.asBoolean().equals(true)); + Assert.assertTrue(integer.asLong().equals(3L)); + System.out.println(integer.asDate()); + } + + @Test + public void test_string() { + StringColumn string = new StringColumn("bazhen"); + Assert.assertTrue(string.asString().equals("bazhen")); + try { + string.asLong(); + Assert.assertTrue(false); + + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + try { + string.asDouble(); + Assert.assertTrue(false); + + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + try { + string.asDate(); + Assert.assertTrue(false); + + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + Assert.assertTrue(new String(string.asString().getBytes()) + .equals("bazhen")); + } + + @Test + public void test_bool() { + StringColumn string = new StringColumn("true"); + Assert.assertTrue(string.asString().equals("true")); + Assert.assertTrue(string.asBoolean().equals(true)); + + try { + string.asDate(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + string.asDouble(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + + try { + string.asLong(); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + } + + @Test + public void test_null() throws UnsupportedEncodingException { + StringColumn string = new StringColumn(); + Assert.assertTrue(string.asString() == null); + Assert.assertTrue(string.asLong() == null); + Assert.assertTrue(string.asDouble() == null); + Assert.assertTrue(string.asDate() == null); + Assert.assertTrue(string.asBytes() == null); + } + + @Test + public void test_overflow() { + StringColumn column = new StringColumn( + new BigDecimal("1E-1000").toPlainString()); + + System.out.println(column.asString()); + + Assert.assertTrue(column.asBigDecimal().equals( + new BigDecimal("1E-1000"))); + + Assert.assertTrue(column.asBigInteger().compareTo(BigInteger.ZERO) == 0); + Assert.assertTrue(column.asLong().equals(0L)); + + try { + column.asDouble(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + column = new StringColumn(new BigDecimal("1E1000").toPlainString()); + Assert.assertTrue(column.asBigDecimal().compareTo( + new BigDecimal("1E1000")) == 0); + Assert.assertTrue(column.asBigInteger().compareTo( + new BigDecimal("1E1000").toBigInteger()) == 0); + try { + column.asDouble(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + try { + column.asLong(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + column = new StringColumn(new BigDecimal("-1E1000").toPlainString()); + Assert.assertTrue(column.asBigDecimal().compareTo( + new BigDecimal("-1E1000")) == 0); + Assert.assertTrue(column.asBigInteger().compareTo( + new BigDecimal("-1E1000").toBigInteger()) == 0); + try { + column.asDouble(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + + try { + column.asLong(); + Assert.assertTrue(false); + } catch (Exception e) { + e.printStackTrace(); + Assert.assertTrue(true); + } + } + + @Test + public void test_NaN() { + StringColumn column = new StringColumn(String.valueOf(Double.NaN)); + Assert.assertTrue(column.asDouble().equals(Double.NaN)); + try { + column.asBigDecimal(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + column = new StringColumn(String.valueOf(Double.POSITIVE_INFINITY)); + Assert.assertTrue(column.asDouble().equals(Double.POSITIVE_INFINITY)); + try { + column.asBigDecimal(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + column = new StringColumn(String.valueOf(Double.NEGATIVE_INFINITY)); + Assert.assertTrue(column.asDouble().equals(Double.NEGATIVE_INFINITY)); + try { + column.asBigDecimal(); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void testEmptyString() { + StringColumn column = new StringColumn(""); + try { + BigDecimal num = column.asBigDecimal(); + } catch(Exception e) { + Assert.assertTrue(e.getMessage().contains("String [\"\"] 不能转为BigDecimal")); + } + + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/exception/DataXExceptionTest.java b/common/src/test/java/com/alibaba/datax/common/exception/DataXExceptionTest.java new file mode 100755 index 000000000..96cebb966 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/exception/DataXExceptionTest.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.common.exception; + +import org.junit.Assert; +import org.junit.Test; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class DataXExceptionTest { + + private DataXException dataXException; + + @Test + public void basicTest() { + ErrorCode errorCode = FakeErrorCode.FAKE_ERROR_CODE_ONLY_FOR_TEST_00; + String errorMsg = "basicTest"; + dataXException = DataXException.asDataXException(errorCode, errorMsg); + Assert.assertEquals(errorCode.toString() + " - " + errorMsg, + dataXException.getMessage()); + } + + @Test + public void basicTest_中文() { + ErrorCode errorCode = FakeErrorCode.FAKE_ERROR_CODE_ONLY_FOR_TEST_01; + String errorMsg = "basicTest中文"; + dataXException = DataXException.asDataXException(errorCode, errorMsg); + Assert.assertEquals(errorCode.toString() + " - " + errorMsg, + dataXException.getMessage()); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/exception/FakeErrorCode.java b/common/src/test/java/com/alibaba/datax/common/exception/FakeErrorCode.java new file mode 100755 index 000000000..27e2bdafa --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/exception/FakeErrorCode.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.common.exception; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum FakeErrorCode implements ErrorCode { + + FAKE_ERROR_CODE_ONLY_FOR_TEST_00("FakeErrorCode-00", + "only a test, FakeErrorCode."), FAKE_ERROR_CODE_ONLY_FOR_TEST_01( + "FakeErrorCode-01", + "only a test, FakeErrorCode,测试中文."), + + ; + + private final String code; + private final String description; + + private FakeErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]", this.code, + this.description); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/statistics/PerfRecordTest.java b/common/src/test/java/com/alibaba/datax/common/statistics/PerfRecordTest.java new file mode 100644 index 000000000..66c07570e --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/statistics/PerfRecordTest.java @@ -0,0 +1,471 @@ +package com.alibaba.datax.common.statistics; + +import org.junit.Assert; +import org.junit.Before; +import org.junit.FixMethodOrder; +import org.junit.Test; +import org.junit.runners.MethodSorters; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.reflect.Field; +import java.util.List; + +/** + * Created by liqiang on 15/8/26. + */ +@FixMethodOrder(MethodSorters.NAME_ASCENDING) +public class PerfRecordTest { + private static Logger LOG = LoggerFactory.getLogger(PerfRecordTest.class); + private final int TGID = 1; + + + @Before + public void setUp() throws Exception { + Field instance=PerfTrace.class.getDeclaredField("instance"); + instance.setAccessible(true); + instance.set(null,null); + } + + public boolean hasRecordInList(List perfRecordList,PerfRecord perfRecord){ + if(perfRecordList==null || perfRecordList.size()==0){ + return false; + } + + for(PerfRecord perfRecord1:perfRecordList){ + if(perfRecord.equals(perfRecord1)){ + return true; + } + } + + return false; + } + + @Test + public void test001PerfRecordEquals() throws Exception { + PerfTrace.getInstance(true, 1001, 1, 0, true); + + PerfRecord initPerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord.start(); + Thread.sleep(50); + initPerfRecord.end(); + + PerfRecord initPerfRecord2 = initPerfRecord.copy(); + + Assert.assertTrue(initPerfRecord.equals(initPerfRecord2)); + + PerfRecord initPerfRecord3 = new PerfRecord(TGID, 1, PerfRecord.PHASE.READ_TASK_DESTROY); + initPerfRecord3.start(); + Thread.sleep(1050); + initPerfRecord3.end(); + + Assert.assertTrue(!initPerfRecord.equals(initPerfRecord3)); + + PerfRecord initPerfRecord4 = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord4.start(); + Thread.sleep(2050); + initPerfRecord4.end(); + + System.out.println(initPerfRecord4.toString()); + System.out.println(initPerfRecord.toString()); + + Assert.assertTrue(!initPerfRecord.equals(initPerfRecord4)); + + PerfRecord initPerfRecord5 = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord5.start(); + Thread.sleep(50); + initPerfRecord5.end(); + + initPerfRecord5.addCount(100); + initPerfRecord5.addSize(200); + + Assert.assertTrue(!initPerfRecord.equals(initPerfRecord5)); + + PerfRecord initPerfRecord6 = initPerfRecord.copy(); + initPerfRecord6.addCount(1001); + initPerfRecord6.addSize(1001); + + Assert.assertTrue(initPerfRecord.equals(initPerfRecord6)); + + } + + @Test + public void test002Normal() throws Exception { + + PerfTrace.getInstance(true, 1001, 1, 0, true); + + PerfRecord initPerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord.start(); + Thread.sleep(1050); + initPerfRecord.end(); + + Assert.assertTrue(initPerfRecord.getAction().name().equals("end")); + Assert.assertTrue(initPerfRecord.getElapsedTimeInNs() >= 1050000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_INIT).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), initPerfRecord)); + + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord.start(); + Thread.sleep(1020); + preparePerfRecord.end(); + + Assert.assertTrue(preparePerfRecord.getAction().name().equals("end")); + Assert.assertTrue(preparePerfRecord.getElapsedTimeInNs() >= 1020000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_PREPARE).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), preparePerfRecord)); + + + LOG.debug("task writer starts to write ..."); + PerfRecord dataPerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecord.start(); + + Thread.sleep(1200); + dataPerfRecord.addCount(1001); + dataPerfRecord.addSize(1002); + dataPerfRecord.end(); + + Assert.assertTrue(dataPerfRecord.getAction().name().equals("end")); + Assert.assertTrue(dataPerfRecord.getElapsedTimeInNs() >= 1020000000); + Assert.assertTrue(dataPerfRecord.getCount() == 1001); + Assert.assertTrue(dataPerfRecord.getSize() == 1002); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DATA).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), dataPerfRecord)); + + + PerfRecord destoryPerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.READ_TASK_DESTROY); + destoryPerfRecord.start(); + + Thread.sleep(250); + destoryPerfRecord.end(); + + Assert.assertTrue(destoryPerfRecord.getAction().name().equals("end")); + Assert.assertTrue(destoryPerfRecord.getElapsedTimeInNs() >= 250000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DESTROY).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), destoryPerfRecord)); + + PerfRecord waitTimePerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.WAIT_READ_TIME); + waitTimePerfRecord.start(); + + Thread.sleep(250); + waitTimePerfRecord.end(); + + Assert.assertTrue(waitTimePerfRecord.getAction().name().equals("end")); + Assert.assertTrue(waitTimePerfRecord.getElapsedTimeInNs() >= 250000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WAIT_READ_TIME).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), waitTimePerfRecord)); + + + + PerfRecord initPerfRecord2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord2.start(); + Thread.sleep(50); + initPerfRecord2.end(); + + Assert.assertTrue(initPerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(initPerfRecord2.getElapsedTimeInNs() >= 50000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_INIT).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), initPerfRecord2)); + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord2.start(); + Thread.sleep(20); + preparePerfRecord2.end(); + LOG.debug("task writer starts to write ..."); + + Assert.assertTrue(preparePerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(preparePerfRecord2.getElapsedTimeInNs() >= 20000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_PREPARE).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), preparePerfRecord2)); + + + PerfRecord dataPerfRecor2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecor2.start(); + + Thread.sleep(2200); + dataPerfRecor2.addCount(2001); + dataPerfRecor2.addSize(2002); + dataPerfRecor2.end(); + + Assert.assertTrue(dataPerfRecor2.getAction().name().equals("end")); + Assert.assertTrue(dataPerfRecor2.getElapsedTimeInNs() >= 2200000000L); + Assert.assertTrue(dataPerfRecor2.getCount() == 2001); + Assert.assertTrue(dataPerfRecor2.getSize() == 2002); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DATA).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), dataPerfRecor2)); + + + PerfRecord destoryPerfRecord2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.READ_TASK_DESTROY); + destoryPerfRecord2.start(); + + Thread.sleep(1250); + destoryPerfRecord2.end(); + + Assert.assertTrue(destoryPerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(destoryPerfRecord2.getElapsedTimeInNs() >= 1250000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DESTROY).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), destoryPerfRecord2)); + + PerfRecord waitPerfRecord2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.WAIT_READ_TIME); + waitPerfRecord2.start(); + + Thread.sleep(1250); + waitPerfRecord2.end(); + + Assert.assertTrue(waitPerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(waitPerfRecord2.getElapsedTimeInNs() >= 1250000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WAIT_READ_TIME).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), waitPerfRecord2)); + + + PerfTrace.getInstance().addTaskDetails(1, " "); + PerfTrace.getInstance().addTaskDetails(1, "task 1 some thing abcdf"); + PerfTrace.getInstance().addTaskDetails(2,"before char"); + PerfTrace.getInstance().addTaskDetails(2,"task 2 some thing abcdf"); + + Assert.assertTrue(PerfTrace.getInstance().getTaskDetails().get(1).equals("task 1 some thing abcdf")); + Assert.assertTrue(PerfTrace.getInstance().getTaskDetails().get(2).equals("before char,task 2 some thing abcdf")); + System.out.println(PerfTrace.getInstance().summarizeNoException()); + } + @Test + public void test003Disable() throws Exception { + + PerfTrace.getInstance(true, 1001, 1, 0, false); + + PerfRecord initPerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord.start(); + Thread.sleep(1050); + initPerfRecord.end(); + + Assert.assertTrue(initPerfRecord.getDatetime().equals("null time")); + Assert.assertTrue(initPerfRecord.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_INIT) == null); + + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord.start(); + Thread.sleep(1020); + preparePerfRecord.end(); + LOG.debug("task writer starts to write ..."); + + Assert.assertTrue(preparePerfRecord.getDatetime().equals("null time")); + Assert.assertTrue(preparePerfRecord.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_PREPARE) == null); + + + PerfRecord dataPerfRecord = new PerfRecord(TGID, 1, PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecord.start(); + + Thread.sleep(1200); + dataPerfRecord.addCount(1001); + dataPerfRecord.addSize(1001); + dataPerfRecord.end(); + + Assert.assertTrue(dataPerfRecord.getDatetime().equals("null time")); + Assert.assertTrue(dataPerfRecord.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DATA) == null); + + PerfRecord waitPerfRecor1 = new PerfRecord(TGID, 1, PerfRecord.PHASE.WAIT_WRITE_TIME); + waitPerfRecor1.start(); + + Thread.sleep(2200); + waitPerfRecor1.end(); + + Assert.assertTrue(waitPerfRecor1.getDatetime().equals("null time")); + Assert.assertTrue(waitPerfRecor1.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WAIT_WRITE_TIME) == null); + + + PerfRecord initPerfRecord2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord2.start(); + Thread.sleep(50); + initPerfRecord2.end(); + + Assert.assertTrue(initPerfRecord2.getDatetime().equals("null time")); + Assert.assertTrue(initPerfRecord2.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_INIT) == null); + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord2.start(); + Thread.sleep(20); + preparePerfRecord2.end(); + LOG.debug("task writer starts to write ..."); + + Assert.assertTrue(preparePerfRecord2.getDatetime().equals("null time")); + Assert.assertTrue(preparePerfRecord2.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_PREPARE) == null); + + + + PerfRecord dataPerfRecor2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecor2.start(); + + Thread.sleep(2200); + dataPerfRecor2.addCount(2001); + dataPerfRecor2.addSize(2001); + dataPerfRecor2.end(); + + Assert.assertTrue(dataPerfRecor2.getDatetime().equals("null time")); + Assert.assertTrue(dataPerfRecor2.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DATA) == null); + + PerfRecord waitPerfRecor2 = new PerfRecord(TGID, 2, PerfRecord.PHASE.WAIT_WRITE_TIME); + waitPerfRecor2.start(); + + Thread.sleep(2200); + waitPerfRecor2.end(); + + Assert.assertTrue(waitPerfRecor2.getDatetime().equals("null time")); + Assert.assertTrue(waitPerfRecor2.getElapsedTimeInNs() == -1); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WAIT_WRITE_TIME) == null); + + PerfTrace.getInstance().addTaskDetails(1, "task 1 some thing abcdf"); + PerfTrace.getInstance().addTaskDetails(2, "task 2 some thing abcdf"); + + Assert.assertTrue(PerfTrace.getInstance().getTaskDetails().size()==0); + System.out.println(PerfTrace.getInstance().summarizeNoException()); + } + + @Test + public void test004Normal2() throws Exception { + int priority = 0; + try { + priority = Integer.parseInt(System.getenv("SKYNET_PRIORITY")); + }catch (NumberFormatException e){ + LOG.warn("prioriy set to 0, because NumberFormatException, the value is: "+System.getProperty("PROIORY")); + } + + System.out.println("priority====" + priority); + + PerfTrace.getInstance(false, 1001001001001L, 1, 0, true); + + PerfRecord initPerfRecord = new PerfRecord(TGID, 10000001, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord.start(); + Thread.sleep(1050); + initPerfRecord.end(); + + Assert.assertTrue(initPerfRecord.getAction().name().equals("end")); + Assert.assertTrue(initPerfRecord.getElapsedTimeInNs() >= 1050000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_INIT).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), initPerfRecord)); + + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(TGID, 10000001, PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord.start(); + Thread.sleep(1020); + preparePerfRecord.end(); + + Assert.assertTrue(preparePerfRecord.getAction().name().equals("end")); + Assert.assertTrue(preparePerfRecord.getElapsedTimeInNs() >= 1020000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_PREPARE).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), preparePerfRecord)); + + LOG.debug("task wait time ..."); + PerfRecord waitPerfRecord = new PerfRecord(TGID, 10000001, PerfRecord.PHASE.WAIT_WRITE_TIME); + waitPerfRecord.start(); + Thread.sleep(1030); + waitPerfRecord.end(); + + Assert.assertTrue(waitPerfRecord.getAction().name().equals("end")); + Assert.assertTrue(waitPerfRecord.getElapsedTimeInNs() >= 1030000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WAIT_WRITE_TIME).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), waitPerfRecord)); + + + + LOG.debug("task writer starts to write ..."); + + PerfRecord dataPerfRecord = new PerfRecord(TGID, 10000001, PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecord.start(); + + Thread.sleep(1200); + dataPerfRecord.addCount(1001); + dataPerfRecord.addSize(1002); + dataPerfRecord.end(); + + Assert.assertTrue(dataPerfRecord.getAction().name().equals("end")); + Assert.assertTrue(dataPerfRecord.getElapsedTimeInNs() >= 1020000000); + Assert.assertTrue(dataPerfRecord.getCount() == 1001); + Assert.assertTrue(dataPerfRecord.getSize() == 1002); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DATA).getTotalCount() == 1); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), dataPerfRecord)); + + + PerfRecord initPerfRecord2 = new PerfRecord(TGID, 10000002, PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord2.start(); + Thread.sleep(50); + initPerfRecord2.end(); + + Assert.assertTrue(initPerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(initPerfRecord2.getElapsedTimeInNs() >= 50000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_INIT).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), initPerfRecord2)); + + LOG.debug("task wait time ..."); + PerfRecord waitPerfRecord2 = new PerfRecord(TGID, 10000002, PerfRecord.PHASE.WAIT_WRITE_TIME); + waitPerfRecord2.start(); + Thread.sleep(2030); + waitPerfRecord2.end(); + + Assert.assertTrue(waitPerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(waitPerfRecord2.getElapsedTimeInNs() >= 2030000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WAIT_WRITE_TIME).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), waitPerfRecord2)); + + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord2 = new PerfRecord(TGID, 10000002, PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord2.start(); + Thread.sleep(20); + preparePerfRecord2.end(); + + Assert.assertTrue(preparePerfRecord2.getAction().name().equals("end")); + Assert.assertTrue(preparePerfRecord2.getElapsedTimeInNs() >= 20000000); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.WRITE_TASK_PREPARE).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), preparePerfRecord2)); + + + LOG.debug("task writer starts to write ..."); + + PerfRecord dataPerfRecor2 = new PerfRecord(TGID, 10000002, PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecor2.start(); + + Thread.sleep(2200); + dataPerfRecor2.addCount(2001); + dataPerfRecor2.addSize(2002); + dataPerfRecor2.end(); + + Assert.assertTrue(dataPerfRecor2.getAction().name().equals("end")); + Assert.assertTrue(dataPerfRecor2.getElapsedTimeInNs() >= 2200000000L); + Assert.assertTrue(dataPerfRecor2.getCount() == 2001); + Assert.assertTrue(dataPerfRecor2.getSize() == 2002); + Assert.assertTrue(PerfTrace.getInstance().getPerfRecordMaps().get(PerfRecord.PHASE.READ_TASK_DATA).getTotalCount() == 2); + Assert.assertTrue(hasRecordInList(PerfTrace.getInstance().getWaitingReportList(), dataPerfRecor2)); + + + PerfTrace.getInstance().addTaskDetails(10000001, "task 100000011 some thing abcdf"); + PerfTrace.getInstance().addTaskDetails(10000002, "task 100000012 some thing abcdf"); + PerfTrace.getInstance().addTaskDetails(10000004, "task 100000012 some thing abcdf?123?345"); + PerfTrace.getInstance().addTaskDetails(10000005, "task 100000012 some thing abcdf?456"); + PerfTrace.getInstance().addTaskDetails(10000006, "[task 100000012? some thing abcdf?456"); + + Assert.assertTrue(PerfTrace.getInstance().getTaskDetails().get(10000001).equals("task 100000011 some thing abcdf")); + Assert.assertTrue(PerfTrace.getInstance().getTaskDetails().get(10000002).equals("task 100000012 some thing abcdf")); + + PerfRecord.addPerfRecord(TGID, 10000003, PerfRecord.PHASE.TASK_TOTAL, System.currentTimeMillis(), 12300123L * 1000L * 1000L); + PerfRecord.addPerfRecord(TGID, 10000004, PerfRecord.PHASE.TASK_TOTAL, System.currentTimeMillis(), 22300123L * 1000L * 1000L); + PerfRecord.addPerfRecord(TGID, 10000005, PerfRecord.PHASE.SQL_QUERY, System.currentTimeMillis(), 4L); + PerfRecord.addPerfRecord(TGID, 10000006, PerfRecord.PHASE.RESULT_NEXT_ALL, System.currentTimeMillis(), 3000L); + PerfRecord.addPerfRecord(TGID, 10000006, PerfRecord.PHASE.ODPS_BLOCK_CLOSE, System.currentTimeMillis(), 2000000L); + + System.out.println(PerfTrace.getInstance().summarizeNoException()); + + + } + +} \ No newline at end of file diff --git a/common/src/test/java/com/alibaba/datax/common/statistics/VMInfoTest.java b/common/src/test/java/com/alibaba/datax/common/statistics/VMInfoTest.java new file mode 100644 index 000000000..ce97007c5 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/statistics/VMInfoTest.java @@ -0,0 +1,109 @@ +package com.alibaba.datax.common.statistics; + +import org.junit.Assert; +import org.junit.Test; + +import java.lang.management.*; +import java.util.Arrays; +import java.util.List; + +/** + * Created by liqiang on 15/11/12. + */ +public class VMInfoTest { + static final long MB = 1024 * 1024; + + @Test + public void testOs() throws Exception { + + RuntimeMXBean runtimeMXBean = ManagementFactory.getRuntimeMXBean(); + System.out.println(runtimeMXBean.getName()); + System.out.println("jvm运营商:" + runtimeMXBean.getVmVendor()); + System.out.println("jvm规范版本:" + runtimeMXBean.getSpecVersion()); + System.out.println("jvm实现版本:" + runtimeMXBean.getVmVersion()); + + + OperatingSystemMXBean osMXBean = ManagementFactory.getOperatingSystemMXBean(); + System.out.println(osMXBean.getName()); + System.out.println(osMXBean.getArch()); + System.out.println(osMXBean.getVersion()); + System.out.println(osMXBean.getAvailableProcessors()); + + + if (VMInfo.isSunOsMBean(osMXBean)) { + long totalPhysicalMemory = VMInfo.getLongFromOperatingSystem(osMXBean, "getTotalPhysicalMemorySize"); + long freePhysicalMemory = VMInfo.getLongFromOperatingSystem(osMXBean, "getFreePhysicalMemorySize"); + System.out.println("总物理内存(M):" + totalPhysicalMemory / MB); + System.out.println("剩余物理内存(M):" + freePhysicalMemory / MB); + + long maxFileDescriptorCount = VMInfo.getLongFromOperatingSystem(osMXBean, "getMaxFileDescriptorCount"); + long currentOpenFileDescriptorCount = VMInfo.getLongFromOperatingSystem(osMXBean, "getOpenFileDescriptorCount"); + long getProcessCpuTime = VMInfo.getLongFromOperatingSystem(osMXBean, "getProcessCpuTime"); + System.out.println(osMXBean.getSystemLoadAverage()); + System.out.println("maxFileDescriptorCount=>" + maxFileDescriptorCount); + System.out.println("currentOpenFileDescriptorCount=>" + currentOpenFileDescriptorCount); + System.out.println("jvm运行时间(毫秒):" + runtimeMXBean.getUptime()); + System.out.println("getProcessCpuTime=>" + getProcessCpuTime); + + long startTime = System.currentTimeMillis(); + while (true) { + if (System.currentTimeMillis() > startTime + 1000) { + break; + } + } +// system = ManagementFactory.getOperatingSystemMXBean(); +// runtime = ManagementFactory.getRuntimeMXBean(); + System.out.println("test!!" + 2 * 2 * 2 * 123456789); + System.out.println("test!!" + 123456789 * 987654321); + System.out.println("test!!" + 2 * 2 * 2 * 2); + System.out.println("test!!" + 3 * 2 * 4); + System.out.println("test123!!"); + long upTime = runtimeMXBean.getUptime(); + long processTime = VMInfo.getLongFromOperatingSystem(osMXBean, "getProcessCpuTime"); + System.out.println("jvm运行时间(毫秒):" + upTime); + System.out.println("getProcessCpuTime=>" + processTime); + + System.out.println(String.format("%,.1f", (float) processTime / (upTime * osMXBean.getAvailableProcessors() * 10000))); + + + List garbages = ManagementFactory.getGarbageCollectorMXBeans(); + for (GarbageCollectorMXBean garbage : garbages) { + System.out.println("垃圾收集器:名称=" + garbage.getName() + ",收集=" + garbage.getCollectionCount() + ",总花费时间=" + garbage.getCollectionTime() + ",内存区名称=" + Arrays.deepToString(garbage.getMemoryPoolNames())); + } + + List pools = ManagementFactory.getMemoryPoolMXBeans(); + if (pools != null && !pools.isEmpty()) { + for (MemoryPoolMXBean pool : pools) { + //只打印一些各个内存区都有的属性,一些区的特殊属性,可看文档或百度 + // 最大值,初始值,如果没有定义的话,返回-1,所以真正使用时,要注意 + System.out.println("vm内存区:\n\t名称=" + pool.getName() + "\n\t所属内存管理者=" + Arrays.deepToString(pool.getMemoryManagerNames()) + "\n\t ObjectName=" + "\n\t初始大小(M)=" + pool.getUsage().getInit() / MB + "\n\t最大(上限)(M)=" + pool.getUsage().getMax() / MB + "\n\t已用大小(M)=" + pool.getUsage().getUsed() / MB + "\n\t已提交(已申请)(M)=" + pool.getUsage().getCommitted() / MB + "\n\t使用率=" + (pool.getUsage().getUsed() * 100 / pool.getUsage().getCommitted()) + "%"); + } + } + + } + } + + @Test + public void testVMInfo() throws Exception { + VMInfo vmInfo = VMInfo.getVmInfo(); + Assert.assertTrue(vmInfo != null); + System.out.println(vmInfo.toString()); + vmInfo.getDelta(); + int count = 0; + + while(count < 10) { + long startTime = System.currentTimeMillis(); + while (true) { + if (System.currentTimeMillis() > startTime + 1000) { + break; + } + } + vmInfo.getDelta(); + count++; + Thread.sleep(1000); + } + + vmInfo.getDelta(false); + System.out.println(vmInfo.totalString()); + } +} \ No newline at end of file diff --git a/common/src/test/java/com/alibaba/datax/common/util/ConfigurationTest.java b/common/src/test/java/com/alibaba/datax/common/util/ConfigurationTest.java new file mode 100755 index 000000000..1b8a56858 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/util/ConfigurationTest.java @@ -0,0 +1,699 @@ +package com.alibaba.datax.common.util; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.fastjson.JSON; +import org.apache.commons.lang3.StringUtils; +import org.junit.Assert; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; + +import java.util.*; + +public class ConfigurationTest { + + @Test + public void test_get() { + Configuration configuration = Configuration + .from("{\"a\":[{\"b\":[{\"c\":[\"bazhen\"]}]}]}"); + + String path = ""; + Assert.assertTrue(JSON.toJSONString(configuration.get(path)).equals( + "{\"a\":[{\"b\":[{\"c\":[\"bazhen\"]}]}]}")); + + path = "a[0].b[0].c[0]"; + Assert.assertTrue(JSON.toJSONString(configuration.get(path)).equals( + "\"bazhen\"")); + + configuration = Configuration.from("{\"a\": [[[0]]]}"); + path = "a[0][0][0]"; + System.out.println(JSON.toJSONString(configuration.get(path))); + Assert.assertTrue(JSON.toJSONString(configuration.get(path)) + .equals("0")); + + path = "a[0]"; + System.out.println(JSON.toJSONString(configuration.get(path))); + Assert.assertTrue(JSON.toJSONString(configuration.get(path)).equals( + "[[0]]")); + + path = "c[0]"; + System.out.println(JSON.toJSONString(configuration.get(path))); + Assert.assertTrue(JSON.toJSONString(configuration.get(path)).equals( + "null")); + + configuration = Configuration.from("[1,2]"); + System.out.println(configuration.get("[0]")); + Assert.assertTrue(configuration.getString("[0]").equals("1")); + Assert.assertTrue(configuration.getString("[1]").equals("2")); + + } + + @Test + public void test_buildObject() { + + // 非法参数 + try { + Configuration.from("{}").buildObject(null, "bazhen"); + Assert.assertTrue(false); + } catch (Exception e) { + Assert.assertTrue(true); + } + + // 测试单元素 + Assert.assertTrue(Configuration.from("{}") + .buildObject(new ArrayList(), "bazhen") + .equals("bazhen")); + Assert.assertTrue(Configuration.from("{}").buildObject( + new ArrayList(), new HashMap()) instanceof Map); + Assert.assertTrue(Configuration.from("{}").buildObject( + new ArrayList(), null) == null); + + // 测试多级元素 + String path = null; + String json = null; + + path = ""; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("\"bazhen\"".equals(json)); + + path = "a"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("{\"a\":\"bazhen\"}".equals(json)); + + path = "a"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), + new HashMap())); + System.out.println(json); + Assert.assertTrue("{\"a\":{}}".equals(json)); + + path = "a"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), + new ArrayList())); + System.out.println(json); + Assert.assertTrue("{\"a\":[]}".equals(json)); + + path = "a"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), 1L)); + System.out.println(json); + Assert.assertTrue("{\"a\":1}".equals(json)); + + path = "a"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), 1.1)); + System.out.println(json); + Assert.assertTrue("{\"a\":1.1}".equals(json)); + + path = "[0]"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("[\"bazhen\"]".equals(json)); + + path = "[1]"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("[null,\"bazhen\"]".equals(json)); + + path = "a.b.c.d.e.f"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("{\"a\":{\"b\":{\"c\":{\"d\":{\"e\":{\"f\":\"bazhen\"}}}}}}" + .equals(json)); + + path = "[1].[1]"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("[null,[null,\"bazhen\"]]".equals(json)); + + path = "a.[10].b.[0].c.[1]"; + json = JSON.toJSONString(Configuration.from("{}").buildObject( + Arrays.asList(StringUtils.split(path, ".")), "bazhen")); + System.out.println(json); + Assert.assertTrue("{\"a\":[null,null,null,null,null,null,null,null,null,null,{\"b\":[{\"c\":[null,\"bazhen\"]}]}]}" + .equals(json)); + } + + @Test + public void test_setObjectRecursive() { + // 当current完全为空,类似新插入对象 + + String path = ""; + Object root = null; + + root = Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, "bazhen"); + System.out.println(root); + Assert.assertTrue(JSON.toJSONString(root).equals("\"bazhen\"")); + + root = JSON.toJSONString(Configuration.from("{}").setObjectRecursive( + null, Arrays.asList(StringUtils.split(path, ".")), 0, + new ArrayList())); + System.out.println(root); + Assert.assertTrue(root.equals("[]")); + + root = JSON.toJSONString(Configuration.from("{}").setObjectRecursive( + null, Arrays.asList(StringUtils.split(path, ".")), 0, + new HashMap())); + System.out.println(root); + Assert.assertTrue(root.equals("{}")); + + root = JSON.toJSONString(Configuration.from("{}").setObjectRecursive( + null, Arrays.asList(StringUtils.split(path, ".")), 0, 0L)); + System.out.println(root); + Assert.assertTrue(root.equals("0")); + + // 当current当前为空,但是path存在路径,类似新插入对象 + path = "a"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":\"bazhen\"}".equals(root)); + + path = "a.b"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":{\"b\":\"bazhen\"}}".equals(root)); + + path = "a.b.c.d.e.f.g.h.i.j.k.l.m.n.o.p.q.r.s.t.u.v.w.x.y.z"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":{\"b\":{\"c\":{\"d\":{\"e\":{\"f\":{\"g\":{\"h\":{\"i\":{\"j\":{\"k\":{\"l\":{\"m\":{\"n\":{\"o\":{\"p\":{\"q\":{\"r\":{\"s\":{\"t\":{\"u\":{\"v\":{\"w\":{\"x\":{\"y\":{\"z\":\"bazhen\"}}}}}}}}}}}}}}}}}}}}}}}}}}" + .equals(root)); + + path = "1.1"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"1\":{\"1\":\"bazhen\"}}".equals(root)); + + path = "-.-"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"-\":{\"-\":\"bazhen\"}}".equals(root)); + + path = "[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[\"bazhen\"]")); + + path = "[0].[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[[\"bazhen\"]]")); + + path = "[0].[0].[0].[0].[0].[0].[0].[0].[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[[[[[[[[[\"bazhen\"]]]]]]]]]")); + + path = "[0].[1].[2].[3].[4].[5].[6].[7].[8]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root + .equals("[[null,[null,null,[null,null,null,[null,null,null,null,[null,null,null,null,null,[null,null,null,null,null,null,[null,null,null,null,null,null,null,[null,null,null,null,null,null,null,null,\"bazhen\"]]]]]]]]]")); + + path = "a.[0].b.[0].c.[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(null, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root + .equals("{\"a\":[{\"b\":[{\"c\":[\"bazhen\"]}]}]}")); + + // 初始化为list,测试插入对象 + + root = JSON.parse("[]"); + path = "a"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":\"bazhen\"}".equals(root)); + + root = JSON.parse("[]"); + path = "a.b"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":{\"b\":\"bazhen\"}}".equals(root)); + + root = JSON.parse("[]"); + path = "[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[\"bazhen\"]")); + + root = JSON.parse("[]"); + path = "[0].[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[[\"bazhen\"]]")); + + // 初始化为map,测试插入对象 + root = JSON.parse("{}"); + path = "a"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":\"bazhen\"}".equals(root)); + + root = JSON.parse("{}"); + path = "a.b"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue("{\"a\":{\"b\":\"bazhen\"}}".equals(root)); + + root = JSON.parse("{}"); + path = "[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[\"bazhen\"]")); + + root = JSON.parse("{}"); + path = "[0].[0]"; + root = JSON + .toJSONString(Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, + "bazhen")); + System.out.println(root); + Assert.assertTrue(root.equals("[[\"bazhen\"]]")); + + root = JSON.parse("{\"a\": \"a\", \"b\":\"b\"}"); + path = "a.[0]"; + root = Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, "bazhen"); + System.out.println(root); + System.out.println(JSON.toJSONString(root).equals( + "{\"a\":[\"bazhen\"],\"b\":\"b\"}")); + + root = JSON + .parse("{\"a\":{\"b\":{\"c\":[0],\"B\": \"B\"},\"A\": \"A\"}}"); + path = "a.b.c.[0]"; + root = Configuration.from("{}").setObjectRecursive(root, + Arrays.asList(StringUtils.split(path, ".")), 0, "bazhen"); + System.out.println(root); + Assert.assertTrue(JSON.toJSONString(root).equals( + "{\"a\":{\"A\":\"A\",\"b\":{\"B\":\"B\",\"c\":[\"bazhen\"]}}}")); + } + + @Test + public void test_setConfiguration() { + Configuration configuration = Configuration.from("{}"); + configuration.set("b", Configuration.from("{}")); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{\"b\":{}}")); + + configuration = Configuration.newDefault(); + List list = new ArrayList(); + for (int i = 0; i < 3; i++) { + list.add(Configuration.newDefault()); + } + configuration.set("a", list); + System.out.println(configuration.toJSON()); + Assert.assertTrue("{\"a\":[{},{},{}]}".equals(configuration.toJSON())); + + Map map = new HashMap(); + map.put("a", Configuration.from("{\"a\": 1}")); + configuration.set("a", map); + System.out.println(configuration.toJSON()); + Assert.assertTrue("{\"a\":{\"a\":{\"a\":1}}}".equals(configuration + .toJSON())); + } + + @Test + public void test_set() { + Configuration configuration = Configuration + .from("{\"a\":{\"b\":{\"c\":[0],\"B\": \"B\"},\"A\": \"A\"}}"); + configuration.set("a.b.c[0]", 3.1415); + Assert.assertTrue(configuration.toJSON().equals( + "{\"a\":{\"A\":\"A\",\"b\":{\"B\":\"B\",\"c\":[3.1415]}}}")); + + configuration.set("a.b.c[1]", 3.1415); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration + .toJSON() + .equals("{\"a\":{\"A\":\"A\",\"b\":{\"B\":\"B\",\"c\":[3.1415,3.1415]}}}")); + configuration.set("a.b.c[0]", null); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration + .toJSON() + .equals("{\"a\":{\"A\":\"A\",\"b\":{\"B\":\"B\",\"c\":[null,3.1415]}}}")); + + configuration.set("[0]", 3.14); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("[3.14]")); + + configuration.set("[1]", 3.14); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("[3.14,3.14]")); + + configuration.set("", new HashMap()); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{}")); + + configuration = Configuration.newDefault(); + configuration.set("a[0].b", 1); + configuration.set("a[0].b", 1); + System.out.println(configuration.toJSON()); + Assert.assertTrue("{\"a\":[{\"b\":1}]}".equals(configuration.toJSON())); + + try { + configuration.set(null, 3.14); + Assert.assertFalse(true); + } catch (Exception e) { + Assert.assertTrue(true); + } + + try { + configuration.set("", 3.14); + Assert.assertFalse(true); + } catch (Exception e) { + Assert.assertTrue(true); + } + } + + @Test + public void test_getKeys() { + Set sets = new HashSet(); + + sets.clear(); + Configuration configuration = Configuration.from("{}"); + System.out.println(JSON.toJSONString(configuration.getKeys())); + Assert.assertTrue(configuration.getKeys().isEmpty()); + + sets.clear(); + configuration = Configuration.from("[]"); + System.out.println(JSON.toJSONString(configuration.getKeys())); + Assert.assertTrue(configuration.getKeys().isEmpty()); + + sets.clear(); + configuration = Configuration.from("[0]"); + System.out.println(JSON.toJSONString(configuration.getKeys())); + Assert.assertTrue(configuration.getKeys().contains("[0]")); + + sets.clear(); + configuration = Configuration.from("[1,2]"); + System.out.println(JSON.toJSONString(configuration.getKeys())); + Assert.assertTrue(configuration.getKeys().contains("[0]")); + Assert.assertTrue(configuration.getKeys().contains("[1]")); + + sets.clear(); + configuration = Configuration.from("[[[0]]]"); + System.out.println(JSON.toJSONString(configuration.getKeys())); + Assert.assertTrue(configuration.getKeys().contains("[0][0][0]")); + + sets.clear(); + configuration = Configuration + .from("{\"a\":{\"b\":{\"c\":[0],\"B\": \"B\"},\"A\": \"A\"}}"); + System.out.println(JSON.toJSONString(configuration.getKeys())); + Assert.assertTrue(JSON.toJSONString(configuration.getKeys()).equals( + "[\"a.b.B\",\"a.b.c[0]\",\"a.A\"]")); + } + + @Test + public void test_merge() { + Configuration configuration = Configuration.from("{}"); + configuration.merge(Configuration.from("[1,2]"), true); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("[1,2]")); + + configuration = Configuration.from("{}"); + configuration.merge(Configuration.from("[1,2]"), false); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("[1,2]")); + + configuration = Configuration.from("{}"); + configuration.merge(Configuration.from("{\"1\": 2}"), true); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{\"1\":2}")); + + configuration = Configuration.from("{}"); + configuration.merge(Configuration.from("{\"1\": 2}"), false); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{\"1\":2}")); + + configuration = Configuration.from("{}"); + configuration.merge(Configuration.from("{\"1\":\"2\"}"), true); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{\"1\":\"2\"}")); + + configuration = Configuration.from("{}"); + configuration.merge(Configuration.from("{\"1\":\"2\"}"), false); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{\"1\":\"2\"}")); + + configuration = Configuration + .from("{\"a\":{\"b\":{\"c\":[0],\"B\": \"B\"},\"A\": \"A\"}}"); + configuration + .merge(Configuration + .from("{\"a\":{\"b\":{\"c\":[\"bazhen\"],\"B\": \"B\"},\"A\": \"A\"}}"), + true); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals( + "{\"a\":{\"A\":\"A\",\"b\":{\"B\":\"B\",\"c\":[\"bazhen\"]}}}")); + + configuration = Configuration + .from("{\"a\":{\"b\":{\"c\":[0],\"B\": \"B\"},\"A\": \"A\"}}"); + configuration + .merge(Configuration + .from("{\"a\":{\"b\":{\"c\":[\"bazhen\"],\"B\": \"B\",\"C\": \"C\"}}}"), + false); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration + .toJSON() + .equals("{\"a\":{\"A\":\"A\",\"b\":{\"B\":\"B\",\"C\":\"C\",\"c\":[0]}}}")); + } + + @Test + public void test_type() { + Configuration configuration = Configuration.from("{\"a\": 1}"); + Assert.assertTrue(configuration.getLong("a") == 1); + } + + @Test + public void test_beautify() { + Configuration configuration = Configuration + .from(ConfigurationTest.class.getClassLoader() + .getResourceAsStream("all.json")); + System.out.println(configuration.getConfiguration("job.content") + .beautify()); + } + + @SuppressWarnings("unchecked") + @Test + public void test() { + Configuration configuration = Configuration + .from(ConfigurationTest.class.getClassLoader() + .getResourceAsStream("all.json")); + System.out.println(configuration.toJSON()); + configuration.merge(Configuration.from(ConfigurationTest.class + .getClassLoader().getResourceAsStream("all.json")), true); + Assert.assertTrue(((List) configuration + .get("job.content[0].reader.parameter.jdbcUrl")).size() == 2); + + } + + @Test(expected = DataXException.class) + public void test_remove() { + Configuration configuration = Configuration.from("{\"a\": \"b\"}"); + configuration.remove("a"); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{}")); + + configuration.set("a[1]", "b"); + System.out.println(configuration.toJSON()); + configuration.remove("a[1]"); + System.out.println(configuration.toJSON()); + Assert.assertTrue(configuration.toJSON().equals("{\"a\":[null,null]}")); + + configuration.set("a", "b"); + configuration.remove("b"); + } + + @Test + public void test_unescape() { + Configuration configuration = Configuration.from("{\"a\": \"\\t\"}"); + System.out.println("|" + configuration.getString("a") + "|"); + Assert.assertTrue("|\t|".equals("|" + configuration.getString("a") + + "|")); + + configuration = Configuration.from("{\"a\": \"\u0001\"}"); + Assert.assertTrue(configuration.getString("a").equals("\u0001")); + Assert.assertTrue(new String(new byte[] { 0x01 }).equals(configuration + .get("a"))); + + } + + @Test + public void test_list() { + Configuration configuration = Configuration.newDefault(); + List lists = new ArrayList(); + lists.add("bazhen.csy"); + configuration.set("a.b.c", lists); + System.out.println(configuration); + configuration.set("a.b.c.d", lists); + System.out.println(configuration); + } + + @Test + public void test_serialize() { + StringBuilder sb = new StringBuilder(); + for (int i = 1; i < 128; i++) { + sb.append((char) i); + } + + Configuration configuration = Configuration.newDefault(); + configuration.set("a", sb.toString()); + Configuration another = Configuration.from(configuration.toJSON()); + Assert.assertTrue(another.getString("a").equals(configuration.get("a"))); + } + + @Test + public void test_variable() { + Properties prop = new Properties(); + System.setProperties(prop); + System.setProperty("bizdate", "20141125"); + System.setProperty("errRec", "1"); + System.setProperty("errPercent", "0.5"); + String json = "{\n" + + " \"core\": {\n" + + " \"where\": \"gmt_modified >= ${bizdate}\"\n" + + " },\n" + + " \"errorLimit\": {\n" + + " \t\"record\": ${errRec},\n" + + " \t\"percentage\": ${errPercent}\n" + + " }\n" + + "}"; + Configuration conf = Configuration.from(json); + Assert.assertEquals("gmt_modified >= 20141125", conf.getString("core.where")); + Assert.assertEquals(Integer.valueOf(1), conf.getInt("errorLimit.record")); + Assert.assertEquals(Double.valueOf(0.5), conf.getDouble("errorLimit.percentage")); + + // 依然能够转回来 + Configuration.from(conf.toJSON()); + } + + @Test + public void test_secretKey() { + Configuration config = Configuration.newDefault(); + + String keyPath1 = "a.b.c"; + String keyPath2 = "a.b.c[2].d"; + config.addSecretKeyPath(keyPath1); + config.addSecretKeyPath(keyPath2); + + Assert.assertTrue(config.isSecretPath(keyPath1)); + Assert.assertTrue(config.isSecretPath(keyPath2)); + + Configuration configClone = config.clone(); + Assert.assertTrue(configClone.isSecretPath(keyPath1)); + Assert.assertTrue(configClone.isSecretPath(keyPath2)); + + config.setSecretKeyPathSet(new HashSet()); + Assert.assertTrue(configClone.isSecretPath(keyPath1)); + Assert.assertTrue(configClone.isSecretPath(keyPath2)); + } + + @Rule + public ExpectedException expectedEx = ExpectedException.none(); + + @Test + public void test_get_list() { + Configuration configuration = Configuration + .from(ConfigurationTest.class.getClassLoader() + .getResourceAsStream("all.json")); +// System.out.println(configuration.toJSON()); + + List noPathNameThis = configuration.get("job.no_path_named_this", List.class); + Assert.assertNull(noPathNameThis); + + noPathNameThis = configuration.getList("job.no_path_named_this", String.class); + Assert.assertNull(noPathNameThis); + + System.out.println(configuration.getString("job.setting")); + + expectedEx.expect(ClassCastException.class); + expectedEx.expectMessage("com.alibaba.fastjson.JSONObject cannot be cast to java.util.List"); + List aStringCantConvertToList = configuration.getList("job.setting"); + } + + @Test + public void test_getNecessaryValue() { + Configuration configuration = Configuration.newDefault(); + configuration.set("a.b.c", "XX"); + configuration.set("x.y.z", "true"); + configuration.getNecessaryValue("a.b.c", CommonErrorCode.CONFIG_ERROR); + configuration.getNecessaryBool("x.y.z", CommonErrorCode.CONFIG_ERROR); + } + + + @Test + public void test_getNecessaryValue2() { + expectedEx.expect(DataXException.class); + Configuration configuration = Configuration.newDefault(); + configuration.set("x.y.z", "yes"); + configuration.getNecessaryBool("x.y.z", CommonErrorCode.CONFIG_ERROR); + } + + @Test + public void test_getNecessaryValue3() { + expectedEx.expect(DataXException.class); + Configuration configuration = Configuration.newDefault(); + configuration.getNecessaryBool("x.y.z", CommonErrorCode.CONFIG_ERROR); + } + +} diff --git a/common/src/test/java/com/alibaba/datax/common/util/FilterUtilTest.java b/common/src/test/java/com/alibaba/datax/common/util/FilterUtilTest.java new file mode 100755 index 000000000..cd756e70a --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/util/FilterUtilTest.java @@ -0,0 +1,169 @@ +package com.alibaba.datax.common.util; + +import org.junit.Assert; +import org.junit.BeforeClass; +import org.junit.Test; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +public class FilterUtilTest { + private static List ALL_STRS; + + @BeforeClass + public static void beforeClass() { + ALL_STRS = new ArrayList(); + ALL_STRS.add("pt=1/ds=hangzhou"); + ALL_STRS.add("pt=1/ds=shanghai"); + ALL_STRS.add("pt=2/ds2=hangzhou"); + } + + @Test + public void test00() { + String regular = "pt=[1|2]/ds=*"; + + List matched = FilterUtil.filterByRegular(ALL_STRS, regular); + + System.out.println("matched:" + matched); + List expected = new ArrayList(); + expected.add(ALL_STRS.get(0)); + expected.add(ALL_STRS.get(1)); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test01() { + String regular = "pt=[1|2]/ds=.*"; + + List matched = FilterUtil.filterByRegular(ALL_STRS, regular); + + System.out.println("matched:" + matched); + List expected = new ArrayList(); + expected.add(ALL_STRS.get(0)); + expected.add(ALL_STRS.get(1)); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test02() { + String regular = "pt=[1|2]/ds=.*"; + + List matched = FilterUtil.filterByRegular(ALL_STRS, regular); + + System.out.println("matched:" + matched); + List expected = new ArrayList(); + expected.add(ALL_STRS.get(0)); + expected.add(ALL_STRS.get(1)); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test03() { + String regular = "pt=*"; + + List matched = FilterUtil.filterByRegular(ALL_STRS, regular); + + System.out.println("matched:" + matched); + List expected = new ArrayList(ALL_STRS); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test04() { + String regular = "^pt=*"; + + List matched = FilterUtil.filterByRegular(ALL_STRS, regular); + + System.out.println("matched:" + matched); + List expected = new ArrayList(ALL_STRS); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test05() { + String regular = "pt=1/ds=s[a-z]*"; + + List matched = FilterUtil.filterByRegular(ALL_STRS, regular); + + System.out.println("matched:" + matched); + List expected = new ArrayList(); + expected.add(ALL_STRS.get(1)); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test06() { + // 两个规则,其中规则一匹配到1个,规则二匹配到2个。希望返回值为二者的并集 + List regulars = new ArrayList(); + String regular1 = "pt=1/ds=s[a-z]*"; + String regular2 = "pt=1/ds=*"; + regulars.add(regular1); + regulars.add(regular2); + + List matched = FilterUtil.filterByRegulars(ALL_STRS, regulars); + + System.out.println("matched:" + matched); + List expected = new ArrayList(); + expected.add(ALL_STRS.get(0)); + expected.add(ALL_STRS.get(1)); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } + + @Test + public void test07() { + // 两个规则 一模一样,都是只能匹配到一个 + List regulars = new ArrayList(); + String regular1 = "pt=1/ds=s[a-z]*"; + String regular2 = "pt=1/ds=s[a-z]*"; + regulars.add(regular1); + regulars.add(regular2); + + List matched = FilterUtil.filterByRegulars(ALL_STRS, regulars); + + System.out.println("matched:" + matched); + List expected = new ArrayList(); + expected.add(ALL_STRS.get(1)); + + Assert.assertEquals(expected.size(), matched.size()); + + Collections.sort(expected); + Collections.sort(matched); + Assert.assertArrayEquals(expected.toArray(), matched.toArray()); + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/util/ListUtilTest.java b/common/src/test/java/com/alibaba/datax/common/util/ListUtilTest.java new file mode 100755 index 000000000..23691efd0 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/util/ListUtilTest.java @@ -0,0 +1,90 @@ +package com.alibaba.datax.common.util; + +import org.junit.Assert; +import org.junit.BeforeClass; +import org.junit.Test; + +import java.util.ArrayList; +import java.util.List; + +public class ListUtilTest { + private static List aList = null; + + @BeforeClass + public static void beforeClass() { + aList = new ArrayList(); + aList.add("one"); + aList.add("onE"); + aList.add("two"); + aList.add("阿里巴巴"); + } + + @Test + public void testCheckIfValueDuplicate() { + List list = new ArrayList(aList); + list.add(aList.get(0)); + boolean result = ListUtil.checkIfValueDuplicate(list, true); + Assert.assertTrue(list + " has no duplicate value.", result); + + list = new ArrayList(aList); + list.add(aList.get(0)); + result = ListUtil.checkIfValueDuplicate(list, false); + Assert.assertTrue(list + " has duplicate value.", result); + + + list = new ArrayList(aList); + list.add(aList.get(0)); + list.set(list.size() - 1, list.get(list.size() - 1).toUpperCase()); + result = ListUtil.checkIfValueDuplicate(list, true); + Assert.assertTrue(list + " has duplicate value.", result == false); + + list = new ArrayList(aList); + list.add(aList.get(0)); + list.set(list.size() - 1, list.get(list.size() - 1).toUpperCase()); + result = ListUtil.checkIfValueDuplicate(list, false); + Assert.assertTrue(list + " has duplicate value.", result); + } + + @Test + public void testValueToLowerCase() { + List list = new ArrayList(aList); + for (int i = 0, len = list.size(); i < len; i++) { + list.set(i, list.get(i).toLowerCase()); + } + + Assert.assertArrayEquals(list.toArray(), ListUtil.valueToLowerCase(list).toArray()); + } + + @Test + public void testCheckIfValueSame() { + List boolList = new ArrayList(); + boolList.add(true); + boolList.add(true); + boolList.add(true); + Assert.assertTrue(boolList + " all value same.", ListUtil.checkIfValueSame(boolList)); + + boolList.add(false); + Assert.assertTrue(boolList + "not all value same.", ListUtil.checkIfValueSame(boolList) == false); + } + + @Test + public void testCheckIfBInA() { + List bList = new ArrayList(aList); + bList.set(0, bList.get(0) + "_hello"); + Assert.assertTrue(bList + " not all in " + aList, ListUtil.checkIfBInA(aList, bList, false) == false); + + Assert.assertTrue(bList + " not all in " + aList, ListUtil.checkIfBInA(aList, bList, true) == false); + + + bList = new ArrayList(aList); + bList.set(0, bList.get(0).toUpperCase()); + Assert.assertTrue(bList + " all in " + aList, ListUtil.checkIfBInA(aList, bList, false)); + + + bList = new ArrayList(aList); + bList.set(0, bList.get(0).toUpperCase()); + Assert.assertTrue(bList + " not all in " + aList, ListUtil.checkIfBInA(aList, bList, true) == false); + + } + +} diff --git a/common/src/test/java/com/alibaba/datax/common/util/RangeSplitUtilTest.java b/common/src/test/java/com/alibaba/datax/common/util/RangeSplitUtilTest.java new file mode 100755 index 000000000..ec6e750a8 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/util/RangeSplitUtilTest.java @@ -0,0 +1,185 @@ +package com.alibaba.datax.common.util; + +import org.apache.commons.lang3.RandomStringUtils; +import org.apache.commons.lang3.builder.ToStringBuilder; +import org.apache.commons.lang3.builder.ToStringStyle; +import org.apache.commons.lang3.tuple.Pair; +import org.junit.Assert; +import org.junit.Test; + +import java.util.Arrays; +import java.util.Collections; +import java.util.Random; + +public class RangeSplitUtilTest { + + @Test + public void testSplitString() { + int expectSliceNumber = 3; + + String left = "00468374-8cdb-11e4-a66a-008cfac1c3b8"; + String right = "fcbc8a79-8427-11e4-a66a-008cfac1c3b8"; + + String[] result = RangeSplitUtil.doAsciiStringSplit(left, right, expectSliceNumber); + + Assert.assertTrue(result.length - 1 == expectSliceNumber); + System.out.println(Arrays.toString(result)); + } + + @Test + public void testSplitStringRandom() { + String left = RandomStringUtils.randomAlphanumeric(40); + String right = RandomStringUtils.randomAlphanumeric(40); + + for (int expectSliceNumber = 1; expectSliceNumber < 100; expectSliceNumber++) { + String[] result = RangeSplitUtil.doAsciiStringSplit(left, right, expectSliceNumber); + + Assert.assertTrue(result.length - 1 == expectSliceNumber); + + String[] clonedResult = result.clone(); +// Collections.sort(Arrays.asList(result)); + + Assert.assertTrue(Arrays.toString(clonedResult).equals(Arrays.toString(result))); + + System.out.println(result); + } + } + + //TODO + @Test + public void testLong_00() { + long count = 0; + long left = 0; + long right = count - 1; + int expectSliceNumber = 3; + long[] result = RangeSplitUtil.doLongSplit(left, right, expectSliceNumber); + + result[result.length - 1]++; + for (int i = 0; i < result.length - 1; i++) { + System.out.println("start:" + result[i] + " count:" + (result[i + 1] - result[i])); + } + +// Assert.assertTrue(result.length - 1 == expectSliceNumber); + System.out.println(Arrays.toString(result)); + } + + @Test + public void testLong_01() { + long count = 8; + long left = 0; + long right = count - 1; + int expectSliceNumber = 3; + long[] result = RangeSplitUtil.doLongSplit(left, right, expectSliceNumber); + + result[result.length - 1]++; + for (int i = 0; i < result.length - 1; i++) { + System.out.println("start:" + result[i] + " count:" + (result[i + 1] - result[i])); + } + + Assert.assertTrue(result.length - 1 == expectSliceNumber); + System.out.println(Arrays.toString(result)); + } + + @Test + public void testLong() { + long left = 8L; + long right = 301L; + int expectSliceNumber = 93; + doTest(left, right, expectSliceNumber); + + for (int i = 1; i < right * 20; i++) { + doTest(left, right, i); + } + + System.out.println(" 测试随机值..."); + int testTimes = 200; + for (int i = 0; i < testTimes; i++) { + left = getRandomLong(); + right = getRandomLong(); + expectSliceNumber = getRandomInteger(); + doTest(left, right, expectSliceNumber); + } + + } + + + @Test + public void testGetMinAndMaxCharacter() { + Pair result = RangeSplitUtil.getMinAndMaxCharacter("abc%^&"); + Assert.assertEquals('%', result.getLeft().charValue()); + Assert.assertEquals('c', result.getRight().charValue()); + + result = RangeSplitUtil.getMinAndMaxCharacter("\tAabcZx"); + Assert.assertEquals('\t', result.getLeft().charValue()); + Assert.assertEquals('x', result.getRight().charValue()); + } + + + //TODO 自动化测试 + @Test + public void testDoAsciiStringSplit() { +// String left = "adde"; +// String right = "xyz"; +// int expectSliceNumber = 4; + String left = "a"; + String right = "z"; + int expectSliceNumber = 3; + + String[] result = RangeSplitUtil.doAsciiStringSplit(left, right, expectSliceNumber); + System.out.println(ToStringBuilder.reflectionToString(result, ToStringStyle.SIMPLE_STYLE)); + + } + + private long getRandomLong() { + Random r = new Random(); + return r.nextLong(); + } + + private int getRandomInteger() { + Random r = new Random(); + return Math.abs(r.nextInt(1000) + 1); + } + + private void doTest(long left, long right, int expectSliceNumber) { + long[] result = RangeSplitUtil.doLongSplit(left, right, expectSliceNumber); + + System.out.println(String.format("left:[%s],right:[%s],expectSliceNumber:[%s]====> splitResult:[\n%s\n].\n", + left, right, expectSliceNumber, ToStringBuilder.reflectionToString(result, ToStringStyle.SIMPLE_STYLE))); + + Assert.assertTrue(doCheck(result, left, right, Math.abs(right - left) > + expectSliceNumber ? expectSliceNumber : -1)); + } + + + private boolean doCheck(long[] result, long left, + long right, int expectSliceNumber) { + if (null == result) { + throw new IllegalArgumentException("parameter result can not be null."); + } + + // 调整大小顺序,确保 left right) { + long temp = left; + left = right; + right = temp; + } + + //为了方法共用,expectSliceNumber == -1 表示不对切分份数进行校验. + boolean skipSliceNumberCheck = expectSliceNumber == -1; + if (skipSliceNumberCheck || expectSliceNumber == result.length - 1) { + boolean leftCheckOk = left == result[0]; + boolean rightCheckOk = right == result[result.length - 1]; + + if (leftCheckOk && rightCheckOk) { + for (int i = 0, len = result.length; i < len - 1; i++) { + if (result[i] > result[i + 1]) { + return false; + } + } + return true; + } + } + + return false; + } +} diff --git a/common/src/test/java/com/alibaba/datax/common/util/RetryUtilTest.java b/common/src/test/java/com/alibaba/datax/common/util/RetryUtilTest.java new file mode 100755 index 000000000..f3d2ab162 --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/util/RetryUtilTest.java @@ -0,0 +1,259 @@ +package com.alibaba.datax.common.util; + +import org.hamcrest.core.StringContains; +import org.junit.Assert; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; + +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.*; +import java.util.concurrent.atomic.AtomicInteger; + +public class RetryUtilTest { + + private static String OK = "I am ok now."; + + private static String BAD = "I am bad now."; + + + /** + * 模拟一个不靠谱的方法,其不靠谱体现在:调用它,前2次必定失败,第3次才能成功. 运行成功时,输出为:I am ok now. + * 运行报错时,报错中信息为:I am bad now. + */ + static class SomeService implements Callable { + private int i = 0; + + @Override + public String call() throws Exception { + i++; + if (i <= 2) { + throw new Exception(BAD); + } + return OK; + } + } + + @Test(timeout = 3000L) + public void test1() throws Exception { + long startTime = System.currentTimeMillis(); + + String result = RetryUtil.executeWithRetry(new SomeService(), 3, 1000L, + false); + long endTime = System.currentTimeMillis(); + Assert.assertEquals(result, OK); + long executeTime = endTime - startTime; + + System.out.println("executeTime:" + executeTime); + Assert.assertTrue(executeTime < 3 * 1000L); + } + + @Test(timeout = 3000L) + public void test2() throws Exception { + long startTime = System.currentTimeMillis(); + String result = RetryUtil.executeWithRetry(new SomeService(), 4, 1000L, + false); + long endTime = System.currentTimeMillis(); + Assert.assertEquals(result, OK); + long executeTime = endTime - startTime; + + System.out.println("executeTime:" + executeTime); + Assert.assertTrue(executeTime < 3 * 1000L); + } + + @Test(timeout = 3000L) + public void test3() throws Exception { + long startTime = System.currentTimeMillis(); + String result = RetryUtil.executeWithRetry(new SomeService(), 40, + 1000L, false); + long endTime = System.currentTimeMillis(); + Assert.assertEquals(result, OK); + long executeTime = endTime - startTime; + + System.out.println("executeTime:" + executeTime); + Assert.assertTrue(executeTime < 3 * 1000L); + } + + @Test(timeout = 4000L) + public void test4() throws Exception { + long startTime = System.currentTimeMillis(); + String result = RetryUtil.executeWithRetry(new SomeService(), 40, + 1000L, true); + long endTime = System.currentTimeMillis(); + Assert.assertEquals(result, OK); + long executeTime = endTime - startTime; + + System.out.println("executeTime:" + executeTime); + Assert.assertTrue(executeTime < 4 * 1000L); + Assert.assertTrue(executeTime > 3 * 1000L); + } + + @Rule + public ExpectedException expectedEx = ExpectedException.none(); + + @Test(timeout = 3000L) + public void test5() throws Exception { + expectedEx.expect(Exception.class); + expectedEx.expectMessage(StringContains.containsString(BAD)); + + RetryUtil.executeWithRetry(new SomeService(), 2, 100L, false); + } + + /** + * 线程池无法释放,后续提交被拒绝 + * + * @throws Exception + */ + @Test + public void testExecutorService线程池占满() throws Exception { + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + expectedEx.expect(RejectedExecutionException.class); + for (int i = 0; i < 10; i++) { + executor.submit(new Callable() { + @Override + public Object call() throws Exception { + TimeUnit.SECONDS.sleep(10); + return null; + } + }); + System.out.println("Submit: " + i + ", running tasks: " + executor.getActiveCount()); + } + + } + + /** + * 保持有任务运行,最多4个,所有提交过来的任务都能运行 + * + * @throws Exception + */ + @Test + public void testExecutorService正常运行() throws Exception { + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + for (int i = 0; i < 10; i++) { + executor.submit(new Callable() { + @Override + public Object call() throws Exception { + TimeUnit.SECONDS.sleep(4); + return null; + } + }); + System.out.println("Submit: " + i + ", running tasks: " + executor.getActiveCount()); + TimeUnit.SECONDS.sleep(1); + } + } + + /** + * 线程池没有被全部占用,但是正在运行的总数超过限制,后续提交拒绝 + * + * @throws Exception + */ + @Test + public void testExecutorService正在运行的总数超过限制() throws Exception { + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + expectedEx.expect(RejectedExecutionException.class); + for (int i = 0; i < 10; i++) { + executor.submit(new Callable() { + @Override + public Object call() throws Exception { + TimeUnit.SECONDS.sleep(6); + return null; + } + }); + System.out.println("Submit: " + i + ", running tasks: " + executor.getActiveCount()); + TimeUnit.SECONDS.sleep(1); + } + } + + @Test + public void testExecutorService取消正在运行的任务() throws Exception { + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + List> futures = new ArrayList>(10); + for (int i = 0; i < 10; i++) { + Future f = executor.submit(new Callable() { + @Override + public Object call() throws Exception { + TimeUnit.SECONDS.sleep(6); + return null; + } + }); + futures.add(f); + System.out.println("Submit: " + i + ", running tasks: " + executor.getActiveCount()); + + if (i == 4) { + for (Future future : futures) { + future.cancel(true); + } + System.out.println("Cancel all"); + System.out.println("Submit: " + i + ", running tasks: " + executor.getActiveCount()); + } + + TimeUnit.SECONDS.sleep(1); + } + } + + @Test + public void testExecutorService取消方式错误() throws Exception { + expectedEx.expect(RejectedExecutionException.class); + + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + + List> futures = new ArrayList>(10); + for (int i = 0; i < 10; i++) { + Future f = executor.submit(new Callable() { + @Override + public Object call() throws Exception { + TimeUnit.SECONDS.sleep(6); + return null; + } + }); + futures.add(f); + System.out.println("Submit: " + i + ", running tasks: " + executor.getActiveCount()); + + if (i == 4) { + for (Future future : futures) { + future.cancel(false); + } + System.out.println("Cancel all"); + } + + TimeUnit.SECONDS.sleep(1); + } + } + + @Test + public void testRetryAsync() throws Exception { + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + final AtomicInteger runCnt = new AtomicInteger(); + String res = RetryUtil.asyncExecuteWithRetry(new Callable() { + @Override + public String call() throws Exception { + runCnt.incrementAndGet(); + if (runCnt.get() < 3) { + TimeUnit.SECONDS.sleep(10); + } else { + TimeUnit.SECONDS.sleep(1); + } + + return OK; + } + }, 3, 1000L, false, 2000L, executor); + Assert.assertEquals(res, OK); +// Assert.assertEquals(RetryUtil.EXECUTOR.getActiveCount(), 0); + } + + + @Test + public void testRetryAsync2() throws Exception { + expectedEx.expect(TimeoutException.class); + ThreadPoolExecutor executor = RetryUtil.createThreadPoolExecutor(); + String res = RetryUtil.asyncExecuteWithRetry(new Callable() { + @Override + public String call() throws Exception { + TimeUnit.SECONDS.sleep(10); + return OK; + } + }, 3, 1000L, false, 2000L, executor); + } + +} diff --git a/common/src/test/java/com/alibaba/datax/common/util/StrUtilTest.java b/common/src/test/java/com/alibaba/datax/common/util/StrUtilTest.java new file mode 100644 index 000000000..42b9aba9b --- /dev/null +++ b/common/src/test/java/com/alibaba/datax/common/util/StrUtilTest.java @@ -0,0 +1,59 @@ +package com.alibaba.datax.common.util; + +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; + +import java.util.Properties; + +import static org.junit.Assert.assertEquals; + +public class StrUtilTest { + + @Rule + public ExpectedException ex= ExpectedException.none(); + + + @Test + public void testReplaceVariable() throws Exception { + Properties valuesMap = System.getProperties(); + valuesMap.put("animal", "quick brown fox"); + valuesMap.put("target", "lazy dog"); + String templateString = "The $animal jumped over the ${target}."; + String resolvedString =StrUtil.replaceVariable(templateString); + System.out.println(resolvedString); + assertEquals(resolvedString, "The quick brown fox jumped over the lazy dog."); + } + + @Test + public void testCompressMiddle() throws Exception { + assertEquals(StrUtil.compressMiddle("0123456789", 2, 2), "01...89"); + assertEquals(StrUtil.compressMiddle("0123456789", 5, 5), "0123456789"); + assertEquals(StrUtil.compressMiddle("0123456789", 6, 7), "0123456789"); + assertEquals(StrUtil.compressMiddle("0123456789", 10, 1), "0123456789"); + assertEquals(StrUtil.compressMiddle("0123456789", 20, 2), "0123456789"); + assertEquals(StrUtil.compressMiddle("0123456789", 2, 20), "0123456789"); + } + + + @Test + public void testCompressMiddleFailed() throws Exception { + ex.expect(NullPointerException.class); + StrUtil.compressMiddle(null, 2, 20); + } + + @Test + public void testCompressMiddleFailed2() throws Exception { + ex.expect(IllegalArgumentException.class); + StrUtil.compressMiddle("sssss", 0, 20); + } + + @Test + public void testCompressMiddleFailed3() throws Exception { + ex.expect(IllegalArgumentException.class); + StrUtil.compressMiddle("sfsdfsd", 2, -1); + } + + + +} \ No newline at end of file diff --git a/common/src/test/resources/all.json b/common/src/test/resources/all.json new file mode 100755 index 000000000..19c5ad977 --- /dev/null +++ b/common/src/test/resources/all.json @@ -0,0 +1,151 @@ + +{ + "entry": { + "jvm": "-Xms1G -Xmx1G", + "environment": { + "PATH": "/home/admin", + "DATAX_HOME": "/home/admin" + } + }, + "common": { + "column": { + "datetimeFormat": "yyyy-MM-dd HH:mm:ss", + "timeFormat": "HH:mm:ss", + "dateFormat": "yyyy-MM-dd", + "extraFormats":["yyyyMMdd"], + "timeZone": "GMT+8", + "encoding": "utf-8" + } + }, + "core": { + "transport": { + "channel": { + "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel", + "speed": { + "byte": 1048576 + }, + "capacity": 32 + }, + "exchanger": { + "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger", + "bufferSize": 32 + } + }, + "container": { + "job": { + "reportInterval": 1000 + }, + "taskGroup": { + "channel": 3 + } + }, + "statistics": { + "collector": { + "plugin": { + "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector", + "maxDirtyNumber": 1000 + } + } + } + }, + "plugin": { + "reader": { + "mysqlreader": { + "name": "fakereader", + "class": "com.alibaba.datax.plugins.reader.fakereader.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + }, + "oraclereader": { + "name": "oraclereader", + "class": "com.alibaba.datax.plugins.reader.oraclereader.OracleReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + }, + "fakereader": { + "name": "fakereader", + "class": "com.alibaba.datax.core.faker.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + } + }, + "writer": { + "fakewriter": { + "name": "fakewriter", + "class": "com.alibaba.datax.core.faker.FakeWriter", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + } + }, + "transformer": { + "groovyTranformer": {} + } + }, + "job": { + "setting": { + "speed": { + "byte": 104857600 + }, + "errorLimit": { + "record": null, + "percentage": null + } + }, + "content": [ + { + "reader": { + "name": "fakereader", + "parameter": { + "jdbcUrl": [ + [ + "jdbc:mysql://localhost:3305/db1", + "jdbc:mysql://localhost:3306/db1" + ], + [ + "jdbc:mysql://localhost:3305/db2", + "jdbc:mysql://localhost:3306/db2" + ] + ], + "table": [ + "bazhen_[0-15]", + "bazhen_[15-31]" + ] + } + }, + "writer": { + "name": "fakewriter", + "parameter": { + "column": [ + { + "type": "string", + "name": "id" + }, + { + "type": "int", + "name": "age" + } + ], + "encode": "utf-8", + "hbase-conf": "/home/hbase/hbase-conf.xml" + } + } + } + ] + } +} \ No newline at end of file diff --git a/common/src/test/resources/logback-test.xml b/common/src/test/resources/logback-test.xml new file mode 100644 index 000000000..d91666eba --- /dev/null +++ b/common/src/test/resources/logback-test.xml @@ -0,0 +1,18 @@ + + + + + + + + UTF-8 + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + + + \ No newline at end of file diff --git a/common/src/test/resources/logback.xml b/common/src/test/resources/logback.xml new file mode 100644 index 000000000..d91666eba --- /dev/null +++ b/common/src/test/resources/logback.xml @@ -0,0 +1,18 @@ + + + + + + + + UTF-8 + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + + + \ No newline at end of file diff --git a/core/datax-core.iml b/core/datax-core.iml new file mode 100644 index 000000000..503cdb82b --- /dev/null +++ b/core/datax-core.iml @@ -0,0 +1,47 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/core/pom.xml b/core/pom.xml new file mode 100755 index 000000000..ba8dcf86f --- /dev/null +++ b/core/pom.xml @@ -0,0 +1,145 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + datax-core + datax-core + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + commons-configuration + commons-configuration + ${commons-configuration-version} + + + commons-cli + commons-cli + ${commons-cli-version} + + + commons-beanutils + commons-beanutils + 1.9.2 + + + org.apache.httpcomponents + httpclient + 4.4 + + + org.apache.httpcomponents + fluent-hc + 4.4 + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + org.codehaus.janino + janino + 2.5.16 + + + + junit + junit + test + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + org.apache.commons + commons-lang3 + 3.3.2 + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + + com.alibaba.datax.core.Engine + + + + + + + maven-assembly-plugin + + + + com.alibaba.datax.core.Engine + + + datax + + src/main/assembly/package.xml + + + + + + package + + single + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + diff --git a/core/src/main/assembly/package.xml b/core/src/main/assembly/package.xml new file mode 100755 index 000000000..7369f5635 --- /dev/null +++ b/core/src/main/assembly/package.xml @@ -0,0 +1,98 @@ + + + + dir + + false + + + + src/main/bin + + *.* + + + *.pyc + + 775 + /bin + + + + src/main/script + + *.* + + 775 + /script + + + + src/main/conf + + *.* + + /conf + + + + target/ + + datax-core-0.0.1-SNAPSHOT.jar + + /lib + + + + + + + + + + + + + + + + + + + + src/main/job/ + + *.json + + /job + + + + src/main/tools/ + + *.* + + /tools + + + + 777 + src/main/tmp + + *.* + + /tmp + + + + + + false + /lib + runtime + + + diff --git a/core/src/main/bin/datax.py b/core/src/main/bin/datax.py new file mode 100755 index 000000000..3e50dc636 --- /dev/null +++ b/core/src/main/bin/datax.py @@ -0,0 +1,216 @@ +#!/usr/bin/env python +# -*- coding:utf-8 -*- + +import sys +import os +import signal +import subprocess +import time +import re +import socket +import json +from optparse import OptionParser +from optparse import OptionGroup +from string import Template + +DATAX_HOME = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) + +DATAX_VERSION = 'UNKNOWN_DATAX_VERSION' +CLASS_PATH = ("%s/lib/*:.") % (DATAX_HOME) +LOGBACK_FILE = ("%s/conf/logback.xml") % (DATAX_HOME) +DEFAULT_JVM = "-Xms1g -Xmx1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME) +DEFAULT_PROPERTY_CONF = "-Dfile.encoding=UTF-8 -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Ddatax.home=%s -Dlogback.configurationFile=%s" % ( + DATAX_HOME, LOGBACK_FILE) +ENGINE_COMMAND = "java -server ${jvm} %s -classpath %s ${params} com.alibaba.datax.core.Engine -mode ${mode} -jobid ${jobid} -job ${job}" % ( + DEFAULT_PROPERTY_CONF, CLASS_PATH) +REMOTE_DEBUG_CONFIG = "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=9999" + +RET_STATE = { + "KILL": 143, + "FAIL": -1, + "OK": 0, + "RUN": 1, + "RETRY": 2 +} + + +def getLocalIp(): + try: + return socket.gethostbyname(socket.getfqdn(socket.gethostname())) + except: + return "Unknown" + + +def suicide(signum, e): + global child_process + print >> sys.stderr, "[Error] DataX receive unexpected signal %d, starts to suicide." % (signum) + + if child_process: + child_process.send_signal(signal.SIGQUIT) + time.sleep(1) + child_process.kill() + print >> sys.stderr, "DataX Process was killed ! you did ?" + sys.exit(RET_STATE["KILL"]) + + +def register_signal(): + global child_process + signal.signal(2, suicide) + signal.signal(3, suicide) + signal.signal(15, suicide) + + +def getOptionParser(): + usage = "usage: %prog [options] job-url-or-path" + parser = OptionParser(usage=usage) + + prodEnvOptionGroup = OptionGroup(parser, "Product Env Options", + "Normal user use these options to set jvm parameters, job runtime mode etc. " + "Make sure these options can be used in Product Env.") + prodEnvOptionGroup.add_option("-j", "--jvm", metavar="", dest="jvmParameters", action="store", + default=DEFAULT_JVM, help="Set jvm parameters if necessary.") + prodEnvOptionGroup.add_option("--jobid", metavar="", dest="jobid", action="store", default="-1", + help="Set job unique id when running by Distribute/Local Mode.") + prodEnvOptionGroup.add_option("-m", "--mode", metavar="", + action="store", default="standalone", + help="Set job runtime mode such as: standalone, local, distribute. " + "Default mode is standalone.") + prodEnvOptionGroup.add_option("-p", "--params", metavar="", + action="store", dest="params", + help='Set job parameter, eg: the source tableName you want to set it by command, ' + 'then you can use like this: -v"-DtableName=you-wanted-table-name". ' + 'Note: you should config in you job tableName with ${tableName}.') + prodEnvOptionGroup.add_option("-r", "--reader", metavar="", + action="store", dest="reader",type="string", + help='View job config[reader] template, eg: mysqlreader,streamreader') + prodEnvOptionGroup.add_option("-w", "--writer", metavar="", + action="store", dest="writer",type="string", + help='View job config[writer] template, eg: mysqlwriter,streamwriter') + parser.add_option_group(prodEnvOptionGroup) + + devEnvOptionGroup = OptionGroup(parser, "Develop/Debug Options", + "Developer use these options to trace more details of DataX.") + devEnvOptionGroup.add_option("-d", "--debug", dest="remoteDebug", action="store_true", + help="Set to remote debug mode.") + devEnvOptionGroup.add_option("--loglevel", metavar="", dest="loglevel", action="store", + default="info", help="Set log level such as: debug, info, all etc.") + parser.add_option_group(devEnvOptionGroup) + return parser + +def generateJobConfigTemplate(reader, writer): + readerRef = "Please refer to the %s document:\n https://github.com/alibaba/DataX/blob/master/%s/doc/%s.md \n" % (reader,reader,reader) + writerRef = "Please refer to the %s document:\n https://github.com/alibaba/DataX/blob/master/%s/doc/%s.md \n " % (writer,writer,writer) + print readerRef + print writerRef + jobGuid = 'Please save the following configuration as a json file and use\n python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json \nto run the job.\n' + print jobGuid + jobTemplate={ + "job": { + "setting": { + "speed": { + "channel": "" + } + }, + "content": [ + { + "reader": {}, + "writer": {} + } + ] + } + } + readerTemplatePath = "%s/plugin/reader/%s/plugin_job_template.json" % (DATAX_HOME,reader) + writerTemplatePath = "%s/plugin/writer/%s/plugin_job_template.json" % (DATAX_HOME,writer) + try: + readerPar = readPluginTemplate(readerTemplatePath); + except Exception, e: + print "Read reader[%s] template error: can\'t find file %s" % (reader,readerTemplatePath) + try: + writerPar = readPluginTemplate(writerTemplatePath); + except Exception, e: + print "Read writer[%s] template error: : can\'t find file %s" % (writer,writerTemplatePath) + jobTemplate['job']['content'][0]['reader'] = readerPar; + jobTemplate['job']['content'][0]['writer'] = writerPar; + print json.dumps(jobTemplate, indent=4, sort_keys=True) + +def readPluginTemplate(plugin): + with open(plugin, 'r') as f: + return json.load(f) + +def isUrl(path): + if not path: + return False + + assert (isinstance(path, str)) + m = re.match(r"^http[s]?://\S+\w*", path.lower()) + if m: + return True + else: + return False + + +def buildStartCommand(options, args): + commandMap = {} + tempJVMCommand = DEFAULT_JVM + if options.jvmParameters: + tempJVMCommand = tempJVMCommand + " " + options.jvmParameters + + if options.remoteDebug: + tempJVMCommand = tempJVMCommand + " " + REMOTE_DEBUG_CONFIG + print 'local ip: ', getLocalIp() + + if options.loglevel: + tempJVMCommand = tempJVMCommand + " " + ("-Dloglevel=%s" % (options.loglevel)) + + if options.mode: + commandMap["mode"] = options.mode + + # jobResource 可能是 URL,也可能是本地文件路径(相对,绝对) + jobResource = args[0] + if not isUrl(jobResource): + jobResource = os.path.abspath(jobResource) + if jobResource.lower().startswith("file://"): + jobResource = jobResource[len("file://"):] + + jobParams = ("-Dlog.file.name=%s") % (jobResource[-20:].replace('/', '_').replace('.', '_')) + if options.params: + jobParams = jobParams + " " + options.params + + if options.jobid: + commandMap["jobid"] = options.jobid + + commandMap["jvm"] = tempJVMCommand + commandMap["params"] = jobParams + commandMap["job"] = jobResource + + return Template(ENGINE_COMMAND).substitute(**commandMap) + + +def printCopyright(): + print ''' +DataX (%s), From Alibaba ! +Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved. + +''' % DATAX_VERSION + sys.stdout.flush() + + +if __name__ == "__main__": + printCopyright() + parser = getOptionParser() + options, args = parser.parse_args(sys.argv[1:]) + if options.reader is not None and options.writer is not None: + generateJobConfigTemplate(options.reader,options.writer) + sys.exit(RET_STATE['OK']) + if len(args) != 1: + parser.print_help() + sys.exit(RET_STATE['FAIL']) + + startCommand = buildStartCommand(options, args) + # print startCommand + + child_process = subprocess.Popen(startCommand, shell=True) + register_signal() + (stdout, stderr) = child_process.communicate() + + sys.exit(child_process.returncode) diff --git a/core/src/main/bin/dxprof.py b/core/src/main/bin/dxprof.py new file mode 100644 index 000000000..181bf9008 --- /dev/null +++ b/core/src/main/bin/dxprof.py @@ -0,0 +1,191 @@ +#! /usr/bin/env python +# vim: set expandtab tabstop=4 shiftwidth=4 foldmethod=marker nu: + +import re +import sys +import time + +REG_SQL_WAKE = re.compile(r'Begin\s+to\s+read\s+record\s+by\s+Sql', re.IGNORECASE) +REG_SQL_DONE = re.compile(r'Finished\s+read\s+record\s+by\s+Sql', re.IGNORECASE) +REG_SQL_PATH = re.compile(r'from\s+(\w+)(\s+where|\s*$)', re.IGNORECASE) +REG_SQL_JDBC = re.compile(r'jdbcUrl:\s*\[(.+?)\]', re.IGNORECASE) +REG_SQL_UUID = re.compile(r'(\d+\-)+reader') +REG_COMMIT_UUID = re.compile(r'(\d+\-)+writer') +REG_COMMIT_WAKE = re.compile(r'begin\s+to\s+commit\s+blocks', re.IGNORECASE) +REG_COMMIT_DONE = re.compile(r'commit\s+blocks\s+ok', re.IGNORECASE) + +# {{{ function parse_timestamp() # +def parse_timestamp(line): + try: + ts = int(time.mktime(time.strptime(line[0:19], '%Y-%m-%d %H:%M:%S'))) + except: + ts = 0 + + return ts + +# }}} # + +# {{{ function parse_query_host() # +def parse_query_host(line): + ori = REG_SQL_JDBC.search(line) + if (not ori): + return '' + + ori = ori.group(1).split('?')[0] + off = ori.find('@') + if (off > -1): + ori = ori[off+1:len(ori)] + else: + off = ori.find('//') + if (off > -1): + ori = ori[off+2:len(ori)] + + return ori.lower() +# }}} # + +# {{{ function parse_query_table() # +def parse_query_table(line): + ori = REG_SQL_PATH.search(line) + return (ori and ori.group(1).lower()) or '' +# }}} # + +# {{{ function parse_reader_task() # +def parse_task(fname): + global LAST_SQL_UUID + global LAST_COMMIT_UUID + global DATAX_JOBDICT + global DATAX_JOBDICT_COMMIT + global UNIXTIME + LAST_SQL_UUID = '' + DATAX_JOBDICT = {} + LAST_COMMIT_UUID = '' + DATAX_JOBDICT_COMMIT = {} + + UNIXTIME = int(time.time()) + with open(fname, 'r') as f: + for line in f.readlines(): + line = line.strip() + + if (LAST_SQL_UUID and (LAST_SQL_UUID in DATAX_JOBDICT)): + DATAX_JOBDICT[LAST_SQL_UUID]['host'] = parse_query_host(line) + LAST_SQL_UUID = '' + + if line.find('CommonRdbmsReader$Task') > 0: + parse_read_task(line) + elif line.find('commit blocks') > 0: + parse_write_task(line) + else: + continue +# }}} # + +# {{{ function parse_read_task() # +def parse_read_task(line): + ser = REG_SQL_UUID.search(line) + if not ser: + return + + LAST_SQL_UUID = ser.group() + if REG_SQL_WAKE.search(line): + DATAX_JOBDICT[LAST_SQL_UUID] = { + 'stat' : 'R', + 'wake' : parse_timestamp(line), + 'done' : UNIXTIME, + 'host' : parse_query_host(line), + 'path' : parse_query_table(line) + } + elif ((LAST_SQL_UUID in DATAX_JOBDICT) and REG_SQL_DONE.search(line)): + DATAX_JOBDICT[LAST_SQL_UUID]['stat'] = 'D' + DATAX_JOBDICT[LAST_SQL_UUID]['done'] = parse_timestamp(line) +# }}} # + +# {{{ function parse_write_task() # +def parse_write_task(line): + ser = REG_COMMIT_UUID.search(line) + if not ser: + return + + LAST_COMMIT_UUID = ser.group() + if REG_COMMIT_WAKE.search(line): + DATAX_JOBDICT_COMMIT[LAST_COMMIT_UUID] = { + 'stat' : 'R', + 'wake' : parse_timestamp(line), + 'done' : UNIXTIME, + } + elif ((LAST_COMMIT_UUID in DATAX_JOBDICT_COMMIT) and REG_COMMIT_DONE.search(line)): + DATAX_JOBDICT_COMMIT[LAST_COMMIT_UUID]['stat'] = 'D' + DATAX_JOBDICT_COMMIT[LAST_COMMIT_UUID]['done'] = parse_timestamp(line) +# }}} # + +# {{{ function result_analyse() # +def result_analyse(): + def compare(a, b): + return b['cost'] - a['cost'] + + tasklist = [] + hostsmap = {} + statvars = {'sum' : 0, 'cnt' : 0, 'svr' : 0, 'max' : 0, 'min' : int(time.time())} + tasklist_commit = [] + statvars_commit = {'sum' : 0, 'cnt' : 0} + + for idx in DATAX_JOBDICT: + item = DATAX_JOBDICT[idx] + item['uuid'] = idx; + item['cost'] = item['done'] - item['wake'] + tasklist.append(item); + + if (not (item['host'] in hostsmap)): + hostsmap[item['host']] = 1 + statvars['svr'] += 1 + + if (item['cost'] > -1 and item['cost'] < 864000): + statvars['sum'] += item['cost'] + statvars['cnt'] += 1 + statvars['max'] = max(statvars['max'], item['done']) + statvars['min'] = min(statvars['min'], item['wake']) + + for idx in DATAX_JOBDICT_COMMIT: + itemc = DATAX_JOBDICT_COMMIT[idx] + itemc['uuid'] = idx + itemc['cost'] = itemc['done'] - itemc['wake'] + tasklist_commit.append(itemc) + + if (itemc['cost'] > -1 and itemc['cost'] < 864000): + statvars_commit['sum'] += itemc['cost'] + statvars_commit['cnt'] += 1 + + ttl = (statvars['max'] - statvars['min']) or 1 + idx = float(statvars['cnt']) / (statvars['sum'] or ttl) + + tasklist.sort(compare) + for item in tasklist: + print '%s\t%s.%s\t%s\t%s\t% 4d\t% 2.1f%%\t% .2f' %(item['stat'], item['host'], item['path'], + time.strftime('%H:%M:%S', time.localtime(item['wake'])), + (('D' == item['stat']) and time.strftime('%H:%M:%S', time.localtime(item['done']))) or '--', + item['cost'], 100 * item['cost'] / ttl, idx * item['cost']) + + if (not len(tasklist) or not statvars['cnt']): + return + + print '\n--- DataX Profiling Statistics ---' + print '%d task(s) on %d server(s), Total elapsed %d second(s), %.2f second(s) per task in average' %(statvars['cnt'], + statvars['svr'], statvars['sum'], float(statvars['sum']) / statvars['cnt']) + print 'Actually cost %d second(s) (%s - %s), task concurrency: %.2f, tilt index: %.2f' %(ttl, + time.strftime('%H:%M:%S', time.localtime(statvars['min'])), + time.strftime('%H:%M:%S', time.localtime(statvars['max'])), + float(statvars['sum']) / ttl, idx * tasklist[0]['cost']) + + idx_commit = float(statvars_commit['cnt']) / (statvars_commit['sum'] or ttl) + tasklist_commit.sort(compare) + print '%d task(s) done odps comit, Total elapsed %d second(s), %.2f second(s) per task in average, tilt index: %.2f' % ( + statvars_commit['cnt'], + statvars_commit['sum'], float(statvars_commit['sum']) / statvars_commit['cnt'], + idx_commit * tasklist_commit[0]['cost']) + +# }}} # + +if (len(sys.argv) < 2): + print "Usage: %s filename" %(sys.argv[0]) + quit(1) +else: + parse_task(sys.argv[1]) + result_analyse() \ No newline at end of file diff --git a/core/src/main/conf/.secret.properties b/core/src/main/conf/.secret.properties new file mode 100755 index 000000000..3b295a175 --- /dev/null +++ b/core/src/main/conf/.secret.properties @@ -0,0 +1,3 @@ +#ds basicAuth config +auth.user= +auth.pass= \ No newline at end of file diff --git a/core/src/main/conf/core.json b/core/src/main/conf/core.json new file mode 100755 index 000000000..ce5d025ae --- /dev/null +++ b/core/src/main/conf/core.json @@ -0,0 +1,61 @@ + +{ + "entry": { + "jvm": "-Xms1G -Xmx1G", + "environment": {} + }, + "common": { + "column": { + "datetimeFormat": "yyyy-MM-dd HH:mm:ss", + "timeFormat": "HH:mm:ss", + "dateFormat": "yyyy-MM-dd", + "extraFormats":["yyyyMMdd"], + "timeZone": "GMT+8", + "encoding": "utf-8" + } + }, + "core": { + "dataXServer": { + "address": "http://localhost:7001/api", + "timeout": 10000, + "reportDataxLog": false, + "reportPerfLog": false + }, + "transport": { + "channel": { + "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel", + "speed": { + "byte": 1048576, + "record": 10000 + }, + "flowControlInterval": 20, + "capacity": 512, + "byteCapacity": 67108864 + }, + "exchanger": { + "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger", + "bufferSize": 32 + } + }, + "container": { + "job": { + "reportInterval": 10000 + }, + "taskGroup": { + "channel": 5 + }, + "trace": { + "enable": "true" + } + + }, + "statistics": { + "collector": { + "plugin": { + "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector", + "maxDirtyNumber": 10 + } + } + } + } +} diff --git a/core/src/main/conf/logback.xml b/core/src/main/conf/logback.xml new file mode 100755 index 000000000..7a433ba97 --- /dev/null +++ b/core/src/main/conf/logback.xml @@ -0,0 +1,150 @@ + + + + + + + + + + UTF-8 + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + UTF-8 + ${log.dir}/${ymd}/${log.file.name}-${byMillionSecond}.log + false + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + UTF-8 + ${perf.dir}/${ymd}/${log.file.name}-${byMillionSecond}.log + false + + %msg%n + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/AbstractContainer.java b/core/src/main/java/com/alibaba/datax/core/AbstractContainer.java new file mode 100755 index 000000000..c4e09b757 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/AbstractContainer.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.core; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import org.apache.commons.lang.Validate; + +/** + * 执行容器的抽象类,持有该容器全局的配置 configuration + */ +public abstract class AbstractContainer { + protected Configuration configuration; + + protected AbstractContainerCommunicator containerCommunicator; + + public AbstractContainer(Configuration configuration) { + Validate.notNull(configuration, "Configuration can not be null."); + + this.configuration = configuration; + } + + public Configuration getConfiguration() { + return configuration; + } + + public AbstractContainerCommunicator getContainerCommunicator() { + return containerCommunicator; + } + + public void setContainerCommunicator(AbstractContainerCommunicator containerCommunicator) { + this.containerCommunicator = containerCommunicator; + } + + public abstract void start(); + +} diff --git a/core/src/main/java/com/alibaba/datax/core/Engine.java b/core/src/main/java/com/alibaba/datax/core/Engine.java new file mode 100755 index 000000000..5f69a88fc --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/Engine.java @@ -0,0 +1,217 @@ +package com.alibaba.datax.core; + +import com.alibaba.datax.common.element.ColumnCast; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.ErrorCode; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.JobContainer; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.util.ConfigParser; +import com.alibaba.datax.core.util.ConfigurationValidate; +import com.alibaba.datax.core.util.ExceptionTracker; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import org.apache.commons.cli.BasicParser; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; +import java.util.List; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Engine是DataX入口类,该类负责初始化Job或者Task的运行容器,并运行插件的Job或者Task逻辑 + */ +public class Engine { + private static final Logger LOG = LoggerFactory.getLogger(Engine.class); + + private static String RUNTIME_MODE; + + /* check job model (job/task) first */ + public void start(Configuration allConf) { + + // 绑定column转换信息 + ColumnCast.bind(allConf); + + /** + * 初始化PluginLoader,可以获取各种插件配置 + */ + LoadUtil.bind(allConf); + + boolean isJob = !("taskGroup".equalsIgnoreCase(allConf + .getString(CoreConstant.DATAX_CORE_CONTAINER_MODEL))); + + AbstractContainer container; + long instanceId; + int taskGroupId = -1; + if (isJob) { + allConf.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_MODE, RUNTIME_MODE); + container = new JobContainer(allConf); + instanceId = allConf.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, 0); + + } else { + container = new TaskGroupContainer(allConf); + instanceId = allConf.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + taskGroupId = allConf.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + } + + //缺省打开perfTrace + boolean traceEnable = allConf.getBool(CoreConstant.DATAX_CORE_CONTAINER_TRACE_ENABLE, true); + boolean perfReportEnable = allConf.getBool(CoreConstant.DATAX_CORE_REPORT_DATAX_PERFLOG, true); + + int priority = 0; + try { + priority = Integer.parseInt(System.getenv("SKYNET_PRIORITY")); + }catch (NumberFormatException e){ + LOG.warn("prioriy set to 0, because NumberFormatException, the value is: "+System.getProperty("PROIORY")); + } + + Configuration jobInfoConfig = allConf.getConfiguration(CoreConstant.DATAX_JOB_JOBINFO); + //初始化PerfTrace + PerfTrace perfTrace = PerfTrace.getInstance(isJob, instanceId, taskGroupId, priority, traceEnable); + perfTrace.setJobInfo(jobInfoConfig); + perfTrace.setPerfReportEnalbe(perfReportEnable); + container.start(); + + + } + + + // 注意屏蔽敏感信息 + public static String filterJobConfiguration(final Configuration configuration) { + Configuration jobConfWithSetting = configuration.getConfiguration("job").clone(); + + Configuration jobContent = jobConfWithSetting.getConfiguration("content"); + + filterSensitiveConfiguration(jobContent); + + jobConfWithSetting.set("content",jobContent); + + return jobConfWithSetting.beautify(); + } + + public static Configuration filterSensitiveConfiguration(Configuration configuration){ + Set keys = configuration.getKeys(); + for (final String key : keys) { + boolean isSensitive = StringUtils.endsWithIgnoreCase(key, "password") + || StringUtils.endsWithIgnoreCase(key, "accessKey"); + if (isSensitive && configuration.get(key) instanceof String) { + configuration.set(key, configuration.getString(key).replaceAll(".", "*")); + } + } + return configuration; + } + + public static void entry(final String[] args) throws Throwable { + Options options = new Options(); + options.addOption("job", true, "Job config."); + options.addOption("jobid", true, "Job unique id."); + options.addOption("mode", true, "Job runtime mode."); + + BasicParser parser = new BasicParser(); + CommandLine cl = parser.parse(options, args); + + String jobPath = cl.getOptionValue("job"); + + // 如果用户没有明确指定jobid, 则 datax.py 会指定 jobid 默认值为-1 + String jobIdString = cl.getOptionValue("jobid"); + RUNTIME_MODE = cl.getOptionValue("mode"); + + Configuration configuration = ConfigParser.parse(jobPath); + + long jobId; + if (!"-1".equalsIgnoreCase(jobIdString)) { + jobId = Long.parseLong(jobIdString); + } else { + // only for dsc & ds & datax 3 update + String dscJobUrlPatternString = "/instance/(\\d{1,})/config.xml"; + String dsJobUrlPatternString = "/inner/job/(\\d{1,})/config"; + String dsTaskGroupUrlPatternString = "/inner/job/(\\d{1,})/taskGroup/"; + List patternStringList = Arrays.asList(dscJobUrlPatternString, + dsJobUrlPatternString, dsTaskGroupUrlPatternString); + jobId = parseJobIdFromUrl(patternStringList, jobPath); + } + + boolean isStandAloneMode = "standalone".equalsIgnoreCase(RUNTIME_MODE); + if (!isStandAloneMode && jobId == -1) { + // 如果不是 standalone 模式,那么 jobId 一定不能为-1 + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "非 standalone 模式必须在 URL 中提供有效的 jobId."); + } + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, jobId); + + //打印vmInfo + VMInfo vmInfo = VMInfo.getVmInfo(); + if (vmInfo != null) { + LOG.info(vmInfo.toString()); + } + + LOG.info("\n" + Engine.filterJobConfiguration(configuration) + "\n"); + + LOG.debug(configuration.toJSON()); + + ConfigurationValidate.doValidate(configuration); + Engine engine = new Engine(); + engine.start(configuration); + } + + + /** + * -1 表示未能解析到 jobId + * + * only for dsc & ds & datax 3 update + */ + private static long parseJobIdFromUrl(List patternStringList, String url) { + long result = -1; + for (String patternString : patternStringList) { + result = doParseJobIdFromUrl(patternString, url); + if (result != -1) { + return result; + } + } + return result; + } + + private static long doParseJobIdFromUrl(String patternString, String url) { + Pattern pattern = Pattern.compile(patternString); + Matcher matcher = pattern.matcher(url); + if (matcher.find()) { + return Long.parseLong(matcher.group(1)); + } + + return -1; + } + + public static void main(String[] args) throws Exception { + int exitCode = 0; + try { + Engine.entry(args); + } catch (Throwable e) { + exitCode = 1; + LOG.error("\n\n经DataX智能分析,该任务最可能的错误原因是:\n" + ExceptionTracker.trace(e)); + + if (e instanceof DataXException) { + DataXException tempException = (DataXException) e; + ErrorCode errorCode = tempException.getErrorCode(); + if (errorCode instanceof FrameworkErrorCode) { + FrameworkErrorCode tempErrorCode = (FrameworkErrorCode) errorCode; + exitCode = tempErrorCode.toExitValue(); + } + } + + System.exit(exitCode); + } + System.exit(exitCode); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/container/util/HookInvoker.java b/core/src/main/java/com/alibaba/datax/core/container/util/HookInvoker.java new file mode 100755 index 000000000..6e0ef1782 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/container/util/HookInvoker.java @@ -0,0 +1,91 @@ +package com.alibaba.datax.core.container.util; + +/** + * Created by xiafei.qiuxf on 14/12/17. + */ + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.Hook; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.JarLoader; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FilenameFilter; +import java.util.HashMap; +import java.util.Iterator; +import java.util.Map; +import java.util.ServiceLoader; + +/** + * 扫描给定目录的所有一级子目录,每个子目录当作一个Hook的目录。 + * 对于每个子目录,必须符合ServiceLoader的标准目录格式,见http://docs.oracle.com/javase/6/docs/api/java/util/ServiceLoader.html。 + * 加载里头的jar,使用ServiceLoader机制调用。 + */ +public class HookInvoker { + + private static final Logger LOG = LoggerFactory.getLogger(HookInvoker.class); + private final Map msg; + private final Configuration conf; + + private File baseDir; + + public HookInvoker(String baseDirName, Configuration conf, Map msg) { + this.baseDir = new File(baseDirName); + this.conf = conf; + this.msg = msg; + } + + public void invokeAll() { + if (!baseDir.exists() || baseDir.isFile()) { + LOG.info("No hook invoked, because base dir not exists or is a file: " + baseDir.getAbsolutePath()); + return; + } + + String[] subDirs = baseDir.list(new FilenameFilter() { + @Override + public boolean accept(File dir, String name) { + return new File(dir, name).isDirectory(); + } + }); + + if (subDirs == null) { + throw DataXException.asDataXException(FrameworkErrorCode.HOOK_LOAD_ERROR, "获取HOOK子目录返回null"); + } + + for (String subDir : subDirs) { + doInvoke(new File(baseDir, subDir).getAbsolutePath()); + } + + } + + private void doInvoke(String path) { + ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader(); + try { + JarLoader jarLoader = new JarLoader(new String[]{path}); + Thread.currentThread().setContextClassLoader(jarLoader); + Iterator hookIt = ServiceLoader.load(Hook.class).iterator(); + if (!hookIt.hasNext()) { + LOG.warn("No hook defined under path: " + path); + } else { + Hook hook = hookIt.next(); + LOG.info("Invoke hook [{}], path: {}", hook.getName(), path); + hook.invoke(conf, msg); + } + } catch (Exception e) { + LOG.error("Exception when invoke hook", e); + throw DataXException.asDataXException( + CommonErrorCode.HOOK_INTERNAL_ERROR, "Exception when invoke hook", e); + } finally { + Thread.currentThread().setContextClassLoader(oldClassLoader); + } + } + + public static void main(String[] args) { + new HookInvoker("/Users/xiafei/workspace/datax3/target/datax/datax/hook", + null, new HashMap()).invokeAll(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/container/util/JobAssignUtil.java b/core/src/main/java/com/alibaba/datax/core/container/util/JobAssignUtil.java new file mode 100755 index 000000000..31ba60a4d --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/container/util/JobAssignUtil.java @@ -0,0 +1,177 @@ +package com.alibaba.datax.core.container.util; + +import com.alibaba.datax.common.constant.CommonConstant; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; +import org.apache.commons.lang3.StringUtils; + +import java.util.*; + +public final class JobAssignUtil { + private JobAssignUtil() { + } + + /** + * 公平的分配 task 到对应的 taskGroup 中。 + * 公平体现在:会考虑 task 中对资源负载作的 load 标识进行更均衡的作业分配操作。 + * TODO 具体文档举例说明 + */ + public static List assignFairly(Configuration configuration, int channelNumber, int channelsPerTaskGroup) { + Validate.isTrue(configuration != null, "框架获得的 Job 不能为 null."); + + List contentConfig = configuration.getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + Validate.isTrue(contentConfig.size() > 0, "框架获得的切分后的 Job 无内容."); + + Validate.isTrue(channelNumber > 0 && channelsPerTaskGroup > 0, + "每个channel的平均task数[averTaskPerChannel],channel数目[channelNumber],每个taskGroup的平均channel数[channelsPerTaskGroup]都应该为正数"); + + int taskGroupNumber = (int) Math.ceil(1.0 * channelNumber / channelsPerTaskGroup); + + Configuration aTaskConfig = contentConfig.get(0); + + String readerResourceMark = aTaskConfig.getString(CoreConstant.JOB_READER_PARAMETER + "." + + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + String writerResourceMark = aTaskConfig.getString(CoreConstant.JOB_WRITER_PARAMETER + "." + + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + + boolean hasLoadBalanceResourceMark = StringUtils.isNotBlank(readerResourceMark) || + StringUtils.isNotBlank(writerResourceMark); + + if (!hasLoadBalanceResourceMark) { + // fake 一个固定的 key 作为资源标识(在 reader 或者 writer 上均可,此处选择在 reader 上进行 fake) + for (Configuration conf : contentConfig) { + conf.set(CoreConstant.JOB_READER_PARAMETER + "." + + CommonConstant.LOAD_BALANCE_RESOURCE_MARK, "aFakeResourceMarkForLoadBalance"); + } + // 是为了避免某些插件没有设置 资源标识 而进行了一次随机打乱操作 + Collections.shuffle(contentConfig, new Random(System.currentTimeMillis())); + } + + LinkedHashMap> resourceMarkAndTaskIdMap = parseAndGetResourceMarkAndTaskIdMap(contentConfig); + List taskGroupConfig = doAssign(resourceMarkAndTaskIdMap, configuration, taskGroupNumber); + + // 调整 每个 taskGroup 对应的 Channel 个数(属于优化范畴) + adjustChannelNumPerTaskGroup(taskGroupConfig, channelNumber); + return taskGroupConfig; + } + + private static void adjustChannelNumPerTaskGroup(List taskGroupConfig, int channelNumber) { + int taskGroupNumber = taskGroupConfig.size(); + int avgChannelsPerTaskGroup = channelNumber / taskGroupNumber; + int remainderChannelCount = channelNumber % taskGroupNumber; + // 表示有 remainderChannelCount 个 taskGroup,其对应 Channel 个数应该为:avgChannelsPerTaskGroup + 1; + // (taskGroupNumber - remainderChannelCount)个 taskGroup,其对应 Channel 个数应该为:avgChannelsPerTaskGroup + + int i = 0; + for (; i < remainderChannelCount; i++) { + taskGroupConfig.get(i).set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, avgChannelsPerTaskGroup + 1); + } + + for (int j = 0; j < taskGroupNumber - remainderChannelCount; j++) { + taskGroupConfig.get(i + j).set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, avgChannelsPerTaskGroup); + } + } + + /** + * 根据task 配置,获取到: + * 资源名称 --> taskId(List) 的 map 映射关系 + */ + private static LinkedHashMap> parseAndGetResourceMarkAndTaskIdMap(List contentConfig) { + // key: resourceMark, value: taskId + LinkedHashMap> readerResourceMarkAndTaskIdMap = new LinkedHashMap>(); + LinkedHashMap> writerResourceMarkAndTaskIdMap = new LinkedHashMap>(); + + for (Configuration aTaskConfig : contentConfig) { + int taskId = aTaskConfig.getInt(CoreConstant.TASK_ID); + // 把 readerResourceMark 加到 readerResourceMarkAndTaskIdMap 中 + String readerResourceMark = aTaskConfig.getString(CoreConstant.JOB_READER_PARAMETER + "." + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + if (readerResourceMarkAndTaskIdMap.get(readerResourceMark) == null) { + readerResourceMarkAndTaskIdMap.put(readerResourceMark, new LinkedList()); + } + readerResourceMarkAndTaskIdMap.get(readerResourceMark).add(taskId); + + // 把 writerResourceMark 加到 writerResourceMarkAndTaskIdMap 中 + String writerResourceMark = aTaskConfig.getString(CoreConstant.JOB_WRITER_PARAMETER + "." + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + if (writerResourceMarkAndTaskIdMap.get(writerResourceMark) == null) { + writerResourceMarkAndTaskIdMap.put(writerResourceMark, new LinkedList()); + } + writerResourceMarkAndTaskIdMap.get(writerResourceMark).add(taskId); + } + + if (readerResourceMarkAndTaskIdMap.size() >= writerResourceMarkAndTaskIdMap.size()) { + // 采用 reader 对资源做的标记进行 shuffle + return readerResourceMarkAndTaskIdMap; + } else { + // 采用 writer 对资源做的标记进行 shuffle + return writerResourceMarkAndTaskIdMap; + } + } + + + /** + * /** + * 需要实现的效果通过例子来说是: + *
+     * a 库上有表:0, 1, 2
+     * a 库上有表:3, 4
+     * c 库上有表:5, 6, 7
+     *
+     * 如果有 4个 taskGroup
+     * 则 assign 后的结果为:
+     * taskGroup-0: 0,  4,
+     * taskGroup-1: 3,  6,
+     * taskGroup-2: 5,  2,
+     * taskGroup-3: 1,  7
+     *
+     * 
+ */ + private static List doAssign(LinkedHashMap> resourceMarkAndTaskIdMap, Configuration jobConfiguration, int taskGroupNumber) { + List contentConfig = jobConfiguration.getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + + Configuration taskGroupTemplate = jobConfiguration.clone(); + taskGroupTemplate.remove(CoreConstant.DATAX_JOB_CONTENT); + + List result = new LinkedList(); + + List> taskGroupConfigList = new ArrayList>(taskGroupNumber); + for (int i = 0; i < taskGroupNumber; i++) { + taskGroupConfigList.add(new LinkedList()); + } + + int mapValueMaxLength = -1; + + List resourceMarks = new ArrayList(); + for (Map.Entry> entry : resourceMarkAndTaskIdMap.entrySet()) { + resourceMarks.add(entry.getKey()); + if (entry.getValue().size() > mapValueMaxLength) { + mapValueMaxLength = entry.getValue().size(); + } + } + + int taskGroupIndex = 0; + for (int i = 0; i < mapValueMaxLength; i++) { + for (String resourceMark : resourceMarks) { + if (resourceMarkAndTaskIdMap.get(resourceMark).size() > 0) { + int taskId = resourceMarkAndTaskIdMap.get(resourceMark).get(0); + taskGroupConfigList.get(taskGroupIndex % taskGroupNumber).add(contentConfig.get(taskId)); + taskGroupIndex++; + + resourceMarkAndTaskIdMap.get(resourceMark).remove(0); + } + } + } + + Configuration tempTaskGroupConfig; + for (int i = 0; i < taskGroupNumber; i++) { + tempTaskGroupConfig = taskGroupTemplate.clone(); + tempTaskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, taskGroupConfigList.get(i)); + tempTaskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, i); + + result.add(tempTaskGroupConfig); + } + + return result; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/JobContainer.java b/core/src/main/java/com/alibaba/datax/core/job/JobContainer.java new file mode 100755 index 000000000..494509b22 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/JobContainer.java @@ -0,0 +1,923 @@ +package com.alibaba.datax.core.job; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.JobPluginCollector; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.StrUtil; +import com.alibaba.datax.core.AbstractContainer; +import com.alibaba.datax.core.Engine; +import com.alibaba.datax.core.container.util.HookInvoker; +import com.alibaba.datax.core.container.util.JobAssignUtil; +import com.alibaba.datax.core.job.meta.ExecuteMode; +import com.alibaba.datax.core.job.scheduler.AbstractScheduler; +import com.alibaba.datax.core.job.scheduler.processinner.StandAloneScheduler; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.statistics.container.communicator.job.StandAloneJobContainerCommunicator; +import com.alibaba.datax.core.statistics.plugin.DefaultJobPluginCollector; +import com.alibaba.datax.core.util.ErrorRecordChecker; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.ClassLoaderSwapper; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14-8-24. + *

+ * job实例运行在jobContainer容器中,它是所有任务的master,负责初始化、拆分、调度、运行、回收、监控和汇报 + * 但它并不做实际的数据同步操作 + */ +public class JobContainer extends AbstractContainer { + private static final Logger LOG = LoggerFactory + .getLogger(JobContainer.class); + + private static final SimpleDateFormat dateFormat = new SimpleDateFormat( + "yyyy-MM-dd HH:mm:ss"); + + private ClassLoaderSwapper classLoaderSwapper = ClassLoaderSwapper + .newCurrentThreadClassLoaderSwapper(); + + private long jobId; + + private String readerPluginName; + + private String writerPluginName; + + /** + * reader和writer jobContainer的实例 + */ + private Reader.Job jobReader; + + private Writer.Job jobWriter; + + private Configuration userConf; + + private long startTimeStamp; + + private long endTimeStamp; + + private long startTransferTimeStamp; + + private long endTransferTimeStamp; + + private int needChannelNumber; + + private int totalStage = 1; + + private ErrorRecordChecker errorLimit; + + public JobContainer(Configuration configuration) { + super(configuration); + + errorLimit = new ErrorRecordChecker(configuration); + } + + /** + * jobContainer主要负责的工作全部在start()里面,包括init、prepare、split、scheduler、 + * post以及destroy和statistics + */ + @Override + public void start() { + LOG.info("DataX jobContainer starts job."); + + boolean hasException = false; + boolean isDryRun = false; + try { + this.startTimeStamp = System.currentTimeMillis(); + isDryRun = configuration.getBool(CoreConstant.DATAX_JOB_SETTING_DRYRUN, false); + if(isDryRun) { + LOG.info("jobContainer starts to do preCheck ..."); + this.preCheck(); + } else { + userConf = configuration.clone(); + LOG.debug("jobContainer starts to do preHandle ..."); + this.preHandle(); + + LOG.debug("jobContainer starts to do init ..."); + this.init(); + LOG.debug("jobContainer starts to do prepare ..."); + this.prepare(); + LOG.debug("jobContainer starts to do split ..."); + this.totalStage = this.split(); + LOG.debug("jobContainer starts to do schedule ..."); + this.schedule(); + LOG.debug("jobContainer starts to do post ..."); + this.post(); + + LOG.debug("jobContainer starts to do postHandle ..."); + this.postHandle(); + LOG.info("DataX jobId [{}] completed successfully.", this.jobId); + + this.invokeHooks(); + } + } catch (Throwable e) { + LOG.error("Exception when job run", e); + + hasException = true; + + if (e instanceof OutOfMemoryError) { + this.destroy(); + System.gc(); + } + + + if (super.getContainerCommunicator() == null) { + // 由于 containerCollector 是在 scheduler() 中初始化的,所以当在 scheduler() 之前出现异常时,需要在此处对 containerCollector 进行初始化 + + AbstractContainerCommunicator tempContainerCollector; + // standalone + tempContainerCollector = new StandAloneJobContainerCommunicator(configuration); + + super.setContainerCommunicator(tempContainerCollector); + } + + Communication communication = super.getContainerCommunicator().collect(); + // 汇报前的状态,不需要手动进行设置 + // communication.setState(State.FAILED); + communication.setThrowable(e); + communication.setTimestamp(this.endTimeStamp); + + Communication tempComm = new Communication(); + tempComm.setTimestamp(this.startTransferTimeStamp); + + Communication reportCommunication = CommunicationTool.getReportCommunication(communication, tempComm, this.totalStage); + super.getContainerCommunicator().report(reportCommunication); + + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } finally { + if(!isDryRun) { + + this.destroy(); + this.endTimeStamp = System.currentTimeMillis(); + if (!hasException) { + //最后打印cpu的平均消耗,GC的统计 + VMInfo vmInfo = VMInfo.getVmInfo(); + if (vmInfo != null) { + vmInfo.getDelta(false); + LOG.info(vmInfo.totalString()); + } + + LOG.info(PerfTrace.getInstance().summarizeNoException()); + this.logStatistics(); + } + } + } + } + + private void preCheck() { + this.preCheckInit(); + this.adjustChannelNumber(); + + if (this.needChannelNumber <= 0) { + this.needChannelNumber = 1; + } + this.preCheckReader(); + this.preCheckWriter(); + LOG.info("PreCheck通过"); + } + + private void preCheckInit() { + this.jobId = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, -1); + + if (this.jobId < 0) { + LOG.info("Set jobId = 0"); + this.jobId = 0; + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, + this.jobId); + } + + Thread.currentThread().setName("job-" + this.jobId); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + this.jobReader = this.preCheckReaderInit(jobPluginCollector); + this.jobWriter = this.preCheckWriterInit(jobPluginCollector); + } + + private Reader.Job preCheckReaderInit(JobPluginCollector jobPluginCollector) { + this.readerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_READER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + + Reader.Job jobReader = (Reader.Job) LoadUtil.loadJobPlugin( + PluginType.READER, this.readerPluginName); + + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER + ".dryRun", true); + + // 设置reader的jobConfig + jobReader.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + // 设置reader的readerConfig + jobReader.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + jobReader.setJobPluginCollector(jobPluginCollector); + + classLoaderSwapper.restoreCurrentThreadClassLoader(); + return jobReader; + } + + + private Writer.Job preCheckWriterInit(JobPluginCollector jobPluginCollector) { + this.writerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + + Writer.Job jobWriter = (Writer.Job) LoadUtil.loadJobPlugin( + PluginType.WRITER, this.writerPluginName); + + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER + ".dryRun", true); + + // 设置writer的jobConfig + jobWriter.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER)); + // 设置reader的readerConfig + jobWriter.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + jobWriter.setPeerPluginName(this.readerPluginName); + jobWriter.setJobPluginCollector(jobPluginCollector); + + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + return jobWriter; + } + + private void preCheckReader() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + LOG.info(String.format("DataX Reader.Job [%s] do preCheck work .", + this.readerPluginName)); + this.jobReader.preCheck(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + private void preCheckWriter() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + LOG.info(String.format("DataX Writer.Job [%s] do preCheck work .", + this.writerPluginName)); + this.jobWriter.preCheck(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + /** + * reader和writer的初始化 + */ + private void init() { + this.jobId = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, -1); + + if (this.jobId < 0) { + LOG.info("Set jobId = 0"); + this.jobId = 0; + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, + this.jobId); + } + + Thread.currentThread().setName("job-" + this.jobId); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + //必须先Reader ,后Writer + this.jobReader = this.initJobReader(jobPluginCollector); + this.jobWriter = this.initJobWriter(jobPluginCollector); + } + + private void prepare() { + this.prepareJobReader(); + this.prepareJobWriter(); + } + + private void preHandle() { + String handlerPluginTypeStr = this.configuration.getString( + CoreConstant.DATAX_JOB_PREHANDLER_PLUGINTYPE); + if(!StringUtils.isNotEmpty(handlerPluginTypeStr)){ + return; + } + PluginType handlerPluginType; + try { + handlerPluginType = PluginType.valueOf(handlerPluginTypeStr.toUpperCase()); + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + String.format("Job preHandler's pluginType(%s) set error, reason(%s)", handlerPluginTypeStr.toUpperCase(), e.getMessage())); + } + + String handlerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_PREHANDLER_PLUGINNAME); + + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + handlerPluginType, handlerPluginName)); + + AbstractJobPlugin handler = LoadUtil.loadJobPlugin( + handlerPluginType, handlerPluginName); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + handler.setJobPluginCollector(jobPluginCollector); + + //todo configuration的安全性,将来必须保证 + handler.preHandler(configuration); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + LOG.info("After PreHandler: \n" + Engine.filterJobConfiguration(configuration) + "\n"); + } + + private void postHandle() { + String handlerPluginTypeStr = this.configuration.getString( + CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINTYPE); + + if(!StringUtils.isNotEmpty(handlerPluginTypeStr)){ + return; + } + PluginType handlerPluginType; + try { + handlerPluginType = PluginType.valueOf(handlerPluginTypeStr.toUpperCase()); + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + String.format("Job postHandler's pluginType(%s) set error, reason(%s)", handlerPluginTypeStr.toUpperCase(), e.getMessage())); + } + + String handlerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINNAME); + + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + handlerPluginType, handlerPluginName)); + + AbstractJobPlugin handler = LoadUtil.loadJobPlugin( + handlerPluginType, handlerPluginName); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + handler.setJobPluginCollector(jobPluginCollector); + + handler.postHandler(configuration); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + + /** + * 执行reader和writer最细粒度的切分,需要注意的是,writer的切分结果要参照reader的切分结果, + * 达到切分后数目相等,才能满足1:1的通道模型,所以这里可以将reader和writer的配置整合到一起, + * 然后,为避免顺序给读写端带来长尾影响,将整合的结果shuffler掉 + */ + private int split() { + this.adjustChannelNumber(); + + if (this.needChannelNumber <= 0) { + this.needChannelNumber = 1; + } + + List readerTaskConfigs = this + .doReaderSplit(this.needChannelNumber); + int taskNumber = readerTaskConfigs.size(); + List writerTaskConfigs = this + .doWriterSplit(taskNumber); + + /** + * 输入是reader和writer的parameter list,输出是content下面元素的list + */ + List contentConfig = mergeReaderAndWriterTaskConfigs( + readerTaskConfigs, writerTaskConfigs); + + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT, contentConfig); + + return contentConfig.size(); + } + + private void adjustChannelNumber() { + int needChannelNumberByByte = Integer.MAX_VALUE; + int needChannelNumberByRecord = Integer.MAX_VALUE; + + boolean isByteLimit = (this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_BYTE, 0) > 0); + if (isByteLimit) { + long globalLimitedByteSpeed = this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_BYTE, 10 * 1024 * 1024); + + // 在byte流控情况下,单个Channel流量最大值必须设置,否则报错! + Long channelLimitedByteSpeed = this.configuration + .getLong(CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_BYTE); + if (channelLimitedByteSpeed == null || channelLimitedByteSpeed <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + "在有总bps限速条件下,单个channel的bps值不能为空,也不能为非正数"); + } + + needChannelNumberByByte = + (int) (globalLimitedByteSpeed / channelLimitedByteSpeed); + needChannelNumberByByte = + needChannelNumberByByte > 0 ? needChannelNumberByByte : 1; + LOG.info("Job set Max-Byte-Speed to " + globalLimitedByteSpeed + " bytes."); + } + + boolean isRecordLimit = (this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_RECORD, 0)) > 0; + if (isRecordLimit) { + long globalLimitedRecordSpeed = this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_RECORD, 100000); + + Long channelLimitedRecordSpeed = this.configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_RECORD); + if (channelLimitedRecordSpeed == null || channelLimitedRecordSpeed <= 0) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, + "在有总tps限速条件下,单个channel的tps值不能为空,也不能为非正数"); + } + + needChannelNumberByRecord = + (int) (globalLimitedRecordSpeed / channelLimitedRecordSpeed); + needChannelNumberByRecord = + needChannelNumberByRecord > 0 ? needChannelNumberByRecord : 1; + LOG.info("Job set Max-Record-Speed to " + globalLimitedRecordSpeed + " records."); + } + + // 取较小值 + this.needChannelNumber = needChannelNumberByByte < needChannelNumberByRecord ? + needChannelNumberByByte : needChannelNumberByRecord; + + // 如果从byte或record上设置了needChannelNumber则退出 + if (this.needChannelNumber < Integer.MAX_VALUE) { + return; + } + + boolean isChannelLimit = (this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_CHANNEL, 0) > 0); + if (isChannelLimit) { + this.needChannelNumber = this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_CHANNEL); + + LOG.info("Job set Channel-Number to " + this.needChannelNumber + + " channels."); + + return; + } + + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + "Job运行速度必须设置"); + } + + /** + * schedule首先完成的工作是把上一步reader和writer split的结果整合到具体taskGroupContainer中, + * 同时不同的执行模式调用不同的调度策略,将所有任务调度起来 + */ + private void schedule() { + /** + * 这里的全局speed和每个channel的速度设置为B/s + */ + int channelsPerTaskGroup = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, 5); + int taskNumber = this.configuration.getList( + CoreConstant.DATAX_JOB_CONTENT).size(); + + this.needChannelNumber = Math.min(this.needChannelNumber, taskNumber); + + /** + * 通过获取配置信息得到每个taskGroup需要运行哪些tasks任务 + */ + + List taskGroupConfigs = JobAssignUtil.assignFairly(this.configuration, + this.needChannelNumber, channelsPerTaskGroup); + + LOG.info("Scheduler starts [{}] taskGroups.", taskGroupConfigs.size()); + + ExecuteMode executeMode = null; + AbstractScheduler scheduler; + try { + executeMode = ExecuteMode.STANDALONE; + scheduler = initStandaloneScheduler(this.configuration); + + //设置 executeMode + for (Configuration taskGroupConfig : taskGroupConfigs) { + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_MODE, executeMode.getValue()); + } + + LOG.info("Running by {} Mode.", executeMode); + + this.startTransferTimeStamp = System.currentTimeMillis(); + + scheduler.schedule(taskGroupConfigs); + + this.endTransferTimeStamp = System.currentTimeMillis(); + } catch (Exception e) { + LOG.error("运行scheduler 模式[{}]出错.", executeMode); + this.endTransferTimeStamp = System.currentTimeMillis(); + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } + + /** + * 检查任务执行情况 + */ + this.checkLimit(); + } + + private AbstractScheduler initStandaloneScheduler(Configuration configuration) { + AbstractContainerCommunicator containerCommunicator = new StandAloneJobContainerCommunicator(configuration); + super.setContainerCommunicator(containerCommunicator); + + return new StandAloneScheduler(containerCommunicator); + } + + private void post() { + this.postJobWriter(); + this.postJobReader(); + } + + private void destroy() { + if (this.jobWriter != null) { + this.jobWriter.destroy(); + this.jobWriter = null; + } + if (this.jobReader != null) { + this.jobReader.destroy(); + this.jobReader = null; + } + } + + private void logStatistics() { + long totalCosts = (this.endTimeStamp - this.startTimeStamp) / 1000; + long transferCosts = (this.endTransferTimeStamp - this.startTransferTimeStamp) / 1000; + if (0L == transferCosts) { + transferCosts = 1L; + } + + if (super.getContainerCommunicator() == null) { + return; + } + + Communication communication = super.getContainerCommunicator().collect(); + communication.setTimestamp(this.endTimeStamp); + + Communication tempComm = new Communication(); + tempComm.setTimestamp(this.startTransferTimeStamp); + + Communication reportCommunication = CommunicationTool.getReportCommunication(communication, tempComm, this.totalStage); + + // 字节速率 + long byteSpeedPerSecond = communication.getLongCounter(CommunicationTool.READ_SUCCEED_BYTES) + / transferCosts; + + long recordSpeedPerSecond = communication.getLongCounter(CommunicationTool.READ_SUCCEED_RECORDS) + / transferCosts; + + reportCommunication.setLongCounter(CommunicationTool.BYTE_SPEED, byteSpeedPerSecond); + reportCommunication.setLongCounter(CommunicationTool.RECORD_SPEED, recordSpeedPerSecond); + + super.getContainerCommunicator().report(reportCommunication); + + LOG.info(String.format( + "\n" + "%-26s: %-18s\n" + "%-26s: %-18s\n" + "%-26s: %19s\n" + + "%-26s: %19s\n" + "%-26s: %19s\n" + "%-26s: %19s\n" + + "%-26s: %19s\n", + "任务启动时刻", + dateFormat.format(startTimeStamp), + + "任务结束时刻", + dateFormat.format(endTimeStamp), + + "任务总计耗时", + String.valueOf(totalCosts) + "s", + "任务平均流量", + StrUtil.stringify(byteSpeedPerSecond) + + "/s", + "记录写入速度", + String.valueOf(recordSpeedPerSecond) + + "rec/s", "读出记录总数", + String.valueOf(CommunicationTool.getTotalReadRecords(communication)), + "读写失败总数", + String.valueOf(CommunicationTool.getTotalErrorRecords(communication)) + )); + + + } + + /** + * reader job的初始化,返回Reader.Job + */ + private Reader.Job initJobReader( + JobPluginCollector jobPluginCollector) { + this.readerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_READER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + + Reader.Job jobReader = (Reader.Job) LoadUtil.loadJobPlugin( + PluginType.READER, this.readerPluginName); + + // 设置reader的jobConfig + jobReader.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + // 设置reader的readerConfig + jobReader.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER)); + + jobReader.setJobPluginCollector(jobPluginCollector); + jobReader.init(); + + classLoaderSwapper.restoreCurrentThreadClassLoader(); + return jobReader; + } + + /** + * writer job的初始化,返回Writer.Job + */ + private Writer.Job initJobWriter( + JobPluginCollector jobPluginCollector) { + this.writerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + + Writer.Job jobWriter = (Writer.Job) LoadUtil.loadJobPlugin( + PluginType.WRITER, this.writerPluginName); + + // 设置writer的jobConfig + jobWriter.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER)); + + // 设置reader的readerConfig + jobWriter.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + jobWriter.setPeerPluginName(this.readerPluginName); + jobWriter.setJobPluginCollector(jobPluginCollector); + jobWriter.init(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + return jobWriter; + } + + private void prepareJobReader() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + LOG.info(String.format("DataX Reader.Job [%s] do prepare work .", + this.readerPluginName)); + this.jobReader.prepare(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + private void prepareJobWriter() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + LOG.info(String.format("DataX Writer.Job [%s] do prepare work .", + this.writerPluginName)); + this.jobWriter.prepare(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + // TODO: 如果源头就是空数据 + private List doReaderSplit(int adviceNumber) { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + List readerSlicesConfigs = + this.jobReader.split(adviceNumber); + if (readerSlicesConfigs == null || readerSlicesConfigs.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_SPLIT_ERROR, + "reader切分的task数目不能小于等于0"); + } + LOG.info("DataX Reader.Job [{}] splits to [{}] tasks.", + this.readerPluginName, readerSlicesConfigs.size()); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + return readerSlicesConfigs; + } + + private List doWriterSplit(int readerTaskNumber) { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + + List writerSlicesConfigs = this.jobWriter + .split(readerTaskNumber); + if (writerSlicesConfigs == null || writerSlicesConfigs.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_SPLIT_ERROR, + "writer切分的task不能小于等于0"); + } + LOG.info("DataX Writer.Job [{}] splits to [{}] tasks.", + this.writerPluginName, writerSlicesConfigs.size()); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + return writerSlicesConfigs; + } + + /** + * 按顺序整合reader和writer的配置,这里的顺序不能乱! 输入是reader、writer级别的配置,输出是一个完整task的配置 + */ + private List mergeReaderAndWriterTaskConfigs( + List readerTasksConfigs, + List writerTasksConfigs) { + if (readerTasksConfigs.size() != writerTasksConfigs.size()) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_SPLIT_ERROR, + String.format("reader切分的task数目[%d]不等于writer切分的task数目[%d].", + readerTasksConfigs.size(), writerTasksConfigs.size()) + ); + } + + List contentConfigs = new ArrayList(); + for (int i = 0; i < readerTasksConfigs.size(); i++) { + Configuration taskConfig = Configuration.newDefault(); + taskConfig.set(CoreConstant.JOB_READER_NAME, + this.readerPluginName); + taskConfig.set(CoreConstant.JOB_READER_PARAMETER, + readerTasksConfigs.get(i)); + taskConfig.set(CoreConstant.JOB_WRITER_NAME, + this.writerPluginName); + taskConfig.set(CoreConstant.JOB_WRITER_PARAMETER, + writerTasksConfigs.get(i)); + taskConfig.set(CoreConstant.TASK_ID, i); + contentConfigs.add(taskConfig); + } + + return contentConfigs; + } + + /** + * 这里比较复杂,分两步整合 1. tasks到channel 2. channel到taskGroup + * 合起来考虑,其实就是把tasks整合到taskGroup中,需要满足计算出的channel数,同时不能多起channel + *

+ * example: + *

+ * 前提条件: 切分后是1024个分表,假设用户要求总速率是1000M/s,每个channel的速率的3M/s, + * 每个taskGroup负责运行7个channel + *

+ * 计算: 总channel数为:1000M/s / 3M/s = + * 333个,为平均分配,计算可知有308个每个channel有3个tasks,而有25个每个channel有4个tasks, + * 需要的taskGroup数为:333 / 7 = + * 47...4,也就是需要48个taskGroup,47个是每个负责7个channel,有4个负责1个channel + *

+ * 处理:我们先将这负责4个channel的taskGroup处理掉,逻辑是: + * 先按平均为3个tasks找4个channel,设置taskGroupId为0, + * 接下来就像发牌一样轮询分配task到剩下的包含平均channel数的taskGroup中 + *

+ * TODO delete it + * + * @param averTaskPerChannel + * @param channelNumber + * @param channelsPerTaskGroup + * @return 每个taskGroup独立的全部配置 + */ + @SuppressWarnings("serial") + private List distributeTasksToTaskGroup( + int averTaskPerChannel, int channelNumber, + int channelsPerTaskGroup) { + Validate.isTrue(averTaskPerChannel > 0 && channelNumber > 0 + && channelsPerTaskGroup > 0, + "每个channel的平均task数[averTaskPerChannel],channel数目[channelNumber],每个taskGroup的平均channel数[channelsPerTaskGroup]都应该为正数"); + List taskConfigs = this.configuration + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + int taskGroupNumber = channelNumber / channelsPerTaskGroup; + int leftChannelNumber = channelNumber % channelsPerTaskGroup; + if (leftChannelNumber > 0) { + taskGroupNumber += 1; + } + + /** + * 如果只有一个taskGroup,直接打标返回 + */ + if (taskGroupNumber == 1) { + final Configuration taskGroupConfig = this.configuration.clone(); + /** + * configure的clone不能clone出 + */ + taskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, this.configuration + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT)); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + channelNumber); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 0); + return new ArrayList() { + { + add(taskGroupConfig); + } + }; + } + + List taskGroupConfigs = new ArrayList(); + /** + * 将每个taskGroup中content的配置清空 + */ + for (int i = 0; i < taskGroupNumber; i++) { + Configuration taskGroupConfig = this.configuration.clone(); + List taskGroupJobContent = taskGroupConfig + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + taskGroupJobContent.clear(); + taskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, taskGroupJobContent); + + taskGroupConfigs.add(taskGroupConfig); + } + + int taskConfigIndex = 0; + int channelIndex = 0; + int taskGroupConfigIndex = 0; + + /** + * 先处理掉taskGroup包含channel数不是平均值的taskGroup + */ + if (leftChannelNumber > 0) { + Configuration taskGroupConfig = taskGroupConfigs.get(taskGroupConfigIndex); + for (; channelIndex < leftChannelNumber; channelIndex++) { + for (int i = 0; i < averTaskPerChannel; i++) { + List taskGroupJobContent = taskGroupConfig + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + taskGroupJobContent.add(taskConfigs.get(taskConfigIndex++)); + taskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, + taskGroupJobContent); + } + } + + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + leftChannelNumber); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, + taskGroupConfigIndex++); + } + + /** + * 下面需要轮询分配,并打上channel数和taskGroupId标记 + */ + int equalDivisionStartIndex = taskGroupConfigIndex; + for (; taskConfigIndex < taskConfigs.size() + && equalDivisionStartIndex < taskGroupConfigs.size(); ) { + for (taskGroupConfigIndex = equalDivisionStartIndex; taskGroupConfigIndex < taskGroupConfigs + .size() && taskConfigIndex < taskConfigs.size(); taskGroupConfigIndex++) { + Configuration taskGroupConfig = taskGroupConfigs.get(taskGroupConfigIndex); + List taskGroupJobContent = taskGroupConfig + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + taskGroupJobContent.add(taskConfigs.get(taskConfigIndex++)); + taskGroupConfig.set( + CoreConstant.DATAX_JOB_CONTENT, taskGroupJobContent); + } + } + + for (taskGroupConfigIndex = equalDivisionStartIndex; + taskGroupConfigIndex < taskGroupConfigs.size(); ) { + Configuration taskGroupConfig = taskGroupConfigs.get(taskGroupConfigIndex); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + channelsPerTaskGroup); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, + taskGroupConfigIndex++); + } + + return taskGroupConfigs; + } + + private void postJobReader() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + LOG.info("DataX Reader.Job [{}] do post work.", + this.readerPluginName); + this.jobReader.post(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + private void postJobWriter() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + LOG.info("DataX Writer.Job [{}] do post work.", + this.writerPluginName); + this.jobWriter.post(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + /** + * 检查最终结果是否超出阈值,如果阈值设定小于1,则表示百分数阈值,大于1表示条数阈值。 + * + * @param + */ + private void checkLimit() { + Communication communication = super.getContainerCommunicator().collect(); + errorLimit.checkRecordLimit(communication); + errorLimit.checkPercentageLimit(communication); + } + + /** + * 调用外部hook + */ + private void invokeHooks() { + Communication comm = super.getContainerCommunicator().collect(); + HookInvoker invoker = new HookInvoker(CoreConstant.DATAX_HOME + "/hook", configuration, comm.getCounter()); + invoker.invokeAll(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/meta/ExecuteMode.java b/core/src/main/java/com/alibaba/datax/core/job/meta/ExecuteMode.java new file mode 100644 index 000000000..956f9c4b2 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/meta/ExecuteMode.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.core.job.meta; + +/** + * Created by liupeng on 15/12/21. + */ +public enum ExecuteMode { + STANDALONE("standalone"), ; + + String value; + + private ExecuteMode(String value) { + this.value = value; + } + + public String value() { + return this.value; + } + + public String getValue() { + return this.value; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/meta/State.java b/core/src/main/java/com/alibaba/datax/core/job/meta/State.java new file mode 100644 index 000000000..2a1dd227e --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/meta/State.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.core.job.meta; + +/** + * Created by liupeng on 15/12/21. + */ +public enum State { + SUBMITTING(10), + WAITING(20), + RUNNING(30), + KILLING(40), + KILLED(50), + FAILED(60), + SUCCEEDED(70), ; + + int value; + + private State(int value) { + this.value = value; + } + + public int value() { + return this.value; + } + + public boolean isFinished() { + return this == KILLED || this == FAILED || this == SUCCEEDED; + } + + public boolean isRunning() { + return !this.isFinished(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/scheduler/AbstractScheduler.java b/core/src/main/java/com/alibaba/datax/core/job/scheduler/AbstractScheduler.java new file mode 100755 index 000000000..3dbc8f696 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/scheduler/AbstractScheduler.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.core.job.scheduler; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.util.ErrorRecordChecker; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public abstract class AbstractScheduler { + private static final Logger LOG = LoggerFactory + .getLogger(AbstractScheduler.class); + + private ErrorRecordChecker errorLimit; + + private AbstractContainerCommunicator containerCommunicator; + + private Long jobId; + + public Long getJobId() { + return jobId; + } + + public AbstractScheduler(AbstractContainerCommunicator containerCommunicator) { + this.containerCommunicator = containerCommunicator; + } + + public void schedule(List configurations) { + Validate.notNull(configurations, + "scheduler配置不能为空"); + int jobReportIntervalInMillSec = configurations.get(0).getInt( + CoreConstant.DATAX_CORE_CONTAINER_JOB_REPORTINTERVAL, 30000); + int jobSleepIntervalInMillSec = configurations.get(0).getInt( + CoreConstant.DATAX_CORE_CONTAINER_JOB_SLEEPINTERVAL, 10000); + + this.jobId = configurations.get(0).getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + + errorLimit = new ErrorRecordChecker(configurations.get(0)); + + /** + * 给 taskGroupContainer 的 Communication 注册 + */ + this.containerCommunicator.registerCommunication(configurations); + + int totalTasks = calculateTaskCount(configurations); + startAllTaskGroup(configurations); + + Communication lastJobContainerCommunication = new Communication(); + + long lastReportTimeStamp = System.currentTimeMillis(); + try { + while (true) { + /** + * step 1: collect job stat + * step 2: getReport info, then report it + * step 3: errorLimit do check + * step 4: dealSucceedStat(); + * step 5: dealKillingStat(); + * step 6: dealFailedStat(); + * step 7: refresh last job stat, and then sleep for next while + * + * above steps, some ones should report info to DS + * + */ + Communication nowJobContainerCommunication = this.containerCommunicator.collect(); + nowJobContainerCommunication.setTimestamp(System.currentTimeMillis()); + LOG.debug(nowJobContainerCommunication.toString()); + + //汇报周期 + long now = System.currentTimeMillis(); + if (now - lastReportTimeStamp > jobReportIntervalInMillSec) { + Communication reportCommunication = CommunicationTool + .getReportCommunication(nowJobContainerCommunication, lastJobContainerCommunication, totalTasks); + + this.containerCommunicator.report(reportCommunication); + lastReportTimeStamp = now; + lastJobContainerCommunication = nowJobContainerCommunication; + } + + errorLimit.checkRecordLimit(nowJobContainerCommunication); + + if (nowJobContainerCommunication.getState() == State.SUCCEEDED) { + LOG.info("Scheduler accomplished all tasks."); + break; + } + + if (isJobKilling(this.getJobId())) { + dealKillingStat(this.containerCommunicator, totalTasks); + } else if (nowJobContainerCommunication.getState() == State.FAILED) { + dealFailedStat(this.containerCommunicator, nowJobContainerCommunication.getThrowable()); + } + + Thread.sleep(jobSleepIntervalInMillSec); + } + } catch (InterruptedException e) { + // 以 failed 状态退出 + LOG.error("捕获到InterruptedException异常!", e); + + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } + + } + + protected abstract void startAllTaskGroup(List configurations); + + protected abstract void dealFailedStat(AbstractContainerCommunicator frameworkCollector, Throwable throwable); + + protected abstract void dealKillingStat(AbstractContainerCommunicator frameworkCollector, int totalTasks); + + private int calculateTaskCount(List configurations) { + int totalTasks = 0; + for (Configuration taskGroupConfiguration : configurations) { + totalTasks += taskGroupConfiguration.getListConfiguration( + CoreConstant.DATAX_JOB_CONTENT).size(); + } + return totalTasks; + } + +// private boolean isJobKilling(Long jobId) { +// Result jobInfo = DataxServiceUtil.getJobInfo(jobId); +// return jobInfo.getData() == State.KILLING.value(); +// } + + protected abstract boolean isJobKilling(Long jobId); +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/ProcessInnerScheduler.java b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/ProcessInnerScheduler.java new file mode 100755 index 000000000..2bc6e64c9 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/ProcessInnerScheduler.java @@ -0,0 +1,60 @@ +package com.alibaba.datax.core.job.scheduler.processinner; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.scheduler.AbstractScheduler; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.taskgroup.runner.TaskGroupContainerRunner; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import java.util.List; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; + +public abstract class ProcessInnerScheduler extends AbstractScheduler { + + private ExecutorService taskGroupContainerExecutorService; + + public ProcessInnerScheduler(AbstractContainerCommunicator containerCommunicator) { + super(containerCommunicator); + } + + @Override + public void startAllTaskGroup(List configurations) { + this.taskGroupContainerExecutorService = Executors + .newFixedThreadPool(configurations.size()); + + for (Configuration taskGroupConfiguration : configurations) { + TaskGroupContainerRunner taskGroupContainerRunner = newTaskGroupContainerRunner(taskGroupConfiguration); + this.taskGroupContainerExecutorService.execute(taskGroupContainerRunner); + } + + this.taskGroupContainerExecutorService.shutdown(); + } + + @Override + public void dealFailedStat(AbstractContainerCommunicator frameworkCollector, Throwable throwable) { + this.taskGroupContainerExecutorService.shutdownNow(); + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_RUNTIME_ERROR, throwable); + } + + + @Override + public void dealKillingStat(AbstractContainerCommunicator frameworkCollector, int totalTasks) { + //通过进程退出返回码标示状态 + this.taskGroupContainerExecutorService.shutdownNow(); + throw DataXException.asDataXException(FrameworkErrorCode.KILLED_EXIT_VALUE, + "job killed status"); + } + + + private TaskGroupContainerRunner newTaskGroupContainerRunner( + Configuration configuration) { + TaskGroupContainer taskGroupContainer = new TaskGroupContainer(configuration); + + return new TaskGroupContainerRunner(taskGroupContainer); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/StandAloneScheduler.java b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/StandAloneScheduler.java new file mode 100755 index 000000000..d87421b7a --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/StandAloneScheduler.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.core.job.scheduler.processinner; + +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; + +/** + * Created by hongjiao.hj on 2014/12/22. + */ +public class StandAloneScheduler extends ProcessInnerScheduler{ + + public StandAloneScheduler(AbstractContainerCommunicator containerCommunicator) { + super(containerCommunicator); + } + + @Override + protected boolean isJobKilling(Long jobId) { + return false; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/communication/Communication.java b/core/src/main/java/com/alibaba/datax/core/statistics/communication/Communication.java new file mode 100755 index 000000000..2a93a80bb --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/communication/Communication.java @@ -0,0 +1,281 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.core.job.meta.State; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang.Validate; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.concurrent.ConcurrentHashMap; + +/** + * DataX所有的状态及统计信息交互类,job、taskGroup、task等的消息汇报都走该类 + */ +public class Communication extends BaseObject implements Cloneable { + /** + * 所有的数值key-value对 * + */ + private Map counter; + + /** + * 运行状态 * + */ + private State state; + + /** + * 异常记录 * + */ + private Throwable throwable; + + /** + * 记录的timestamp * + */ + private long timestamp; + + /** + * task给job的信息 * + */ + Map> message; + + public Communication() { + this.init(); + } + + public synchronized void reset() { + this.init(); + } + + private void init() { + this.counter = new ConcurrentHashMap(); + this.state = State.RUNNING; + this.throwable = null; + this.message = new ConcurrentHashMap>(); + this.timestamp = System.currentTimeMillis(); + } + + public Map getCounter() { + return this.counter; + } + + public State getState() { + return this.state; + } + + public synchronized void setState(State state, boolean isForce) { + if (!isForce && this.state.equals(State.FAILED)) { + return; + } + + this.state = state; + } + + public synchronized void setState(State state) { + setState(state, false); + } + + public Throwable getThrowable() { + return this.throwable; + } + + public synchronized String getThrowableMessage() { + return this.throwable == null ? "" : this.throwable.getMessage(); + } + + public void setThrowable(Throwable throwable) { + setThrowable(throwable, false); + } + + public synchronized void setThrowable(Throwable throwable, boolean isForce) { + if (isForce) { + this.throwable = throwable; + } else { + this.throwable = this.throwable == null ? throwable : this.throwable; + } + } + + public long getTimestamp() { + return this.timestamp; + } + + public void setTimestamp(long timestamp) { + this.timestamp = timestamp; + } + + public Map> getMessage() { + return this.message; + } + + public List getMessage(final String key) { + return message.get(key); + } + + public synchronized void addMessage(final String key, final String value) { + Validate.isTrue(StringUtils.isNotBlank(key), "增加message的key不能为空"); + List valueList = this.message.get(key); + if (null == valueList) { + valueList = new ArrayList(); + this.message.put(key, valueList); + } + + valueList.add(value); + } + + public synchronized Long getLongCounter(final String key) { + Number value = this.counter.get(key); + + return value == null ? 0 : value.longValue(); + } + + public synchronized void setLongCounter(final String key, final long value) { + Validate.isTrue(StringUtils.isNotBlank(key), "设置counter的key不能为空"); + this.counter.put(key, value); + } + + public synchronized Double getDoubleCounter(final String key) { + Number value = this.counter.get(key); + + return value == null ? 0.0d : value.doubleValue(); + } + + public synchronized void setDoubleCounter(final String key, final double value) { + Validate.isTrue(StringUtils.isNotBlank(key), "设置counter的key不能为空"); + this.counter.put(key, value); + } + + public synchronized void increaseCounter(final String key, final long deltaValue) { + Validate.isTrue(StringUtils.isNotBlank(key), "增加counter的key不能为空"); + + long value = this.getLongCounter(key); + + this.counter.put(key, value + deltaValue); + } + + @Override + public Communication clone() { + Communication communication = new Communication(); + + /** + * clone counter + */ + if (this.counter != null) { + for (Map.Entry entry : this.counter.entrySet()) { + String key = entry.getKey(); + Number value = entry.getValue(); + if (value instanceof Long) { + communication.setLongCounter(key, (Long) value); + } else if (value instanceof Double) { + communication.setDoubleCounter(key, (Double) value); + } + } + } + + communication.setState(this.state, true); + communication.setThrowable(this.throwable, true); + communication.setTimestamp(this.timestamp); + + /** + * clone message + */ + if (this.message != null) { + for (final Map.Entry> entry : this.message.entrySet()) { + String key = entry.getKey(); + List value = new ArrayList() {{ + addAll(entry.getValue()); + }}; + communication.getMessage().put(key, value); + } + } + + return communication; + } + + public synchronized Communication mergeFrom(final Communication otherComm) { + if (otherComm == null) { + return this; + } + + /** + * counter的合并,将otherComm的值累加到this中,不存在的则创建 + * 同为long + */ + for (Entry entry : otherComm.getCounter().entrySet()) { + String key = entry.getKey(); + Number otherValue = entry.getValue(); + if (otherValue == null) { + continue; + } + + Number value = this.counter.get(key); + if (value == null) { + value = otherValue; + } else { + if (value instanceof Long && otherValue instanceof Long) { + value = value.longValue() + otherValue.longValue(); + } else { + value = value.doubleValue() + value.doubleValue(); + } + } + + this.counter.put(key, value); + } + + // 合并state + mergeStateFrom(otherComm); + + /** + * 合并throwable,当this的throwable为空时, + * 才将otherComm的throwable合并进来 + */ + this.throwable = this.throwable == null ? otherComm.getThrowable() : this.throwable; + + /** + * timestamp是整个一次合并的时间戳,单独两两communication不作合并 + */ + + /** + * message的合并采取求并的方式,即全部累计在一起 + */ + for (Entry> entry : otherComm.getMessage().entrySet()) { + String key = entry.getKey(); + List valueList = this.message.get(key); + if (valueList == null) { + valueList = new ArrayList(); + this.message.put(key, valueList); + } + + valueList.addAll(entry.getValue()); + } + + return this; + } + + /** + * 合并state,优先级: (Failed | Killed) > Running > Success + * 这里不会出现 Killing 状态,killing 状态只在 Job 自身状态上才有. + */ + public synchronized State mergeStateFrom(final Communication otherComm) { + State retState = this.getState(); + if (otherComm == null) { + return retState; + } + + if (this.state == State.FAILED || otherComm.getState() == State.FAILED + || this.state == State.KILLED || otherComm.getState() == State.KILLED) { + retState = State.FAILED; + } else if (this.state.isRunning() || otherComm.state.isRunning()) { + retState = State.RUNNING; + } + + this.setState(retState); + return retState; + } + + public synchronized boolean isFinished(){ + return this.state == State.SUCCEEDED || this.state == State.FAILED + || this.state == State.KILLED; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/communication/CommunicationTool.java b/core/src/main/java/com/alibaba/datax/core/statistics/communication/CommunicationTool.java new file mode 100755 index 000000000..c7e0d0539 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/communication/CommunicationTool.java @@ -0,0 +1,261 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.util.StrUtil; +import com.alibaba.fastjson.JSON; +import org.apache.commons.lang.Validate; + +import java.text.DecimalFormat; +import java.util.HashMap; +import java.util.Map; + +/** + * 这里主要是业务层面的处理 + */ +public final class CommunicationTool { + public static final String STAGE = "stage"; + public static final String BYTE_SPEED = "byteSpeed"; + public static final String RECORD_SPEED = "recordSpeed"; + public static final String PERCENTAGE = "percentage"; + + public static final String READ_SUCCEED_RECORDS = "readSucceedRecords"; + public static final String READ_SUCCEED_BYTES = "readSucceedBytes"; + + public static final String READ_FAILED_RECORDS = "readFailedRecords"; + public static final String READ_FAILED_BYTES = "readFailedBytes"; + + public static final String WRITE_RECEIVED_RECORDS = "writeReceivedRecords"; + public static final String WRITE_RECEIVED_BYTES = "writeReceivedBytes"; + + public static final String WRITE_FAILED_RECORDS = "writeFailedRecords"; + public static final String WRITE_FAILED_BYTES = "writeFailedBytes"; + + private static final String TOTAL_READ_RECORDS = "totalReadRecords"; + private static final String TOTAL_READ_BYTES = "totalReadBytes"; + + private static final String TOTAL_ERROR_RECORDS = "totalErrorRecords"; + private static final String TOTAL_ERROR_BYTES = "totalErrorBytes"; + + private static final String WRITE_SUCCEED_RECORDS = "writeSucceedRecords"; + private static final String WRITE_SUCCEED_BYTES = "writeSucceedBytes"; + + public static final String WAIT_WRITER_TIME = "waitWriterTime"; + + public static final String WAIT_READER_TIME = "waitReaderTime"; + + public static Communication getReportCommunication(Communication now, Communication old, int totalStage) { + Validate.isTrue(now != null && old != null, + "为汇报准备的新旧metric不能为null"); + + long totalReadRecords = getTotalReadRecords(now); + long totalReadBytes = getTotalReadBytes(now); + now.setLongCounter(TOTAL_READ_RECORDS, totalReadRecords); + now.setLongCounter(TOTAL_READ_BYTES, totalReadBytes); + now.setLongCounter(TOTAL_ERROR_RECORDS, getTotalErrorRecords(now)); + now.setLongCounter(TOTAL_ERROR_BYTES, getTotalErrorBytes(now)); + now.setLongCounter(WRITE_SUCCEED_RECORDS, getWriteSucceedRecords(now)); + now.setLongCounter(WRITE_SUCCEED_BYTES, getWriteSucceedBytes(now)); + + long timeInterval = now.getTimestamp() - old.getTimestamp(); + long sec = timeInterval <= 1000 ? 1 : timeInterval / 1000; + long bytesSpeed = (totalReadBytes + - getTotalReadBytes(old)) / sec; + long recordsSpeed = (totalReadRecords + - getTotalReadRecords(old)) / sec; + + now.setLongCounter(BYTE_SPEED, bytesSpeed < 0 ? 0 : bytesSpeed); + now.setLongCounter(RECORD_SPEED, recordsSpeed < 0 ? 0 : recordsSpeed); + now.setDoubleCounter(PERCENTAGE, now.getLongCounter(STAGE) / (double) totalStage); + + if (old.getThrowable() != null) { + now.setThrowable(old.getThrowable()); + } + + return now; + } + + public static long getTotalReadRecords(final Communication communication) { + return communication.getLongCounter(READ_SUCCEED_RECORDS) + + communication.getLongCounter(READ_FAILED_RECORDS); + } + + public static long getTotalReadBytes(final Communication communication) { + return communication.getLongCounter(READ_SUCCEED_BYTES) + + communication.getLongCounter(READ_FAILED_BYTES); + } + + public static long getTotalErrorRecords(final Communication communication) { + return communication.getLongCounter(READ_FAILED_RECORDS) + + communication.getLongCounter(WRITE_FAILED_RECORDS); + } + + public static long getTotalErrorBytes(final Communication communication) { + return communication.getLongCounter(READ_FAILED_BYTES) + + communication.getLongCounter(WRITE_FAILED_BYTES); + } + + public static long getWriteSucceedRecords(final Communication communication) { + return communication.getLongCounter(WRITE_RECEIVED_RECORDS) - + communication.getLongCounter(WRITE_FAILED_RECORDS); + } + + public static long getWriteSucceedBytes(final Communication communication) { + return communication.getLongCounter(WRITE_RECEIVED_BYTES) - + communication.getLongCounter(WRITE_FAILED_BYTES); + } + + public static class Stringify { + private final static DecimalFormat df = new DecimalFormat("0.00"); + + public static String getSnapshot(final Communication communication) { + StringBuilder sb = new StringBuilder(); + sb.append("Total "); + sb.append(getTotal(communication)); + sb.append(" | "); + sb.append("Speed "); + sb.append(getSpeed(communication)); + sb.append(" | "); + sb.append("Error "); + sb.append(getError(communication)); + sb.append(" | "); + sb.append(" All Task WaitWriterTime "); + sb.append(PerfTrace.unitTime(communication.getLongCounter(WAIT_WRITER_TIME))); + sb.append(" | "); + sb.append(" All Task WaitReaderTime "); + sb.append(PerfTrace.unitTime(communication.getLongCounter(WAIT_READER_TIME))); + sb.append(" | "); + sb.append("Percentage "); + sb.append(getPercentage(communication)); + return sb.toString(); + } + + private static String getTotal(final Communication communication) { + return String.format("%d records, %d bytes", + communication.getLongCounter(TOTAL_READ_RECORDS), + communication.getLongCounter(TOTAL_READ_BYTES)); + } + + private static String getSpeed(final Communication communication) { + return String.format("%s/s, %d records/s", + StrUtil.stringify(communication.getLongCounter(BYTE_SPEED)), + communication.getLongCounter(RECORD_SPEED)); + } + + private static String getError(final Communication communication) { + return String.format("%d records, %d bytes", + communication.getLongCounter(TOTAL_ERROR_RECORDS), + communication.getLongCounter(TOTAL_ERROR_BYTES)); + } + + private static String getPercentage(final Communication communication) { + return df.format(communication.getDoubleCounter(PERCENTAGE) * 100) + "%"; + } + } + + public static class Jsonify { + @SuppressWarnings("rawtypes") + public static String getSnapshot(Communication communication) { + Validate.notNull(communication); + + Map state = new HashMap(); + + Pair pair = getTotalBytes(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getTotalRecords(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getSpeedRecord(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getSpeedByte(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getStage(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getErrorRecords(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getErrorBytes(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getErrorMessage(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getPercentage(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getWaitReaderTime(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getWaitWriterTime(communication); + state.put((String) pair.getKey(), pair.getValue()); + + return JSON.toJSONString(state); + } + + private static Pair getTotalBytes(final Communication communication) { + return new Pair("totalBytes", communication.getLongCounter(TOTAL_READ_BYTES)); + } + + private static Pair getTotalRecords(final Communication communication) { + return new Pair("totalRecords", communication.getLongCounter(TOTAL_READ_RECORDS)); + } + + private static Pair getSpeedByte(final Communication communication) { + return new Pair("speedBytes", communication.getLongCounter(BYTE_SPEED)); + } + + private static Pair getSpeedRecord(final Communication communication) { + return new Pair("speedRecords", communication.getLongCounter(RECORD_SPEED)); + } + + private static Pair getErrorRecords(final Communication communication) { + return new Pair("errorRecords", communication.getLongCounter(TOTAL_ERROR_RECORDS)); + } + + private static Pair getErrorBytes(final Communication communication) { + return new Pair("errorBytes", communication.getLongCounter(TOTAL_ERROR_BYTES)); + } + + private static Pair getStage(final Communication communication) { + return new Pair("stage", communication.getLongCounter(STAGE)); + } + + private static Pair getPercentage(final Communication communication) { + return new Pair("percentage", communication.getDoubleCounter(PERCENTAGE)); + } + + private static Pair getErrorMessage(final Communication communication) { + return new Pair("errorMessage", communication.getThrowableMessage()); + } + + private static Pair getWaitReaderTime(final Communication communication) { + return new Pair("waitReaderTime", communication.getLongCounter(CommunicationTool.WAIT_READER_TIME)); + } + + private static Pair getWaitWriterTime(final Communication communication) { + return new Pair("waitWriterTime", communication.getLongCounter(CommunicationTool.WAIT_WRITER_TIME)); + } + + static class Pair { + public Pair(final K key, final V value) { + this.key = key; + this.value = value; + } + + public K getKey() { + return key; + } + + public V getValue() { + return value; + } + + private K key; + + private V value; + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/communication/LocalTGCommunicationManager.java b/core/src/main/java/com/alibaba/datax/core/statistics/communication/LocalTGCommunicationManager.java new file mode 100755 index 000000000..163262f66 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/communication/LocalTGCommunicationManager.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.core.job.meta.State; +import org.apache.commons.lang3.Validate; + +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +public final class LocalTGCommunicationManager { + private static Map taskGroupCommunicationMap = + new ConcurrentHashMap(); + + public static void registerTaskGroupCommunication( + int taskGroupId, Communication communication) { + taskGroupCommunicationMap.put(taskGroupId, communication); + } + + public static Communication getJobCommunication() { + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + + for (Communication taskGroupCommunication : + taskGroupCommunicationMap.values()) { + communication.mergeFrom(taskGroupCommunication); + } + + return communication; + } + + /** + * 采用获取taskGroupId后再获取对应communication的方式, + * 防止map遍历时修改,同时也防止对map key-value对的修改 + * + * @return + */ + public static Set getTaskGroupIdSet() { + return taskGroupCommunicationMap.keySet(); + } + + public static Communication getTaskGroupCommunication(int taskGroupId) { + Validate.isTrue(taskGroupId >= 0, "taskGroupId不能小于0"); + + return taskGroupCommunicationMap.get(taskGroupId); + } + + public static void updateTaskGroupCommunication(final int taskGroupId, + final Communication communication) { + Validate.isTrue(taskGroupCommunicationMap.containsKey( + taskGroupId), String.format("taskGroupCommunicationMap中没有注册taskGroupId[%d]的Communication," + + "无法更新该taskGroup的信息", taskGroupId)); + taskGroupCommunicationMap.put(taskGroupId, communication); + } + + public static void clear() { + taskGroupCommunicationMap.clear(); + } + + public static Map getTaskGroupCommunicationMap() { + return taskGroupCommunicationMap; + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/AbstractCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/AbstractCollector.java new file mode 100755 index 000000000..a3d18a4a7 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/AbstractCollector.java @@ -0,0 +1,69 @@ +package com.alibaba.datax.core.statistics.container.collector; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; +import com.alibaba.datax.core.util.container.CoreConstant; + +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +public abstract class AbstractCollector { + private Map taskCommunicationMap = new ConcurrentHashMap(); + private Long jobId; + + public Map getTaskCommunicationMap() { + return taskCommunicationMap; + } + + public Long getJobId() { + return jobId; + } + + public void setJobId(Long jobId) { + this.jobId = jobId; + } + + public void registerTGCommunication(List taskGroupConfigurationList) { + for (Configuration config : taskGroupConfigurationList) { + int taskGroupId = config.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + LocalTGCommunicationManager.registerTaskGroupCommunication(taskGroupId, new Communication()); + } + } + + public void registerTaskCommunication(List taskConfigurationList) { + for (Configuration taskConfig : taskConfigurationList) { + int taskId = taskConfig.getInt(CoreConstant.TASK_ID); + this.taskCommunicationMap.put(taskId, new Communication()); + } + } + + public Communication collectFromTask() { + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + + for (Communication taskCommunication : + this.taskCommunicationMap.values()) { + communication.mergeFrom(taskCommunication); + } + + return communication; + } + + public abstract Communication collectFromTaskGroup(); + + public Map getTGCommunicationMap() { + return LocalTGCommunicationManager.getTaskGroupCommunicationMap(); + } + + public Communication getTGCommunication(Integer taskGroupId) { + return LocalTGCommunicationManager.getTaskGroupCommunication(taskGroupId); + } + + public Communication getTaskCommunication(Integer taskId) { + return this.taskCommunicationMap.get(taskId); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/ProcessInnerCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/ProcessInnerCollector.java new file mode 100755 index 000000000..530794b56 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/ProcessInnerCollector.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.core.statistics.container.collector; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; + +public class ProcessInnerCollector extends AbstractCollector { + + public ProcessInnerCollector(Long jobId) { + super.setJobId(jobId); + } + + @Override + public Communication collectFromTaskGroup() { + return LocalTGCommunicationManager.getJobCommunication(); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/AbstractContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/AbstractContainerCommunicator.java new file mode 100755 index 000000000..b7b159f6f --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/AbstractContainerCommunicator.java @@ -0,0 +1,88 @@ +package com.alibaba.datax.core.statistics.container.communicator; + + +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.container.collector.AbstractCollector; +import com.alibaba.datax.core.statistics.container.report.AbstractReporter; +import com.alibaba.datax.core.util.container.CoreConstant; + +import java.util.List; +import java.util.Map; + +public abstract class AbstractContainerCommunicator { + private Configuration configuration; + private AbstractCollector collector; + private AbstractReporter reporter; + + private Long jobId; + + private VMInfo vmInfo = VMInfo.getVmInfo(); + private long lastReportTime = System.currentTimeMillis(); + + + public Configuration getConfiguration() { + return this.configuration; + } + + public AbstractCollector getCollector() { + return collector; + } + + public AbstractReporter getReporter() { + return reporter; + } + + public void setCollector(AbstractCollector collector) { + this.collector = collector; + } + + public void setReporter(AbstractReporter reporter) { + this.reporter = reporter; + } + + public Long getJobId() { + return jobId; + } + + public AbstractContainerCommunicator(Configuration configuration) { + this.configuration = configuration; + this.jobId = configuration.getLong(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + } + + + public abstract void registerCommunication(List configurationList); + + public abstract Communication collect(); + + public abstract void report(Communication communication); + + public abstract State collectState(); + + public abstract Communication getCommunication(Integer id); + + /** + * 当 实现是 TGContainerCommunicator 时,返回的 Map: key=taskId, value=Communication + * 当 实现是 JobContainerCommunicator 时,返回的 Map: key=taskGroupId, value=Communication + */ + public abstract Map getCommunicationMap(); + + public void resetCommunication(Integer id){ + Map map = getCommunicationMap(); + map.put(id, new Communication()); + } + + public void reportVmInfo(){ + long now = System.currentTimeMillis(); + //每5分钟打印一次 + if(now - lastReportTime >= 300000) { + //当前仅打印 + if (vmInfo != null) { + vmInfo.getDelta(true); + } + lastReportTime = now; + } + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/job/StandAloneJobContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/job/StandAloneJobContainerCommunicator.java new file mode 100755 index 000000000..1bce7318d --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/job/StandAloneJobContainerCommunicator.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.core.statistics.container.communicator.job; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.collector.ProcessInnerCollector; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.statistics.container.report.ProcessInnerReporter; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.Map; + +public class StandAloneJobContainerCommunicator extends AbstractContainerCommunicator { + private static final Logger LOG = LoggerFactory + .getLogger(StandAloneJobContainerCommunicator.class); + + public StandAloneJobContainerCommunicator(Configuration configuration) { + super(configuration); + super.setCollector(new ProcessInnerCollector(configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID))); + super.setReporter(new ProcessInnerReporter()); + } + + @Override + public void registerCommunication(List configurationList) { + super.getCollector().registerTGCommunication(configurationList); + } + + @Override + public Communication collect() { + return super.getCollector().collectFromTaskGroup(); + } + + @Override + public State collectState() { + return this.collect().getState(); + } + + /** + * 和 DistributeJobContainerCollector 的 report 实现一样 + */ + @Override + public void report(Communication communication) { + super.getReporter().reportJobCommunication(super.getJobId(), communication); + + LOG.info(CommunicationTool.Stringify.getSnapshot(communication)); + reportVmInfo(); + } + + @Override + public Communication getCommunication(Integer taskGroupId) { + return super.getCollector().getTGCommunication(taskGroupId); + } + + @Override + public Map getCommunicationMap() { + return super.getCollector().getTGCommunicationMap(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/AbstractTGContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/AbstractTGContainerCommunicator.java new file mode 100755 index 000000000..fbb5b6d88 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/AbstractTGContainerCommunicator.java @@ -0,0 +1,74 @@ +package com.alibaba.datax.core.statistics.container.communicator.taskgroup; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.container.collector.ProcessInnerCollector; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; + +import java.util.List; +import java.util.Map; + +/** + * 该类是用于处理 taskGroupContainer 的 communication 的收集汇报的父类 + * 主要是 taskCommunicationMap 记录了 taskExecutor 的 communication 属性 + */ +public abstract class AbstractTGContainerCommunicator extends AbstractContainerCommunicator { + + protected long jobId; + + /** + * 由于taskGroupContainer是进程内部调度 + * 其registerCommunication(),getCommunication(), + * getCommunications(),collect()等方法是一致的 + * 所有TG的Collector都是ProcessInnerCollector + */ + protected int taskGroupId; + + public AbstractTGContainerCommunicator(Configuration configuration) { + super(configuration); + this.jobId = configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + super.setCollector(new ProcessInnerCollector(this.jobId)); + this.taskGroupId = configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + } + + @Override + public void registerCommunication(List configurationList) { + super.getCollector().registerTaskCommunication(configurationList); + } + + @Override + public final Communication collect() { + return this.getCollector().collectFromTask(); + } + + @Override + public final State collectState() { + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + + for (Communication taskCommunication : + super.getCollector().getTaskCommunicationMap().values()) { + communication.mergeStateFrom(taskCommunication); + } + + return communication.getState(); + } + + @Override + public final Communication getCommunication(Integer taskId) { + Validate.isTrue(taskId >= 0, "注册的taskId不能小于0"); + + return super.getCollector().getTaskCommunication(taskId); + } + + @Override + public final Map getCommunicationMap() { + return super.getCollector().getTaskCommunicationMap(); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/StandaloneTGContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/StandaloneTGContainerCommunicator.java new file mode 100755 index 000000000..7852154df --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/StandaloneTGContainerCommunicator.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.core.statistics.container.communicator.taskgroup; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.container.report.ProcessInnerReporter; +import com.alibaba.datax.core.statistics.communication.Communication; + +public class StandaloneTGContainerCommunicator extends AbstractTGContainerCommunicator { + + public StandaloneTGContainerCommunicator(Configuration configuration) { + super(configuration); + super.setReporter(new ProcessInnerReporter()); + } + + @Override + public void report(Communication communication) { + super.getReporter().reportTGCommunication(super.taskGroupId, communication); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/report/AbstractReporter.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/AbstractReporter.java new file mode 100755 index 000000000..57f98587a --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/AbstractReporter.java @@ -0,0 +1,11 @@ +package com.alibaba.datax.core.statistics.container.report; + +import com.alibaba.datax.core.statistics.communication.Communication; + +public abstract class AbstractReporter { + + public abstract void reportJobCommunication(Long jobId, Communication communication); + + public abstract void reportTGCommunication(Integer taskGroupId, Communication communication); + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/report/ProcessInnerReporter.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/ProcessInnerReporter.java new file mode 100755 index 000000000..15cdccc98 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/ProcessInnerReporter.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.core.statistics.container.report; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; + +public class ProcessInnerReporter extends AbstractReporter { + + @Override + public void reportJobCommunication(Long jobId, Communication communication) { + // do nothing + } + + @Override + public void reportTGCommunication(Integer taskGroupId, Communication communication) { + LocalTGCommunicationManager.updateTaskGroupCommunication(taskGroupId, communication); + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/DefaultJobPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/DefaultJobPluginCollector.java new file mode 100755 index 000000000..a9571bd44 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/DefaultJobPluginCollector.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.core.statistics.plugin; + +import com.alibaba.datax.common.plugin.JobPluginCollector; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.statistics.communication.Communication; + +import java.util.List; +import java.util.Map; + +/** + * Created by jingxing on 14-9-9. + */ +public final class DefaultJobPluginCollector implements JobPluginCollector { + private AbstractContainerCommunicator jobCollector; + + public DefaultJobPluginCollector(AbstractContainerCommunicator containerCollector) { + this.jobCollector = containerCollector; + } + + @Override + public Map> getMessage() { + Communication totalCommunication = this.jobCollector.collect(); + return totalCommunication.getMessage(); + } + + @Override + public List getMessage(String key) { + Communication totalCommunication = this.jobCollector.collect(); + return totalCommunication.getMessage(key); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/AbstractTaskPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/AbstractTaskPluginCollector.java new file mode 100755 index 000000000..ada9687f2 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/AbstractTaskPluginCollector.java @@ -0,0 +1,77 @@ +package com.alibaba.datax.core.statistics.plugin.task; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by jingxing on 14-9-11. + */ +public abstract class AbstractTaskPluginCollector extends TaskPluginCollector { + private static final Logger LOG = LoggerFactory + .getLogger(AbstractTaskPluginCollector.class); + + private Communication communication; + + private Configuration configuration; + + private PluginType pluginType; + + public AbstractTaskPluginCollector(Configuration conf, Communication communication, + PluginType type) { + this.configuration = conf; + this.communication = communication; + this.pluginType = type; + } + + public Communication getCommunication() { + return communication; + } + + public Configuration getConfiguration() { + return configuration; + } + + public PluginType getPluginType() { + return pluginType; + } + + @Override + final public void collectMessage(String key, String value) { + this.communication.addMessage(key, value); + } + + @Override + public void collectDirtyRecord(Record dirtyRecord, Throwable t, + String errorMessage) { + + if (null == dirtyRecord) { + LOG.warn("脏数据record=null."); + return; + } + + if (this.pluginType.equals(PluginType.READER)) { + this.communication.increaseCounter( + CommunicationTool.READ_FAILED_RECORDS, 1); + this.communication.increaseCounter( + CommunicationTool.READ_FAILED_BYTES, dirtyRecord.getByteSize()); + } else if (this.pluginType.equals(PluginType.WRITER)) { + this.communication.increaseCounter( + CommunicationTool.WRITE_FAILED_RECORDS, 1); + this.communication.increaseCounter( + CommunicationTool.WRITE_FAILED_BYTES, dirtyRecord.getByteSize()); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format("不知道的插件类型[%s].", this.pluginType)); + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/HttpPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/HttpPluginCollector.java new file mode 100755 index 000000000..e479fe2c1 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/HttpPluginCollector.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.core.statistics.plugin.task; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; + +/** + * Created by jingxing on 14-9-9. + */ +public class HttpPluginCollector extends AbstractTaskPluginCollector { + public HttpPluginCollector(Configuration configuration, Communication Communication, + PluginType type) { + super(configuration, Communication, type); + } + + @Override + public void collectDirtyRecord(Record dirtyRecord, Throwable t, + String errorMessage) { + super.collectDirtyRecord(dirtyRecord, t, errorMessage); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/StdoutPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/StdoutPluginCollector.java new file mode 100755 index 000000000..8b2a83781 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/StdoutPluginCollector.java @@ -0,0 +1,74 @@ +package com.alibaba.datax.core.statistics.plugin.task; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.statistics.plugin.task.util.DirtyRecord; +import com.alibaba.fastjson.JSON; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.atomic.AtomicInteger; + +/** + * Created by jingxing on 14-9-9. + */ +public class StdoutPluginCollector extends AbstractTaskPluginCollector { + private static final Logger LOG = LoggerFactory + .getLogger(StdoutPluginCollector.class); + + private static final int DEFAULT_MAX_DIRTYNUM = 128; + + private AtomicInteger maxLogNum = new AtomicInteger(0); + + private AtomicInteger currentLogNum = new AtomicInteger(0); + + public StdoutPluginCollector(Configuration configuration, Communication communication, + PluginType type) { + super(configuration, communication, type); + maxLogNum = new AtomicInteger( + configuration.getInt( + CoreConstant.DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_MAXDIRTYNUM, + DEFAULT_MAX_DIRTYNUM)); + } + + private String formatDirty(final Record dirty, final Throwable t, + final String msg) { + Map msgGroup = new HashMap(); + + msgGroup.put("type", super.getPluginType().toString()); + if (StringUtils.isNotBlank(msg)) { + msgGroup.put("message", msg); + } + if (null != t && StringUtils.isNotBlank(t.getMessage())) { + msgGroup.put("exception", t.getMessage()); + } + if (null != dirty) { + msgGroup.put("record", DirtyRecord.asDirtyRecord(dirty) + .getColumns()); + } + + return JSON.toJSONString(msgGroup); + } + + @Override + public void collectDirtyRecord(Record dirtyRecord, Throwable t, + String errorMessage) { + int logNum = currentLogNum.getAndIncrement(); + if(logNum==0 && t!=null){ + LOG.error("", t); + } + if (maxLogNum.intValue() < 0 || currentLogNum.intValue() < maxLogNum.intValue()) { + LOG.error("脏数据: \n" + + this.formatDirty(dirtyRecord, t, errorMessage)); + } + + super.collectDirtyRecord(dirtyRecord, t, errorMessage); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/util/DirtyRecord.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/util/DirtyRecord.java new file mode 100755 index 000000000..fdc5d8215 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/util/DirtyRecord.java @@ -0,0 +1,151 @@ +package com.alibaba.datax.core.statistics.plugin.task.util; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.fastjson.JSON; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.ArrayList; +import java.util.Date; +import java.util.List; + +public class DirtyRecord implements Record { + private List columns = new ArrayList(); + + public static DirtyRecord asDirtyRecord(final Record record) { + DirtyRecord result = new DirtyRecord(); + for (int i = 0; i < record.getColumnNumber(); i++) { + result.addColumn(record.getColumn(i)); + } + + return result; + } + + @Override + public void addColumn(Column column) { + this.columns.add( + DirtyColumn.asDirtyColumn(column, this.columns.size())); + } + + @Override + public String toString() { + return JSON.toJSONString(this.columns); + } + + @Override + public void setColumn(int i, Column column) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Column getColumn(int i) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public int getColumnNumber() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public int getByteSize() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public int getMemorySize() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + public List getColumns() { + return columns; + } + + public void setColumns(List columns) { + this.columns = columns; + } + +} + +class DirtyColumn extends Column { + private int index; + + public static Column asDirtyColumn(final Column column, int index) { + return new DirtyColumn(column, index); + } + + private DirtyColumn(Column column, int index) { + this(null == column ? null : column.getRawData(), + null == column ? Column.Type.NULL : column.getType(), + null == column ? 0 : column.getByteSize(), index); + } + + public int getIndex() { + return index; + } + + public void setIndex(int index) { + this.index = index; + } + + @Override + public Long asLong() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Double asDouble() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public String asString() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public BigDecimal asBigDecimal() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public BigInteger asBigInteger() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + private DirtyColumn(Object object, Type type, int byteSize, int index) { + super(object, type, byteSize); + this.setIndex(index); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskGroupContainer.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskGroupContainer.java new file mode 100755 index 000000000..5702185bf --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskGroupContainer.java @@ -0,0 +1,540 @@ +package com.alibaba.datax.core.taskgroup; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.AbstractContainer; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.taskgroup.StandaloneTGContainerCommunicator; +import com.alibaba.datax.core.statistics.plugin.task.AbstractTaskPluginCollector; +import com.alibaba.datax.core.taskgroup.runner.AbstractRunner; +import com.alibaba.datax.core.taskgroup.runner.ReaderRunner; +import com.alibaba.datax.core.taskgroup.runner.WriterRunner; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.exchanger.BufferedRecordExchanger; +import com.alibaba.datax.core.util.ClassUtil; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import com.alibaba.fastjson.JSON; +import org.apache.commons.lang3.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; + +public class TaskGroupContainer extends AbstractContainer { + private static final Logger LOG = LoggerFactory + .getLogger(TaskGroupContainer.class); + + /** + * 当前taskGroup所属jobId + */ + private long jobId; + + /** + * 当前taskGroupId + */ + private int taskGroupId; + + /** + * 使用的channel类 + */ + private String channelClazz; + + /** + * task收集器使用的类 + */ + private String taskCollectorClass; + + private TaskMonitor taskMonitor = TaskMonitor.getInstance(); + + public TaskGroupContainer(Configuration configuration) { + super(configuration); + + initCommunicator(configuration); + + this.jobId = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + this.taskGroupId = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + + this.channelClazz = this.configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CLASS); + this.taskCollectorClass = this.configuration.getString( + CoreConstant.DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_TASKCLASS); + } + + private void initCommunicator(Configuration configuration) { + super.setContainerCommunicator(new StandaloneTGContainerCommunicator(configuration)); + } + + public long getJobId() { + return jobId; + } + + public int getTaskGroupId() { + return taskGroupId; + } + + @Override + public void start() { + try { + /** + * 状态check时间间隔,较短,可以把任务及时分发到对应channel中 + */ + int sleepIntervalInMillSec = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_SLEEPINTERVAL, 100); + /** + * 状态汇报时间间隔,稍长,避免大量汇报 + */ + long reportIntervalInMillSec = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_REPORTINTERVAL, + 5000); + + // 获取channel数目 + int channelNumber = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL); + + int taskMaxRetryTimes = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXRETRYTIMES, 1); + + long taskRetryIntervalInMsec = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_TASK_FAILOVER_RETRYINTERVALINMSEC, 10000); + + long taskMaxWaitInMsec = this.configuration.getLong(CoreConstant.DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXWAITINMSEC, 60000); + + List taskConfigs = this.configuration + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + + if(LOG.isDebugEnabled()) { + LOG.debug("taskGroup[{}]'s task configs[{}]", this.taskGroupId, + JSON.toJSONString(taskConfigs)); + } + + int taskCountInThisTaskGroup = taskConfigs.size(); + LOG.info(String.format( + "taskGroupId=[%d] start [%d] channels for [%d] tasks.", + this.taskGroupId, channelNumber, taskCountInThisTaskGroup)); + + this.containerCommunicator.registerCommunication(taskConfigs); + + Map taskConfigMap = buildTaskConfigMap(taskConfigs); //taskId与task配置 + List taskQueue = buildRemainTasks(taskConfigs); //待运行task列表 + Map taskFailedExecutorMap = new HashMap(); //taskId与上次失败实例 + List runTasks = new ArrayList(channelNumber); //正在运行task + Map taskStartTimeMap = new HashMap(); //任务开始时间 + + long lastReportTimeStamp = 0; + Communication lastTaskGroupContainerCommunication = new Communication(); + + while (true) { + //1.判断task状态 + boolean failedOrKilled = false; + Map communicationMap = containerCommunicator.getCommunicationMap(); + for(Map.Entry entry : communicationMap.entrySet()){ + Integer taskId = entry.getKey(); + Communication taskCommunication = entry.getValue(); + if(!taskCommunication.isFinished()){ + continue; + } + TaskExecutor taskExecutor = removeTask(runTasks, taskId); + + //上面从runTasks里移除了,因此对应在monitor里移除 + taskMonitor.removeTask(taskId); + + //失败,看task是否支持failover,重试次数未超过最大限制 + if(taskCommunication.getState() == State.FAILED){ + taskFailedExecutorMap.put(taskId, taskExecutor); + if(taskExecutor.supportFailOver() && taskExecutor.getAttemptCount() < taskMaxRetryTimes){ + taskExecutor.shutdown(); //关闭老的executor + containerCommunicator.resetCommunication(taskId); //将task的状态重置 + Configuration taskConfig = taskConfigMap.get(taskId); + taskQueue.add(taskConfig); //重新加入任务列表 + }else{ + failedOrKilled = true; + break; + } + }else if(taskCommunication.getState() == State.KILLED){ + failedOrKilled = true; + break; + }else if(taskCommunication.getState() == State.SUCCEEDED){ + Long taskStartTime = taskStartTimeMap.get(taskId); + if(taskStartTime != null){ + Long usedTime = System.currentTimeMillis() - taskStartTime; + LOG.info("taskGroup[{}] taskId[{}] is successed, used[{}]ms", + this.taskGroupId, taskId, usedTime); + //usedTime*1000*1000 转换成PerfRecord记录的ns,这里主要是简单登记,进行最长任务的打印。因此增加特定静态方法 + PerfRecord.addPerfRecord(taskGroupId, taskId, PerfRecord.PHASE.TASK_TOTAL,taskStartTime, usedTime * 1000L * 1000L); + taskStartTimeMap.remove(taskId); + taskConfigMap.remove(taskId); + } + } + } + + // 2.发现该taskGroup下taskExecutor的总状态失败则汇报错误 + if (failedOrKilled) { + lastTaskGroupContainerCommunication = reportTaskGroupCommunication( + lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_RUNTIME_ERROR, lastTaskGroupContainerCommunication.getThrowable()); + } + + //3.有任务未执行,且正在运行的任务数小于最大通道限制 + Iterator iterator = taskQueue.iterator(); + while(iterator.hasNext() && runTasks.size() < channelNumber){ + Configuration taskConfig = iterator.next(); + Integer taskId = taskConfig.getInt(CoreConstant.TASK_ID); + int attemptCount = 1; + TaskExecutor lastExecutor = taskFailedExecutorMap.get(taskId); + if(lastExecutor!=null){ + attemptCount = lastExecutor.getAttemptCount() + 1; + long now = System.currentTimeMillis(); + long failedTime = lastExecutor.getTimeStamp(); + if(now - failedTime < taskRetryIntervalInMsec){ //未到等待时间,继续留在队列 + continue; + } + if(!lastExecutor.isShutdown()){ //上次失败的task仍未结束 + if(now - failedTime > taskMaxWaitInMsec){ + markCommunicationFailed(taskId); + reportTaskGroupCommunication(lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + throw DataXException.asDataXException(CommonErrorCode.WAIT_TIME_EXCEED, "task failover等待超时"); + }else{ + lastExecutor.shutdown(); //再次尝试关闭 + continue; + } + }else{ + LOG.info("taskGroup[{}] taskId[{}] attemptCount[{}] has already shutdown", + this.taskGroupId, taskId, lastExecutor.getAttemptCount()); + } + } + Configuration taskConfigForRun = taskMaxRetryTimes > 1 ? taskConfig.clone() : taskConfig; + TaskExecutor taskExecutor = new TaskExecutor(taskConfigForRun, attemptCount); + taskStartTimeMap.put(taskId, System.currentTimeMillis()); + taskExecutor.doStart(); + + iterator.remove(); + runTasks.add(taskExecutor); + + //上面,增加task到runTasks列表,因此在monitor里注册。 + taskMonitor.registerTask(taskId, this.containerCommunicator.getCommunication(taskId)); + + taskFailedExecutorMap.remove(taskId); + LOG.info("taskGroup[{}] taskId[{}] attemptCount[{}] is started", + this.taskGroupId, taskId, attemptCount); + } + + //4.任务列表为空,executor已结束, 搜集状态为success--->成功 + if (taskQueue.isEmpty() && isAllTaskDone(runTasks) && containerCommunicator.collectState() == State.SUCCEEDED) { + // 成功的情况下,也需要汇报一次。否则在任务结束非常快的情况下,采集的信息将会不准确 + lastTaskGroupContainerCommunication = reportTaskGroupCommunication( + lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + LOG.info("taskGroup[{}] completed it's tasks.", this.taskGroupId); + break; + } + + // 5.如果当前时间已经超出汇报时间的interval,那么我们需要马上汇报 + long now = System.currentTimeMillis(); + if (now - lastReportTimeStamp > reportIntervalInMillSec) { + lastTaskGroupContainerCommunication = reportTaskGroupCommunication( + lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + lastReportTimeStamp = now; + + //taskMonitor对于正在运行的task,每reportIntervalInMillSec进行检查 + for(TaskExecutor taskExecutor:runTasks){ + taskMonitor.report(taskExecutor.getTaskId(),this.containerCommunicator.getCommunication(taskExecutor.getTaskId())); + } + + } + + Thread.sleep(sleepIntervalInMillSec); + } + + //6.最后还要汇报一次 + reportTaskGroupCommunication(lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + } catch (Throwable e) { + Communication nowTaskGroupContainerCommunication = this.containerCommunicator.collect(); + + if (nowTaskGroupContainerCommunication.getThrowable() == null) { + nowTaskGroupContainerCommunication.setThrowable(e); + } + nowTaskGroupContainerCommunication.setState(State.FAILED); + this.containerCommunicator.report(nowTaskGroupContainerCommunication); + + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + }finally { + if(!PerfTrace.getInstance().isJob()){ + //最后打印cpu的平均消耗,GC的统计 + VMInfo vmInfo = VMInfo.getVmInfo(); + if (vmInfo != null) { + vmInfo.getDelta(false); + LOG.info(vmInfo.totalString()); + } + + LOG.info(PerfTrace.getInstance().summarizeNoException()); + } + } + } + + private Map buildTaskConfigMap(List configurations){ + Map map = new HashMap(); + for(Configuration taskConfig : configurations){ + int taskId = taskConfig.getInt(CoreConstant.TASK_ID); + map.put(taskId, taskConfig); + } + return map; + } + + private List buildRemainTasks(List configurations){ + List remainTasks = new LinkedList(); + for(Configuration taskConfig : configurations){ + remainTasks.add(taskConfig); + } + return remainTasks; + } + + private TaskExecutor removeTask(List taskList, int taskId){ + Iterator iterator = taskList.iterator(); + while(iterator.hasNext()){ + TaskExecutor taskExecutor = iterator.next(); + if(taskExecutor.getTaskId() == taskId){ + iterator.remove(); + return taskExecutor; + } + } + return null; + } + + private boolean isAllTaskDone(List taskList){ + for(TaskExecutor taskExecutor : taskList){ + if(!taskExecutor.isTaskFinished()){ + return false; + } + } + return true; + } + + private Communication reportTaskGroupCommunication(Communication lastTaskGroupContainerCommunication, int taskCount){ + Communication nowTaskGroupContainerCommunication = this.containerCommunicator.collect(); + nowTaskGroupContainerCommunication.setTimestamp(System.currentTimeMillis()); + Communication reportCommunication = CommunicationTool.getReportCommunication(nowTaskGroupContainerCommunication, + lastTaskGroupContainerCommunication, taskCount); + this.containerCommunicator.report(reportCommunication); + return reportCommunication; + } + + private void markCommunicationFailed(Integer taskId){ + Communication communication = containerCommunicator.getCommunication(taskId); + communication.setState(State.FAILED); + } + + /** + * TaskExecutor是一个完整task的执行器 + * 其中包括1:1的reader和writer + */ + class TaskExecutor { + private Configuration taskConfig; + + private int taskId; + + private int attemptCount; + + private Channel channel; + + private Thread readerThread; + + private Thread writerThread; + + private ReaderRunner readerRunner; + + private WriterRunner writerRunner; + + /** + * 该处的taskCommunication在多处用到: + * 1. channel + * 2. readerRunner和writerRunner + * 3. reader和writer的taskPluginCollector + */ + private Communication taskCommunication; + + public TaskExecutor(Configuration taskConf, int attemptCount) { + // 获取该taskExecutor的配置 + this.taskConfig = taskConf; + Validate.isTrue(null != this.taskConfig.getConfiguration(CoreConstant.JOB_READER) + && null != this.taskConfig.getConfiguration(CoreConstant.JOB_WRITER), + "[reader|writer]的插件参数不能为空!"); + + // 得到taskId + this.taskId = this.taskConfig.getInt(CoreConstant.TASK_ID); + this.attemptCount = attemptCount; + + /** + * 由taskId得到该taskExecutor的Communication + * 要传给readerRunner和writerRunner,同时要传给channel作统计用 + */ + this.taskCommunication = containerCommunicator + .getCommunication(taskId); + Validate.notNull(this.taskCommunication, + String.format("taskId[%d]的Communication没有注册过", taskId)); + this.channel = ClassUtil.instantiate(channelClazz, + Channel.class, configuration); + this.channel.setCommunication(this.taskCommunication); + + /** + * 生成writerThread + */ + writerRunner = (WriterRunner) generateRunner(PluginType.WRITER); + this.writerThread = new Thread(writerRunner, + String.format("%d-%d-%d-writer", + jobId, taskGroupId, this.taskId)); + //通过设置thread的contextClassLoader,即可实现同步和主程序不通的加载器 + this.writerThread.setContextClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.taskConfig.getString( + CoreConstant.JOB_WRITER_NAME))); + + /** + * 生成readerThread + */ + readerRunner = (ReaderRunner) generateRunner(PluginType.READER); + this.readerThread = new Thread(readerRunner, + String.format("%d-%d-%d-reader", + jobId, taskGroupId, this.taskId)); + /** + * 通过设置thread的contextClassLoader,即可实现同步和主程序不通的加载器 + */ + this.readerThread.setContextClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.taskConfig.getString( + CoreConstant.JOB_READER_NAME))); + } + + public void doStart() { + this.writerThread.start(); + + // reader没有起来,writer不可能结束 + if (!this.writerThread.isAlive() || this.taskCommunication.getState() == State.FAILED) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + this.taskCommunication.getThrowable()); + } + + this.readerThread.start(); + + // 这里reader可能很快结束 + if (!this.readerThread.isAlive() && this.taskCommunication.getState() == State.FAILED) { + // 这里有可能出现Reader线上启动即挂情况 对于这类情况 需要立刻抛出异常 + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + this.taskCommunication.getThrowable()); + } + + } + + private AbstractRunner generateRunner(PluginType pluginType) { + AbstractRunner newRunner = null; + TaskPluginCollector pluginCollector; + + switch (pluginType) { + case READER: + newRunner = LoadUtil.loadPluginRunner(pluginType, + this.taskConfig.getString(CoreConstant.JOB_READER_NAME)); + newRunner.setJobConf(this.taskConfig.getConfiguration( + CoreConstant.JOB_READER_PARAMETER)); + + pluginCollector = ClassUtil.instantiate( + taskCollectorClass, AbstractTaskPluginCollector.class, + configuration, this.taskCommunication, + PluginType.READER); + + ((ReaderRunner) newRunner).setRecordSender( + new BufferedRecordExchanger(this.channel, pluginCollector)); + /** + * 设置taskPlugin的collector,用来处理脏数据和job/task通信 + */ + newRunner.setTaskPluginCollector(pluginCollector); + break; + case WRITER: + newRunner = LoadUtil.loadPluginRunner(pluginType, + this.taskConfig.getString(CoreConstant.JOB_WRITER_NAME)); + newRunner.setJobConf(this.taskConfig + .getConfiguration(CoreConstant.JOB_WRITER_PARAMETER)); + + pluginCollector = ClassUtil.instantiate( + taskCollectorClass, AbstractTaskPluginCollector.class, + configuration, this.taskCommunication, + PluginType.WRITER); + ((WriterRunner) newRunner).setRecordReceiver(new BufferedRecordExchanger( + this.channel, pluginCollector)); + /** + * 设置taskPlugin的collector,用来处理脏数据和job/task通信 + */ + newRunner.setTaskPluginCollector(pluginCollector); + break; + default: + throw DataXException.asDataXException(FrameworkErrorCode.ARGUMENT_ERROR, "Cant generateRunner for:" + pluginType); + } + + newRunner.setTaskGroupId(taskGroupId); + newRunner.setTaskId(this.taskId); + newRunner.setRunnerCommunication(this.taskCommunication); + + return newRunner; + } + + // 检查任务是否结束 + private boolean isTaskFinished() { + // 如果reader 或 writer没有完成工作,那么直接返回工作没有完成 + if (readerThread.isAlive() || writerThread.isAlive()) { + return false; + } + + if(taskCommunication==null || !taskCommunication.isFinished()){ + return false; + } + + return true; + } + + private int getTaskId(){ + return taskId; + } + + private long getTimeStamp(){ + return taskCommunication.getTimestamp(); + } + + private int getAttemptCount(){ + return attemptCount; + } + + private boolean supportFailOver(){ + return writerRunner.supportFailOver(); + } + + private void shutdown(){ + writerRunner.shutdown(); + readerRunner.shutdown(); + if(writerThread.isAlive()){ + writerThread.interrupt(); + } + if(readerThread.isAlive()){ + readerThread.interrupt(); + } + } + + private boolean isShutdown(){ + return !readerThread.isAlive() && !writerThread.isAlive(); + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskMonitor.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskMonitor.java new file mode 100644 index 000000000..d852c0e0c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskMonitor.java @@ -0,0 +1,113 @@ +package com.alibaba.datax.core.taskgroup; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.concurrent.ConcurrentHashMap; + +/** + * Created by liqiang on 15/7/23. + */ +public class TaskMonitor { + + private static final Logger LOG = LoggerFactory.getLogger(TaskMonitor.class); + private static final TaskMonitor instance = new TaskMonitor(); + private static long EXPIRED_TIME = 172800 * 1000; + + private ConcurrentHashMap tasks = new ConcurrentHashMap(); + + private TaskMonitor() { + } + + public static TaskMonitor getInstance() { + return instance; + } + + public void registerTask(Integer taskid, Communication communication) { + //如果task已经finish,直接返回 + if (communication.isFinished()) { + return; + } + tasks.putIfAbsent(taskid, new TaskCommunication(taskid, communication)); + } + + public void removeTask(Integer taskid) { + tasks.remove(taskid); + } + + public void report(Integer taskid, Communication communication) { + //如果task已经finish,直接返回 + if (communication.isFinished()) { + return; + } + if (!tasks.containsKey(taskid)) { + LOG.warn("unexpected: taskid({}) missed.", taskid); + tasks.putIfAbsent(taskid, new TaskCommunication(taskid, communication)); + } else { + tasks.get(taskid).report(communication); + } + } + + public TaskCommunication getTaskCommunication(Integer taskid) { + return tasks.get(taskid); + } + + + public static class TaskCommunication { + private Integer taskid; + //记录最后更新的communication + private long lastAllReadRecords = -1; + //只有第一次,或者统计变更时才会更新TS + private long lastUpdateComunicationTS; + private long ttl; + + private TaskCommunication(Integer taskid, Communication communication) { + this.taskid = taskid; + lastAllReadRecords = CommunicationTool.getTotalReadRecords(communication); + ttl = System.currentTimeMillis(); + lastUpdateComunicationTS = ttl; + } + + public void report(Communication communication) { + + ttl = System.currentTimeMillis(); + //采集的数量增长,则变更当前记录, 优先判断这个条件,因为目的是不卡住,而不是expired + if (CommunicationTool.getTotalReadRecords(communication) > lastAllReadRecords) { + lastAllReadRecords = CommunicationTool.getTotalReadRecords(communication); + lastUpdateComunicationTS = ttl; + } else if (isExpired(lastUpdateComunicationTS)) { + communication.setState(State.FAILED); + communication.setTimestamp(ttl); + communication.setThrowable(DataXException.asDataXException(CommonErrorCode.TASK_HUNG_EXPIRED, + String.format("task(%s) hung expired [allReadRecord(%s), elased(%s)]", taskid, lastAllReadRecords, (ttl - lastUpdateComunicationTS)))); + } + + + } + + private boolean isExpired(long lastUpdateComunicationTS) { + return System.currentTimeMillis() - lastUpdateComunicationTS > EXPIRED_TIME; + } + + public Integer getTaskid() { + return taskid; + } + + public long getLastAllReadRecords() { + return lastAllReadRecords; + } + + public long getLastUpdateComunicationTS() { + return lastUpdateComunicationTS; + } + + public long getTtl() { + return ttl; + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/AbstractRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/AbstractRunner.java new file mode 100755 index 000000000..4820698e2 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/AbstractRunner.java @@ -0,0 +1,113 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.apache.commons.lang.Validate; + +public abstract class AbstractRunner { + private AbstractTaskPlugin plugin; + + private Configuration jobConf; + + private Communication runnerCommunication; + + private int taskGroupId; + + private int taskId; + + public AbstractRunner(AbstractTaskPlugin taskPlugin) { + this.plugin = taskPlugin; + } + + public void destroy() { + if (this.plugin != null) { + this.plugin.destroy(); + } + } + + public State getRunnerState() { + return this.runnerCommunication.getState(); + } + + public AbstractTaskPlugin getPlugin() { + return plugin; + } + + public void setPlugin(AbstractTaskPlugin plugin) { + this.plugin = plugin; + } + + public Configuration getJobConf() { + return jobConf; + } + + public void setJobConf(Configuration jobConf) { + this.jobConf = jobConf; + this.plugin.setPluginJobConf(jobConf); + } + + public void setTaskPluginCollector(TaskPluginCollector pluginCollector) { + this.plugin.setTaskPluginCollector(pluginCollector); + } + + private void mark(State state) { + this.runnerCommunication.setState(state); + // 对 stage + 1 + this.runnerCommunication.setLongCounter(CommunicationTool.STAGE, + this.runnerCommunication.getLongCounter(CommunicationTool.STAGE) + 1); + } + + public void markRun() { + mark(State.RUNNING); + } + + public void markSuccess() { + mark(State.SUCCEEDED); + } + + public void markFail(final Throwable throwable) { + mark(State.FAILED); + this.runnerCommunication.setTimestamp(System.currentTimeMillis()); + this.runnerCommunication.setThrowable(throwable); + } + + /** + * @param taskGroupId the taskGroupId to set + */ + public void setTaskGroupId(int taskGroupId) { + this.taskGroupId = taskGroupId; + this.plugin.setTaskGroupId(taskGroupId); + } + + /** + * @return the taskGroupId + */ + public int getTaskGroupId() { + return taskGroupId; + } + + public int getTaskId() { + return taskId; + } + + public void setTaskId(int taskId) { + this.taskId = taskId; + this.plugin.setTaskId(taskId); + } + + public void setRunnerCommunication(final Communication runnerCommunication) { + Validate.notNull(runnerCommunication, + "插件的Communication不能为空"); + this.runnerCommunication = runnerCommunication; + } + + public Communication getRunnerCommunication() { + return runnerCommunication; + } + + public abstract void shutdown(); +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/ReaderRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/ReaderRunner.java new file mode 100755 index 000000000..284d1b6b9 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/ReaderRunner.java @@ -0,0 +1,88 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by jingxing on 14-9-1. + *

+ * 单个slice的reader执行调用 + */ +public class ReaderRunner extends AbstractRunner implements Runnable { + + private static final Logger LOG = LoggerFactory + .getLogger(ReaderRunner.class); + + private RecordSender recordSender; + + public void setRecordSender(RecordSender recordSender) { + this.recordSender = recordSender; + } + + public ReaderRunner(AbstractTaskPlugin abstractTaskPlugin) { + super(abstractTaskPlugin); + } + + @Override + public void run() { + assert null != this.recordSender; + + Reader.Task taskReader = (Reader.Task) this.getPlugin(); + + //统计waitWriterTime,并且在finally才end。 + PerfRecord channelWaitWrite = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WAIT_WRITE_TIME); + try { + channelWaitWrite.start(); + + LOG.debug("task reader starts to do init ..."); + PerfRecord initPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_INIT); + initPerfRecord.start(); + taskReader.init(); + initPerfRecord.end(); + + LOG.debug("task reader starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_PREPARE); + preparePerfRecord.start(); + taskReader.prepare(); + preparePerfRecord.end(); + + LOG.debug("task reader starts to read ..."); + PerfRecord dataPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecord.start(); + taskReader.startRead(recordSender); + recordSender.terminate(); + + dataPerfRecord.addCount(CommunicationTool.getTotalReadRecords(super.getRunnerCommunication())); + dataPerfRecord.addSize(CommunicationTool.getTotalReadBytes(super.getRunnerCommunication())); + dataPerfRecord.end(); + + LOG.debug("task reader starts to do post ..."); + PerfRecord postPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_POST); + postPerfRecord.start(); + taskReader.post(); + postPerfRecord.end(); + // automatic flush + // super.markSuccess(); 这里不能标记为成功,成功的标志由 writerRunner 来标志(否则可能导致 reader 先结束,而 writer 还没有结束的严重 bug) + } catch (Throwable e) { + LOG.error("Reader runner Received Exceptions:", e); + super.markFail(e); + } finally { + LOG.debug("task reader starts to do destroy ..."); + PerfRecord desPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_DESTROY); + desPerfRecord.start(); + super.destroy(); + desPerfRecord.end(); + + channelWaitWrite.end(super.getRunnerCommunication().getLongCounter(CommunicationTool.WAIT_WRITER_TIME)); + } + } + + public void shutdown(){ + recordSender.shutdown(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/TaskGroupContainerRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/TaskGroupContainerRunner.java new file mode 100755 index 000000000..ab4af808b --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/TaskGroupContainerRunner.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +public class TaskGroupContainerRunner implements Runnable { + + private TaskGroupContainer taskGroupContainer; + + private State state; + + public TaskGroupContainerRunner(TaskGroupContainer taskGroup) { + this.taskGroupContainer = taskGroup; + this.state = State.SUCCEEDED; + } + + @Override + public void run() { + try { + Thread.currentThread().setName( + String.format("taskGroup-%d", this.taskGroupContainer.getTaskGroupId())); + this.taskGroupContainer.start(); + this.state = State.SUCCEEDED; + } catch (Throwable e) { + this.state = State.FAILED; + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + + public TaskGroupContainer getTaskGroupContainer() { + return taskGroupContainer; + } + + public State getState() { + return state; + } + + public void setState(State state) { + this.state = state; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/WriterRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/WriterRunner.java new file mode 100755 index 000000000..8fa5d68be --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/WriterRunner.java @@ -0,0 +1,90 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.apache.commons.lang3.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by jingxing on 14-9-1. + *

+ * 单个slice的writer执行调用 + */ +public class WriterRunner extends AbstractRunner implements Runnable { + + private static final Logger LOG = LoggerFactory + .getLogger(WriterRunner.class); + + private RecordReceiver recordReceiver; + + public void setRecordReceiver(RecordReceiver receiver) { + this.recordReceiver = receiver; + } + + public WriterRunner(AbstractTaskPlugin abstractTaskPlugin) { + super(abstractTaskPlugin); + } + + @Override + public void run() { + Validate.isTrue(this.recordReceiver != null); + + Writer.Task taskWriter = (Writer.Task) this.getPlugin(); + //统计waitReadTime,并且在finally end + PerfRecord channelWaitRead = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WAIT_READ_TIME); + try { + channelWaitRead.start(); + LOG.debug("task writer starts to do init ..."); + PerfRecord initPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord.start(); + taskWriter.init(); + initPerfRecord.end(); + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord.start(); + taskWriter.prepare(); + preparePerfRecord.end(); + LOG.debug("task writer starts to write ..."); + + PerfRecord dataPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_DATA); + dataPerfRecord.start(); + taskWriter.startWrite(recordReceiver); + + dataPerfRecord.addCount(CommunicationTool.getTotalReadRecords(super.getRunnerCommunication())); + dataPerfRecord.addSize(CommunicationTool.getTotalReadBytes(super.getRunnerCommunication())); + dataPerfRecord.end(); + + LOG.debug("task writer starts to do post ..."); + PerfRecord postPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_POST); + postPerfRecord.start(); + taskWriter.post(); + postPerfRecord.end(); + + super.markSuccess(); + } catch (Throwable e) { + LOG.error("Writer Runner Received Exceptions:", e); + super.markFail(e); + } finally { + LOG.debug("task writer starts to do destroy ..."); + PerfRecord desPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_DESTROY); + desPerfRecord.start(); + super.destroy(); + desPerfRecord.end(); + channelWaitRead.end(super.getRunnerCommunication().getLongCounter(CommunicationTool.WAIT_READER_TIME)); + } + } + + public boolean supportFailOver(){ + Writer.Task taskWriter = (Writer.Task) this.getPlugin(); + return taskWriter.supportFailOver(); + } + + public void shutdown(){ + recordReceiver.shutdown(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/channel/Channel.java b/core/src/main/java/com/alibaba/datax/core/transport/channel/Channel.java new file mode 100755 index 000000000..f2845b804 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/channel/Channel.java @@ -0,0 +1,247 @@ +package com.alibaba.datax.core.transport.channel; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Collection; + +/** + * Created by jingxing on 14-8-25. + *

+ * 统计和限速都在这里 + */ +public abstract class Channel { + + private static final Logger LOG = LoggerFactory.getLogger(Channel.class); + + protected int taskGroupId; + + protected int capacity; + + protected int byteCapacity; + + protected long byteSpeed; // bps: bytes/s + + protected long recordSpeed; // tps: records/s + + protected long flowControlInterval; + + protected volatile boolean isClosed = false; + + protected Configuration configuration = null; + + protected volatile long waitReaderTime = 0; + + protected volatile long waitWriterTime = 0; + + private static Boolean isFirstPrint = true; + + private Communication currentCommunication; + + private Communication lastCommunication = new Communication(); + + public Channel(final Configuration configuration) { + //channel的queue里默认record为1万条。原来为512条 + int capacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY, 2048); + long byteSpeed = configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_BYTE, 1024 * 1024); + long recordSpeed = configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_RECORD, 10000); + + if (capacity <= 0) { + throw new IllegalArgumentException(String.format( + "通道容量[%d]必须大于0.", capacity)); + } + + synchronized (isFirstPrint) { + if (isFirstPrint) { + Channel.LOG.info("Channel set byte_speed_limit to " + byteSpeed + + (byteSpeed <= 0 ? ", No bps activated." : ".")); + Channel.LOG.info("Channel set record_speed_limit to " + recordSpeed + + (recordSpeed <= 0 ? ", No tps activated." : ".")); + isFirstPrint = false; + } + } + + this.taskGroupId = configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + this.capacity = capacity; + this.byteSpeed = byteSpeed; + this.recordSpeed = recordSpeed; + this.flowControlInterval = configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_FLOWCONTROLINTERVAL, 1000); + //channel的queue默认大小为8M,原来为64M + this.byteCapacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 8 * 1024 * 1024); + this.configuration = configuration; + } + + public void close() { + this.isClosed = true; + } + + public void open() { + this.isClosed = false; + } + + public boolean isClosed() { + return isClosed; + } + + public int getTaskGroupId() { + return this.taskGroupId; + } + + public int getCapacity() { + return capacity; + } + + public long getByteSpeed() { + return byteSpeed; + } + + public Configuration getConfiguration() { + return this.configuration; + } + + public void setCommunication(final Communication communication) { + this.currentCommunication = communication; + this.lastCommunication.reset(); + } + + public void push(final Record r) { + Validate.notNull(r, "record不能为空."); + this.doPush(r); + this.statPush(1L, r.getByteSize()); + } + + public void pushTerminate(final TerminateRecord r) { + Validate.notNull(r, "record不能为空."); + this.doPush(r); + +// // 对 stage + 1 +// currentCommunication.setLongCounter(CommunicationTool.STAGE, +// currentCommunication.getLongCounter(CommunicationTool.STAGE) + 1); + } + + public void pushAll(final Collection rs) { + Validate.notNull(rs); + Validate.noNullElements(rs); + this.doPushAll(rs); + this.statPush(rs.size(), this.getByteSize(rs)); + } + + public Record pull() { + Record record = this.doPull(); + this.statPull(1L, record.getByteSize()); + return record; + } + + public void pullAll(final Collection rs) { + Validate.notNull(rs); + this.doPullAll(rs); + this.statPull(rs.size(), this.getByteSize(rs)); + } + + protected abstract void doPush(Record r); + + protected abstract void doPushAll(Collection rs); + + protected abstract Record doPull(); + + protected abstract void doPullAll(Collection rs); + + public abstract int size(); + + public abstract boolean isEmpty(); + + public abstract void clear(); + + private long getByteSize(final Collection rs) { + long size = 0; + for (final Record each : rs) { + size += each.getByteSize(); + } + return size; + } + + private void statPush(long recordSize, long byteSize) { + currentCommunication.increaseCounter(CommunicationTool.READ_SUCCEED_RECORDS, + recordSize); + currentCommunication.increaseCounter(CommunicationTool.READ_SUCCEED_BYTES, + byteSize); + //在读的时候进行统计waitCounter即可,因为写(pull)的时候可能正在阻塞,但读的时候已经能读到这个阻塞的counter数 + + currentCommunication.setLongCounter(CommunicationTool.WAIT_READER_TIME, waitReaderTime); + currentCommunication.setLongCounter(CommunicationTool.WAIT_WRITER_TIME, waitWriterTime); + + boolean isChannelByteSpeedLimit = (this.byteSpeed > 0); + boolean isChannelRecordSpeedLimit = (this.recordSpeed > 0); + if (!isChannelByteSpeedLimit && !isChannelRecordSpeedLimit) { + return; + } + + long lastTimestamp = lastCommunication.getTimestamp(); + long nowTimestamp = System.currentTimeMillis(); + long interval = nowTimestamp - lastTimestamp; + if (interval - this.flowControlInterval >= 0) { + long byteLimitSleepTime = 0; + long recordLimitSleepTime = 0; + if (isChannelByteSpeedLimit) { + long currentByteSpeed = (CommunicationTool.getTotalReadBytes(currentCommunication) - + CommunicationTool.getTotalReadBytes(lastCommunication)) * 1000 / interval; + if (currentByteSpeed > this.byteSpeed) { + // 计算根据byteLimit得到的休眠时间 + byteLimitSleepTime = currentByteSpeed * interval / this.byteSpeed + - interval; + } + } + + if (isChannelRecordSpeedLimit) { + long currentRecordSpeed = (CommunicationTool.getTotalReadRecords(currentCommunication) - + CommunicationTool.getTotalReadRecords(lastCommunication)) * 1000 / interval; + if (currentRecordSpeed > this.recordSpeed) { + // 计算根据recordLimit得到的休眠时间 + recordLimitSleepTime = currentRecordSpeed * interval / this.recordSpeed + - interval; + } + } + + // 休眠时间取较大值 + long sleepTime = byteLimitSleepTime < recordLimitSleepTime ? + recordLimitSleepTime : byteLimitSleepTime; + if (sleepTime > 0) { + try { + Thread.sleep(sleepTime); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + } + + lastCommunication.setLongCounter(CommunicationTool.READ_SUCCEED_BYTES, + currentCommunication.getLongCounter(CommunicationTool.READ_SUCCEED_BYTES)); + lastCommunication.setLongCounter(CommunicationTool.READ_FAILED_BYTES, + currentCommunication.getLongCounter(CommunicationTool.READ_FAILED_BYTES)); + lastCommunication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, + currentCommunication.getLongCounter(CommunicationTool.READ_SUCCEED_RECORDS)); + lastCommunication.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, + currentCommunication.getLongCounter(CommunicationTool.READ_FAILED_RECORDS)); + lastCommunication.setTimestamp(nowTimestamp); + } + } + + private void statPull(long recordSize, long byteSize) { + currentCommunication.increaseCounter( + CommunicationTool.WRITE_RECEIVED_RECORDS, recordSize); + currentCommunication.increaseCounter( + CommunicationTool.WRITE_RECEIVED_BYTES, byteSize); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannel.java b/core/src/main/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannel.java new file mode 100755 index 000000000..e49c7878c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannel.java @@ -0,0 +1,146 @@ +package com.alibaba.datax.core.transport.channel.memory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; + +import java.util.Collection; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.locks.Condition; +import java.util.concurrent.locks.ReentrantLock; + +/** + * 内存Channel的具体实现,底层其实是一个ArrayBlockingQueue + * + */ +public class MemoryChannel extends Channel { + + private int bufferSize = 0; + + private AtomicInteger memoryBytes = new AtomicInteger(0); + + private ArrayBlockingQueue queue = null; + + private ReentrantLock lock; + + private Condition notInsufficient, notEmpty; + + public MemoryChannel(final Configuration configuration) { + super(configuration); + this.queue = new ArrayBlockingQueue(this.getCapacity()); + this.bufferSize = configuration.getInt(CoreConstant.DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE); + + lock = new ReentrantLock(); + notInsufficient = lock.newCondition(); + notEmpty = lock.newCondition(); + } + + @Override + public void close() { + super.close(); + try { + this.queue.put(TerminateRecord.get()); + } catch (InterruptedException ex) { + Thread.currentThread().interrupt(); + } + } + + @Override + public void clear(){ + this.queue.clear(); + } + + @Override + protected void doPush(Record r) { + try { + long startTime = System.nanoTime(); + this.queue.put(r); + waitWriterTime += System.nanoTime() - startTime; + memoryBytes.addAndGet(r.getMemorySize()); + } catch (InterruptedException ex) { + Thread.currentThread().interrupt(); + } + } + + @Override + protected void doPushAll(Collection rs) { + try { + long startTime = System.nanoTime(); + lock.lockInterruptibly(); + int bytes = getRecordBytes(rs); + while (memoryBytes.get() + bytes > this.byteCapacity || rs.size() > this.queue.remainingCapacity()) { + notInsufficient.await(200L, TimeUnit.MILLISECONDS); + } + this.queue.addAll(rs); + waitWriterTime += System.nanoTime() - startTime; + memoryBytes.addAndGet(bytes); + notEmpty.signalAll(); + } catch (InterruptedException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } finally { + lock.unlock(); + } + } + + @Override + protected Record doPull() { + try { + long startTime = System.nanoTime(); + Record r = this.queue.take(); + waitReaderTime += System.nanoTime() - startTime; + memoryBytes.addAndGet(-r.getMemorySize()); + return r; + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new IllegalStateException(e); + } + } + + @Override + protected void doPullAll(Collection rs) { + assert rs != null; + rs.clear(); + try { + long startTime = System.nanoTime(); + lock.lockInterruptibly(); + while (this.queue.drainTo(rs, bufferSize) <= 0) { + notEmpty.await(200L, TimeUnit.MILLISECONDS); + } + waitReaderTime += System.nanoTime() - startTime; + int bytes = getRecordBytes(rs); + memoryBytes.addAndGet(-bytes); + notInsufficient.signalAll(); + } catch (InterruptedException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } finally { + lock.unlock(); + } + } + + private int getRecordBytes(Collection rs){ + int bytes = 0; + for(Record r : rs){ + bytes += r.getMemorySize(); + } + return bytes; + } + + @Override + public int size() { + return this.queue.size(); + } + + @Override + public boolean isEmpty() { + return this.queue.isEmpty(); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordExchanger.java b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordExchanger.java new file mode 100755 index 000000000..4ea4902dd --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordExchanger.java @@ -0,0 +1,156 @@ +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; + +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.atomic.AtomicInteger; + +public class BufferedRecordExchanger implements RecordSender, RecordReceiver { + + private final Channel channel; + + private final Configuration configuration; + + private final List buffer; + + private int bufferSize ; + + protected final int byteCapacity; + + private final AtomicInteger memoryBytes = new AtomicInteger(0); + + private int bufferIndex = 0; + + private static Class RECORD_CLASS; + + private volatile boolean shutdown = false; + + private final TaskPluginCollector pluginCollector; + + @SuppressWarnings("unchecked") + public BufferedRecordExchanger(final Channel channel, final TaskPluginCollector pluginCollector) { + assert null != channel; + assert null != channel.getConfiguration(); + + this.channel = channel; + this.pluginCollector = pluginCollector; + this.configuration = channel.getConfiguration(); + + this.bufferSize = configuration + .getInt(CoreConstant.DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE); + this.buffer = new ArrayList(bufferSize); + + //channel的queue默认大小为8M,原来为64M + this.byteCapacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 8 * 1024 * 1024); + + try { + BufferedRecordExchanger.RECORD_CLASS = ((Class) Class + .forName(configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_RECORD_CLASS, + "com.alibaba.datax.core.transport.record.DefaultRecord"))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public Record createRecord() { + try { + return BufferedRecordExchanger.RECORD_CLASS.newInstance(); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public void sendToWriter(Record record) { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + + Validate.notNull(record, "record不能为空."); + + if (record.getMemorySize() > this.byteCapacity) { + this.pluginCollector.collectDirtyRecord(record, new Exception(String.format("单条记录超过大小限制,当前限制为:%s", this.byteCapacity))); + return; + } + + boolean isFull = (this.bufferIndex >= this.bufferSize || this.memoryBytes.get() + record.getMemorySize() > this.byteCapacity); + if (isFull) { + flush(); + } + + this.buffer.add(record); + this.bufferIndex++; + memoryBytes.addAndGet(record.getMemorySize()); + } + + @Override + public void flush() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + this.channel.pushAll(this.buffer); + this.buffer.clear(); + this.bufferIndex = 0; + this.memoryBytes.set(0); + } + + @Override + public void terminate() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + flush(); + this.channel.pushTerminate(TerminateRecord.get()); + } + + @Override + public Record getFromReader() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + boolean isEmpty = (this.bufferIndex >= this.buffer.size()); + if (isEmpty) { + receive(); + } + + Record record = this.buffer.get(this.bufferIndex++); + if (record instanceof TerminateRecord) { + record = null; + } + return record; + } + + @Override + public void shutdown(){ + shutdown = true; + try{ + buffer.clear(); + channel.clear(); + }catch(Throwable t){ + t.printStackTrace(); + } + } + + private void receive() { + this.channel.pullAll(this.buffer); + this.bufferIndex = 0; + this.bufferSize = this.buffer.size(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/exchanger/RecordExchanger.java b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/RecordExchanger.java new file mode 100755 index 000000000..ed236173a --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/RecordExchanger.java @@ -0,0 +1,99 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; + +public class RecordExchanger implements RecordSender, RecordReceiver { + + private Channel channel; + + private Configuration configuration; + + private static Class RECORD_CLASS; + + private volatile boolean shutdown = false; + + @SuppressWarnings("unchecked") + public RecordExchanger(final Channel channel) { + assert channel != null; + this.channel = channel; + this.configuration = channel.getConfiguration(); + try { + RecordExchanger.RECORD_CLASS = (Class) Class + .forName(configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_RECORD_CLASS, + "com.alibaba.datax.core.transport.record.DefaultRecord")); + } catch (ClassNotFoundException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public Record getFromReader() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + Record record = this.channel.pull(); + return (record instanceof TerminateRecord ? null : record); + } + + @Override + public Record createRecord() { + try { + return RECORD_CLASS.newInstance(); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public void sendToWriter(Record record) { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + this.channel.push(record); + } + + @Override + public void flush() { + } + + @Override + public void terminate() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + this.channel.pushTerminate(TerminateRecord.get()); + } + + @Override + public void shutdown(){ + shutdown = true; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/record/DefaultRecord.java b/core/src/main/java/com/alibaba/datax/core/transport/record/DefaultRecord.java new file mode 100755 index 000000000..2598bc8c8 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/record/DefaultRecord.java @@ -0,0 +1,119 @@ +package com.alibaba.datax.core.transport.record; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.util.ClassSize; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.fastjson.JSON; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Created by jingxing on 14-8-24. + */ + +public class DefaultRecord implements Record { + + private static final int RECORD_AVERGAE_COLUMN_NUMBER = 16; + + private List columns; + + private int byteSize; + + // 首先是Record本身需要的内存 + private int memorySize = ClassSize.DefaultRecordHead; + + public DefaultRecord() { + this.columns = new ArrayList(RECORD_AVERGAE_COLUMN_NUMBER); + } + + @Override + public void addColumn(Column column) { + columns.add(column); + incrByteSize(column); + } + + @Override + public Column getColumn(int i) { + if (i < 0 || i >= columns.size()) { + return null; + } + return columns.get(i); + } + + @Override + public void setColumn(int i, final Column column) { + if (i < 0) { + throw DataXException.asDataXException(FrameworkErrorCode.ARGUMENT_ERROR, + "不能给index小于0的column设置值"); + } + + if (i >= columns.size()) { + expandCapacity(i + 1); + } + + decrByteSize(getColumn(i)); + this.columns.set(i, column); + incrByteSize(getColumn(i)); + } + + @Override + public String toString() { + Map json = new HashMap(); + json.put("size", this.getColumnNumber()); + json.put("data", this.columns); + return JSON.toJSONString(json); + } + + @Override + public int getColumnNumber() { + return this.columns.size(); + } + + @Override + public int getByteSize() { + return byteSize; + } + + public int getMemorySize(){ + return memorySize; + } + + private void decrByteSize(final Column column) { + if (null == column) { + return; + } + + byteSize -= column.getByteSize(); + + //内存的占用是column对象的头 再加实际大小 + memorySize = memorySize - ClassSize.ColumnHead - column.getByteSize(); + } + + private void incrByteSize(final Column column) { + if (null == column) { + return; + } + + byteSize += column.getByteSize(); + + //内存的占用是column对象的头 再加实际大小 + memorySize = memorySize + ClassSize.ColumnHead + column.getByteSize(); + } + + private void expandCapacity(int totalSize) { + if (totalSize <= 0) { + return; + } + + int needToExpand = totalSize - columns.size(); + while (needToExpand-- > 0) { + this.columns.add(null); + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/record/TerminateRecord.java b/core/src/main/java/com/alibaba/datax/core/transport/record/TerminateRecord.java new file mode 100755 index 000000000..928609abd --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/record/TerminateRecord.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.core.transport.record; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; + +/** + * 作为标示 生产者已经完成生产的标志 + * + */ +public class TerminateRecord implements Record { + private final static TerminateRecord SINGLE = new TerminateRecord(); + + private TerminateRecord() { + } + + public static TerminateRecord get() { + return SINGLE; + } + + @Override + public void addColumn(Column column) { + } + + @Override + public Column getColumn(int i) { + return null; + } + + @Override + public int getColumnNumber() { + return 0; + } + + @Override + public int getByteSize() { + return 0; + } + + @Override + public int getMemorySize() { + return 0; + } + + @Override + public void setColumn(int i, Column column) { + return; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ClassSize.java b/core/src/main/java/com/alibaba/datax/core/util/ClassSize.java new file mode 100644 index 000000000..1be49addf --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ClassSize.java @@ -0,0 +1,42 @@ +package com.alibaba.datax.core.util; + +/** + * Created by liqiang on 15/12/12. + */ +public class ClassSize { + + public static final int DefaultRecordHead; + public static final int ColumnHead; + + //objectHead的大小 + public static final int REFERENCE; + public static final int OBJECT; + public static final int ARRAY; + public static final int ARRAYLIST; + static { + //only 64位 + REFERENCE = 8; + + OBJECT = 2 * REFERENCE; + + ARRAY = align(3 * REFERENCE); + + // 16+8+24+16 + ARRAYLIST = align(OBJECT + align(REFERENCE) + align(ARRAY) + + (2 * Long.SIZE / Byte.SIZE)); + // 8+64+8 + DefaultRecordHead = align(align(REFERENCE) + ClassSize.ARRAYLIST + 2 * Integer.SIZE / Byte.SIZE); + //16+4 + ColumnHead = align(2 * REFERENCE + Integer.SIZE / Byte.SIZE); + } + + public static int align(int num) { + return (int)(align((long)num)); + } + + public static long align(long num) { + //The 7 comes from that the alignSize is 8 which is the number of bytes + //stored and sent together + return ((num + 7) >> 3) << 3; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ClassUtil.java b/core/src/main/java/com/alibaba/datax/core/util/ClassUtil.java new file mode 100755 index 000000000..0cf0d5617 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ClassUtil.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.core.util; + +import java.lang.reflect.Constructor; + +public final class ClassUtil { + + /** + * 通过反射构造类对象 + * + * @param className + * 反射的类名称 + * @param t + * 反射类的类型Class对象 + * @param args + * 构造参数 + * + * */ + @SuppressWarnings({ "rawtypes", "unchecked" }) + public static T instantiate(String className, Class t, + Object... args) { + try { + Constructor constructor = (Constructor) Class.forName(className) + .getConstructor(ClassUtil.toClassType(args)); + return (T) constructor.newInstance(args); + } catch (Exception e) { + throw new IllegalArgumentException(e); + } + } + + private static Class[] toClassType(Object[] args) { + Class[] clazzs = new Class[args.length]; + + for (int i = 0, length = args.length; i < length; i++) { + clazzs[i] = args[i].getClass(); + } + + return clazzs; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ConfigParser.java b/core/src/main/java/com/alibaba/datax/core/util/ConfigParser.java new file mode 100755 index 000000000..dadef752b --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ConfigParser.java @@ -0,0 +1,181 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.google.common.collect.Lists; +import org.apache.commons.io.FileUtils; +import org.apache.commons.lang.StringUtils; +import org.apache.http.client.methods.HttpGet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.IOException; +import java.net.URL; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +public final class ConfigParser { + private static final Logger LOG = LoggerFactory.getLogger(ConfigParser.class); + /** + * 指定Job配置路径,ConfigParser会解析Job、Plugin、Core全部信息,并以Configuration返回 + */ + public static Configuration parse(final String jobPath) { + Configuration configuration = ConfigParser.parseJobConfig(jobPath); + + configuration.merge( + ConfigParser.parseCoreConfig(CoreConstant.DATAX_CONF_PATH), + false); + // todo config优化,只捕获需要的plugin + String readerPluginName = configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_READER_NAME); + String writerPluginName = configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); + try { + configuration.merge(parsePluginConfig(Lists.newArrayList(readerPluginName, writerPluginName)), false); + }catch (Exception e){ + //吞掉异常,保持log干净。这里message足够。 + LOG.warn(String.format("插件[%s,%s]加载失败,1s后重试... Exception:%s ", readerPluginName, writerPluginName, e.getMessage())); + try { + Thread.sleep(1000); + } catch (InterruptedException e1) { + // + } + configuration.merge(parsePluginConfig(Lists.newArrayList(readerPluginName, writerPluginName)), false); + } + + return configuration; + } + + private static Configuration parseCoreConfig(final String path) { + return Configuration.from(new File(path)); + } + + public static Configuration parseJobConfig(final String path) { + String jobContent = getJobContent(path); + Configuration config = Configuration.from(jobContent); + + return SecretUtil.decryptSecretKey(config); + } + + private static String getJobContent(String jobResource) { + String jobContent; + + boolean isJobResourceFromHttp = jobResource.trim().toLowerCase().startsWith("http"); + + + if (isJobResourceFromHttp) { + //设置httpclient的 HTTP_TIMEOUT_INMILLIONSECONDS + Configuration coreConfig = ConfigParser.parseCoreConfig(CoreConstant.DATAX_CONF_PATH); + int httpTimeOutInMillionSeconds = coreConfig.getInt( + CoreConstant.DATAX_CORE_DATAXSERVER_TIMEOUT, 5000); + HttpClientUtil.setHttpTimeoutInMillionSeconds(httpTimeOutInMillionSeconds); + + HttpClientUtil httpClientUtil = new HttpClientUtil(); + try { + URL url = new URL(jobResource); + HttpGet httpGet = HttpClientUtil.getGetRequest(); + httpGet.setURI(url.toURI()); + + jobContent = httpClientUtil.executeAndGetWithFailedRetry(httpGet, 6, 1000l); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "获取作业配置信息失败:" + jobResource, e); + } + } else { + // jobResource 是本地文件绝对路径 + try { + jobContent = FileUtils.readFileToString(new File(jobResource)); + } catch (IOException e) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "获取作业配置信息失败:" + jobResource, e); + } + } + + if (jobContent == null) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "获取作业配置信息失败:" + jobResource); + } + return jobContent; + } + + private static Configuration parsePluginConfig(List wantPluginNames) { + Configuration configuration = Configuration.newDefault(); + + Set replicaCheckPluginSet = new HashSet(); + int complete = 0; + for (final String each : ConfigParser + .getDirAsList(CoreConstant.DATAX_PLUGIN_READER_HOME)) { + Configuration eachReaderConfig = ConfigParser.parseOnePluginConfig(each, "reader", replicaCheckPluginSet, wantPluginNames); + if(eachReaderConfig!=null) { + configuration.merge(eachReaderConfig, true); + complete += 1; + } + } + + for (final String each : ConfigParser + .getDirAsList(CoreConstant.DATAX_PLUGIN_WRITER_HOME)) { + Configuration eachWriterConfig = ConfigParser.parseOnePluginConfig(each, "writer", replicaCheckPluginSet, wantPluginNames); + if(eachWriterConfig!=null) { + configuration.merge(eachWriterConfig, true); + complete += 1; + } + } + + if (wantPluginNames != null && wantPluginNames.size() > 0 && wantPluginNames.size() != complete) { + throw DataXException.asDataXException(FrameworkErrorCode.PLUGIN_INIT_ERROR, "插件加载失败,未完成指定插件加载:" + wantPluginNames); + } + + return configuration; + } + + + public static Configuration parseOnePluginConfig(final String path, + final String type, + Set pluginSet, List wantPluginNames) { + String filePath = path + File.separator + "plugin.json"; + Configuration configuration = Configuration.from(new File(filePath)); + + String pluginPath = configuration.getString("path"); + String pluginName = configuration.getString("name"); + if(!pluginSet.contains(pluginName)) { + pluginSet.add(pluginName); + } else { + throw DataXException.asDataXException(FrameworkErrorCode.PLUGIN_INIT_ERROR, "插件加载失败,存在重复插件:" + filePath); + } + + //不是想要的插件,返回null + if (wantPluginNames != null && wantPluginNames.size() > 0 && !wantPluginNames.contains(pluginName)) { + return null; + } + + boolean isDefaultPath = StringUtils.isBlank(pluginPath); + if (isDefaultPath) { + configuration.set("path", path); + } + + Configuration result = Configuration.newDefault(); + + result.set( + String.format("plugin.%s.%s", type, pluginName), + configuration.getInternal()); + + return result; + } + + private static List getDirAsList(String path) { + List result = new ArrayList(); + + String[] paths = new File(path).list(); + if (null == paths) { + return result; + } + + for (final String each : paths) { + result.add(path + File.separator + each); + } + + return result; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ConfigurationValidate.java b/core/src/main/java/com/alibaba/datax/core/util/ConfigurationValidate.java new file mode 100755 index 000000000..bc15bcf14 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ConfigurationValidate.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang.Validate; + +/** + * Created by jingxing on 14-9-16. + * + * 对配置文件做整体检查 + */ +public class ConfigurationValidate { + public static void doValidate(Configuration allConfig) { + Validate.isTrue(allConfig!=null, ""); + + coreValidate(allConfig); + + pluginValidate(allConfig); + + jobValidate(allConfig); + } + + private static void coreValidate(Configuration allconfig) { + return; + } + + private static void pluginValidate(Configuration allConfig) { + return; + } + + private static void jobValidate(Configuration allConfig) { + return; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ErrorRecordChecker.java b/core/src/main/java/com/alibaba/datax/core/util/ErrorRecordChecker.java new file mode 100755 index 000000000..ad7f80f61 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ErrorRecordChecker.java @@ -0,0 +1,82 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang3.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * 检查任务是否到达错误记录限制。有检查条数(recordLimit)和百分比(percentageLimit)两种方式。 + * 1. errorRecord表示出错条数不能大于限制数,当超过时任务失败。比如errorRecord为0表示不容许任何脏数据。 + * 2. errorPercentage表示出错比例,在任务结束时校验。 + * 3. errorRecord优先级高于errorPercentage。 + */ +public final class ErrorRecordChecker { + private static final Logger LOG = LoggerFactory + .getLogger(ErrorRecordChecker.class); + + private Long recordLimit; + private Double percentageLimit; + + public ErrorRecordChecker(Configuration configuration) { + this(configuration.getLong(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_RECORD), + configuration.getDouble(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_PERCENT)); + } + + public ErrorRecordChecker(Long rec, Double percentage) { + recordLimit = rec; + percentageLimit = percentage; + + if (percentageLimit != null) { + Validate.isTrue(0.0 <= percentageLimit && percentageLimit <= 1.0, + "脏数据百分比限制应该在[0.0, 1.0]之间"); + } + + if (recordLimit != null) { + Validate.isTrue(recordLimit >= 0, + "脏数据条数现在应该为非负整数"); + + // errorRecord优先级高于errorPercentage. + percentageLimit = null; + } + } + + public void checkRecordLimit(Communication communication) { + if (recordLimit == null) { + return; + } + + long errorNumber = CommunicationTool.getTotalErrorRecords(communication); + if (recordLimit < errorNumber) { + LOG.debug( + String.format("Error-limit set to %d, error count check.", + recordLimit)); + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_DIRTY_DATA_LIMIT_EXCEED, + String.format("脏数据条数检查不通过,限制是[%d]条,但实际上捕获了[%d]条.", + recordLimit, errorNumber)); + } + } + + public void checkPercentageLimit(Communication communication) { + if (percentageLimit == null) { + return; + } + LOG.debug(String.format( + "Error-limit set to %f, error percent check.", percentageLimit)); + + long total = CommunicationTool.getTotalReadRecords(communication); + long error = CommunicationTool.getTotalErrorRecords(communication); + + if (total > 0 && ((double) error / (double) total) > percentageLimit) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_DIRTY_DATA_LIMIT_EXCEED, + String.format("脏数据百分比检查不通过,限制是[%f],但实际上捕获到[%f].", + percentageLimit, ((double) error / (double) total))); + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ExceptionTracker.java b/core/src/main/java/com/alibaba/datax/core/util/ExceptionTracker.java new file mode 100755 index 000000000..d06f6798c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ExceptionTracker.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.core.util; + +import java.io.PrintWriter; +import java.io.StringWriter; + +public class ExceptionTracker { + public static final int STRING_BUFFER = 4096; + + public static String trace(Throwable ex) { + StringWriter sw = new StringWriter(STRING_BUFFER); + PrintWriter pw = new PrintWriter(sw); + ex.printStackTrace(pw); + return sw.toString(); + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/util/FrameworkErrorCode.java b/core/src/main/java/com/alibaba/datax/core/util/FrameworkErrorCode.java new file mode 100755 index 000000000..f50f79350 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/FrameworkErrorCode.java @@ -0,0 +1,68 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * TODO: 根据现有日志数据分析各类错误,进行细化。 + * + *

请不要格式化本类代码

+ */ +public enum FrameworkErrorCode implements ErrorCode { + + INSTALL_ERROR("Framework-00", "DataX引擎安装错误, 请联系您的运维解决 ."), + ARGUMENT_ERROR("Framework-01", "DataX引擎运行错误,该问题通常是由于内部编程错误引起,请联系DataX开发团队解决 ."), + RUNTIME_ERROR("Framework-02", "DataX引擎运行过程出错,具体原因请参看DataX运行结束时的错误诊断信息 ."), + CONFIG_ERROR("Framework-03", "DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + SECRET_ERROR("Framework-04", "DataX引擎加解密出错,该问题通常是由于DataX密钥配置错误引起,请联系您的运维解决 ."), + HOOK_LOAD_ERROR("Framework-05", "加载外部Hook出现错误,通常是由于DataX安装引起的"), + HOOK_FAIL_ERROR("Framework-06", "执行外部Hook出现错误"), + + PLUGIN_INSTALL_ERROR("Framework-10", "DataX插件安装错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + PLUGIN_NOT_FOUND("Framework-11", "DataX插件配置错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + PLUGIN_INIT_ERROR("Framework-12", "DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + PLUGIN_RUNTIME_ERROR("Framework-13", "DataX插件运行时出错, 具体原因请参看DataX运行结束时的错误诊断信息 ."), + PLUGIN_DIRTY_DATA_LIMIT_EXCEED("Framework-14", "DataX传输脏数据超过用户预期,该错误通常是由于源端数据存在较多业务脏数据导致,请仔细检查DataX汇报的脏数据日志信息, 或者您可以适当调大脏数据阈值 ."), + PLUGIN_SPLIT_ERROR("Framework-15", "DataX插件切分出错, 该问题通常是由于DataX各个插件编程错误引起,请联系DataX开发团队解决"), + KILL_JOB_TIMEOUT_ERROR("Framework-16", "kill 任务超时,请联系PE解决"), + START_TASKGROUP_ERROR("Framework-17", "taskGroup启动失败,请联系DataX开发团队解决"), + CALL_DATAX_SERVICE_FAILED("Framework-18", "请求 DataX Service 出错."), + CALL_REMOTE_FAILED("Framework-19", "远程调用失败"), + KILLED_EXIT_VALUE("Framework-143", "Job 收到了 Kill 命令."); + + private final String code; + + private final String description; + + private FrameworkErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } + + /** + * 通过 "Framework-143" 来标示 任务是 Killed 状态 + */ + public int toExitValue() { + if (this == FrameworkErrorCode.KILLED_EXIT_VALUE) { + return 143; + } else { + return 1; + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/HttpClientUtil.java b/core/src/main/java/com/alibaba/datax/core/util/HttpClientUtil.java new file mode 100755 index 000000000..bc1f93a94 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/HttpClientUtil.java @@ -0,0 +1,169 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.RetryUtil; +import org.apache.http.Consts; +import org.apache.http.HttpEntity; +import org.apache.http.HttpResponse; +import org.apache.http.HttpStatus; +import org.apache.http.auth.AuthScope; +import org.apache.http.auth.UsernamePasswordCredentials; +import org.apache.http.client.CredentialsProvider; +import org.apache.http.client.config.RequestConfig; +import org.apache.http.client.methods.*; +import org.apache.http.impl.client.BasicCredentialsProvider; +import org.apache.http.impl.client.CloseableHttpClient; +import org.apache.http.impl.client.HttpClientBuilder; +import org.apache.http.util.EntityUtils; + +import java.io.IOException; +import java.util.Properties; +import java.util.concurrent.Callable; +import java.util.concurrent.ThreadPoolExecutor; + + +public class HttpClientUtil { + + private static CredentialsProvider provider; + + private CloseableHttpClient httpClient; + + private volatile static HttpClientUtil clientUtil; + + //构建httpclient的时候一定要设置这两个参数。淘宝很多生产故障都由此引起 + private static int HTTP_TIMEOUT_INMILLIONSECONDS = 5000; + + private static final int POOL_SIZE = 20; + + private static ThreadPoolExecutor asyncExecutor = RetryUtil.createThreadPoolExecutor(); + + public static void setHttpTimeoutInMillionSeconds(int httpTimeoutInMillionSeconds) { + HTTP_TIMEOUT_INMILLIONSECONDS = httpTimeoutInMillionSeconds; + } + + public static synchronized HttpClientUtil getHttpClientUtil() { + if (null == clientUtil) { + synchronized (HttpClientUtil.class) { + if (null == clientUtil) { + clientUtil = new HttpClientUtil(); + } + } + } + return clientUtil; + } + + public HttpClientUtil() { + Properties prob = SecretUtil.getSecurityProperties(); + HttpClientUtil.setBasicAuth(prob.getProperty("auth.user"),prob.getProperty("auth.pass")); + initApacheHttpClient(); + } + + public void destroy() { + destroyApacheHttpClient(); + } + + public static void setBasicAuth(String username, String password) { + provider = new BasicCredentialsProvider(); + provider.setCredentials(AuthScope.ANY, + new UsernamePasswordCredentials(username,password)); + } + + // 创建包含connection pool与超时设置的client + private void initApacheHttpClient() { + RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(HTTP_TIMEOUT_INMILLIONSECONDS) + .setConnectTimeout(HTTP_TIMEOUT_INMILLIONSECONDS).setConnectionRequestTimeout(HTTP_TIMEOUT_INMILLIONSECONDS) + .setStaleConnectionCheckEnabled(true).build(); + + if(null == provider) { + httpClient = HttpClientBuilder.create().setMaxConnTotal(POOL_SIZE).setMaxConnPerRoute(POOL_SIZE) + .setDefaultRequestConfig(requestConfig).build(); + } else { + httpClient = HttpClientBuilder.create().setMaxConnTotal(POOL_SIZE).setMaxConnPerRoute(POOL_SIZE) + .setDefaultRequestConfig(requestConfig).setDefaultCredentialsProvider(provider).build(); + } + } + + private void destroyApacheHttpClient() { + try { + if (httpClient != null) { + httpClient.close(); + httpClient = null; + } + } catch (IOException e) { + e.printStackTrace(); + } + } + + public static HttpGet getGetRequest() { + return new HttpGet(); + } + + public static HttpPost getPostRequest() { + return new HttpPost(); + } + + public static HttpPut getPutRequest() { + return new HttpPut(); + } + + public static HttpDelete getDeleteRequest() { + return new HttpDelete(); + } + + public String executeAndGet(HttpRequestBase httpRequestBase) throws Exception { + HttpResponse response; + String entiStr = ""; + try { + response = httpClient.execute(httpRequestBase); + + if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { + System.err.println("请求地址:" + httpRequestBase.getURI() + ", 请求方法:" + httpRequestBase.getMethod() + + ",STATUS CODE = " + response.getStatusLine().getStatusCode()); + + throw new Exception("Response Status Code : " + response.getStatusLine().getStatusCode()); + } else { + HttpEntity entity = response.getEntity(); + if (entity != null) { + entiStr = EntityUtils.toString(entity, Consts.UTF_8); + } else { + throw new Exception("Response Entity Is Null"); + } + } + } catch (Exception e) { + throw e; + } + + return entiStr; + } + + public String executeAndGetWithRetry(final HttpRequestBase httpRequestBase, final int retryTimes, final long retryInterval) { + try { + return RetryUtil.asyncExecuteWithRetry(new Callable() { + @Override + public String call() throws Exception { + return executeAndGet(httpRequestBase); + } + }, retryTimes, retryInterval, true, HTTP_TIMEOUT_INMILLIONSECONDS + 1000, asyncExecutor); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + + public String executeAndGetWithFailedRetry(final HttpRequestBase httpRequestBase, final int retryTimes, final long retryInterval){ + try { + return RetryUtil.asyncExecuteWithRetry(new Callable() { + @Override + public String call() throws Exception { + String result = executeAndGet(httpRequestBase); + if(result!=null && result.startsWith("{\"result\":-1")){ + throw DataXException.asDataXException(FrameworkErrorCode.CALL_REMOTE_FAILED, "远程接口返回-1,将重试"); + } + return result; + } + }, retryTimes, retryInterval, true, HTTP_TIMEOUT_INMILLIONSECONDS + 1000, asyncExecutor); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/SecretUtil.java b/core/src/main/java/com/alibaba/datax/core/util/SecretUtil.java new file mode 100755 index 000000000..a9f69479c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/SecretUtil.java @@ -0,0 +1,440 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; + +import org.apache.commons.codec.binary.Base64; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang3.tuple.ImmutableTriple; +import org.apache.commons.lang3.tuple.Triple; + +import javax.crypto.Cipher; +import javax.crypto.SecretKey; +import javax.crypto.spec.SecretKeySpec; + +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.IOException; +import java.io.InputStream; +import java.security.Key; +import java.security.KeyFactory; +import java.security.KeyPair; +import java.security.KeyPairGenerator; +import java.security.interfaces.RSAPrivateKey; +import java.security.interfaces.RSAPublicKey; +import java.security.spec.PKCS8EncodedKeySpec; +import java.security.spec.X509EncodedKeySpec; +import java.util.HashMap; +import java.util.Map; +import java.util.Properties; + +/** + * Created by jingxing on 14/12/15. + */ +public class SecretUtil { + private static Properties properties; + + //RSA Key:keyVersion value:left:privateKey, right:publicKey, middle: type + //DESede Key: keyVersion value:left:keyContent, right:keyContent, middle: type + private static Map> versionKeyMap; + + private static final String ENCODING = "UTF-8"; + + public static final String KEY_ALGORITHM_RSA = "RSA"; + + public static final String KEY_ALGORITHM_3DES = "DESede"; + + private static final String CIPHER_ALGORITHM_3DES = "DESede/ECB/PKCS5Padding"; + + private static final Base64 base64 = new Base64(); + + /** + * BASE64加密 + * + * @param plaintextBytes + * @return + * @throws Exception + */ + public static String encryptBASE64(byte[] plaintextBytes) throws Exception { + return new String(base64.encode(plaintextBytes), ENCODING); + } + + /** + * BASE64解密 + * + * @param cipherText + * @return + * @throws Exception + */ + public static byte[] decryptBASE64(String cipherText) { + return base64.decode(cipherText); + } + + /** + * 加密
+ * @param data 裸的原始数据 + * @param key 经过base64加密的公钥(RSA)或者裸密钥(3DES) + * */ + public static String encrypt(String data, String key, String method) { + if (SecretUtil.KEY_ALGORITHM_RSA.equals(method)) { + return SecretUtil.encryptRSA(data, key); + } else if (SecretUtil.KEY_ALGORITHM_3DES.equals(method)) { + return SecretUtil.encrypt3DES(data, key); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("系统编程错误,不支持的加密类型", method)); + } + } + + /** + * 解密
+ * @param data 已经经过base64加密的密文 + * @param key 已经经过base64加密私钥(RSA)或者裸密钥(3DES) + * */ + public static String decrypt(String data, String key, String method) { + if (SecretUtil.KEY_ALGORITHM_RSA.equals(method)) { + return SecretUtil.decryptRSA(data, key); + } else if (SecretUtil.KEY_ALGORITHM_3DES.equals(method)) { + return SecretUtil.decrypt3DES(data, key); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("系统编程错误,不支持的加密类型", method)); + } + } + + /** + * 加密
+ * 用公钥加密 encryptByPublicKey + * + * @param data 裸的原始数据 + * @param key 经过base64加密的公钥 + * @return 结果也采用base64加密 + * @throws Exception + */ + public static String encryptRSA(String data, String key) { + try { + // 对公钥解密,公钥被base64加密过 + byte[] keyBytes = decryptBASE64(key); + + // 取得公钥 + X509EncodedKeySpec x509KeySpec = new X509EncodedKeySpec(keyBytes); + KeyFactory keyFactory = KeyFactory.getInstance(KEY_ALGORITHM_RSA); + Key publicKey = keyFactory.generatePublic(x509KeySpec); + + // 对数据加密 + Cipher cipher = Cipher.getInstance(keyFactory.getAlgorithm()); + cipher.init(Cipher.ENCRYPT_MODE, publicKey); + + return encryptBASE64(cipher.doFinal(data.getBytes(ENCODING))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "rsa加密出错", e); + } + } + + /** + * 解密
+ * 用私钥解密 + * + * @param data 已经经过base64加密的密文 + * @param key 已经经过base64加密私钥 + * @return + * @throws Exception + */ + public static String decryptRSA(String data, String key) { + try { + // 对密钥解密 + byte[] keyBytes = decryptBASE64(key); + + // 取得私钥 + PKCS8EncodedKeySpec pkcs8KeySpec = new PKCS8EncodedKeySpec(keyBytes); + KeyFactory keyFactory = KeyFactory.getInstance(KEY_ALGORITHM_RSA); + Key privateKey = keyFactory.generatePrivate(pkcs8KeySpec); + + // 对数据解密 + Cipher cipher = Cipher.getInstance(keyFactory.getAlgorithm()); + cipher.init(Cipher.DECRYPT_MODE, privateKey); + + return new String(cipher.doFinal(decryptBASE64(data)), ENCODING); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "rsa解密出错", e); + } + } + + /** + * 初始化密钥 for RSA ALGORITHM + * + * @return + * @throws Exception + */ + public static String[] initKey() throws Exception { + KeyPairGenerator keyPairGen = KeyPairGenerator + .getInstance(KEY_ALGORITHM_RSA); + keyPairGen.initialize(1024); + + KeyPair keyPair = keyPairGen.generateKeyPair(); + + // 公钥 + RSAPublicKey publicKey = (RSAPublicKey) keyPair.getPublic(); + + // 私钥 + RSAPrivateKey privateKey = (RSAPrivateKey) keyPair.getPrivate(); + + String[] publicAndPrivateKey = { + encryptBASE64(publicKey.getEncoded()), + encryptBASE64(privateKey.getEncoded())}; + + return publicAndPrivateKey; + } + + /** + * 加密 DESede
+ * 用密钥加密 + * + * @param data 裸的原始数据 + * @param key 加密的密钥 + * @return 结果也采用base64加密 + * @throws Exception + */ + public static String encrypt3DES(String data, String key) { + try { + // 生成密钥 + SecretKey desKey = new SecretKeySpec(build3DesKey(key), + KEY_ALGORITHM_3DES); + // 对数据加密 + Cipher cipher = Cipher.getInstance(CIPHER_ALGORITHM_3DES); + cipher.init(Cipher.ENCRYPT_MODE, desKey); + return encryptBASE64(cipher.doFinal(data.getBytes(ENCODING))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "3重DES加密出错", e); + } + } + + /** + * 解密
+ * 用密钥解密 + * + * @param data 已经经过base64加密的密文 + * @param key 解密的密钥 + * @return + * @throws Exception + */ + public static String decrypt3DES(String data, String key) { + try { + // 生成密钥 + SecretKey desKey = new SecretKeySpec(build3DesKey(key), + KEY_ALGORITHM_3DES); + // 对数据解密 + Cipher cipher = Cipher.getInstance(CIPHER_ALGORITHM_3DES); + cipher.init(Cipher.DECRYPT_MODE, desKey); + return new String(cipher.doFinal(decryptBASE64(data)), ENCODING); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "rsa解密出错", e); + } + } + + /** + * 根据字符串生成密钥字节数组 + * + * @param keyStr + * 密钥字符串 + * @return key 符合DESede标准的24byte数组 + */ + private static byte[] build3DesKey(String keyStr) { + try { + // 声明一个24位的字节数组,默认里面都是0,warn: 字符串0(48)和数组默认值0不一样,统一字符串0(48) + byte[] key = "000000000000000000000000".getBytes(ENCODING); + byte[] temp = keyStr.getBytes(ENCODING); + if (key.length > temp.length) { + // 如果temp不够24位,则拷贝temp数组整个长度的内容到key数组中 + System.arraycopy(temp, 0, key, 0, temp.length); + } else { + // 如果temp大于24位,则拷贝temp数组24个长度的内容到key数组中 + System.arraycopy(temp, 0, key, 0, key.length); + } + return key; + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "构建三重DES密匙出错", e); + } + } + + public static synchronized Properties getSecurityProperties() { + if (properties == null) { + InputStream secretStream; + try { + secretStream = new FileInputStream( + CoreConstant.DATAX_SECRET_PATH); + } catch (FileNotFoundException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + "DataX配置要求加解密,但无法找到密钥的配置文件"); + } + + properties = new Properties(); + try { + properties.load(secretStream); + secretStream.close(); + } catch (IOException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "读取加解密配置文件出错", e); + } + } + + return properties; + } + + + public static Configuration encryptSecretKey(Configuration configuration) { + String keyVersion = configuration + .getString(CoreConstant.DATAX_JOB_SETTING_KEYVERSION); + // 没有设置keyVersion,表示不用解密 + if (StringUtils.isBlank(keyVersion)) { + return configuration; + } + + Map> versionKeyMap = getPrivateKeyMap(); + + if (null == versionKeyMap.get(keyVersion)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,任务密钥配置错误,不存在您配置的密钥版本", keyVersion)); + } + + String key = versionKeyMap.get(keyVersion).getRight(); + String method = versionKeyMap.get(keyVersion).getMiddle(); + // keyVersion要求的私钥没有配置 + if (StringUtils.isBlank(key)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,可能是任务密钥配置错误,也可能是系统维护问题", keyVersion)); + } + + String tempEncrptedData = null; + for (String path : configuration.getSecretKeyPathSet()) { + tempEncrptedData = SecretUtil.encrypt(configuration.getString(path), key, method); + int lastPathIndex = path.lastIndexOf(".") + 1; + String lastPathKey = path.substring(lastPathIndex); + + String newPath = path.substring(0, lastPathIndex) + "*" + + lastPathKey; + configuration.set(newPath, tempEncrptedData); + configuration.remove(path); + } + + return configuration; + } + + public static Configuration decryptSecretKey(Configuration config) { + String keyVersion = config + .getString(CoreConstant.DATAX_JOB_SETTING_KEYVERSION); + // 没有设置keyVersion,表示不用解密 + if (StringUtils.isBlank(keyVersion)) { + return config; + } + + Map> versionKeyMap = getPrivateKeyMap(); + if (null == versionKeyMap.get(keyVersion)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,任务密钥配置错误,不存在您配置的密钥版本", keyVersion)); + } + String decryptKey = versionKeyMap.get(keyVersion).getLeft(); + String method = versionKeyMap.get(keyVersion).getMiddle(); + // keyVersion要求的私钥没有配置 + if (StringUtils.isBlank(decryptKey)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,可能是任务密钥配置错误,也可能是系统维护问题", keyVersion)); + } + + // 对包含*号key解密处理 + for (String key : config.getKeys()) { + int lastPathIndex = key.lastIndexOf(".") + 1; + String lastPathKey = key.substring(lastPathIndex); + if (lastPathKey.length() > 1 && lastPathKey.charAt(0) == '*' + && lastPathKey.charAt(1) != '*') { + Object value = config.get(key); + if (value instanceof String) { + String newKey = key.substring(0, lastPathIndex) + + lastPathKey.substring(1); + config.set(newKey, + SecretUtil.decrypt((String) value, decryptKey, method)); + config.addSecretKeyPath(newKey); + config.remove(key); + } + } + } + + return config; + } + + private static synchronized Map> getPrivateKeyMap() { + if (versionKeyMap == null) { + versionKeyMap = new HashMap>(); + Properties properties = SecretUtil.getSecurityProperties(); + + String[] serviceUsernames = new String[] { + CoreConstant.LAST_SERVICE_USERNAME, + CoreConstant.CURRENT_SERVICE_USERNAME }; + String[] servicePasswords = new String[] { + CoreConstant.LAST_SERVICE_PASSWORD, + CoreConstant.CURRENT_SERVICE_PASSWORD }; + + for (int i = 0; i < serviceUsernames.length; i++) { + String serviceUsername = properties + .getProperty(serviceUsernames[i]); + if (StringUtils.isNotBlank(serviceUsername)) { + String servicePassword = properties + .getProperty(servicePasswords[i]); + if (StringUtils.isNotBlank(servicePassword)) { + versionKeyMap.put(serviceUsername, ImmutableTriple.of( + servicePassword, SecretUtil.KEY_ALGORITHM_3DES, + servicePassword)); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, String.format( + "DataX配置要求加解密,但配置的密钥版本[%s]存在密钥为空的情况", + serviceUsername)); + } + } + } + + String[] keyVersions = new String[] { CoreConstant.LAST_KEYVERSION, + CoreConstant.CURRENT_KEYVERSION }; + String[] privateKeys = new String[] { CoreConstant.LAST_PRIVATEKEY, + CoreConstant.CURRENT_PRIVATEKEY }; + String[] publicKeys = new String[] { CoreConstant.LAST_PUBLICKEY, + CoreConstant.CURRENT_PUBLICKEY }; + for (int i = 0; i < keyVersions.length; i++) { + String keyVersion = properties.getProperty(keyVersions[i]); + if (StringUtils.isNotBlank(keyVersion)) { + String privateKey = properties.getProperty(privateKeys[i]); + String publicKey = properties.getProperty(publicKeys[i]); + if (StringUtils.isNotBlank(privateKey) + && StringUtils.isNotBlank(publicKey)) { + versionKeyMap.put(keyVersion, ImmutableTriple.of( + privateKey, SecretUtil.KEY_ALGORITHM_RSA, + publicKey)); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, String.format( + "DataX配置要求加解密,但配置的公私钥对存在为空的情况,版本[%s]", + keyVersion)); + } + } + } + } + if (versionKeyMap.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "DataX配置要求加解密,但无法找到加解密配置"); + } + return versionKeyMap; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/ClassLoaderSwapper.java b/core/src/main/java/com/alibaba/datax/core/util/container/ClassLoaderSwapper.java new file mode 100755 index 000000000..b878cf090 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/ClassLoaderSwapper.java @@ -0,0 +1,41 @@ +package com.alibaba.datax.core.util.container; + +/** + * Created by jingxing on 14-8-29. + * + * 为避免jar冲突,比如hbase可能有多个版本的读写依赖jar包,JobContainer和TaskGroupContainer + * 就需要脱离当前classLoader去加载这些jar包,执行完成后,又退回到原来classLoader上继续执行接下来的代码 + */ +public final class ClassLoaderSwapper { + private ClassLoader storeClassLoader = null; + + private ClassLoaderSwapper() { + } + + public static ClassLoaderSwapper newCurrentThreadClassLoaderSwapper() { + return new ClassLoaderSwapper(); + } + + /** + * 保存当前classLoader,并将当前线程的classLoader设置为所给classLoader + * + * @param + * @return + */ + public ClassLoader setCurrentThreadClassLoader(ClassLoader classLoader) { + this.storeClassLoader = Thread.currentThread().getContextClassLoader(); + Thread.currentThread().setContextClassLoader(classLoader); + return this.storeClassLoader; + } + + /** + * 将当前线程的类加载器设置为保存的类加载 + * @return + */ + public ClassLoader restoreCurrentThreadClassLoader() { + ClassLoader classLoader = Thread.currentThread() + .getContextClassLoader(); + Thread.currentThread().setContextClassLoader(this.storeClassLoader); + return classLoader; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/CoreConstant.java b/core/src/main/java/com/alibaba/datax/core/util/container/CoreConstant.java new file mode 100755 index 000000000..f45df47a7 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/CoreConstant.java @@ -0,0 +1,168 @@ +package com.alibaba.datax.core.util.container; + +import org.apache.commons.lang.StringUtils; + +import java.io.File; + +/** + * Created by jingxing on 14-8-25. + */ +public class CoreConstant { + // --------------------------- 全局使用的变量(最好按照逻辑顺序,调整下成员变量顺序) + // -------------------------------- + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL = "core.container.taskGroup.channel"; + + public static final String DATAX_CORE_CONTAINER_MODEL = "core.container.model"; + + public static final String DATAX_CORE_CONTAINER_JOB_ID = "core.container.job.id"; + + public static final String DATAX_CORE_CONTAINER_TRACE_ENABLE = "core.container.trace.enable"; + + public static final String DATAX_CORE_CONTAINER_JOB_MODE = "core.container.job.mode"; + + public static final String DATAX_CORE_CONTAINER_JOB_REPORTINTERVAL = "core.container.job.reportInterval"; + + public static final String DATAX_CORE_CONTAINER_JOB_SLEEPINTERVAL = "core.container.job.sleepInterval"; + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_ID = "core.container.taskGroup.id"; + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_SLEEPINTERVAL = "core.container.taskGroup.sleepInterval"; + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_REPORTINTERVAL = "core.container.taskGroup.reportInterval"; + + public static final String DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXRETRYTIMES = "core.container.task.failOver.maxRetryTimes"; + + public static final String DATAX_CORE_CONTAINER_TASK_FAILOVER_RETRYINTERVALINMSEC = "core.container.task.failOver.retryIntervalInMsec"; + + public static final String DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXWAITINMSEC = "core.container.task.failOver.maxWaitInMsec"; + + public static final String DATAX_CORE_DATAXSERVER_ADDRESS = "core.dataXServer.address"; + + public static final String DATAX_CORE_DATAXSERVER_TIMEOUT = "core.dataXServer.timeout"; + + public static final String DATAX_CORE_REPORT_DATAX_LOG = "core.dataXServer.reportDataxLog"; + + public static final String DATAX_CORE_REPORT_DATAX_PERFLOG = "core.dataXServer.reportPerfLog"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_CLASS = "core.transport.channel.class"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY = "core.transport.channel.capacity"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE = "core.transport.channel.byteCapacity"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_SPEED_BYTE = "core.transport.channel.speed.byte"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_SPEED_RECORD = "core.transport.channel.speed.record"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_FLOWCONTROLINTERVAL = "core.transport.channel.flowControlInterval"; + + public static final String DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE = "core.transport.exchanger.bufferSize"; + + public static final String DATAX_CORE_TRANSPORT_RECORD_CLASS = "core.transport.record.class"; + + public static final String DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_TASKCLASS = "core.statistics.collector.plugin.taskClass"; + + public static final String DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_MAXDIRTYNUM = "core.statistics.collector.plugin.maxDirtyNumber"; + + public static final String DATAX_JOB_CONTENT_READER_NAME = "job.content[0].reader.name"; + + public static final String DATAX_JOB_CONTENT_READER_PARAMETER = "job.content[0].reader.parameter"; + + public static final String DATAX_JOB_CONTENT_WRITER_NAME = "job.content[0].writer.name"; + + public static final String DATAX_JOB_CONTENT_WRITER_PARAMETER = "job.content[0].writer.parameter"; + + public static final String DATAX_JOB_JOBINFO = "job.jobInfo"; + + public static final String DATAX_JOB_CONTENT = "job.content"; + + public static final String DATAX_JOB_SETTING_KEYVERSION = "job.setting.keyVersion"; + + public static final String DATAX_JOB_SETTING_SPEED_BYTE = "job.setting.speed.byte"; + + public static final String DATAX_JOB_SETTING_SPEED_RECORD = "job.setting.speed.record"; + + public static final String DATAX_JOB_SETTING_SPEED_CHANNEL = "job.setting.speed.channel"; + + public static final String DATAX_JOB_SETTING_ERRORLIMIT = "job.setting.errorLimit"; + + public static final String DATAX_JOB_SETTING_ERRORLIMIT_RECORD = "job.setting.errorLimit.record"; + + public static final String DATAX_JOB_SETTING_ERRORLIMIT_PERCENT = "job.setting.errorLimit.percentage"; + + public static final String DATAX_JOB_SETTING_DRYRUN = "job.setting.dryRun"; + + public static final String DATAX_JOB_PREHANDLER_PLUGINTYPE = "job.preHandler.pluginType"; + + public static final String DATAX_JOB_PREHANDLER_PLUGINNAME = "job.preHandler.pluginName"; + + public static final String DATAX_JOB_POSTHANDLER_PLUGINTYPE = "job.postHandler.pluginType"; + + public static final String DATAX_JOB_POSTHANDLER_PLUGINNAME = "job.postHandler.pluginName"; + // ----------------------------- 局部使用的变量 + public static final String JOB_WRITER = "reader"; + + public static final String JOB_READER = "reader"; + + public static final String JOB_READER_NAME = "reader.name"; + + public static final String JOB_READER_PARAMETER = "reader.parameter"; + + public static final String JOB_WRITER_NAME = "writer.name"; + + public static final String JOB_WRITER_PARAMETER = "writer.parameter"; + + public static final String TASK_ID = "taskId"; + + // ----------------------------- 安全模块变量 ------------------ + + public static final String LAST_KEYVERSION = "last.keyVersion"; + + public static final String LAST_PUBLICKEY = "last.publicKey"; + + public static final String LAST_PRIVATEKEY = "last.privateKey"; + + public static final String LAST_SERVICE_USERNAME = "last.service.username"; + + public static final String LAST_SERVICE_PASSWORD = "last.service.password"; + + public static final String CURRENT_KEYVERSION = "current.keyVersion"; + + public static final String CURRENT_PUBLICKEY = "current.publicKey"; + + public static final String CURRENT_PRIVATEKEY = "current.privateKey"; + + public static final String CURRENT_SERVICE_USERNAME = "current.service.username"; + + public static final String CURRENT_SERVICE_PASSWORD = "current.service.password"; + + // ----------------------------- 环境变量 --------------------------------- + + public static String DATAX_HOME = System.getProperty("datax.home"); + + public static String DATAX_CONF_PATH = StringUtils.join(new String[] { + DATAX_HOME, "conf", "core.json" }, File.separator); + + public static String DATAX_CONF_LOG_PATH = StringUtils.join(new String[] { + DATAX_HOME, "conf", "logback.xml" }, File.separator); + + public static String DATAX_SECRET_PATH = StringUtils.join(new String[] { + DATAX_HOME, "conf", ".secret.properties" }, File.separator); + + public static String DATAX_PLUGIN_HOME = StringUtils.join(new String[] { + DATAX_HOME, "plugin" }, File.separator); + + public static String DATAX_PLUGIN_READER_HOME = StringUtils.join( + new String[] { DATAX_HOME, "plugin", "reader" }, File.separator); + + public static String DATAX_PLUGIN_WRITER_HOME = StringUtils.join( + new String[] { DATAX_HOME, "plugin", "writer" }, File.separator); + + public static String DATAX_BIN_HOME = StringUtils.join(new String[] { + DATAX_HOME, "bin" }, File.separator); + + public static String DATAX_JOB_HOME = StringUtils.join(new String[] { + DATAX_HOME, "job" }, File.separator); + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/JarLoader.java b/core/src/main/java/com/alibaba/datax/core/util/container/JarLoader.java new file mode 100755 index 000000000..9fc113dc6 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/JarLoader.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.core.util.container; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang.Validate; + +import java.io.File; +import java.io.FileFilter; +import java.net.URL; +import java.net.URLClassLoader; +import java.util.ArrayList; +import java.util.List; + +/** + * 提供Jar隔离的加载机制,会把传入的路径、及其子路径、以及路径中的jar文件加入到class path。 + */ +public class JarLoader extends URLClassLoader { + public JarLoader(String[] paths) { + this(paths, JarLoader.class.getClassLoader()); + } + + public JarLoader(String[] paths, ClassLoader parent) { + super(getURLs(paths), parent); + } + + private static URL[] getURLs(String[] paths) { + Validate.isTrue(null != paths && 0 != paths.length, + "jar包路径不能为空."); + + List dirs = new ArrayList(); + for (String path : paths) { + dirs.add(path); + JarLoader.collectDirs(path, dirs); + } + + List urls = new ArrayList(); + for (String path : dirs) { + urls.addAll(doGetURLs(path)); + } + + return urls.toArray(new URL[0]); + } + + private static void collectDirs(String path, List collector) { + if (null == path || StringUtils.isBlank(path)) { + return; + } + + File current = new File(path); + if (!current.exists() || !current.isDirectory()) { + return; + } + + for (File child : current.listFiles()) { + if (!child.isDirectory()) { + continue; + } + + collector.add(child.getAbsolutePath()); + collectDirs(child.getAbsolutePath(), collector); + } + } + + private static List doGetURLs(final String path) { + Validate.isTrue(!StringUtils.isBlank(path), "jar包路径不能为空."); + + File jarPath = new File(path); + + Validate.isTrue(jarPath.exists() && jarPath.isDirectory(), + "jar包路径必须存在且为目录."); + + /* set filter */ + FileFilter jarFilter = new FileFilter() { + @Override + public boolean accept(File pathname) { + return pathname.getName().endsWith(".jar"); + } + }; + + /* iterate all jar */ + File[] allJars = new File(path).listFiles(jarFilter); + List jarURLs = new ArrayList(allJars.length); + + for (int i = 0; i < allJars.length; i++) { + try { + jarURLs.add(allJars[i].toURI().toURL()); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_INIT_ERROR, + "系统加载jar包出错", e); + } + } + + return jarURLs; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/LoadUtil.java b/core/src/main/java/com/alibaba/datax/core/util/container/LoadUtil.java new file mode 100755 index 000000000..30e926c38 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/LoadUtil.java @@ -0,0 +1,202 @@ +package com.alibaba.datax.core.util.container; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.taskgroup.runner.AbstractRunner; +import com.alibaba.datax.core.taskgroup.runner.ReaderRunner; +import com.alibaba.datax.core.taskgroup.runner.WriterRunner; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import org.apache.commons.lang3.StringUtils; + +import java.util.HashMap; +import java.util.Map; + +/** + * Created by jingxing on 14-8-24. + *

+ * 插件加载器,大体上分reader、transformer(还未实现)和writer三中插件类型, + * reader和writer在执行时又可能出现Job和Task两种运行时(加载的类不同) + */ +public class LoadUtil { + private static final String pluginTypeNameFormat = "plugin.%s.%s"; + + private LoadUtil() { + } + + private enum ContainerType { + Job("Job"), Task("Task"); + private String type; + + private ContainerType(String type) { + this.type = type; + } + + public String value() { + return type; + } + } + + /** + * 所有插件配置放置在pluginRegisterCenter中,为区别reader、transformer和writer,还能区别 + * 具体pluginName,故使用pluginType.pluginName作为key放置在该map中 + */ + private static Configuration pluginRegisterCenter; + + /** + * jarLoader的缓冲 + */ + private static Map jarLoaderCenter = new HashMap(); + + /** + * 设置pluginConfigs,方便后面插件来获取 + * + * @param pluginConfigs + */ + public static void bind(Configuration pluginConfigs) { + pluginRegisterCenter = pluginConfigs; + } + + private static String generatePluginKey(PluginType pluginType, + String pluginName) { + return String.format(pluginTypeNameFormat, pluginType.toString(), + pluginName); + } + + private static Configuration getPluginConf(PluginType pluginType, + String pluginName) { + Configuration pluginConf = pluginRegisterCenter + .getConfiguration(generatePluginKey(pluginType, pluginName)); + + if (null == pluginConf) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_INSTALL_ERROR, + String.format("DataX不能找到插件[%s]的配置.", + pluginName)); + } + + return pluginConf; + } + + /** + * 加载JobPlugin,reader、writer都可能要加载 + * + * @param pluginType + * @param pluginName + * @return + */ + public static AbstractJobPlugin loadJobPlugin(PluginType pluginType, + String pluginName) { + Class clazz = LoadUtil.loadPluginClass( + pluginType, pluginName, ContainerType.Job); + + try { + AbstractJobPlugin jobPlugin = (AbstractJobPlugin) clazz + .newInstance(); + jobPlugin.setPluginConf(getPluginConf(pluginType, pluginName)); + return jobPlugin; + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format("DataX找到plugin[%s]的Job配置.", + pluginName), e); + } + } + + /** + * 加载taskPlugin,reader、writer都可能加载 + * + * @param pluginType + * @param pluginName + * @return + */ + public static AbstractTaskPlugin loadTaskPlugin(PluginType pluginType, + String pluginName) { + Class clazz = LoadUtil.loadPluginClass( + pluginType, pluginName, ContainerType.Task); + + try { + AbstractTaskPlugin taskPlugin = (AbstractTaskPlugin) clazz + .newInstance(); + taskPlugin.setPluginConf(getPluginConf(pluginType, pluginName)); + return taskPlugin; + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + String.format("DataX不能找plugin[%s]的Task配置.", + pluginName), e); + } + } + + /** + * 根据插件类型、名字和执行时taskGroupId加载对应运行器 + * + * @param pluginType + * @param pluginName + * @return + */ + public static AbstractRunner loadPluginRunner(PluginType pluginType, String pluginName) { + AbstractTaskPlugin taskPlugin = LoadUtil.loadTaskPlugin(pluginType, + pluginName); + + switch (pluginType) { + case READER: + return new ReaderRunner(taskPlugin); + case WRITER: + return new WriterRunner(taskPlugin); + default: + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format("插件[%s]的类型必须是[reader]或[writer]!", + pluginName)); + } + } + + /** + * 反射出具体plugin实例 + * + * @param pluginType + * @param pluginName + * @param pluginRunType + * @return + */ + @SuppressWarnings("unchecked") + private static synchronized Class loadPluginClass( + PluginType pluginType, String pluginName, + ContainerType pluginRunType) { + Configuration pluginConf = getPluginConf(pluginType, pluginName); + JarLoader jarLoader = LoadUtil.getJarLoader(pluginType, pluginName); + try { + return (Class) jarLoader + .loadClass(pluginConf.getString("class") + "$" + + pluginRunType.value()); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + + public static synchronized JarLoader getJarLoader(PluginType pluginType, + String pluginName) { + Configuration pluginConf = getPluginConf(pluginType, pluginName); + + JarLoader jarLoader = jarLoaderCenter.get(generatePluginKey(pluginType, + pluginName)); + if (null == jarLoader) { + String pluginPath = pluginConf.getString("path"); + if (StringUtils.isBlank(pluginPath)) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format( + "%s插件[%s]路径非法!", + pluginType, pluginName)); + } + jarLoader = new JarLoader(new String[]{pluginPath}); + jarLoaderCenter.put(generatePluginKey(pluginType, pluginName), + jarLoader); + } + + return jarLoader; + } +} diff --git a/core/src/main/job/job.json b/core/src/main/job/job.json new file mode 100755 index 000000000..582065929 --- /dev/null +++ b/core/src/main/job/job.json @@ -0,0 +1,52 @@ +{ + "job": { + "setting": { + "speed": { + "byte":10485760 + }, + "errorLimit": { + "record": 0, + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19890604, + "type": "long" + }, + { + "value": "1989-06-04 00:00:00", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} diff --git a/core/src/main/log/datax.log b/core/src/main/log/datax.log new file mode 100755 index 000000000..e69de29bb diff --git a/core/src/main/script/Readme.md b/core/src/main/script/Readme.md new file mode 100755 index 000000000..341f3f88d --- /dev/null +++ b/core/src/main/script/Readme.md @@ -0,0 +1 @@ +some script here. \ No newline at end of file diff --git a/core/src/main/tmp/readme.txt b/core/src/main/tmp/readme.txt new file mode 100755 index 000000000..74b233ce5 --- /dev/null +++ b/core/src/main/tmp/readme.txt @@ -0,0 +1,4 @@ +If you are developing DataX Plugin, In your Plugin you can use this directory to store temporary resources . + +NOTE: +Each time install DataX, this directory will be cleaned up ! \ No newline at end of file diff --git a/core/src/test/java/com/alibaba/datax/core/EngineTest.java b/core/src/test/java/com/alibaba/datax/core/EngineTest.java new file mode 100755 index 000000000..e5c8dd315 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/EngineTest.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.core; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.util.ConfigParser; +import com.alibaba.datax.core.util.ExceptionTracker; +import com.alibaba.datax.core.util.container.LoadUtil; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.io.FileWriter; + +/** + * Created by jingxing on 14-9-25. + */ +public class EngineTest extends CaseInitializer { + private Configuration configuration; + + @Before + public void setUp() { + String path = EngineTest.class.getClassLoader().getResource(".") + .getFile(); + + this.configuration = ConfigParser.parse(path + File.separator + + "all.json"); + LoadUtil.bind(this.configuration); + } + + + public void test_entry() throws Throwable { + String jobConfig = this.configuration.toString(); + + String jobFile = "./job.json"; + FileWriter writer = new FileWriter(jobFile); + writer.write(jobConfig); + writer.flush(); + writer.close(); + String[] args = { "-job", jobFile, "-mode", "standalone" }; + + Engine.entry(args); + } + + @Test + public void testNN() { + try { + throwEE(); + } catch (Exception e) { + String tarce = ExceptionTracker.trace(e); + if(e instanceof NullPointerException) { + System.out.println(tarce); + } + } + } + + public static void throwEE() { + String aa = null; + aa.toString(); + //throw new NullPointerException(); + } + +} \ No newline at end of file diff --git a/core/src/test/java/com/alibaba/datax/core/constant/CoreConstantTest.java b/core/src/test/java/com/alibaba/datax/core/constant/CoreConstantTest.java new file mode 100755 index 000000000..9ebe86a8a --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/constant/CoreConstantTest.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.core.constant; + +import com.alibaba.datax.core.util.container.CoreConstant; +import org.junit.Test; + +public class CoreConstantTest { + @Test + public void test() { + System.out.println(System.getProperties()); + System.out.println(CoreConstant.DATAX_PLUGIN_HOME); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/container/ClassLoaderSwapperTest.java b/core/src/test/java/com/alibaba/datax/core/container/ClassLoaderSwapperTest.java new file mode 100755 index 000000000..6c1d1ba47 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/container/ClassLoaderSwapperTest.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.core.container; + +import com.alibaba.datax.core.util.container.ClassLoaderSwapper; +import org.junit.Assert; +import org.junit.Test; + +import java.net.URL; +import java.net.URLClassLoader; + +/** + * Created by jingxing on 14-9-4. + */ +public class ClassLoaderSwapperTest { + @Test + public void test() { + ClassLoaderSwapper classLoaderSwapper = + ClassLoaderSwapper.newCurrentThreadClassLoaderSwapper(); + ClassLoader newClassLoader = new URLClassLoader(new URL[]{}); + classLoaderSwapper.setCurrentThreadClassLoader(newClassLoader); + Assert.assertTrue("", newClassLoader == + classLoaderSwapper.restoreCurrentThreadClassLoader()); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/container/JobAssignUtilTest.java b/core/src/test/java/com/alibaba/datax/core/container/JobAssignUtilTest.java new file mode 100755 index 000000000..f6b6e4383 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/container/JobAssignUtilTest.java @@ -0,0 +1,56 @@ +package com.alibaba.datax.core.container; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.container.util.JobAssignUtil; +import org.apache.commons.lang3.StringUtils; +import org.junit.Test; + +import java.io.File; +import java.util.List; + +public class JobAssignUtilTest { + @Test + public void test_01() { + Configuration configuration = Configuration.from(new File(JobAssignUtil.class.getResource("/job/job.json").getFile())); + configuration.set("job.content[0].taskId", 0); + configuration.set("job.content[1].taskId", 1); + System.out.println(configuration.beautify()); + int channelNumber = 3; + int channelsPerTaskGroup = 1; + List result = JobAssignUtil.assignFairly(configuration, channelNumber, channelsPerTaskGroup); + + System.out.println("==================================="); + for (Configuration conf : result) { + System.out.println(conf.beautify()); + System.out.println("----------------"); + } + + System.out.println(configuration); + } + + @Test + public void test_02() { + + String jobString = "{\"job\":{\"setting\":{},\"content\":[{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}},{\"reader\":{\"type\":\"fakereader\",\"parameter\":{}},\"writer\":{\"type\":\"fakewriter\",\"parameter\":{}}}]}}"; + + Configuration configuration = Configuration.from(jobString); + + int taskNumber = StringUtils.countMatches(jobString, "fakereader"); + System.out.println("taskNumber:" + taskNumber); + for (int i = 0; i < taskNumber; i++) { + configuration.set("job.content[" + i + "].taskId", i); + } +// System.out.println(configuration.beautify()); + int channelNumber = 13; + int channelsPerTaskGroup = 5; + List result = JobAssignUtil.assignFairly(configuration, channelNumber, channelsPerTaskGroup); + + System.out.println("==================================="); + for (Configuration conf : result) { + System.out.println(conf.beautify()); + System.out.println("----------------"); + } + +// System.out.println(configuration); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/container/JobContainerTest.java b/core/src/test/java/com/alibaba/datax/core/container/JobContainerTest.java new file mode 100755 index 000000000..7749f9a3e --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/container/JobContainerTest.java @@ -0,0 +1,369 @@ +package com.alibaba.datax.core.container; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.JobContainer; +import com.alibaba.datax.core.job.meta.ExecuteMode; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.util.ConfigParser; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; +import org.powermock.api.mockito.PowerMockito; + +import java.io.File; +import java.lang.reflect.Method; +import java.util.ArrayList; +import java.util.List; + +public class JobContainerTest extends CaseInitializer { + private Configuration configuration; + + @Before + public void setUp() { + String path = JobContainerTest.class.getClassLoader() + .getResource(".").getFile(); + + this.configuration = ConfigParser.parse(path + File.separator + + "all.json"); + LoadUtil.bind(this.configuration); + } + + /** + * standalone模式下点对点跑完全部流程 + */ + @Test + public void testStart() { + JobContainer jobContainer = new JobContainer( + this.configuration); + jobContainer.start(); + } + + @Test + public void testPreHandler() throws Exception { + JobContainer jobContainer = new JobContainer( + this.configuration); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("preHandle"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + + System.out.println(this.configuration.get("job.preHandler.test")); + Assert.assertEquals("writePreDone",this.configuration.get("job.preHandler.test")); + } + + @Test + public void testPostHandler() throws Exception { + JobContainer jobContainer = new JobContainer( + this.configuration); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("postHandle"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + + System.out.println(this.configuration.get("job.postHandler.test")); + Assert.assertEquals("writePostDone",this.configuration.get("job.postHandler.test")); + } + + @Test + public void testPreHandlerByReader() throws Exception { + + Configuration copyConfig = this.configuration.clone(); + copyConfig.set(CoreConstant.DATAX_JOB_PREHANDLER_PLUGINTYPE,"reader"); + copyConfig.set(CoreConstant.DATAX_JOB_PREHANDLER_PLUGINNAME,"fakereader"); + JobContainer jobContainer = new JobContainer( + copyConfig); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("preHandle"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + + System.out.println(copyConfig.get("job.preHandler.test")); + Assert.assertEquals("readPreDone",copyConfig.get("job.preHandler.test")); + } + + @Test + public void testPostHandlerByReader() throws Exception { + + Configuration copyConfig = this.configuration.clone(); + copyConfig.set(CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINTYPE,"reader"); + copyConfig.set(CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINNAME,"fakereader"); + JobContainer jobContainer = new JobContainer( + copyConfig); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("postHandle"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + + System.out.println(copyConfig.get("job.postHandler.test")); + Assert.assertEquals("readPostDone",copyConfig.get("job.postHandler.test")); + } + + @Test + public void testInitNormal() throws Exception { + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, -2); + this.configuration.set("runMode", ExecuteMode.STANDALONE.getValue()); + JobContainer jobContainer = new JobContainer( + this.configuration); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("init"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + Assert.assertEquals("default job id = 0", 0l, this.configuration + .getLong(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID) + .longValue()); + } + + @SuppressWarnings("unchecked") + @Test + public void testMergeReaderAndWriterSlicesConfigs() throws Exception { + JobContainer jobContainer = new JobContainer( + this.configuration); + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("init"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + initMethod.setAccessible(false); + + int splitNumber = 100; + List readerSplitConfigurations = new ArrayList(); + List writerSplitConfigurations = new ArrayList(); + for (int i = 0; i < splitNumber; i++) { + Configuration readerOneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format( + "jdbc:mysql://localhost:3305/db%04d", i)); + readerOneConfig.set("jdbcUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("jingxing_%04d", i)); + readerOneConfig.set("table", tableArray); + + readerSplitConfigurations.add(readerOneConfig); + + Configuration writerOneConfig = Configuration.newDefault(); + List odpsUrlArray = new ArrayList(); + odpsUrlArray.add(String.format("odps://localhost:3305/db%04d", i)); + writerOneConfig.set("jdbcUrl", odpsUrlArray); + + List odpsTableArray = new ArrayList(); + odpsTableArray.add(String.format("jingxing_%04d", i)); + writerOneConfig.set("table", odpsTableArray); + + writerSplitConfigurations.add(writerOneConfig); + } + + initMethod = jobContainer.getClass().getDeclaredMethod( + "mergeReaderAndWriterTaskConfigs", List.class, List.class); + initMethod.setAccessible(true); + + List mergedConfigs = (List) initMethod + .invoke(jobContainer, readerSplitConfigurations, writerSplitConfigurations); + + Assert.assertEquals("merge number equals to split number", splitNumber, + mergedConfigs.size()); + for (Configuration sliceConfig : mergedConfigs) { + Assert.assertNotNull("reader name not null", + sliceConfig.getString(CoreConstant.JOB_READER_NAME)); + Assert.assertNotNull("reader name not null", + sliceConfig.getString(CoreConstant.JOB_READER_PARAMETER)); + Assert.assertNotNull("reader name not null", + sliceConfig.getString(CoreConstant.JOB_WRITER_NAME)); + Assert.assertNotNull("reader name not null", + sliceConfig.getString(CoreConstant.JOB_WRITER_PARAMETER)); + Assert.assertTrue("has slice id", + sliceConfig.getInt(CoreConstant.TASK_ID) >= 0); + } + } + + @Test(expected = Exception.class) + public void testMergeReaderAndWriterSlicesConfigsException() + throws Exception { + JobContainer jobContainer = new JobContainer( + this.configuration); + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("init"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + initMethod.setAccessible(false); + + int readerSplitNumber = 100; + int writerSplitNumber = readerSplitNumber + 1; + List readerSplitConfigurations = new ArrayList(); + List writerSplitConfigurations = new ArrayList(); + for (int i = 0; i < readerSplitNumber; i++) { + Configuration readerOneConfig = Configuration.newDefault(); + readerSplitConfigurations.add(readerOneConfig); + } + for (int i = 0; i < writerSplitNumber; i++) { + Configuration readerOneConfig = Configuration.newDefault(); + writerSplitConfigurations.add(readerOneConfig); + } + + initMethod = jobContainer.getClass().getDeclaredMethod( + "mergeReaderAndWriterSlicesConfigs", List.class, List.class); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, readerSplitConfigurations, writerSplitConfigurations); + } + + @Test + public void testDistributeTasksToTaskGroupContainer() throws Exception { + distributeTasksToTaskGroupContainerTest(333, 7); + + distributeTasksToTaskGroupContainerTest(6, 7); + distributeTasksToTaskGroupContainerTest(7, 7); + distributeTasksToTaskGroupContainerTest(8, 7); + + distributeTasksToTaskGroupContainerTest(1, 1); + distributeTasksToTaskGroupContainerTest(2, 1); + distributeTasksToTaskGroupContainerTest(1, 2); + + distributeTasksToTaskGroupContainerTest(1, 1025); + distributeTasksToTaskGroupContainerTest(1024, 1025); + } + + /** + * 分发测试函数,可根据不同的通道数、每个taskGroup平均包括的channel数得到最优的分发结果 + * 注意:默认的tasks是采用faker里切分出的1024个tasks + * + * @param channelNumber + * @param channelsPerTaskGroupContainer + * @throws Exception + */ + @SuppressWarnings("unchecked") + private void distributeTasksToTaskGroupContainerTest(int channelNumber, + int channelsPerTaskGroupContainer) throws Exception { + JobContainer jobContainer = new JobContainer( + this.configuration); + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("init"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + initMethod.setAccessible(false); + + initMethod = jobContainer.getClass().getDeclaredMethod("split"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + initMethod.setAccessible(false); + + int tasksNumber = this.configuration.getListConfiguration( + CoreConstant.DATAX_JOB_CONTENT).size(); + int averSlicesPerChannel = tasksNumber / channelNumber; + + initMethod = jobContainer.getClass().getDeclaredMethod( + "distributeTasksToTaskGroup", int.class, int.class, + int.class); + initMethod.setAccessible(true); + List taskGroupConfigs = (List) initMethod + .invoke(jobContainer, averSlicesPerChannel, + channelNumber, channelsPerTaskGroupContainer); + initMethod.setAccessible(false); + + Assert.assertEquals("task size check", channelNumber + / channelsPerTaskGroupContainer + + (channelNumber % channelsPerTaskGroupContainer > 0 ? 1 : 0), + taskGroupConfigs.size()); + int sumSlices = 0; + for (Configuration taskGroupConfig : taskGroupConfigs) { + Assert.assertNotNull("have set taskGroupId", taskGroupConfig + .getInt(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID)); + int channelNo = taskGroupConfig + .getInt(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL); + Assert.assertNotNull("have set task channel number", channelNo); + int taskNumber = taskGroupConfig.getListConfiguration( + CoreConstant.DATAX_JOB_CONTENT).size(); + sumSlices += taskNumber; + Assert.assertTrue("task has average tasks", taskNumber + / channelNo == averSlicesPerChannel); + } + + Assert.assertEquals("slices equal to split sum", tasksNumber, sumSlices); + } + + @Test + public void testErrorLimitIgnoreCheck() throws Exception { + this.configuration.set(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT, -1); + JobContainer jobContainer = new JobContainer( + this.configuration); + + Communication communication = new Communication(); + communication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 100); + communication.setLongCounter(CommunicationTool.WRITE_RECEIVED_RECORDS, 100); +// LocalTaskGroupCommunicationManager.updateTaskGroupCommunication(0, communication); + + AbstractContainerCommunicator communicator = PowerMockito.mock(AbstractContainerCommunicator.class); + jobContainer.setContainerCommunicator(communicator); + PowerMockito.when(communicator.collect()).thenReturn(communication); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("checkLimit"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer, new Object[] {}); + initMethod.setAccessible(false); + } + + @Test(expected = Exception.class) + public void testErrorLimitPercentCheck() throws Exception { +// this.configuration.set(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT, 0.1); +// this.configuration.set(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_RECORD, null); + this.configuration.remove(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_RECORD); + this.configuration.set(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_PERCENT, 0.1); + JobContainer jobContainer = new JobContainer( + this.configuration); + + Communication communication = new Communication(); + communication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 100); + communication.setLongCounter(CommunicationTool.WRITE_RECEIVED_RECORDS, 80); + communication.setLongCounter(CommunicationTool.WRITE_FAILED_RECORDS, 20); +// LocalTaskGroupCommunicationManager.updateTaskGroupCommunication(0, communication); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("checkLimit"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer); + initMethod.setAccessible(false); + } + + @Test(expected = Exception.class) + public void testErrorLimitCountCheck() throws Exception { + this.configuration.remove(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_PERCENT); + this.configuration.set(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_RECORD, 1); + JobContainer jobContainer = new JobContainer( + this.configuration); + + Communication communication = new Communication(); + communication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 100); + communication.setLongCounter(CommunicationTool.WRITE_RECEIVED_RECORDS, 98); + communication.setLongCounter(CommunicationTool.WRITE_FAILED_RECORDS, 2); +// LocalTaskGroupCommunicationManager.updateTaskGroupCommunication(0, communication); + + Method initMethod = jobContainer.getClass() + .getDeclaredMethod("checkLimit"); + initMethod.setAccessible(true); + initMethod.invoke(jobContainer); + initMethod.setAccessible(false); + } + + @Test + public void testStartDryRun() { + String path = JobContainerTest.class.getClassLoader() + .getResource(".").getFile(); + + this.configuration = ConfigParser.parse(path + File.separator + + "dryRunAll.json"); + LoadUtil.bind(this.configuration); + + JobContainer jobContainer = new JobContainer( + this.configuration); + jobContainer.start(); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/container/LoadUtilTest.java b/core/src/test/java/com/alibaba/datax/core/container/LoadUtilTest.java new file mode 100755 index 000000000..e6b60ce4e --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/container/LoadUtilTest.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.core.container; + +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import org.junit.Assert; +import org.junit.Test; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.core.util.container.LoadUtil; +import com.alibaba.datax.core.scaffold.ConfigurationProducer; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.fastjson.JSON; + +public class LoadUtilTest extends CaseInitializer { + + @Test + public void test() { + LoadUtil.bind(ConfigurationProducer.produce()); + AbstractJobPlugin jobPlugin = LoadUtil.loadJobPlugin( + PluginType.READER, "fakereader"); + System.out.println(JSON.toJSONString(jobPlugin)); + Assert.assertTrue(jobPlugin.getPluginName().equals("fakereader")); + + AbstractTaskPlugin taskPlugin = LoadUtil.loadTaskPlugin( + PluginType.READER, "fakereader"); + System.out.println(JSON.toJSONString(taskPlugin)); + Assert.assertTrue(taskPlugin.getPluginName().equals("fakereader")); + + } + +} diff --git a/core/src/test/java/com/alibaba/datax/core/container/TaskGroupContainerTest.java b/core/src/test/java/com/alibaba/datax/core/container/TaskGroupContainerTest.java new file mode 100755 index 000000000..9fff704c8 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/container/TaskGroupContainerTest.java @@ -0,0 +1,129 @@ +package com.alibaba.datax.core.container; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.faker.FakeExceptionReader; +import com.alibaba.datax.core.faker.FakeExceptionWriter; +import com.alibaba.datax.core.faker.FakeLongTimeWriter; +import com.alibaba.datax.core.faker.FakeOneReader; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.util.ConfigParser; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.util.ArrayList; +import java.util.List; + +public class TaskGroupContainerTest extends CaseInitializer { + private Configuration configuration; + private int taskNumber; + + @Before + public void setUp() { + String path = TaskGroupContainerTest.class.getClassLoader() + .getResource(".").getFile(); + + this.configuration = ConfigParser.parse(path + File.separator + + "all.json"); + LoadUtil.bind(this.configuration); + + int channelNumber = 5; + taskNumber = channelNumber + 3; + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, 0); + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 1); + this.configuration.set( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_SLEEPINTERVAL, 200); + this.configuration.set( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_REPORTINTERVAL, 1000); + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + channelNumber); + Configuration jobContent = this.configuration.getListConfiguration( + CoreConstant.DATAX_JOB_CONTENT).get(0); + List jobContents = new ArrayList(); + for (int i = 0; i < this.taskNumber; i++) { + Configuration newJobContent = jobContent.clone(); + newJobContent.set(CoreConstant.TASK_ID, i); + jobContents.add(newJobContent); + } + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT, jobContents); + + LocalTGCommunicationManager.clear(); + LocalTGCommunicationManager.registerTaskGroupCommunication( + 1, new Communication()); + + } + + @Test + public void testStart() throws InterruptedException { + TaskGroupContainer taskGroupContainer = new TaskGroupContainer(this.configuration); + taskGroupContainer.start(); + + AbstractContainerCommunicator collector = taskGroupContainer.getContainerCommunicator(); + while (true) { + State totalTaskState = collector.collectState(); + if (totalTaskState.isRunning()) { + Thread.sleep(1000); + } else { + break; + } + } + + Communication totalTaskCommunication = collector.collect(); + List messages = totalTaskCommunication.getMessage("bazhen-reader"); + Assert.assertTrue(!messages.isEmpty()); + + messages = totalTaskCommunication.getMessage("bazhen-writer"); + Assert.assertTrue(!messages.isEmpty()); + + messages = totalTaskCommunication.getMessage("bazhen"); + Assert.assertNull(messages); + + State state = totalTaskCommunication.getState(); + + Assert.assertTrue("task finished", state.equals(State.SUCCEEDED)); + } + + @Test(expected = RuntimeException.class) + public void testReaderException() { + this.configuration.set("plugin.reader.fakereader.class", + FakeExceptionReader.class.getCanonicalName()); + TaskGroupContainer taskGroupContainer = new TaskGroupContainer(this.configuration); + taskGroupContainer.start(); + } + + @Test(expected = RuntimeException.class) + public void testWriterException() { + this.configuration.set("plugin.writer.fakewriter.class", + FakeExceptionWriter.class.getName()); + TaskGroupContainer taskGroupContainer = new TaskGroupContainer(this.configuration); + taskGroupContainer.start(); + } + + @Test + public void testLongTimeWriter() { + this.configuration.set("plugin.writer.fakewriter.class", + FakeOneReader.class.getName()); + this.configuration.set("plugin.writer.fakewriter.class", + FakeLongTimeWriter.class.getName()); + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + 1); + Configuration jobContent = this.configuration.getListConfiguration( + CoreConstant.DATAX_JOB_CONTENT).get(0); + List jobContents = new ArrayList(); + jobContents.add(jobContent); + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT, jobContents); + + TaskGroupContainer taskGroupContainer = new TaskGroupContainer(this.configuration); + taskGroupContainer.start(); + Assert.assertTrue(State.SUCCEEDED == + taskGroupContainer.getContainerCommunicator().collect().getState()); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeExceptionReader.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeExceptionReader.java new file mode 100755 index 000000000..096685a4a --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeExceptionReader.java @@ -0,0 +1,75 @@ +package com.alibaba.datax.core.faker; + +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14-9-12. + */ +public class FakeExceptionReader extends Reader { + public static final class Job extends Reader.Job { + @Override + public List split(int adviceNumber) { + Configuration jobParameter = this.getPluginJobConf(); + System.out.println(jobParameter); + + List splitConfigurationList = new ArrayList(); + for (int i = 0; i < 1024; i++) { + Configuration oneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format( + "jdbc:mysql://localhost:3305/db%04d", i)); + oneConfig.set("jdbcUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("jingxing_%04d", i)); + oneConfig.set("table", tableArray); + + splitConfigurationList.add(oneConfig); + } + + return splitConfigurationList; + } + + @Override + public void init() { + System.out.println("fake reader job initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake reader job destroyed!"); + } + } + + public static final class Task extends Reader.Task { + @Override + public void startRead(RecordSender lineSender) { + throw new RuntimeException("just for test"); + } + + @Override + public void prepare() { + System.out.println("fake reader task prepared!"); + } + + @Override + public void post() { + System.out.println("fake reader task posted!"); + } + + @Override + public void init() { + System.out.println("fake reader task initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake reader task destroyed!"); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeExceptionWriter.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeExceptionWriter.java new file mode 100755 index 000000000..812e2055c --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeExceptionWriter.java @@ -0,0 +1,76 @@ +package com.alibaba.datax.core.faker; + +import java.util.ArrayList; +import java.util.List; + +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; + +/** + * Created by jingxing on 14-9-12. + */ +public class FakeExceptionWriter extends Writer { + public static final class Job extends Writer.Job { + + @Override + public List split(int readerSlicesNumber) { + Configuration jobParameter = this.getPluginJobConf(); + System.out.println(jobParameter); + + List splitConfigurationList = new ArrayList(); + for(int i=0; i<1024; i++) { + Configuration oneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format("odps://localhost:3305/db%04d", i)); + oneConfig.set("odpsUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("odps_jingxing_%04d", i)); + oneConfig.set("table", tableArray); + + splitConfigurationList.add(oneConfig); + } + + return splitConfigurationList; + } + + @Override + public void init() { + System.out.println("fake writer job initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake writer job destroyed!"); + } + } + + public static final class Task extends Writer.Task { + + @Override + public void startWrite(RecordReceiver lineReceiver) { + throw new RuntimeException("just for test"); + } + + @Override + public void prepare() { + System.out.println("fake writer task prepared!"); + } + + @Override + public void post() { + System.out.println("fake writer task posted!"); + } + + @Override + public void init() { + System.out.println("fake writer task initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake writer task destroyed!"); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeJobContainer.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeJobContainer.java new file mode 100755 index 000000000..6809e6675 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeJobContainer.java @@ -0,0 +1,18 @@ +package com.alibaba.datax.core.faker; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.AbstractContainer; + +/** + * Created by jingxing on 14-9-25. + */ +public class FakeJobContainer extends AbstractContainer { + public FakeJobContainer(Configuration configuration) { + super(configuration); + } + + @Override + public void start() { + System.out.println("Fake Job start .."); + } +} \ No newline at end of file diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeLongTimeWriter.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeLongTimeWriter.java new file mode 100755 index 000000000..48c8207f9 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeLongTimeWriter.java @@ -0,0 +1,86 @@ +package com.alibaba.datax.core.faker; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14/12/24. + */ +public class FakeLongTimeWriter extends Writer { + public static final class Job extends Writer.Job { + + @Override + public List split(int readerSlicesNumber) { + Configuration jobParameter = this.getPluginJobConf(); + System.out.println(jobParameter); + + List splitConfigurationList = new ArrayList(); + Configuration oneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format("odps://localhost:3305/db%04d", 0)); + oneConfig.set("odpsUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("odps_jingxing_%04d", 0)); + oneConfig.set("table", tableArray); + + splitConfigurationList.add(oneConfig); + + return splitConfigurationList; + } + + @Override + public void init() { + System.out.println("fake writer job initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake writer job destroyed!"); + } + } + + public static final class Task extends Writer.Task { + + @Override + public void startWrite(RecordReceiver lineReceiver) { + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + + } + for(int i=0; i<2; i++) { + System.out.println("writer sleep 10s ..."); + try { + Thread.sleep(10000); + } catch (InterruptedException e) { + e.printStackTrace(); + } + } + } + + @Override + public void prepare() { + System.out.println("fake writer task prepared!"); + } + + @Override + public void post() { + System.out.println("fake writer task posted!"); + } + + @Override + public void init() { + System.out.println("fake writer task initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake writer task destroyed!"); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeOneReader.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeOneReader.java new file mode 100755 index 000000000..b97af5ea1 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeOneReader.java @@ -0,0 +1,82 @@ +package com.alibaba.datax.core.faker; + +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14/12/24. + */ +public class FakeOneReader extends Reader { + public static final class Job extends Reader.Job { + @Override + public List split(int adviceNumber) { + Configuration jobParameter = this.getPluginJobConf(); + System.out.println(jobParameter); + + List splitConfigurationList = new ArrayList(); + Configuration oneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format( + "jdbc:mysql://localhost:3305/db%04d", 0)); + oneConfig.set("jdbcUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("jingxing_%04d", 0)); + oneConfig.set("table", tableArray); + + splitConfigurationList.add(oneConfig); + + return splitConfigurationList; + } + + @Override + public void init() { + System.out.println("fake reader job initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake reader job destroyed!"); + } + } + + public static final class Task extends Reader.Task { + @Override + public void startRead(RecordSender lineSender) { + Record record = lineSender.createRecord(); + record.addColumn(new LongColumn(1L)); + + for (int i = 0; i < 10; i++) { + lineSender.sendToWriter(record); + } + } + + @Override + public void prepare() { + System.out.println("fake reader task prepared!"); + } + + @Override + public void post() { + System.out.println("fake reader task posted!"); + } + + @Override + public void init() { + System.out.println("fake reader task initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake reader task destroyed!"); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeReader.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeReader.java new file mode 100755 index 000000000..085b71aa0 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeReader.java @@ -0,0 +1,116 @@ +package com.alibaba.datax.core.faker; + +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14-9-2. + */ +public class FakeReader extends Reader { + public static final class Job extends Reader.Job { + @Override + public List split(int adviceNumber) { + Configuration jobParameter = this.getPluginJobConf(); + System.out.println(jobParameter); + + List splitConfigurationList = new ArrayList(); + for (int i = 0; i < 1024; i++) { + Configuration oneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format( + "jdbc:mysql://localhost:3305/db%04d", i)); + oneConfig.set("jdbcUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("jingxing_%04d", i)); + oneConfig.set("table", tableArray); + + splitConfigurationList.add(oneConfig); + } + + return splitConfigurationList; + } + + @Override + public void init() { + System.out.println("fake reader job initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake reader job destroyed!"); + } + + public void preHandler(Configuration jobConfiguration){ + jobConfiguration.set("job.preHandler.test","readPreDone"); + } + + public void postHandler(Configuration jobConfiguration){ + jobConfiguration.set("job.postHandler.test","readPostDone"); + } + } + + public static final class Task extends Reader.Task { + @Override + public void startRead(RecordSender lineSender) { + Record record = lineSender.createRecord(); + record.addColumn(new LongColumn(1L)); + + for (int i = 0; i < 10; i++) { + lineSender.sendToWriter(record); + } + + for (int i = 0; i < 10; i++) { + this.getTaskPluginCollector().collectDirtyRecord( + record, + DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "EXCEPTION MSG"), "ERROR MSG"); + } + + for (int i = 0; i < 10; i++) { + this.getTaskPluginCollector().collectDirtyRecord(record, + "ERROR MSG"); + } + + for (int i = 0; i < 10; i++) { + this.getTaskPluginCollector().collectDirtyRecord( + record, + DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "EXCEPTION MSG")); + } + + for (int i = 0; i < 10; i++) { + this.getTaskPluginCollector().collectMessage("bazhen-reader", + "bazhen"); + } + } + + @Override + public void prepare() { + System.out.println("fake reader task prepared!"); + } + + @Override + public void post() { + System.out.println("fake reader task posted!"); + } + + @Override + public void init() { + System.out.println("fake reader task initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake reader task destroyed!"); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/faker/FakeWriter.java b/core/src/test/java/com/alibaba/datax/core/faker/FakeWriter.java new file mode 100755 index 000000000..44cb5ea69 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/faker/FakeWriter.java @@ -0,0 +1,100 @@ +package com.alibaba.datax.core.faker; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14-9-2. + */ +public class FakeWriter extends Writer { + public static final class Job extends Writer.Job { + + @Override + public List split(int readerSlicesNumber) { + Configuration jobParameter = this.getPluginJobConf(); + System.out.println(jobParameter); + + List splitConfigurationList = new ArrayList(); + for (int i = 0; i < 1024; i++) { + Configuration oneConfig = Configuration.newDefault(); + List jdbcUrlArray = new ArrayList(); + jdbcUrlArray.add(String.format("odps://localhost:3305/db%04d", + i)); + oneConfig.set("odpsUrl", jdbcUrlArray); + + List tableArray = new ArrayList(); + tableArray.add(String.format("odps_jingxing_%04d", i)); + oneConfig.set("table", tableArray); + + splitConfigurationList.add(oneConfig); + } + + return splitConfigurationList; + } + + @Override + public void init() { + System.out.println("fake writer job initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake writer job destroyed!"); + } + + public void preHandler(Configuration jobConfiguration){ + jobConfiguration.set("job.preHandler.test","writePreDone"); + } + + public void postHandler(Configuration jobConfiguration){ + jobConfiguration.set("job.postHandler.test","writePostDone"); + } + } + + public static final class Task extends Writer.Task { + + @Override + public void startWrite(RecordReceiver lineReceiver) { + Record record = null; + + while ((record = lineReceiver.getFromReader()) != null) { + this.getTaskPluginCollector().collectDirtyRecord( + record, + DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "TEST"), "TEST"); + } + + for (int i = 0; i < 10; i++) { + this.getTaskPluginCollector().collectMessage("bazhen-writer", + "bazhen"); + } + } + + @Override + public void prepare() { + System.out.println("fake writer task prepared!"); + } + + @Override + public void post() { + System.out.println("fake writer task posted!"); + } + + @Override + public void init() { + System.out.println("fake writer task initialized!"); + } + + @Override + public void destroy() { + System.out.println("fake writer task destroyed!"); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scaffold/ColumnProducer.java b/core/src/test/java/com/alibaba/datax/core/scaffold/ColumnProducer.java new file mode 100755 index 000000000..5e73b8649 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scaffold/ColumnProducer.java @@ -0,0 +1,30 @@ +package com.alibaba.datax.core.scaffold; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.StringColumn; + +public class ColumnProducer { + public static Column produceLongColumn(int i) { + return new LongColumn(i); + } + + public static Column produceStringColumn(String s) { + return new StringColumn(s); + } + + public static Column produceDateColumn(long time) { + return new DateColumn(time); + } + + public static Column produceBytesColumn(byte[] bytes) { + return new BytesColumn(bytes); + } + + public static Column produceBoolColumn(boolean bool) { + return new BoolColumn(bool); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scaffold/ConfigurationProducer.java b/core/src/test/java/com/alibaba/datax/core/scaffold/ConfigurationProducer.java new file mode 100755 index 000000000..427923ec0 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scaffold/ConfigurationProducer.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.core.scaffold; + +import java.io.File; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.ConfigParser; + +public final class ConfigurationProducer { + + public static Configuration produce() { + String path = ConfigurationProducer.class.getClassLoader() + .getResource(".").getFile(); + return ConfigParser.parse(path + File.separator + "all.json"); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scaffold/RecordProducer.java b/core/src/test/java/com/alibaba/datax/core/scaffold/RecordProducer.java new file mode 100755 index 000000000..0758d794a --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scaffold/RecordProducer.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.core.scaffold; + +import java.io.UnsupportedEncodingException; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.core.transport.record.DefaultRecord; + +public class RecordProducer { + public static Record produceRecord() { + + try { + Record record = new DefaultRecord(); + record.addColumn(ColumnProducer.produceLongColumn(1)); + record.addColumn(ColumnProducer.produceStringColumn("bazhen")); + record.addColumn(ColumnProducer.produceBoolColumn(true)); + record.addColumn(ColumnProducer.produceDateColumn(System + .currentTimeMillis())); + record.addColumn(ColumnProducer.produceBytesColumn("bazhen" + .getBytes("utf-8"))); + return record; + } catch (UnsupportedEncodingException e) { + throw new IllegalArgumentException(e); + } + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scaffold/base/CaseInitializer.java b/core/src/test/java/com/alibaba/datax/core/scaffold/base/CaseInitializer.java new file mode 100755 index 000000000..5641dcf22 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scaffold/base/CaseInitializer.java @@ -0,0 +1,39 @@ +package com.alibaba.datax.core.scaffold.base; + +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.StringUtils; +import org.junit.BeforeClass; + +import java.io.File; + +public class CaseInitializer { + @BeforeClass + public static void beforeClass() { + CoreConstant.DATAX_HOME = CaseInitializer.class.getClassLoader() + .getResource(".").getFile(); + + CoreConstant.DATAX_CONF_PATH = StringUtils.join(new String[]{ + CoreConstant.DATAX_HOME, "conf", "core.json"}, File.separator); + + CoreConstant.DATAX_CONF_LOG_PATH = StringUtils.join( + new String[] { CoreConstant.DATAX_HOME, "conf", "logback.xml" }, File.separator); + + CoreConstant.DATAX_PLUGIN_HOME = StringUtils.join( + new String[] { CoreConstant.DATAX_HOME, "plugin" }, File.separator); + + CoreConstant.DATAX_PLUGIN_READER_HOME = StringUtils.join( + new String[] { CoreConstant.DATAX_HOME, "plugin", "reader" }, File.separator); + + CoreConstant.DATAX_PLUGIN_WRITER_HOME = StringUtils.join( + new String[] { CoreConstant.DATAX_HOME, "plugin", "writer" }, File.separator); + + CoreConstant.DATAX_BIN_HOME = StringUtils.join(new String[] { + CoreConstant.DATAX_HOME, "bin" }, File.separator); + + CoreConstant.DATAX_JOB_HOME = StringUtils.join(new String[] { + CoreConstant.DATAX_HOME, "job" }, File.separator); + + CoreConstant.DATAX_SECRET_PATH = StringUtils.join(new String[] { + CoreConstant.DATAX_HOME, "conf", ".secret.properties"}, File.separator); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scheduler/ErrorRecordLimitTest.java b/core/src/test/java/com/alibaba/datax/core/scheduler/ErrorRecordLimitTest.java new file mode 100755 index 000000000..98582c94e --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scheduler/ErrorRecordLimitTest.java @@ -0,0 +1,57 @@ +package com.alibaba.datax.core.scheduler; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.util.ErrorRecordChecker; +import org.junit.Test; + +public class ErrorRecordLimitTest { + @Test(expected = DataXException.class) + public void testCheckRecordLimit() throws Exception { + ErrorRecordChecker errLimit = new ErrorRecordChecker(0L, 0.5); + errLimit.checkRecordLimit(new Communication() { + { + this.setLongCounter(CommunicationTool.WRITE_FAILED_RECORDS, 1); + } + }); + } + + @Test + public void testCheckRecordLimit2() throws Exception { + ErrorRecordChecker errLimit = new ErrorRecordChecker(1L, 0.5); + errLimit.checkRecordLimit(new Communication() { + { + this.setLongCounter(CommunicationTool.WRITE_FAILED_RECORDS, 1); + } + }); + } + + @Test + public void testCheckRecordLimit3() throws Exception { + // 百分数无效 + ErrorRecordChecker errLimit = new ErrorRecordChecker(1L, 0.05); + errLimit.checkPercentageLimit(new Communication() { + { + this.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 100); + this.setLongCounter(CommunicationTool.WRITE_FAILED_RECORDS, 50); + } + }); + } + + @Test(expected = IllegalArgumentException.class) + public void testInvalidConstruction() throws Exception { + new ErrorRecordChecker(-1L, 0.1); + } + + @Test(expected = IllegalArgumentException.class) + public void testInvalidConstruction2() throws Exception { + new ErrorRecordChecker(0L, -0.1); + } + + @Test(expected = IllegalArgumentException.class) + public void testInvalidConstruction3() throws Exception { + new ErrorRecordChecker(0L, 1.1); + } + +} \ No newline at end of file diff --git a/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneSchedulerTest.java b/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneSchedulerTest.java new file mode 100755 index 000000000..b8e1d8df8 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneSchedulerTest.java @@ -0,0 +1,64 @@ +package com.alibaba.datax.core.scheduler.standalone; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.ExecuteMode; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.job.scheduler.processinner.ProcessInnerScheduler; +import com.alibaba.datax.core.job.scheduler.processinner.StandAloneScheduler; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; +import com.alibaba.datax.core.statistics.container.communicator.job.StandAloneJobContainerCommunicator; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang3.RandomUtils; +import org.junit.Test; +import org.powermock.api.mockito.PowerMockito; + +import java.util.ArrayList; +import java.util.List; + +import static org.mockito.Matchers.anyListOf; + +public class StandAloneSchedulerTest extends CaseInitializer { + + @Test + public void testSchedule() throws NoSuchFieldException, IllegalAccessException { + int taskNumber = 10; + List jobList = new ArrayList(); + + List internal = new ArrayList(); + int randomSize = 20; + int length = RandomUtils.nextInt(0, randomSize)+1; + for (int i = 0; i < length; i++) { + internal.add(Configuration.newDefault()); + } + + LocalTGCommunicationManager.clear(); + for (int i = 0; i < taskNumber; i++) { + Configuration configuration = Configuration.newDefault(); + configuration + .set(CoreConstant.DATAX_CORE_CONTAINER_JOB_REPORTINTERVAL, + 11); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, 0); + configuration.set(CoreConstant.DATAX_JOB_CONTENT, internal); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_MODE, ExecuteMode.STANDALONE.getValue()); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, i); + jobList.add(configuration); + LocalTGCommunicationManager.registerTaskGroupCommunication(i,new Communication()); + } + + StandAloneJobContainerCommunicator standAloneJobContainerCommunicator = PowerMockito. + mock(StandAloneJobContainerCommunicator.class); + ProcessInnerScheduler scheduler = PowerMockito.spy(new StandAloneScheduler(standAloneJobContainerCommunicator)); + + PowerMockito.doNothing().when(scheduler).startAllTaskGroup(anyListOf(Configuration.class)); + + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + PowerMockito.when(standAloneJobContainerCommunicator.collect()). + thenReturn(communication); + PowerMockito.doNothing().when(standAloneJobContainerCommunicator).report(communication); + + scheduler.schedule(jobList); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneTestJobCollector.java b/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneTestJobCollector.java new file mode 100755 index 000000000..2b919a320 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneTestJobCollector.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.core.scheduler.standalone; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.container.collector.AbstractCollector; + +import java.util.List; + +public class StandAloneTestJobCollector extends AbstractCollector { + + public void registerCommunication(List configurationList) { + System.out.println("register ok"); + } + + public void report(Communication communication) { + System.out.println("job report 2"); + } + + public Communication collect() { + return new Communication() {{ + this.setState(State.SUCCEEDED); + }}; + } + + @Override + public void registerTGCommunication(List taskGroupConfigurationList) { + + } + + @Override + public void registerTaskCommunication(List taskConfigurationList) { + + } + + @Override + public Communication collectFromTask() { + return null; + } + + @Override + public Communication collectFromTaskGroup() { + return null; + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneTestTaskGroupContainer.java b/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneTestTaskGroupContainer.java new file mode 100755 index 000000000..8a4e65cb1 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/scheduler/standalone/StandAloneTestTaskGroupContainer.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.core.scheduler.standalone; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; + +/** + * Created by jingxing on 14-9-4. + */ +public class StandAloneTestTaskGroupContainer extends TaskGroupContainer { + public StandAloneTestTaskGroupContainer(Configuration configuration) { + super(configuration); + } + + @Override + public void start() { + try { + Thread.sleep(200); + } catch (InterruptedException e) { + e.printStackTrace(); + } + System.out.println("start standAlone test task container"); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/statistics/collector/ProcessInnerCollectorTest.java b/core/src/test/java/com/alibaba/datax/core/statistics/collector/ProcessInnerCollectorTest.java new file mode 100755 index 000000000..49fb0d575 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/statistics/collector/ProcessInnerCollectorTest.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.core.statistics.collector; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; +import com.alibaba.datax.core.statistics.container.collector.ProcessInnerCollector; +import com.alibaba.datax.core.util.ReflectUtil; +import org.junit.Assert; +import org.junit.Test; + +import java.util.concurrent.ConcurrentHashMap; + +/** + * Created by hongjiao.hj on 2014/12/21. + */ +public class ProcessInnerCollectorTest { + @Test + public void testCollectFromTaskGroup() throws NoSuchFieldException, IllegalAccessException { + Integer taskGroupId_1 = 1; + Integer taskGroupId_2 = 2; + Communication communication_1 = new Communication(); + communication_1.setLongCounter("totalBytes",888); + Communication communication_2 = new Communication(); + communication_2.setLongCounter("totalBytes",112); + + ConcurrentHashMap taskGroupCommunicationMap = new ConcurrentHashMap(); + taskGroupCommunicationMap.put(taskGroupId_1,communication_1); + taskGroupCommunicationMap.put(taskGroupId_2,communication_2); + + ReflectUtil.setField(new LocalTGCommunicationManager(),"taskGroupCommunicationMap",taskGroupCommunicationMap); + + ProcessInnerCollector processInnerCollector = new ProcessInnerCollector(0L); + Communication comm = processInnerCollector.collectFromTaskGroup(); + Assert.assertTrue(comm.getLongCounter("totalBytes") == 1000); + System.out.println(comm); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/statistics/communication/CommunicationJsonifyTest.java b/core/src/test/java/com/alibaba/datax/core/statistics/communication/CommunicationJsonifyTest.java new file mode 100755 index 000000000..c1f2ee23a --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/statistics/communication/CommunicationJsonifyTest.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONObject; +import org.junit.Assert; +import org.junit.Test; + +public class CommunicationJsonifyTest { + @Test + public void testJsonGetSnapshot() { + Communication communication = new Communication(); + communication.setLongCounter(CommunicationTool.STAGE, 10); + communication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 100); + communication.setLongCounter(CommunicationTool.READ_SUCCEED_BYTES, 102400); + communication.setLongCounter(CommunicationTool.BYTE_SPEED, 10240); + communication.setLongCounter(CommunicationTool.RECORD_SPEED, 100); + communication.setDoubleCounter(CommunicationTool.PERCENTAGE, 0.1); + communication.setState(State.RUNNING); + communication.setLongCounter(CommunicationTool.WRITE_RECEIVED_RECORDS, 99); + communication.setLongCounter(CommunicationTool.WRITE_RECEIVED_BYTES, 102300); + + String jsonString = CommunicationTool.Jsonify.getSnapshot(communication); + JSONObject metricJson = JSON.parseObject(jsonString); + + Assert.assertEquals(communication.getLongCounter(CommunicationTool.RECORD_SPEED), + metricJson.getLong("speedRecords")); + Assert.assertEquals(communication.getDoubleCounter(CommunicationTool.PERCENTAGE), + metricJson.getDouble("percentage")); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/statistics/communication/CommunicationTest.java b/core/src/test/java/com/alibaba/datax/core/statistics/communication/CommunicationTest.java new file mode 100755 index 000000000..2f3fdc6c4 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/statistics/communication/CommunicationTest.java @@ -0,0 +1,122 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.core.job.meta.State; +import org.junit.Assert; +import org.junit.Test; + +public class CommunicationTest { + @Test + public void commonTest() { + Communication comm = new Communication(); + + String longKey = "long"; + Assert.assertEquals(0, (long) comm.getLongCounter(longKey)); + long longValue = 100; + comm.setLongCounter(longKey, longValue); + String doubleKey = "double"; + Assert.assertEquals(0.0, comm.getDoubleCounter(doubleKey), 0.01); + double doubleValue = 0.01d; + comm.setDoubleCounter(doubleKey, doubleValue); + comm.setState(State.SUCCEEDED); + comm.setThrowable(new RuntimeException("runtime exception")); + long now = System.currentTimeMillis(); + comm.setTimestamp(now); + + Assert.assertEquals(longValue, (long) comm.getLongCounter("long")); + Assert.assertEquals(doubleValue, comm.getDoubleCounter("double"), 0.01); + Assert.assertTrue(State.SUCCEEDED.equals(comm.getState())); + Assert.assertTrue(comm.getThrowable() instanceof RuntimeException); + Assert.assertEquals(now, comm.getTimestamp()); + + comm.reset(); + Assert.assertTrue(State.RUNNING.equals(comm.getState())); + Assert.assertTrue(comm.getTimestamp() >= now); + + long delta = 5; + comm.increaseCounter(longKey, delta); + Assert.assertEquals(delta, (long) comm.getLongCounter(longKey)); + + String messageKey = "message"; + comm.addMessage(messageKey, "message1"); + comm.addMessage(messageKey, "message2"); + Assert.assertEquals(2, comm.getMessage(messageKey).size()); + } + + @Test + public void setStateTest() { + Communication comm = new Communication(); + Assert.assertTrue(State.RUNNING.equals(comm.getState())); + + comm.setState(State.SUCCEEDED); + Assert.assertTrue(State.SUCCEEDED.equals(comm.getState())); + + comm.setState(State.FAILED); + Assert.assertTrue(State.FAILED.equals(comm.getState())); + + comm.setState(State.SUCCEEDED); + Assert.assertTrue(State.FAILED.equals(comm.getState())); + } + + @Test + public void cloneTest() { + Communication comm0 = new Communication(); + String longKey = "long"; + long longValue = 5; + long timestamp = comm0.getTimestamp(); + comm0.setLongCounter(longKey, longValue); + comm0.setState(State.SUCCEEDED); + + Communication comm1 = comm0.clone(); + + Assert.assertEquals(longValue, (long) comm1.getLongCounter(longKey)); + Assert.assertEquals(timestamp, comm1.getTimestamp()); + Assert.assertTrue(comm0.getState().equals(comm1.getState())); + } + + @Test + public void mergeTest() { + Communication comm1 = new Communication(); + comm1.setLongCounter("long", 5); + comm1.setDoubleCounter("double", 5.1); + comm1.setState(State.SUCCEEDED); + comm1.setThrowable(new RuntimeException("run exception")); + comm1.addMessage("message", "message1"); + + Communication comm2 = new Communication(); + comm2.setLongCounter("long", 5); + comm2.setDoubleCounter("double", 5.1); + comm2.setState(State.FAILED); + comm2.setThrowable(new IllegalArgumentException("")); + comm2.addMessage("message", "message2"); + + Communication comm = comm1.mergeFrom(comm2); + Assert.assertEquals(10, (long) comm.getLongCounter("long")); + Assert.assertEquals(10.2, comm.getDoubleCounter("double"), 0.01); + Assert.assertTrue(State.FAILED.equals(comm.getState())); + Assert.assertTrue(comm.getThrowable() instanceof RuntimeException); + Assert.assertEquals(2, comm.getMessage("message").size()); + } + + @Test + public void testMergeStateFrom() { + Communication comm1 = new Communication(); + Communication comm2 = new Communication(); + + comm1.setState(State.FAILED, true); + comm2.setState(State.KILLED, true); + Assert.assertTrue(comm1.mergeStateFrom(comm2) == State.FAILED); + + comm1.setState(State.RUNNING, true); + comm2.setState(State.SUCCEEDED, true); + Assert.assertTrue(comm1.mergeStateFrom(comm2) == State.RUNNING); + + comm1.setState(State.FAILED, true); + comm2.setState(State.SUCCEEDED, true); + Assert.assertTrue(comm1.mergeStateFrom(comm2) == State.FAILED); + + comm1.setState(State.SUBMITTING, true); + comm2.setState(State.SUCCEEDED, true); + Assert.assertTrue(comm1.mergeStateFrom(comm2) == State.RUNNING); + + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/statistics/communication/LocalTaskGroupCommunicationTest.java b/core/src/test/java/com/alibaba/datax/core/statistics/communication/LocalTaskGroupCommunicationTest.java new file mode 100755 index 000000000..6323a544c --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/statistics/communication/LocalTaskGroupCommunicationTest.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.core.job.meta.State; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +public class LocalTaskGroupCommunicationTest { + private final int taskGroupNumber = 5; + + @Before + public void setUp() { + LocalTGCommunicationManager.clear(); + for (int index = 0; index < taskGroupNumber; index++) { + LocalTGCommunicationManager.registerTaskGroupCommunication( + index, new Communication()); + } + } + + @Test + public void LocalCommunicationTest() { + Communication jobCommunication = + LocalTGCommunicationManager.getJobCommunication(); + Assert.assertTrue(jobCommunication.getState().equals(State.RUNNING)); + + for (int index : LocalTGCommunicationManager.getTaskGroupIdSet()) { + Communication communication = LocalTGCommunicationManager + .getTaskGroupCommunication(index); + communication.setState(State.SUCCEEDED); + LocalTGCommunicationManager.updateTaskGroupCommunication( + index, communication); + } + + jobCommunication = LocalTGCommunicationManager.getJobCommunication(); + Assert.assertTrue(jobCommunication.getState().equals(State.SUCCEEDED)); + } + + @Test(expected = IllegalArgumentException.class) + public void noTaskGroupIdForUpdate() { + LocalTGCommunicationManager.updateTaskGroupCommunication( + this.taskGroupNumber + 1, new Communication()); + } + + @Test(expected = IllegalArgumentException.class) + public void noTaskGroupIdForGet() { + LocalTGCommunicationManager.getTaskGroupCommunication(-1); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/statistics/reporter/ProcessInnerReporterTest.java b/core/src/test/java/com/alibaba/datax/core/statistics/reporter/ProcessInnerReporterTest.java new file mode 100755 index 000000000..52e016f29 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/statistics/reporter/ProcessInnerReporterTest.java @@ -0,0 +1,43 @@ +package com.alibaba.datax.core.statistics.reporter; + +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; +import com.alibaba.datax.core.statistics.container.report.ProcessInnerReporter; +import com.alibaba.datax.core.util.ReflectUtil; +import org.junit.Assert; +import org.junit.Test; + +import java.util.concurrent.ConcurrentHashMap; + + +public class ProcessInnerReporterTest { + + @Test + public void testReportJobCommunication() { + Long jobId = 0L; + Communication communication = new Communication(); + + ProcessInnerReporter processInnerReporter = new ProcessInnerReporter(); + processInnerReporter.reportJobCommunication(jobId,communication); + System.out.println("this function do noting"); + } + + @Test + public void testReportTGCommunication() throws NoSuchFieldException, IllegalAccessException { + Integer taskGroupId = 1; + Communication communication = new Communication(); + communication.setState(State.SUBMITTING); + + ConcurrentHashMap map = new ConcurrentHashMap(); + map.put(taskGroupId,communication); + + ReflectUtil.setField(new LocalTGCommunicationManager(),"taskGroupCommunicationMap",map); + ProcessInnerReporter processInnerReporter = new ProcessInnerReporter(); + + Communication updateCommunication = new Communication(); + updateCommunication.setState(State.WAITING); + processInnerReporter.reportTGCommunication(taskGroupId,updateCommunication); + Assert.assertEquals(map.get(taskGroupId).getState(),State.WAITING); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/taskgroup/TaskMonitorTest.java b/core/src/test/java/com/alibaba/datax/core/taskgroup/TaskMonitorTest.java new file mode 100644 index 000000000..f4fa89421 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/taskgroup/TaskMonitorTest.java @@ -0,0 +1,192 @@ +package com.alibaba.datax.core.taskgroup; + +import com.alibaba.datax.core.job.meta.State; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.lang.reflect.Field; +import java.util.concurrent.ConcurrentHashMap; + +public class TaskMonitorTest { + + TaskMonitor taskMonitor = TaskMonitor.getInstance(); + private ConcurrentHashMap tasks; + + @Before + public void setUp() throws Exception { + Field tasks = taskMonitor.getClass().getDeclaredField("tasks"); + tasks.setAccessible(true); + tasks.set(taskMonitor, new ConcurrentHashMap()); + this.tasks = (ConcurrentHashMap) tasks.get(taskMonitor); + } + + @Test + public void testNormal() throws Exception { + + //register task + long ttl = System.currentTimeMillis(); + + Communication communication1 = new Communication(); + + taskMonitor.registerTask(1, communication1); + + TaskMonitor.TaskCommunication taskCommunication1 = taskMonitor.getTaskCommunication(1); + + Assert.assertEquals(taskCommunication1.getLastAllReadRecords(), 0L); + Assert.assertEquals(this.tasks.size(), 1); + + Assert.assertTrue(taskCommunication1.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication1.getTtl() >= ttl); + + // report 没有任何变化的communication + + long oldTS = taskCommunication1.getLastUpdateComunicationTS(); + long oldTTL = taskCommunication1.getTtl(); + Thread.sleep(1000); + + taskMonitor.report(1, communication1); + + TaskMonitor.TaskCommunication taskCommunication1_1 = taskMonitor.getTaskCommunication(1); + + Assert.assertEquals(taskCommunication1_1.getLastAllReadRecords(), 0L); + Assert.assertEquals(taskCommunication1_1.getLastUpdateComunicationTS(), oldTS); + Assert.assertTrue(taskCommunication1_1.getTtl() > oldTTL); + + // report 已经finish的communication + Communication communication2 = new Communication(); + communication2.setState(State.KILLED); + + taskMonitor.registerTask(2, communication2); + Assert.assertEquals(this.tasks.size(), 1); + + + // report 另一个communication + Communication communication3 = new Communication(); + taskMonitor.registerTask(3, communication3); + + Assert.assertEquals(this.tasks.size(), 2); + System.out.println(this.tasks); + + //report communication + + ttl = System.currentTimeMillis(); + + communication1.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 100); + communication3.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, 10); + + taskMonitor.report(1, communication1); + taskMonitor.report(3, communication3); + + taskCommunication1 = taskMonitor.getTaskCommunication(1); + + Assert.assertEquals(taskCommunication1.getLastAllReadRecords(), 100L); + Assert.assertEquals(this.tasks.size(), 2); + + Assert.assertTrue(taskCommunication1.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication1.getTtl() >= ttl); + + TaskMonitor.TaskCommunication taskCommunication3 = taskMonitor.getTaskCommunication(3); + + Assert.assertEquals(taskCommunication3.getLastAllReadRecords(), 10L); + Assert.assertEquals(this.tasks.size(), 2); + + Assert.assertTrue(taskCommunication3.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication3.getTtl() >= ttl); + + //继续report + ttl = System.currentTimeMillis(); + + communication1.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 1001); + communication3.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, 101); + + taskMonitor.report(1, communication1); + taskMonitor.report(3, communication3); + + taskCommunication1 = taskMonitor.getTaskCommunication(1); + + Assert.assertEquals(taskCommunication1.getLastAllReadRecords(), 1001L); + Assert.assertEquals(this.tasks.size(), 2); + + Assert.assertTrue(taskCommunication1.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication1.getTtl() >= ttl); + + taskCommunication3 = taskMonitor.getTaskCommunication(3); + + Assert.assertEquals(taskCommunication3.getLastAllReadRecords(), 101L); + Assert.assertEquals(this.tasks.size(), 2); + + Assert.assertTrue(taskCommunication3.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication3.getTtl() >= ttl); + + // 设置EXPIRED_TIME + Field EXPIRED_TIME = taskMonitor.getClass().getDeclaredField("EXPIRED_TIME"); + EXPIRED_TIME.setAccessible(true); + EXPIRED_TIME.set(null, 1000); + + Thread.sleep(2000); + + //超时没有变更 + taskMonitor.report(1, communication1); + + System.out.println(communication1.getCounter()); + System.out.println(communication1.getThrowable()); + System.out.println(communication1.getThrowableMessage()); + System.out.println(communication1.getState()); + + Assert.assertTrue(communication1.getThrowableMessage().contains("任务hung住,Expired")); + Assert.assertEquals(communication1.getState(), State.FAILED); + + // communicatio1 已经fail, communication3 在超时后进行变更,update正常 + ttl = System.currentTimeMillis(); + + communication1.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, 2001); + communication3.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, 201); + + taskMonitor.report(1, communication1); + taskMonitor.report(3, communication3); + + + taskCommunication1 = taskMonitor.getTaskCommunication(1); + + Assert.assertEquals(taskCommunication1.getLastAllReadRecords(), 1001L); + Assert.assertEquals(this.tasks.size(), 2); + + Assert.assertTrue(communication1.getThrowableMessage().contains("任务hung住,Expired")); + Assert.assertEquals(communication1.getState(), State.FAILED); + + taskCommunication3 = taskMonitor.getTaskCommunication(3); + + Assert.assertEquals(taskCommunication3.getLastAllReadRecords(), 201L); + Assert.assertEquals(this.tasks.size(), 2); + + Assert.assertTrue(taskCommunication3.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication3.getTtl() >= ttl); + + + //remove 1 + taskMonitor.removeTask(1); + Assert.assertEquals(this.tasks.size(), 1); + + //remove 3 + taskMonitor.removeTask(3); + Assert.assertEquals(this.tasks.size(), 0); + + // 没有register communication3 直接report + ttl = System.currentTimeMillis(); + + communication3.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, 301); + + taskMonitor.report(3, communication3); + + taskCommunication3 = taskMonitor.getTaskCommunication(3); + + Assert.assertEquals(taskCommunication3.getLastAllReadRecords(), 301L); + Assert.assertEquals(this.tasks.size(), 1); + + Assert.assertTrue(taskCommunication3.getLastUpdateComunicationTS() >= ttl); + Assert.assertTrue(taskCommunication3.getTtl() >= ttl); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannelTest.java b/core/src/test/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannelTest.java new file mode 100755 index 000000000..90de8a805 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannelTest.java @@ -0,0 +1,174 @@ +package com.alibaba.datax.core.transport.channel.memory; + +import java.util.ArrayList; +import java.util.List; + +import com.alibaba.datax.core.statistics.communication.Communication; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.scaffold.ConfigurationProducer; +import com.alibaba.datax.core.scaffold.RecordProducer; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.container.CoreConstant; + +public class MemoryChannelTest extends CaseInitializer { + private Channel channel; + + @Before + public void before() { + System.out.println(ConfigurationProducer.produce().toJSON()); + Configuration configuration = ConfigurationProducer.produce(); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 0); + this.channel = new MemoryChannel(configuration); + this.channel.setCommunication(new Communication()); + } + + // 测试SEQ + @Test + public void test_seq() { + int capacity = 4; + + Record record = null; + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + this.channel.push(record); + } + + for (int i = 0; i < capacity; i++) { + record = this.channel.pull(); + System.out.println(record.getColumn(0).asLong()); + Assert.assertTrue(record.getColumn(0).asLong() == i); + } + + List records = new ArrayList(capacity); + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + records.add(record); + } + this.channel.pushAll(records); + + this.channel.pullAll(records); + System.out.println(records.size()); + for (int i = 0; i < capacity; i++) { + System.out.println(records.get(i).getColumn(0).asLong()); + Assert.assertTrue(records.get(i).getColumn(0).asLong() == i); + } + } + + @Test + public void test_Block() throws InterruptedException { + int tryCount = 100; + int capacity = ConfigurationProducer.produce().getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY); + + System.out.println("capacity: " + capacity); + + Thread thread = new Thread(new Consumer(this.channel, tryCount * capacity)); + thread.start(); + + List records = new ArrayList(capacity); + for (int i = 0; i < capacity; i++) { + Record record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + records.add(record); + } + + for (int i = 0; i < tryCount; i++) { + this.channel.pushAll(records); + } + + Thread.sleep(5000L); + + List termindateRecords = new ArrayList(); + termindateRecords.add(TerminateRecord.get()); + this.channel.pushAll(termindateRecords); + + Thread.sleep(1000L); + + thread.join(); + + } + + @Test + public void test_BlockAndSeq() throws InterruptedException { + int tryCount = 100; + int capacity = ConfigurationProducer.produce().getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY); + + System.out.println("capacity: " + capacity); + + Thread thread = new Thread(new Consumer(this.channel, tryCount * capacity)); + thread.start(); + + List records = new ArrayList(capacity); + for (int i = 0; i < capacity; i++) { + Record record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + records.add(record); + } + + for (int i = 0; i < tryCount; i++) { + this.channel.pushAll(records); + } + + Thread.sleep(5000L); + + this.channel.push(TerminateRecord.get()); + + Thread.sleep(1000L); + + thread.join(); + + } +} + +class Consumer implements Runnable { + + private Channel channel = null; + + private int needCapacity = 0; + + public Consumer(Channel channel, int needCapacity) { + this.channel = channel; + this.needCapacity = needCapacity; + return; + } + + @Override + public void run() { + List records = new ArrayList(); + + boolean isTermindate = false; + int counter = 0; + + while (true) { + this.channel.pullAll(records); + for (final Record each : records) { + if (each == TerminateRecord.get()) { + isTermindate = true; + break; + } + counter++; + continue; + } + + if (isTermindate) { + break; + } + } + + System.out.println(String.format("Need %d, Get %d .", needCapacity, + counter)); + Assert.assertTrue(counter == needCapacity); + } + +} diff --git a/core/src/test/java/com/alibaba/datax/core/transport/exchanger/RecordExchangerTest.java b/core/src/test/java/com/alibaba/datax/core/transport/exchanger/RecordExchangerTest.java new file mode 100755 index 000000000..455125919 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/transport/exchanger/RecordExchangerTest.java @@ -0,0 +1,293 @@ +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.scaffold.ColumnProducer; +import com.alibaba.datax.core.scaffold.ConfigurationProducer; +import com.alibaba.datax.core.scaffold.RecordProducer; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.channel.memory.MemoryChannel; +import com.alibaba.datax.core.transport.record.DefaultRecord; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; +import org.mockito.ArgumentCaptor; + +import static org.mockito.Mockito.*; +import static org.powermock.api.mockito.PowerMockito.mock; + +public class RecordExchangerTest extends CaseInitializer { + + private Configuration configuration = null; + + @Before + public void before() { + this.configuration = ConfigurationProducer.produce(); + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 1); + return; + } + + + @Test + public void testMemeroySize() throws Exception { + Column longColumn = ColumnProducer.produceLongColumn(1); + Column longColumn2 = new LongColumn("234567891"); + Column stringColumn= ColumnProducer.produceStringColumn("sringtest"); + Column boolColumn=ColumnProducer.produceBoolColumn(true); + Column dateColumn = ColumnProducer.produceDateColumn(System.currentTimeMillis()); + Column bytesColumn = ColumnProducer.produceBytesColumn("test".getBytes("utf-8")); + Assert.assertEquals(longColumn.getByteSize(),8); + Assert.assertEquals(longColumn2.getByteSize(),9); + Assert.assertEquals(stringColumn.getByteSize(),9); + Assert.assertEquals(boolColumn.getByteSize(),1); + Assert.assertEquals(dateColumn.getByteSize(),8); + Assert.assertEquals(bytesColumn.getByteSize(),4); + + Record record = new DefaultRecord(); + record.addColumn(longColumn); + record.addColumn(longColumn2); + record.addColumn(stringColumn); + record.addColumn(boolColumn); + record.addColumn(dateColumn); + record.addColumn(bytesColumn); + + Assert.assertEquals(record.getByteSize(),39); + // record classSize = 80 + // column classSize = 6*24 + Assert.assertEquals(record.getMemorySize(),263); + + } + + @Test + public void test_Exchanger() { + Channel channel = new MemoryChannel(configuration); + channel.setCommunication(new Communication()); + + int capacity = 10; + Record record = null; + RecordExchanger recordExchanger = new RecordExchanger(channel); + + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + recordExchanger.sendToWriter(record); + } + + System.out.println("byteSize=" + record.getByteSize()); + System.out.println("meorySize=" + record.getMemorySize()); + + channel.close(); + + int counter = 0; + while ((record = recordExchanger.getFromReader()) != null) { + System.out.println(record.getColumn(0).toString()); + Assert.assertTrue(record.getColumn(0).asLong() == counter); + counter++; + } + + Assert.assertTrue(capacity == counter); + } + + @Test + public void test_BufferExchanger() { + + Configuration configuration = ConfigurationProducer.produce(); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 1); + + Channel channel = new MemoryChannel(configuration); + channel.setCommunication(new Communication()); + + TaskPluginCollector pluginCollector = mock(TaskPluginCollector.class); + int capacity = 10; + Record record = null; + BufferedRecordExchanger recordExchanger = new BufferedRecordExchanger( + channel,pluginCollector); + + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + recordExchanger.sendToWriter(record); + } + + recordExchanger.flush(); + + channel.close(); + + int counter = 0; + while ((record = recordExchanger.getFromReader()) != null) { + System.out.println(record.getColumn(0).toString()); + Assert.assertTrue(record.getColumn(0).asLong() == counter); + counter++; + } + + System.out.println(String.format("Capacity: %d Counter: %d .", + capacity, counter)); + Assert.assertTrue(capacity == counter); + } + + @Test + public void test_BufferExchanger_单条超过buffer的脏数据() throws Exception { + + Configuration configuration = ConfigurationProducer.produce(); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 1); + + //测试单挑记录超过buffer大小 + configuration.set(CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 3); + + TaskPluginCollector pluginCollector = mock(TaskPluginCollector.class); + int capacity = 10; + Record record = null; + + Channel channel2 = new MemoryChannel(configuration); + channel2.setCommunication(new Communication()); + BufferedRecordExchanger recordExchanger2 = new BufferedRecordExchanger( + channel2,pluginCollector); + + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + recordExchanger2.sendToWriter(record); + } + + ArgumentCaptor rgArg = ArgumentCaptor.forClass(Record.class); + ArgumentCaptor eArg = ArgumentCaptor.forClass(Exception.class); + + verify(pluginCollector,times(10)).collectDirtyRecord(rgArg.capture(), eArg.capture()); + + recordExchanger2.flush(); + + channel2.close(); + + int counter = 0; + while ((record = recordExchanger2.getFromReader()) != null) { + System.out.println(record.getColumn(0).toString()); + Assert.assertTrue(record.getColumn(0).asLong() == counter); + counter++; + } + + System.out.println(String.format("Capacity: %d Counter: %d .", + capacity, counter)); + Assert.assertTrue(counter == 0); + + } + + @Test + public void test_BufferExchanger_不满32条到达buffer大小() throws Exception { + + Configuration configuration = ConfigurationProducer.produce(); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 1); + configuration.set(CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 500); + + TaskPluginCollector pluginCollector = mock(TaskPluginCollector.class); + final int capacity = 10; + Record record = null; + + //测试单挑记录超过buffer大小 + + Channel channel3 = new MemoryChannel(configuration); + channel3.setCommunication(new Communication()); + final BufferedRecordExchanger recordExchangerWriter = new BufferedRecordExchanger( + channel3,pluginCollector); + + final BufferedRecordExchanger recordExchangerReader = new BufferedRecordExchanger( + channel3,pluginCollector); + + final BufferedRecordExchanger spy1=spy(recordExchangerWriter); + + Thread t = new Thread(new Runnable() { + @Override + public void run() { + int counter = 0; + Record record; + while ((record = recordExchangerReader.getFromReader()) != null) { + System.out.println(record.getColumn(0).toString()); + Assert.assertTrue(record.getColumn(0).asLong() == counter); + counter++; + } + + System.out.println(String.format("Capacity: %d Counter: %d .", + capacity, counter)); + Assert.assertTrue(capacity == counter); + } + }); + t.start(); + + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + spy1.sendToWriter(record); + } + + spy1.flush(); + + channel3.close(); + + t.join(); + + verify(spy1,times(5)).flush(); + + } + + @Test + public void test_BufferExchanger_每条大小刚好是buffersize() throws Exception { + + Configuration configuration = ConfigurationProducer.produce(); + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 1); + configuration.set(CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 229); + + TaskPluginCollector pluginCollector = mock(TaskPluginCollector.class); + final int capacity = 10; + Record record = null; + + //测试单挑记录超过buffer大小 + + Channel channel3 = new MemoryChannel(configuration); + channel3.setCommunication(new Communication()); + final BufferedRecordExchanger recordExchangerWriter = new BufferedRecordExchanger( + channel3,pluginCollector); + + final BufferedRecordExchanger recordExchangerReader = new BufferedRecordExchanger( + channel3,pluginCollector); + + final BufferedRecordExchanger spy1=spy(recordExchangerWriter); + + Thread t = new Thread(new Runnable() { + @Override + public void run() { + int counter = 0; + Record record; + while ((record = recordExchangerReader.getFromReader()) != null) { + System.out.println(record.getColumn(0).toString()); + Assert.assertTrue(record.getColumn(0).asLong() == counter); + counter++; + } + + System.out.println(String.format("Capacity: %d Counter: %d .", + capacity, counter)); + Assert.assertTrue(capacity == counter); + } + }); + t.start(); + + for (int i = 0; i < capacity; i++) { + record = RecordProducer.produceRecord(); + record.setColumn(0, new LongColumn(i)); + spy1.sendToWriter(record); + } + + spy1.flush(); + + channel3.close(); + + t.join(); + + verify(spy1,times(10)).flush(); + + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/transport/record/RecordTest.java b/core/src/test/java/com/alibaba/datax/core/transport/record/RecordTest.java new file mode 100755 index 000000000..7c588b365 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/transport/record/RecordTest.java @@ -0,0 +1,20 @@ +package com.alibaba.datax.core.transport.record; + +import org.junit.Assert; +import org.junit.Test; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.scaffold.RecordProducer; + +public class RecordTest { + @Test + public void test() { + Record record = RecordProducer.produceRecord(); + System.out.println(record.toString()); + + Configuration configuration = Configuration.from(record.toString()); + Assert.assertTrue(configuration.getInt("size") == 5); + Assert.assertTrue(configuration.getList("data").size() == 5); + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/util/ClassUtilTest.java b/core/src/test/java/com/alibaba/datax/core/util/ClassUtilTest.java new file mode 100755 index 000000000..af3ec99aa --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/util/ClassUtilTest.java @@ -0,0 +1,53 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import org.junit.Assert; +import org.junit.Test; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.AbstractContainer; + +public final class ClassUtilTest extends CaseInitializer { + @Test + public void test() { + + Assert.assertTrue(ClassUtil.instantiate( + Dummy.class.getCanonicalName(), Dummy.class) != null); + + Dummy dummy = ClassUtil.instantiate(Dummy.class.getCanonicalName(), + Dummy.class); + Assert.assertTrue(dummy instanceof Dummy); + + String dataXServerJson = "{\n" + + "\t\"core\": {\n" + + "\t\t\"dataXServer\": {\n" + + "\t\t\t\"address\": \"http://localhost/test\",\n" + + "\t\t\t\"timeout\": 5000\n" + + "\t\t}\n" + + "\t}\n" + + "}"; + Assert.assertTrue(ClassUtil.instantiate( + DummyContainer.class.getCanonicalName(), DummyContainer.class, + Configuration.from(dataXServerJson)) instanceof DummyContainer); + + Assert.assertTrue(ClassUtil.instantiate( + DummyContainer.class.getCanonicalName(), DummyContainer.class, + Configuration.from(dataXServerJson)) instanceof DummyContainer); + } +} + +class DummyContainer extends AbstractContainer { + public DummyContainer(Configuration configuration) { + super(configuration); + } + + @Override + public void start() { + System.out.println(getConfiguration()); + } +} + +class Dummy { + public Dummy() { + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/util/ConfigParserTest.java b/core/src/test/java/com/alibaba/datax/core/util/ConfigParserTest.java new file mode 100755 index 000000000..87e2e2908 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/util/ConfigParserTest.java @@ -0,0 +1,162 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.scaffold.base.CaseInitializer; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.StringUtils; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.*; +import java.lang.reflect.InvocationTargetException; +import java.net.URISyntaxException; +import java.util.HashMap; +import java.util.Map; +import java.util.Properties; + +public class ConfigParserTest extends CaseInitializer { + private String jobPath; + + @Before + public void setUp() { + String path = ConfigParserTest.class.getClassLoader() + .getResource(".").getFile(); + this.jobPath = path + File.separator + + "job" + File.separator + "job.json"; + } + @Test + public void test() throws URISyntaxException { + Configuration configuration = ConfigParser.parse(jobPath); + System.out.println(configuration.toJSON()); + + Assert.assertTrue(configuration.getList("job.content").size() == 2); + Assert.assertTrue(configuration.getString("job.content[0].reader.name") + .equals("fakereader")); + Assert.assertTrue(configuration.getString("job.content[1].reader.name") + .equals("fakereader")); + Assert.assertTrue(configuration.getString("job.content[0].writer.name") + .equals("fakewriter")); + Assert.assertTrue(configuration.getString("job.content[1].writer.name") + .equals("fakewriter")); + + System.out.println(configuration.getConfiguration("plugin").toJSON()); + + configuration = configuration.getConfiguration("plugin"); + Assert.assertTrue(configuration.getString("reader.fakereader.name") + .equals("fakereader")); + Assert.assertTrue(configuration.getString("writer.fakewriter.name") + .equals("fakewriter")); + } + + @Test + public void secretTest() throws NoSuchMethodException, InvocationTargetException, IllegalAccessException { + String password = "password"; + String accessKey = "accessKey"; + String readerParamPath = + "job.content[0].reader.parameter"; + String writerParamPath = + "job.content[1].writer.parameter"; + + Map secretMap = getPublicKeyMap(); + String keyVersion = null; + for(String version : secretMap.keySet()) { + keyVersion = version; + break; + } + + Configuration config = ConfigParser.parse(jobPath); + config.set(CoreConstant.DATAX_JOB_SETTING_KEYVERSION, + keyVersion); + config.set(readerParamPath+".*password", + SecretUtil.encrypt(password, secretMap.get(keyVersion), SecretUtil.KEY_ALGORITHM_RSA)); + config.set(readerParamPath+".*long", 100); + config.set(writerParamPath+".*accessKey", + SecretUtil.encrypt(accessKey, secretMap.get(keyVersion), SecretUtil.KEY_ALGORITHM_RSA)); + config.set(writerParamPath+".*long", 200); + + config = SecretUtil.decryptSecretKey(config); + + Assert.assertTrue(password.equals( + config.getString(readerParamPath+".password"))); + Assert.assertTrue(config.isSecretPath( + readerParamPath+".password")); + Assert.assertTrue(config.get(readerParamPath+".*long") != null); + Assert.assertTrue(accessKey.equals( + config.getString(writerParamPath+".accessKey"))); + Assert.assertTrue(config.isSecretPath( + writerParamPath+".accessKey")); + Assert.assertTrue(config.get(writerParamPath+".*long") != null); + Assert.assertTrue(StringUtils.isBlank( + config.getString(readerParamPath+".*password"))); + Assert.assertTrue(StringUtils.isBlank( + config.getString(writerParamPath+".*accessKey"))); + } + + private Map getPublicKeyMap() { + Map versionKeyMap = + new HashMap(); + InputStream secretStream = null; + try { + secretStream = new FileInputStream( + CoreConstant.DATAX_SECRET_PATH); + } catch (FileNotFoundException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + "DataX配置要求加解密,但无法找到密钥的配置文件"); + } + + Properties properties = new Properties(); + try { + properties.load(secretStream); + secretStream.close(); + } catch (IOException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "读取加解密配置文件出错", e); + } + + String lastKeyVersion = properties.getProperty( + CoreConstant.LAST_KEYVERSION); + String lastPublicKey = properties.getProperty( + CoreConstant.LAST_PUBLICKEY); + String lastPrivateKey = properties.getProperty( + CoreConstant.LAST_PRIVATEKEY); + if(StringUtils.isNotBlank(lastKeyVersion)) { + if(StringUtils.isBlank(lastPublicKey) || + StringUtils.isBlank(lastPrivateKey)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + "DataX配置要求加解密,但上次配置的公私钥对存在为空的情况" + ); + } + + versionKeyMap.put(lastKeyVersion, lastPublicKey); + } + + String currentKeyVersion = properties.getProperty( + CoreConstant.CURRENT_KEYVERSION); + String currentPublicKey = properties.getProperty( + CoreConstant.CURRENT_PUBLICKEY); + String currentPrivateKey = properties.getProperty( + CoreConstant.CURRENT_PRIVATEKEY); + if(StringUtils.isNotBlank(currentKeyVersion)) { + if(StringUtils.isBlank(currentPublicKey) || + StringUtils.isBlank(currentPrivateKey)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + "DataX配置要求加解密,但当前配置的公私钥对存在为空的情况"); + } + + versionKeyMap.put(currentKeyVersion, currentPublicKey); + } + + if(versionKeyMap.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + "DataX配置要求加解密,但无法找到公私钥"); + } + + return versionKeyMap; + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/util/HttpClientUtilTest.java b/core/src/test/java/com/alibaba/datax/core/util/HttpClientUtilTest.java new file mode 100755 index 000000000..9a103290f --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/util/HttpClientUtilTest.java @@ -0,0 +1,175 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import org.apache.http.HttpEntity; +import org.apache.http.HttpStatus; +import org.apache.http.StatusLine; +import org.apache.http.client.methods.CloseableHttpResponse; +import org.apache.http.client.methods.HttpGet; +import org.apache.http.client.methods.HttpRequestBase; +import org.apache.http.impl.client.CloseableHttpClient; +import org.junit.Assert; +import org.junit.Test; +import org.mockito.Mockito; +import org.mockito.invocation.InvocationOnMock; +import org.mockito.stubbing.Answer; +import org.powermock.api.mockito.PowerMockito; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.ByteArrayInputStream; +import java.io.InputStream; +import java.net.URI; +import java.net.URISyntaxException; +import java.text.SimpleDateFormat; +import java.util.Date; + +import static org.mockito.Matchers.any; + +public class HttpClientUtilTest { + + private static final Logger LOGGER = LoggerFactory.getLogger(HttpClientUtilTest.class); + + @Test + public void testExecuteAndGet() throws Exception { + HttpGet httpGet = HttpClientUtil.getGetRequest(); + httpGet.setURI(new URI("http://127.0.0.1:8080")); + + CloseableHttpClient httpClient = PowerMockito.mock(CloseableHttpClient.class); + CloseableHttpResponse response = PowerMockito.mock(CloseableHttpResponse.class); + + StatusLine statusLine = PowerMockito.mock(StatusLine.class); + PowerMockito.when(statusLine.getStatusCode()).thenReturn(HttpStatus.SC_BAD_REQUEST); + PowerMockito.when(response.getStatusLine()).thenReturn(statusLine); + PowerMockito.when(httpClient.execute(any(HttpRequestBase.class))).thenReturn(response); + + try { + HttpClientUtil client = HttpClientUtil.getHttpClientUtil(); + ReflectUtil.setField(client, "httpClient", httpClient); + client.executeAndGet(httpGet); + } catch (Exception e) { + LOGGER.info("msg:" + e.getMessage(), e); + Assert.assertNotNull(e); + Assert.assertEquals("Response Status Code : 400", e.getMessage()); + + } + + try { + PowerMockito.when(statusLine.getStatusCode()).thenReturn(HttpStatus.SC_OK); + PowerMockito.when(response.getEntity()).thenReturn(null); + HttpClientUtil client = HttpClientUtil.getHttpClientUtil(); + ReflectUtil.setField(client, "httpClient", httpClient); + client.executeAndGet(httpGet); + } catch (Exception e) { + LOGGER.info("msg:" + e.getMessage(), e); + Assert.assertNotNull(e); + Assert.assertEquals("Response Entity Is Null", e.getMessage()); + + } + + InputStream is = new ByteArrayInputStream("abc".getBytes()); + HttpEntity entity = PowerMockito.mock(HttpEntity.class); + PowerMockito.when(response.getEntity()).thenReturn(entity); + PowerMockito.when(entity.getContent()).thenReturn(is); + HttpClientUtil client = HttpClientUtil.getHttpClientUtil(); + ReflectUtil.setField(client, "httpClient", httpClient); + String result = client.executeAndGet(httpGet); + Assert.assertEquals("abc", result); + LOGGER.info("result:" + result); + + } + + + @Test + public void testExecuteAndGetWithRetry() throws Exception { + String url = "http://127.0.0.1/:8080"; + HttpRequestBase httpRequestBase = new HttpGet(url); + + HttpClientUtil httpClientUtil = PowerMockito.spy(HttpClientUtil.getHttpClientUtil()); + + + PowerMockito.doAnswer(new Answer() { + @Override + public Object answer(InvocationOnMock invocationOnMock) throws Throwable { + System.out.println("失败第1次"); + return new Exception("失败第1次"); + } + }).doAnswer(new Answer() { + @Override + public Object answer(InvocationOnMock invocationOnMock) throws Throwable { + System.out.println("失败第2次"); + return new Exception("失败第2次"); + } + }).doAnswer(new Answer() { + @Override + public Object answer(InvocationOnMock invocationOnMock) throws Throwable { + System.out.println("失败第3次"); + return new Exception("失败第3次"); + } + }).doAnswer(new Answer() { + @Override + public Object answer(InvocationOnMock invocationOnMock) throws Throwable { + System.out.println("失败第4次"); + return new Exception("失败第4次"); + } + }) + .doReturn("成功") + .when(httpClientUtil).executeAndGet(any(HttpRequestBase.class)); + + + String str = httpClientUtil.executeAndGetWithRetry(httpRequestBase, 5, 1000l); + Assert.assertEquals(str, "成功"); + + try { + httpClientUtil.executeAndGetWithRetry(httpRequestBase, 2, 1000l); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + } + httpClientUtil.destroy(); + } + + /** + * 和测试方法一样:testExecuteAndGetWithRetry(),只是换了一种写法,直接采用 Mockito 进行验证的。 + */ + @Test + public void testExecuteAndGetWithRetry2() throws Exception { + String url = "http://127.0.0.1/:8080"; + HttpRequestBase httpRequestBase = new HttpGet(url); + + HttpClientUtil httpClientUtil = Mockito.spy(HttpClientUtil.getHttpClientUtil()); + + Mockito.doThrow(new Exception("one")).doThrow(new Exception("two")).doThrow(new Exception("three")).doReturn("成功").when(httpClientUtil).executeAndGet(httpRequestBase); + + String str = httpClientUtil.executeAndGetWithRetry(httpRequestBase, 4, 1000l); + Assert.assertEquals(str, "成功"); + + + Mockito.reset(httpClientUtil); + + Mockito.doThrow(new Exception("one")).doThrow(new Exception("two")).doThrow(new Exception("three")).doReturn("成功").when(httpClientUtil).executeAndGet(httpRequestBase); + try { + httpClientUtil.executeAndGetWithRetry(httpRequestBase, 2, 1000l); + } catch (Exception e) { + Assert.assertTrue(e instanceof DataXException); + Assert.assertTrue(e.getMessage().contains("two")); + } + httpClientUtil.destroy(); + } + +// 单独运行可以成功 +// private String url = "http://aBadAddress:8080/"; +// +// @Rule +// public ExpectedException expectedException = ExpectedException.none(); +// +// @Test +// public void testExecuteAndGetWithRetry_exception() throws Exception { +// HttpRequestBase httpRequestBase = new HttpGet(url); +// +// HttpClientUtil httpClientUtil = HttpClientUtil.getHttpClientUtil(); +// +// expectedException.expect(UnknownHostException.class); +// httpClientUtil.executeAndGetWithRetry(httpRequestBase, 3, 1000L); +// } + +} diff --git a/core/src/test/java/com/alibaba/datax/core/util/ReflectUtil.java b/core/src/test/java/com/alibaba/datax/core/util/ReflectUtil.java new file mode 100755 index 000000000..111377ce2 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/util/ReflectUtil.java @@ -0,0 +1,43 @@ +package com.alibaba.datax.core.util; + +import java.lang.reflect.Field; + +/** + * Created by hongjiao.hj on 2014/12/17. + */ +public class ReflectUtil { + + public static void setField(Object targetObj, String name, Object obj) throws NoSuchFieldException, IllegalAccessException { + //Class clazz = targetObj.getClass(); + Field field = getDeclaredField(targetObj,name); + field.setAccessible(true); + field.set(targetObj, obj); + } + + public static Object getField(Object targetObj, String name, Object obj) throws NoSuchFieldException, IllegalAccessException { + Class clazz = targetObj.getClass(); + Field field = clazz.getDeclaredField(name); + field.setAccessible(true); + return field.get(targetObj); + } + + + private static Field getDeclaredField(Object object, String fieldName){ + Field field = null ; + + Class clazz = object.getClass() ; + + for(; clazz != Object.class ; clazz = clazz.getSuperclass()) { + try { + field = clazz.getDeclaredField(fieldName) ; + return field ; + } catch (Exception e) { + //这里什么都不要做!并且这里的异常必须这样写,不能抛出去。 + //如果这里的异常打印或者往外抛,则就不会执行clazz = clazz.getSuperclass(),最后就不会进入到父类中了 + + } + } + + return null; + } +} diff --git a/core/src/test/java/com/alibaba/datax/core/util/SecretUtilTest.java b/core/src/test/java/com/alibaba/datax/core/util/SecretUtilTest.java new file mode 100755 index 000000000..f0c4d03f5 --- /dev/null +++ b/core/src/test/java/com/alibaba/datax/core/util/SecretUtilTest.java @@ -0,0 +1,107 @@ +package com.alibaba.datax.core.util; + +import java.lang.reflect.Field; + +import org.junit.Assert; +import org.junit.Test; + +/** + * Created by jingxing on 14/12/15. + */ +public class SecretUtilTest { + + @Test + public void testRsa() throws Exception { + String[] keys = SecretUtil.initKey(); + String publicKey = keys[0]; + String privateKey = keys[1]; + System.out.println("publicKey:" + publicKey); + System.out.println("privateKey:" + privateKey); + String data = "阿里巴巴DataX"; + + System.out.println("【加密前】:" + data); + + // 加密 + String cipherText = SecretUtil.encrypt(data, publicKey, + SecretUtil.KEY_ALGORITHM_RSA); + System.out.println("【加密后】:" + cipherText); + + // 解密 + String plainText = SecretUtil.decrypt(cipherText, privateKey, + SecretUtil.KEY_ALGORITHM_RSA); + System.out.println("【解密后】:" + plainText); + + Assert.assertTrue(plainText.equals(data)); + } + + @Test + public void testDes() throws Exception { + String keyContent = "datax&cdp&dsc"; + System.out.println("keyContent:" + keyContent); + String data = "阿里巴巴DataX"; + + System.out.println("【加密前】:" + data); + + // 加密 + String cipherText = SecretUtil.encrypt(data, keyContent, + SecretUtil.KEY_ALGORITHM_3DES); + System.out.println("【加密后】:" + cipherText); + + // 解密 + String plainText = SecretUtil.decrypt(cipherText, keyContent, + SecretUtil.KEY_ALGORITHM_3DES); + System.out.println("【解密后】:" + plainText); + + Assert.assertTrue(plainText.equals(data)); + } + + @Test + public void testPythonSwitchJava() { + String data = "rootroot"; + String key = "abcabcabcabcabcabcabcabc"; + String cipherText = SecretUtil.encrypt3DES(data, key); + System.out.println(String.format( + "data[%s] : key[%s] -> cipherText[%s]", data, key, cipherText)); + Assert.assertTrue("svj4x04Oaq6WhrfZVsSRqA==".equals(cipherText)); + Assert.assertTrue(data.equals(SecretUtil.decrypt3DES(cipherText, key))); + + data = "root"; + key = "abcabcabcabcabcabcabcabc"; + cipherText = SecretUtil.encrypt3DES(data, key); + System.out.println(String.format( + "data[%s] : key[%s] -> cipherText[%s]", data, key, cipherText)); + Assert.assertTrue("0Y08MKGhNIw=".equals(cipherText)); + Assert.assertTrue(data.equals(SecretUtil.decrypt3DES(cipherText, key))); + + data = "rootroot"; + key = "abc"; + cipherText = SecretUtil.encrypt3DES(data, key); + System.out.println(String.format( + "data[%s] : key[%s] -> cipherText[%s]", data, key, cipherText)); + Assert.assertTrue("dUTw4gQQ30qtMDBX0lTpmg==".equals(cipherText)); + Assert.assertTrue(data.equals(SecretUtil.decrypt3DES(cipherText, key))); + + data = "root"; + key = "abc"; + cipherText = SecretUtil.encrypt3DES(data, key); + System.out.println(String.format( + "data[%s] : key[%s] -> cipherText[%s]", data, key, cipherText)); + Assert.assertTrue("TRc4s8bf9Yg=".equals(cipherText)); + Assert.assertTrue(data.equals(SecretUtil.decrypt3DES(cipherText, key))); + + data = "rootrootrootroot"; + key = "abc"; + cipherText = SecretUtil.encrypt3DES(data, key); + System.out.println(String.format( + "data[%s] : key[%s] -> cipherText[%s]", data, key, cipherText)); + Assert.assertTrue("dUTw4gQQ30p1RPDiBBDfSq0wMFfSVOma".equals(cipherText)); + Assert.assertTrue(data.equals(SecretUtil.decrypt3DES(cipherText, key))); + } + + @Test + public void test() throws Exception { + this.testRsa(); + System.out.println("\n\n"); + this.testDes(); + } +} diff --git a/core/src/test/resources/all.json b/core/src/test/resources/all.json new file mode 100755 index 000000000..909708640 --- /dev/null +++ b/core/src/test/resources/all.json @@ -0,0 +1,163 @@ + +{ + "entry": { + "jvm": "-Xms1G -Xmx1G", + "environment": { + "PATH": "/home/admin", + "DATAX_HOME": "/home/admin" + } + }, + "common": { + "column": { + "datetimeFormat": "yyyy-MM-dd HH:mm:ss", + "timeFormat": "HH:mm:ss", + "dateFormat": "yyyy-MM-dd", + "extraFormats":["yyyyMMdd"], + "timeZone": "GMT+8", + "encoding": "utf-8" + } + }, + "core": { + "dataXServer": { + "address": "http://localhost/", + "timeout": 10000 + }, + "transport": { + "channel": { + "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel", + "speed": { + "byte": 1048576, + "record": 10000 + }, + "capacity": 32 + }, + "exchanger": { + "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger", + "bufferSize": 32 + } + }, + "container": { + "job": { + "reportInterval": 1000 + }, + "taskGroup": { + "channel": 3 + } + }, + "statistics": { + "collector": { + "plugin": { + "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector", + "maxDirtyNumber": 1000 + } + } + } + }, + "plugin": { + "reader": { + "mysqlreader": { + "name": "fakereader", + "class": "com.alibaba.datax.plugins.reader.fakereader.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + }, + "oraclereader": { + "name": "oraclereader", + "class": "com.alibaba.datax.plugins.reader.oraclereader.OracleReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + }, + "fakereader": { + "name": "fakereader", + "class": "com.alibaba.datax.core.faker.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + } + }, + "writer": { + "fakewriter": { + "name": "fakewriter", + "class": "com.alibaba.datax.core.faker.FakeWriter", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + } + }, + "transformer": { + "groovyTranformer": {} + } + }, + "job": { + "setting": { + "speed": { + "byte": 104857600 + }, + "errorLimit": { + "record": null, + "percentage": null + } + }, + "preHandler":{ + "pluginType":"writer", + "pluginName":"fakewriter" + }, + "postHandler":{ + "pluginType":"writer", + "pluginName":"fakewriter" + }, + "content": [ + { + "reader": { + "name": "fakereader", + "parameter": { + "jdbcUrl": [ + [ + "jdbc:mysql://localhost:3305/db1", + "jdbc:mysql://localhost:3306/db1" + ], + [ + "jdbc:mysql://localhost:3305/db2", + "jdbc:mysql://localhost:3306/db2" + ] + ], + "table": [ + "bazhen_[0-15]", + "bazhen_[15-31]" + ] + } + }, + "writer": { + "name": "fakewriter", + "parameter": { + "column": [ + { + "type": "string", + "name": "id" + }, + { + "type": "int", + "name": "age" + } + ], + "encode": "utf-8" + } + } + } + ] + } +} diff --git a/core/src/test/resources/conf/.secret.properties b/core/src/test/resources/conf/.secret.properties new file mode 100755 index 000000000..826d840bd --- /dev/null +++ b/core/src/test/resources/conf/.secret.properties @@ -0,0 +1,11 @@ +#ds basicAuth config +auth.user=datax +auth.pass=datax +last.keyVersion= +last.publicKey= +last.privateKey= +current.keyVersion=201412091312 +current.publicKey=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDE7hkiZukRMdrb3pCsjG149BObagDjn7y4B+qde1pET9RxbzXP9bhmS63C4e8WzKmYbMd6o8paCtYn+S7R60vYFjoEwgb4p3aGPC8sp5AkpBeQXAADFZFJ7zxN1se3LMa2/UKxIjO9OA3aSieAczKd3ChmGo4Vi3hFIrecTG7oXQIDAQAB +current.privateKey=MIICeAIBADANBgkqhkiG9w0BAQEFAASCAmIwggJeAgEAAoGBAMTuGSJm6REx2tvekKyMbXj0E5tqAOOfvLgH6p17WkRP1HFvNc/1uGZLrcLh7xbMqZhsx3qjyloK1if5LtHrS9gWOgTCBvindoY8LyynkCSkF5BcAAMVkUnvPE3Wx7csxrb9QrEiM704DdpKJ4BzMp3cKGYajhWLeEUit5xMbuhdAgMBAAECgYEAn+CFe059LT6CZjpczhj7z1SojmYS7rmCZw3WRaAdepQs7yLQV1MwL6yFF1CB4MqrbVny4PgUkeF2V+GPR1F1sj2emYdBDBqmTj8gn5aF8xXvfHUl1uH3VnIyDQHnMG0FTSH0f1QgEv7L4WkpuLM1lcBpegkJCcuTcMt+fm3GoEECQQDvDxPyoB2K1PMLzta9aO4ZaaBLS/ggdVaPFcqlDdY0hDSuquHeWqTfvPwI3yZkc6nyhUvQNiKTKqasucSXnTGlAkEA0uK75OWQkGmuLyPwOSGTbFhW6Z0DzvtdhO1TtIBZ3y8DGr8G1a/kzNYWcEQjOK50Ula7jLymSA/lO79Dx9kuWQJBAN6bmMSu6rOT9rsBIaABLO6HGFflZym6eh8FeM1X5CbFEVWxFGD84Vji34LXYSXbOt711xIMxwdpiQmAdxuDqm0CQHKxQ5VS0RPplgUnW5AG1cH4LZSyg46/oPYZiQvDPp2mWN7kA9iV6C8LRHrcY/eA0dyyNSBuvVS16GtdM4TudkkCQQCyo2hFVn/zSbZYV02LPR47IkN2dEkNTr8j4dXcPqAy3rkx18Me86RwraHBJ0TLs6mMbRBRo5AimpH47Z9iVDdb + + diff --git a/core/src/test/resources/conf/core.json b/core/src/test/resources/conf/core.json new file mode 100755 index 000000000..64aede6b9 --- /dev/null +++ b/core/src/test/resources/conf/core.json @@ -0,0 +1,51 @@ +{ + "entry": { + "jvm": "-Xms1G -Xmx1G", + "environment": { + "PATH": "/home/admin", + "DATAX_HOME": "/home/admin" + } + }, + "common": { + "column": { + "datetimeFormat": "yyyy-MM-dd HH:mm:ss", + "timeFormat": "HH:mm:ss", + "dateFormat": "yyyy-MM-dd", + "extraFormats":["yyyyMMdd"], + "timeZone": "GMT+8", + "encoding": "utf-8" + } + }, + "core": { + "channel": { + "class": "com.alibaba.datax.core.MemoryChannel", + "speed": 1000000, + "capacity": 128, + "queue": { + "class": "com.alibaba.datax.core.transport.channel.memory.DoubleQueue", + "timeOut": 30 + } + }, + "container": { + "job": { + }, + "taskGroup": { + }, + "model": "master", + "id": 1 + }, + "statistics": { + "collector": { + "plugin": { + "maxDirtyNumber": 1000 + } + } + }, + "plugin": { + "exchanger": { + "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger", + "bufferSize": 32 + } + } + } +} \ No newline at end of file diff --git a/core/src/test/resources/conf/logback.xml b/core/src/test/resources/conf/logback.xml new file mode 100755 index 000000000..f2f73ea84 --- /dev/null +++ b/core/src/test/resources/conf/logback.xml @@ -0,0 +1,148 @@ + + + + + + + + + + UTF-8 + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + UTF-8 + ${log.dir}/${ymd}/${log.file.name}-${byMillionSecond}.log + false + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + UTF-8 + ${perf.dir}/${ymd}/${log.file.name}-${byMillionSecond}.log + false + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/core/src/test/resources/dryRunAll.json b/core/src/test/resources/dryRunAll.json new file mode 100644 index 000000000..83000b2d5 --- /dev/null +++ b/core/src/test/resources/dryRunAll.json @@ -0,0 +1,157 @@ + +{ + "entry": { + "jvm": "-Xms1G -Xmx1G", + "environment": { + "PATH": "/home/admin", + "DATAX_HOME": "/home/admin" + } + }, + "common": { + "column": { + "datetimeFormat": "yyyy-MM-dd HH:mm:ss", + "timeFormat": "HH:mm:ss", + "dateFormat": "yyyy-MM-dd", + "extraFormats":["yyyyMMdd"], + "timeZone": "GMT+8", + "encoding": "utf-8" + } + }, + "core": { + "dataXServer": { + "address": "http://localhost/", + "timeout": 10000 + }, + "transport": { + "channel": { + "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel", + "speed": { + "byte": 1048576, + "record": 10000 + }, + "capacity": 32 + }, + "exchanger": { + "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger", + "bufferSize": 32 + } + }, + "container": { + "job": { + "reportInterval": 1000 + }, + "taskGroup": { + "channel": 3 + } + }, + "statistics": { + "collector": { + "plugin": { + "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector", + "maxDirtyNumber": 1000 + } + } + } + }, + "plugin": { + "reader": { + "mysqlreader": { + "name": "fakereader", + "class": "com.alibaba.datax.plugins.reader.fakereader.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + }, + "oraclereader": { + "name": "oraclereader", + "class": "com.alibaba.datax.plugins.reader.oraclereader.OracleReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + }, + "fakereader": { + "name": "fakereader", + "class": "com.alibaba.datax.core.faker.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + } + }, + "writer": { + "fakewriter": { + "name": "fakewriter", + "class": "com.alibaba.datax.core.faker.FakeWriter", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" + } + }, + "transformer": { + "groovyTranformer": {} + } + }, + "job": { + "setting": { + "dryRun": true, + "speed": { + "byte": 104857600 + }, + "errorLimit": { + "record": null, + "percentage": null + } + }, + "content": [ + { + "reader": { + "name": "fakereader", + "parameter": { + "jdbcUrl": [ + [ + "jdbc:mysql://localhost:3305/db1", + "jdbc:mysql://localhost:3306/db1" + ], + [ + "jdbc:mysql://localhost:3305/db2", + "jdbc:mysql://localhost:3306/db2" + ] + ], + "table": [ + "bazhen_[0-15]", + "bazhen_[15-31]" + ] + } + }, + "writer": { + "name": "fakewriter", + "parameter": { + "column": [ + { + "type": "string", + "name": "id" + }, + { + "type": "int", + "name": "age" + } + ], + "encode": "utf-8", + "hbase-conf": "/home/hbase/hbase-conf.xml" + } + } + } + ] + } +} diff --git a/core/src/test/resources/job/job.json b/core/src/test/resources/job/job.json new file mode 100755 index 000000000..b857fdcb7 --- /dev/null +++ b/core/src/test/resources/job/job.json @@ -0,0 +1,27 @@ +{ + "job": { + "setting": {}, + "content": [ + { + "reader": { + "name": "fakereader", + "parameter": {} + }, + "writer": { + "name": "fakewriter", + "parameter": {} + } + }, + { + "reader": { + "name": "fakereader", + "parameter": {} + }, + "writer": { + "name": "fakewriter", + "parameter": {} + } + } + ] + } +} \ No newline at end of file diff --git a/core/src/test/resources/plugin/reader/fakereader/FakePluginer.jar b/core/src/test/resources/plugin/reader/fakereader/FakePluginer.jar new file mode 100755 index 000000000..3a409c85a Binary files /dev/null and b/core/src/test/resources/plugin/reader/fakereader/FakePluginer.jar differ diff --git a/core/src/test/resources/plugin/reader/fakereader/plugin.json b/core/src/test/resources/plugin/reader/fakereader/plugin.json new file mode 100755 index 000000000..95d544aa9 --- /dev/null +++ b/core/src/test/resources/plugin/reader/fakereader/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "fakereader", + "class": "com.alibaba.datax.core.faker.FakeReader", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" +} \ No newline at end of file diff --git a/core/src/test/resources/plugin/writer/fakewriter/FakePluginer.jar b/core/src/test/resources/plugin/writer/fakewriter/FakePluginer.jar new file mode 100755 index 000000000..3a409c85a Binary files /dev/null and b/core/src/test/resources/plugin/writer/fakewriter/FakePluginer.jar differ diff --git a/core/src/test/resources/plugin/writer/fakewriter/plugin.json b/core/src/test/resources/plugin/writer/fakewriter/plugin.json new file mode 100755 index 000000000..f91c496dd --- /dev/null +++ b/core/src/test/resources/plugin/writer/fakewriter/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "fakewriter", + "class": "com.alibaba.datax.core.faker.FakeWriter", + "description": { + "useScene": "only for performance test.", + "mechanism": "Produce Record from memory.", + "warn": "Never use it in your real job." + }, + "developer": "someBody,bug reported to : someBody@someSite" +} \ No newline at end of file diff --git a/datax-all.iml b/datax-all.iml new file mode 100644 index 000000000..eefe946ea --- /dev/null +++ b/datax-all.iml @@ -0,0 +1,13 @@ + + + + + + + + + + + + + \ No newline at end of file diff --git a/drdsreader/doc/drdsreader.md b/drdsreader/doc/drdsreader.md new file mode 100644 index 000000000..264fbe0ee --- /dev/null +++ b/drdsreader/doc/drdsreader.md @@ -0,0 +1,348 @@ + +# DrdsReader 插件文档 + + +___ + + +## 1 快速介绍 + +DrdsReader插件实现了从DRDS(分布式RDS)读取数据。在底层实现上,DrdsReader通过JDBC连接远程DRDS数据库,并执行相应的sql语句将数据从DRDS库中SELECT出来。 + +DRDS的插件目前DataX只适配了Mysql引擎的场景,DRDS对于DataX而言,就是一套分布式Mysql数据库,并且大部分通信协议遵守Mysql使用场景。 + +## 2 实现原理 + +简而言之,DrdsReader通过JDBC连接器连接到远程的DRDS数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程DRDS数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,DrdsReader将其拼接为SQL语句发送到DRDS数据库。不同于普通的Mysql数据库,DRDS作为分布式数据库系统,无法适配所有Mysql的协议,包括复杂的Join等语句,DRDS暂时无法支持。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从DRDS数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度,单位为byte/s,DataX运行会尽可能达到该速度但是不超过它. + "byte": 1048576 + } + //出错限制 + "errorLimit": { + //出错的record条数上限,当大于该值即报错。 + "record": 0, + //出错的record百分比上限 1.0表示100%,0.02表示2% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "drdsReader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "column": [ + "id","name" + ], + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:mysql://127.0.0.1:3306/database" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + //是否打印内容 + "parameter": { + "print":true, + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + }, + "content": [ + { + "reader": { + "name": "drdsreader", + "parameter": { + "username": "root", + "password": "root", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:drds://localhost:3306/database"] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述.注意,jdbcUrl必须包含在connection配置单元中。DRDSReader中关于jdbcUrl中JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取需要抽取的表。注意,由于DRDS本身就是分布式数据源,因此填写多张表无意义。系统对多表不做校验。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用*代表默认使用所有列配置,例如['*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照Mysql SQL语法格式: + ["id", "\`table\`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,\`table\`为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + column必须用户显示指定同步的列集合,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,DrdsReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。例如在做测试时,可以将where条件指定为limit 10;在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。where条件不配置或者为空,视作全表同步数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,drdsReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前DrdsReader支持大部分DRDS类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出DrdsReader针对DRDS类型转换列表: + + +| DataX 内部类型| DRDS 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time, year | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `类似Mysql,tinyint(1)视作整形`。 +* `类似Mysql,bit类型读取目前是未定义状态。` + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + CREATE TABLE `tc_biz_vertical_test_0000` ( + `biz_order_id` bigint(20) NOT NULL COMMENT 'id', + `key_value` varchar(4000) NOT NULL COMMENT 'Key-value的内容', + `gmt_create` datetime NOT NULL COMMENT '创建时间', + `gmt_modified` datetime NOT NULL COMMENT '修改时间', + `attribute_cc` int(11) DEFAULT NULL COMMENT '防止并发修改的标志', + `value_type` int(11) NOT NULL DEFAULT '0' COMMENT '类型', + `buyer_id` bigint(20) DEFAULT NULL COMMENT 'buyerid', + `seller_id` bigint(20) DEFAULT NULL COMMENT 'seller_id', + PRIMARY KEY (`biz_order_id`,`value_type`), + KEY `idx_biz_vertical_gmtmodified` (`gmt_modified`) + ) ENGINE=InnoDB DEFAULT CHARSET=gbk COMMENT='tc_biz_vertical' + + +单行记录类似于: + + biz_order_id: 888888888 + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* DRDS数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 是否按照主键切分| DataX速度(Rec/s)| DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| + + +说明: + +1. 这里的单表,主键类型为 bigint(20),范围为:190247559466810-570722244711460,从主键范围划分看,数据分布均匀。 +2. 对单表如果没有安装主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库16张分表,共计32张分表) + + +| 通道数| DataX速度(Rec/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------| --------|--------|--------|--------|--------|--------| + + + +## 5 约束限制 + + +### 5.1 一致性视图问题 + +DRDS本身属于分布式数据库,对外无法提供一致性的多库多表视图,不同于Mysql等单库单表同步,DRDSReader无法抽取同一个时间切片的分库分表快照信息,也就是说DataX DrdsReader抽取底层不同的分表将获取不同的分表快照,无法保证强一致性。 + + +### 5.2 数据库编码问题 + +DRDS本身的编码设置非常灵活,包括指定编码到库、表、字段级别,甚至可以均不同编码。优先级从高到低为字段、表、库、实例。我们不推荐数据库用户设置如此混乱的编码,最好在库级别就统一到UTF-8。 + +DrdsReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此DrdsReader不需用户指定编码,可以自动获取编码并转码。 + +对于DRDS底层写入编码和其设定的编码不一致的混乱情况,DrdsReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.3 增量数据同步 + +DrdsReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,DrdsReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,DrdsReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,DrdsReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.4 Sql安全性 + +DrdsReader提供querySql语句交给用户自己实现SELECT抽取语句,DrdsReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + +*** + +**Q: DrdsReader同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用DRDS命令行测试: + + mysql -u -p -h -D -e "select * from <表名>" + +如果上述命令也报错,那可以证实是环境问题,请联系你的DBA。 + +*** + +**Q: 我想同步DRDS增量数据,怎么配置?** + + A: DrdsReader必须业务支持增量字段DataX才能同步增量,例如在淘宝大部分业务表中,通过gmt_modified字段表征这条记录的最新修改时间,那么DataX DrdsReader只需要配置where条件为 + +``` + "where": "Date(add_time) = '2014-06-01'" +``` + +*** + + + diff --git a/drdsreader/drdsreader.iml b/drdsreader/drdsreader.iml new file mode 100644 index 000000000..1f713bbe7 --- /dev/null +++ b/drdsreader/drdsreader.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/drdsreader/pom.xml b/drdsreader/pom.xml new file mode 100755 index 000000000..0677e0fc8 --- /dev/null +++ b/drdsreader/pom.xml @@ -0,0 +1,82 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + drdsreader + drdsreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + mysql + mysql-connector-java + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/drdsreader/src/main/assembly/package.xml b/drdsreader/src/main/assembly/package.xml new file mode 100755 index 000000000..2d170236a --- /dev/null +++ b/drdsreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/drdsreader + + + target/ + + drdsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/drdsreader + + + + + + false + plugin/reader/drdsreader/libs + runtime + + + diff --git a/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReader.java b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReader.java new file mode 100755 index 000000000..0e6d33013 --- /dev/null +++ b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReader.java @@ -0,0 +1,150 @@ +package com.alibaba.datax.plugin.reader.drdsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.TableExpandUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class DrdsReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.MySql; + private static final Logger LOG = LoggerFactory.getLogger(DrdsReader.class); + + public static class Job extends Reader.Job { + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt(Constant.FETCH_SIZE, + Integer.MIN_VALUE); + this.originalConfig.set(Constant.FETCH_SIZE, fetchSize); + this.validateConfiguration(); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return DrdsReaderSplitUtil.doSplit(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + private void validateConfiguration() { + // do not splitPk + String splitPk = originalConfig.getString(Key.SPLIT_PK, null); + if (null != splitPk) { + LOG.warn("由于您读取数据库是drds, 所以您不需要配置 splitPk. 如果您不想看到这条提醒,请移除您源头表中配置的 splitPk."); + this.originalConfig.remove(Key.SPLIT_PK); + } + + List conns = this.originalConfig.getList( + Constant.CONN_MARK, Object.class); + if (null == conns || conns.size() != 1) { + throw DataXException.asDataXException( + DBUtilErrorCode.REQUIRED_VALUE, + "您未配置读取数据库jdbcUrl的信息. 正确的配置方式是给 jdbcUrl 配置上您需要读取的连接. 请检查您的配置并作出修改."); + } + Configuration connConf = Configuration + .from(conns.get(0).toString()); + connConf.getNecessaryValue(Key.JDBC_URL, + DBUtilErrorCode.REQUIRED_VALUE); + + // only one jdbcUrl + List jdbcUrls = connConf + .getList(Key.JDBC_URL, String.class); + if (null == jdbcUrls || jdbcUrls.size() != 1) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_VALUE, + "您的jdbcUrl配置信息有误, 因为您配置读取数据库jdbcUrl的数量不正确. 正确的配置方式是配置且只配置 1 个目的 jdbcUrl. 请检查您的配置并作出修改."); + } + // if have table,only one + List tables = connConf.getList(Key.TABLE, String.class); + if (null != tables && tables.size() != 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的jdbcUrl配置信息有误. 由于您读取数据库是drds,配置读取源表数目错误. 正确的配置方式是配置且只配置 1 个目的 table. 请检查您的配置并作出修改."); + + } + if (null != tables && tables.size() == 1) { + List expandedTables = TableExpandUtil.expandTableConf( + DATABASE_TYPE, tables); + if (null == expandedTables || expandedTables.size() != 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的jdbcUrl配置信息有误. 由于您读取数据库是drds,配置读取源表数目错误. 正确的配置方式是配置且只配置 1 个目的 table. 请检查您的配置并作出修改."); + } + } + + // if have querySql,only one + List querySqls = connConf.getList(Key.QUERY_SQL, + String.class); + if (null != querySqls && querySqls.size() != 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的querySql配置信息有误. 由于您读取数据库是drds, 配置读取querySql数目错误. 正确的配置方式是配置且只配置 1 个 querySql. 请检查您的配置并作出修改."); + } + + // warn:other checking about table,querySql in common + } + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task( + DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig.getInt(Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderErrorCode.java b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderErrorCode.java new file mode 100755 index 000000000..91b3afd49 --- /dev/null +++ b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.drdsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum DrdsReaderErrorCode implements ErrorCode { + GET_TOPOLOGY_FAILED("DrdsReader-01", "获取 drds 表的拓扑结构失败."),; + + private final String code; + private final String description; + + private DrdsReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderSplitUtil.java b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderSplitUtil.java new file mode 100755 index 000000000..fefd698f7 --- /dev/null +++ b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderSplitUtil.java @@ -0,0 +1,121 @@ +package com.alibaba.datax.plugin.reader.drdsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.reader.util.SingleTableSplitUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.ResultSet; +import java.util.*; + +public class DrdsReaderSplitUtil { + + private static final Logger LOG = LoggerFactory + .getLogger(DrdsReaderSplitUtil.class); + + public static List doSplit(Configuration originalSliceConfig, + int adviceNumber) { + boolean isTableMode = originalSliceConfig.getBool(Constant.IS_TABLE_MODE).booleanValue(); + int tableNumber = originalSliceConfig.getInt(Constant.TABLE_NUMBER_MARK); + + if (isTableMode && tableNumber == 1) { + //需要先把内层的 table,connection 先放到外层 + String table = originalSliceConfig.getString(String.format("%s[0].%s[0]", Constant.CONN_MARK, Key.TABLE)).trim(); + originalSliceConfig.set(Key.TABLE, table); + + //注意:这里的 jdbcUrl 不是从数组中获取的,因为之前的 master init 方法已经进行过预处理 + String jdbcUrl = originalSliceConfig.getString(String.format("%s[0].%s", Constant.CONN_MARK, Key.JDBC_URL)).trim(); + + originalSliceConfig.set(Key.JDBC_URL, DataBaseType.DRDS.appendJDBCSuffixForReader(jdbcUrl)); + + originalSliceConfig.remove(Constant.CONN_MARK); + return doDrdsReaderSplit(originalSliceConfig); + } else { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, "您的配置信息中的表(table)的配置有误. 因为Drdsreader 只需要读取一张逻辑表,后台会通过DRDS Proxy自动获取实际对应物理表的数据. 请检查您的配置并作出修改."); + } + } + + private static List doDrdsReaderSplit(Configuration originalSliceConfig) { + List splittedConfigurations = new ArrayList(); + + Map> topology = getTopology(originalSliceConfig); + if (null == topology || topology.isEmpty()) { + throw DataXException.asDataXException(DrdsReaderErrorCode.GET_TOPOLOGY_FAILED, + "获取 drds 表拓扑结构失败, 拓扑结构不能为空."); + } else { + String table = originalSliceConfig.getString(Key.TABLE).trim(); + String column = originalSliceConfig.getString(Key.COLUMN).trim(); + String where = originalSliceConfig.getString(Key.WHERE, null); + // 不能带英语分号结尾 + String sql = SingleTableSplitUtil + .buildQuerySql(column, table, where); + // 根据拓扑拆分任务 + for (Map.Entry> entry : topology.entrySet()) { + String group = entry.getKey(); + StringBuilder sqlbuilder = new StringBuilder(); + sqlbuilder.append("/*+TDDL({'extra':{'MERGE_UNION':'false'},'type':'direct',"); + sqlbuilder.append("'vtab':'").append(table).append("',"); + sqlbuilder.append("'dbid':'").append(group).append("',"); + sqlbuilder.append("'realtabs':["); + Iterator it = entry.getValue().iterator(); + while (it.hasNext()) { + String realTable = it.next(); + sqlbuilder.append('\'').append(realTable).append('\''); + if (it.hasNext()) { + sqlbuilder.append(','); + } + } + sqlbuilder.append("]})*/"); + sqlbuilder.append(sql); + Configuration param = originalSliceConfig.clone(); + param.set(Key.QUERY_SQL, sqlbuilder.toString()); + splittedConfigurations.add(param); + } + + return splittedConfigurations; + } + } + + + private static Map> getTopology(Configuration configuration) { + Map> topology = new HashMap>(); + + String jdbcURL = configuration.getString(Key.JDBC_URL); + String username = configuration.getString(Key.USERNAME); + String password = configuration.getString(Key.PASSWORD); + String logicTable = configuration.getString(Key.TABLE).trim(); + + Connection conn = null; + ResultSet rs = null; + try { + conn = DBUtil.getConnection(DataBaseType.DRDS, jdbcURL, username, password); + rs = DBUtil.query(conn, "SHOW TOPOLOGY " + logicTable); + while (DBUtil.asyncResultSetNext(rs)) { + String groupName = rs.getString("GROUP_NAME"); + String tableName = rs.getString("TABLE_NAME"); + List tables = topology.get(groupName); + if (tables == null) { + tables = new ArrayList(); + topology.put(groupName, tables); + } + tables.add(tableName); + } + + return topology; + } catch (Exception e) { + throw DataXException.asDataXException(DrdsReaderErrorCode.GET_TOPOLOGY_FAILED, + String.format("获取 drds 表拓扑结构失败.根据您的配置, datax获取不到拓扑信息。相关上下文信息:表:%s, jdbcUrl:%s . 请联系 drds 管理员处理.", logicTable, jdbcURL), e); + } finally { + DBUtil.closeDBResources(rs, null, conn); + } + } + +} + diff --git a/drdsreader/src/main/resources/plugin.json b/drdsreader/src/main/resources/plugin.json new file mode 100755 index 000000000..eaa86d5a4 --- /dev/null +++ b/drdsreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "drdsreader", + "class": "com.alibaba.datax.plugin.reader.drdsreader.DrdsReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/drdsreader/src/main/resources/plugin_job_template.json b/drdsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..cf008227d --- /dev/null +++ b/drdsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,11 @@ +{ + "name": "drdsreader", + "parameter": { + "jdbcUrl": "", + "username": "", + "password": "", + "table": "", + "column": [], + "where": "" + } +} \ No newline at end of file diff --git a/drdswriter/doc/drdswriter.md b/drdswriter/doc/drdswriter.md new file mode 100644 index 000000000..c3c02c2ec --- /dev/null +++ b/drdswriter/doc/drdswriter.md @@ -0,0 +1,386 @@ +# DataX DRDSWriter + + +--- + + +## 1 快速介绍 + +DRDSWriter 插件实现了写入数据到 DRDS 的目的表的功能。在底层实现上, DRDSWriter 通过 JDBC 连接远程 DRDS 数据库的 Proxy,并执行相应的 replace into ... 的 sql 语句将数据写入 DRDS,特别注意执行的 Sql 语句是 replace into,为了避免数据重复写入,需要你的表具备主键或者唯一性索引(Unique Key)。 + +DRDSWriter 面向ETL开发工程师,他们使用 DRDSWriter 从数仓导入数据到 DRDS。同时 DRDSWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +DRDSWriter 通过 DataX 框架获取 Reader 生成的协议数据,通过 `replace into...`(没有遇到主键/唯一性索引冲突时,与 insert into 行为一致,冲突时会用新行替换原有行所有字段) 的语句写入数据到 DRDS。DRDSWriter 累积一定数据,提交给 DRDS 的 Proxy,该 Proxy 内部决定数据是写入一张还是多张表以及多张表写入时如何路由数据。 +
+ + 注意:整个任务至少需要具备 replace into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 DRDS 导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "DRDSWriter", + "parameter": { + "writeMode": "insert", + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息。作业运行时,DataX 会在你提供的 jdbcUrl 后面追加如下属性:yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true + + 注意:1、在一个数据库上只能配置一个 jdbcUrl 值 + 2、一个DRDS 写入任务仅能配置一个 jdbcUrl + 3、jdbcUrl按照Mysql/DRDS官方规范,并可以填写连接附加控制信息,比如想指定连接编码为 gbk ,则在 jdbcUrl 后面追加属性 useUnicode=true&characterEncoding=gbk。具体请参看 Mysql/DRDS官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。 只能配置一个DRDS 的表名称。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + **column配置项必须指定,不能留空!** + + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。比如你想在导入数据前清空数据表中的数据,那么可以配置为:`"preSql":["delete from yourTableName"]`
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **writeMode** + + * 描述:默认为 replace,目前仅支持 replace,可以不配置。
+ + * 必选:否
+ + * 默认值:replace
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与DRDS的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:
+ +### 3.3 类型转换 + +类似 MysqlWriter ,目前 DRDSWriter 支持大部分 Mysql 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 DRDSWriter 针对 Mysql 类型转换列表: + + +| DataX 内部类型| Mysql 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint, year| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + +``` +CREATE TABLE `t_job` ( + `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '实例id', + `project_id` bigint(20) NOT NULL COMMENT '所属资源组,外键', + `pipeline_name` varchar(128) DEFAULT NULL COMMENT '所属资源组,以后变为非null', + `execute_name` varchar(512) DEFAULT NULL COMMENT '执行脚本', + `context` text NOT NULL COMMENT 'job的配置信息', + `trace_id` varchar(128) DEFAULT NULL COMMENT '外界标志', + `submit_user` varchar(128) DEFAULT NULL COMMENT '提交的用户', + `submit_time` datetime DEFAULT NULL COMMENT '提交时间', + `submit_ip` varchar(64) DEFAULT NULL COMMENT '实例提交的客户端ip', + `start_time` datetime DEFAULT NULL COMMENT '开始执行时间', + `end_user` varchar(128) DEFAULT NULL COMMENT '结束的用户', + `end_time` datetime DEFAULT NULL COMMENT '结束时间', + `log_url` varchar(4096) DEFAULT NULL COMMENT '实例运行log的地址', + `execute_id` varchar(256) DEFAULT NULL COMMENT '执行id', + `execute_ip` varchar(64) DEFAULT NULL COMMENT '执行所在机器ip', + `state` tinyint(3) unsigned DEFAULT '255' COMMENT '实例状态,0-success,1-submit,2-init,3-run,4-fail,5-kill,255-unknown', + `stage` float DEFAULT '0' COMMENT '实例运行进度', + `total_records` bigint(20) unsigned DEFAULT '0' COMMENT '实例总条数', + `total_bytes` bigint(20) unsigned DEFAULT '0' COMMENT '实例总bytes数', + `speed_records` bigint(20) unsigned DEFAULT '0' COMMENT '实例运行速度', + `speed_bytes` bigint(20) unsigned DEFAULT '0' COMMENT '实例运行bytes速度', + `error_records` bigint(20) unsigned DEFAULT '0' COMMENT '实例错误总条数', + `error_bytes` bigint(20) unsigned DEFAULT '0' COMMENT '实例错误总bytes数', + `error_message` text COMMENT '实例错误信息', + PRIMARY KEY (`id`), + KEY `idx_project` (`project_id`), + KEY `idx_pipeline_name` (`pipeline_name`), + KEY `idx_trace_id` (`trace_id`), + KEY `idx_submit_time` (`submit_time`) +) ENGINE=InnoDB AUTO_INCREMENT=645299 DEFAULT CHARSET=utf8 COMMENT='实例的信息表' +``` + +单行记录类似于: + +``` + id: 100605 + project_id: 112 +pipeline_name: jcs_project_128105 + execute_name: NULL + context: {"configuration":{"reader":{"parameter":{"*password":"NTdjMknTWnAlCCyaG3EVQg==","column":"`pugId`,`gmtCreated`,`gmtModified`,`gmtAuthTime`,`gmtLeaveTime`,`merchantId`,`token`,`mac`,`os`,`oauth`,`oauthInfo`","database":"witown","error-limit":"1","ip":"rdsyi7biayi7bia.mysql.rds.aliyuncs.com","port":"3306","table":"wi_pug_106","username":"rnd","where":""},"plugin":"mysql"},"writer":{"parameter":{"*access-key":"+oN7h69a9T64z1fas0CNDWmVeSsIF4i5a8s8HA5HNjo=","access-id":"IlUQ8E7i3CFFbGax","error-limit":"1","partition":"","project":"witown_rds","table":"wi_pug_106"},"plugin":"odps"}},"type":"datax"} + trace_id: NULL + submit_user: 128105 + submit_time: 2014-12-12 11:36:27 + submit_ip: 127.0.0.1 + start_time: 2014-12-12 11:36:27 + end_user: NULL + end_time: 2014-12-12 11:36:41 + log_url: /20141212/cdp/11-36-27/hwdbvm3nju4qcgadwgurv445/ + execute_id: T3_0000184404 + execute_ip: oxsgateway04.cloud.et1 + state: 4 + stage: 0 +total_records: 544 + total_bytes: 42819 +speed_records: 0 + speed_bytes: 0 +error_records: 0 + error_bytes: NULL +error_message: Code:[OdpsWriter-01], Description:[您配置的值不合法.]. - 数据源读取的列数是:11 大于目的端的列数是:10 , DataX 不允许这种行为, 请检查您的配置. +``` + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + 2. mem: 96GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* DRDS数据库机器参数为: + 1. 两个实例,每个实例上8个分库,每个分库上一张分表,共计16个分库16张分表 + 2. mem: 1200M + 3. 磁盘: 40960M + 4. mysql类型: MySQL5.6 + 5. 最大连接数: 300 + 6. 最大IOPS: 600 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +压测基于jdbcUrl里添加了选项`rewriteBatchedStatements=true`
+ +#### 4.2.1 replace测试报告 + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DRDS QPS| 备注 | +|--------|--------| --------|--------|--------|--------|--------|--------| +| 1 | 128 | 260 | 0.39 | 0.34 | 0.03 | 2 | | +| 1 | 512 | 256 | 0.36 | 0.38 | 0.03 | 1 | | +| 1 | 1024 | 256 | 0.35 | max:0.81 min:0.002 | 0.36 | 0 | 波动很大 | +| 4 | 128 | 1011 | 1.22 | 1.75 | 0.14 | 7 | | +| 4 | 512 | 1024 | 1.27 | 1.84 | 0.06 | 1 | | +| 4 | 1024 | 1024 | 1.3 | max:2.82 min:0.76 | 0.23 | 1 | | +| 8 | 128 | 1190 | 1.6 | 2.26 | 0.15 | 6 | | +| 8 | 512 | | | | | | IndexOutOfBoundsException报错 | + +初步定位报错是drds那边通过sql去获取表元信息的时候,返回结果为空 + +#### 4.2.2 insert ignore测试报告 + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DRDS QPS| 备注 | +|--------|--------| --------|--------|--------|--------|--------|--------| +| 1 | 128 | 2444 | 3.43 | 3.41 | 0.13 | 12 | | +| 1 | 512 | 2099 | 3.00 | 2.99 | 0.13 | 3 | | +| 1 | 1024 | 2048 | 2.91 | 3.06 | 0.17 | 1 | | +| 4 | 128 | 7923 | 8.75 | 10.7 | 0.17 | 26 | | +| 4 | 512 | 8662 | 10.05 | 11.92 | 0.17 | 7 | | +| 4 | 1024 | 8211 | 10.20 | 12.74 | 0.29 | 3 | | +| 8 | 128 | | | | | | 抛错IndexOutOfBoundsException | + +#### 4.2.3 单条小数据数据量较小情况(insert ignore插入模式) +``` +建表语句 +CREATE TABLE `tddl_users` ( + `id` bigint(20) NOT NULL AUTO_INCREMENT, + `gmt_create` datetime NOT NULL, + `gmt_modified` datetime NOT NULL, + `name` varchar(200) NOT NULL, + `address` varchar(500) NOT NULL, + PRIMARY KEY (`id`) +) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=gbk; + +数据值 +sun,aaa,2015-04-21 00:00:00,2015-04-21 01:00:00,kkk +``` + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DRDS QPS| 备注 | +|--------|--------| --------|--------|--------|--------|--------|--------| +| 1 | 1 | 108 | 0.005 | 0.042 | 0.08 | 110 | | +| 1 | 32 | 2611 | 0.117 | 0.188 | 0.20 | 42 | | +| 1 | 128 | 8012 | 0.36 | 0.531 | 0.4 | 37 | | +| 1 | 512 | 16025 | 0.718 | 1.12 | 0.15 | 27 | | +| 1 | 1024 | 19046 | 0.85 | 1.2 | 0.30 | 13 | | +| 1 | 40960 | 22595 | 1.01 | 1.31 | 1.27 | 1 | | + +#### 4.2.4 性能小节 + +单条数据量较大情况下: + +1. batchSize的增加对性能影响不大,且太多会导致网络波动太大,建议batchSize设置为128 + +2. 并发对写入性能影响很大,但到了8个并发就会抛错IndexOutOfBoundsException,`感觉是datax内部代码有问题` + +3. insert ignore的性能比replace好很多,用户可以根据业务场景来选取insert ignore + +4. 以上性能均是在rewriteBatchedStatements=true情况下测试的,需要datax把该选项加入到默认配置中 + +单条数据量较小情况下 + +5. batchSize设置大一点影响还是蛮大的 + +## 5 约束限制 + + +## FAQ + +*** + +**Q: DRDSWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/drdswriter/drdswriter.iml b/drdswriter/drdswriter.iml new file mode 100644 index 000000000..1f713bbe7 --- /dev/null +++ b/drdswriter/drdswriter.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/drdswriter/pom.xml b/drdswriter/pom.xml new file mode 100755 index 000000000..6873c8fbc --- /dev/null +++ b/drdswriter/pom.xml @@ -0,0 +1,83 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + drdswriter + drdswriter + jar + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + mysql + mysql-connector-java + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/drdswriter/src/main/assembly/package.xml b/drdswriter/src/main/assembly/package.xml new file mode 100755 index 000000000..f3c893ac3 --- /dev/null +++ b/drdswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/drdswriter + + + target/ + + drdswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/drdswriter + + + + + + false + plugin/writer/drdswriter/libs + runtime + + + diff --git a/drdswriter/src/main/java/com/alibaba/datax/plugin/writer/drdswriter/DrdsWriter.java b/drdswriter/src/main/java/com/alibaba/datax/plugin/writer/drdswriter/DrdsWriter.java new file mode 100755 index 000000000..b2bf0ac4e --- /dev/null +++ b/drdswriter/src/main/java/com/alibaba/datax/plugin/writer/drdswriter/DrdsWriter.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.writer.drdswriter; + + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class DrdsWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.DRDS; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private String DEFAULT_WRITEMODE = "replace"; + private String INSERT_IGNORE_WRITEMODE = "insert ignore"; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + String writeMode = this.originalConfig.getString(Key.WRITE_MODE, DEFAULT_WRITEMODE); + if (!DEFAULT_WRITEMODE.equalsIgnoreCase(writeMode) && + !INSERT_IGNORE_WRITEMODE.equalsIgnoreCase(writeMode)) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + String.format("写入模式(writeMode)配置错误. DRDSWriter只支持两种写入模式为:[%s, %s], 但是您配置的写入模式为:%s. 请检查您的配置并作出修改.", + DEFAULT_WRITEMODE, INSERT_IGNORE_WRITEMODE, writeMode)); + } + + this.originalConfig.set(Key.WRITE_MODE, writeMode); + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + // 对于 Drds 而言,只会暴露一张逻辑表,所以直接在 Master 做 pre,post 操作 + @Override + public void prepare() { + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, mandatoryNumber); + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task(DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + //TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, this.writerSliceConfig, + super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + } +} diff --git a/drdswriter/src/main/resources/plugin.json b/drdswriter/src/main/resources/plugin.json new file mode 100755 index 000000000..ad0db036d --- /dev/null +++ b/drdswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "drdswriter", + "class": "com.alibaba.datax.plugin.writer.drdswriter.DrdsWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/drdswriter/src/main/resources/plugin_job_template.json b/drdswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..93fcd4efe --- /dev/null +++ b/drdswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "drdswriter", + "parameter": { + "jdbcUrl": "", + "username": "", + "password": "", + "table": "", + "column": [], + "writeMode": "", + "preSql": [], + "postSql": [] + } +} \ No newline at end of file diff --git a/ftpreader/doc/ftpreader.md b/ftpreader/doc/ftpreader.md new file mode 100644 index 000000000..3af6fb3a6 --- /dev/null +++ b/ftpreader/doc/ftpreader.md @@ -0,0 +1,292 @@ +# DataX FtpReader 说明 + + +------------ + +## 1 快速介绍 + +FtpReader提供了读取远程FTP文件系统数据存储的能力。在底层实现上,FtpReader获取远程FTP文件数据,并转换为DataX传输协议传递给Writer。 + +**本地文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +FtpReader实现了从远程FTP文件读取数据并转为DataX协议的功能,远程FTP文件本身是无结构化数据存储,对于DataX而言,FtpReader实现上类比TxtFileReader,有诸多相似之处。目前FtpReader支持功能如下: + +1. 支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +4. 支持递归读取、支持文件名过滤。 + +5. 支持文本压缩,现有压缩格式为zip、lzo、lzop、tgz、bzip2。 + +6. 多个File可以支持并发读取。 + +我们暂时不能做到: + +1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。 + +2. 单个File在压缩情况下,从技术上无法支持多线程并发读取。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "ftpReader", + "parameter": { + "protocol": "sftp", + "host": "10.101.86.94", + "port": 22, + "username": "xx", + "password": "xxx", + "path": [ + "/home/hanfa.shf/ftpReaderTest/data" + ], + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "encoding": "UTF-8", + "fieldDelimiter": "," + } + }, + "writer": { + "name": "ftpWriter", + "parameter": { + "path": "/home/hanfa.shf/ftpReaderTest/result", + "fileName": "shihf", + "writeMode": "truncate", + "format": "yyyy-MM-dd" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **protocol** + + * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。
+ + * 必选:是
+ + * 默认值:无
+ +* **host** + + * 描述:ftp服务器地址。
+ + * 必选:是
+ + * 默认值:无
+ +* **port** + + * 描述:ftp服务器端口。
+ + * 必选:否
+ + * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21
+ +* **timeout** + + * 描述:连接ftp服务器连接超时时间,单位毫秒。
+ + * 必选:否
+ + * 默认值:60000(1分钟)
+* **connectPattern** + + * 描述:连接模式(主动模式或者被动模式)。该参数只在传输协议是标准ftp协议时使用,值只能为:PORT (主动),PASV(被动)。两种模式主要的不同是数据连接建立的不同。对于Port模式,是客户端在本地打开一个端口等服务器去连接建立数据连接,而Pasv模式就是服务器打开一个端口等待客户端去建立一个数据连接。
+ + * 必选:否
+ + * 默认值:PASV
+ +* **username** + + * 描述:ftp服务器访问用户名。
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:ftp服务器访问密码。
+ + * 必选:是
+ + * 默认值:无
+ +* **path** + + * 描述:远程FTP文件系统的路径信息,注意这里可以支持填写多个路径。
+ + 当指定单个远程FTP文件,FtpReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个File可以进行多线程并发读取。 + + 当指定多个远程FTP文件,FtpReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,FtpReader尝试遍历出多个文件信息。例如: 指定/*代表读取/目录下所有的文件,指定/bazhen/\*代表读取bazhen目录下游所有的文件。**FtpReader目前只支持\*作为文件通配符。** + + **特别需要注意的是,DataX会将一个作业下同步的所有Text File视作同一张数据表。用户必须自己保证所有的File能够适配同一套schema信息。读取文件用户必须保证为类CSV格式,并且提供给DataX权限可读。** + + **特别需要注意的是,如果Path指定的路径下没有符合匹配的文件抽取,DataX将报错。** + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json + { + "type": "long", + "index": 0 //从远程FTP文件文本第一列获取int字段 + }, + { + "type": "string", + "value": "alibaba" //从FtpReader内部生成alibaba的字符串字段作为当前字段 + } + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:是
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为gzip、bzip2。
+ + * 必选:否
+ + * 默认值:没有压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ +* **skipHeader** + + * 描述:类CSV格式文件可能存在表头为标题情况,需要跳过。默认不跳过。
+ + * 必选:否
+ + * 默认值:false
+ +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat:"\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **maxTraversalLevel** + + * 描述:允许遍历文件夹的最大层数。
+ + * 必选:否
+ + * 默认值:100
+ + +### 3.3 类型转换 + +远程FTP文件本身不提供数据类型,该类型是DataX FtpReader定义: + +| DataX 内部类型| 远程FTP文件 数据类型 | +| -------- | ----- | +| +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* 远程FTP文件 Long是指远程FTP文件文本中使用整形的字符串表示形式,例如"19901219"。 +* 远程FTP文件 Double是指远程FTP文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* 远程FTP文件 Boolean是指远程FTP文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* 远程FTP文件 Date是指远程FTP文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/ftpreader/ftpreader.iml b/ftpreader/ftpreader.iml new file mode 100644 index 000000000..47f2ef4d8 --- /dev/null +++ b/ftpreader/ftpreader.iml @@ -0,0 +1,32 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/ftpreader/pom.xml b/ftpreader/pom.xml new file mode 100755 index 000000000..28121bafd --- /dev/null +++ b/ftpreader/pom.xml @@ -0,0 +1,92 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + ftpreader + ftpreader + FtpReader提供了读取指定ftp服务器文件功能,并可以根据用户配置的类型进行类型转换,建议开发、测试环境使用。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + com.jcraft + jsch + 0.1.51 + + + commons-net + commons-net + 3.3 + + + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/ftpreader/src/main/assembly/package.xml b/ftpreader/src/main/assembly/package.xml new file mode 100755 index 000000000..4c340a4cb --- /dev/null +++ b/ftpreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/ftpreader + + + target/ + + ftpreader-0.0.1-SNAPSHOT.jar + + plugin/reader/ftpreader + + + + + + false + plugin/reader/ftpreader/libs + runtime + + + diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Constant.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Constant.java new file mode 100755 index 000000000..15019fdb5 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Constant.java @@ -0,0 +1,14 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + + +public class Constant { + public static final String SOURCE_FILES = "sourceFiles"; + + public static final int DEFAULT_FTP_PORT = 21; + public static final int DEFAULT_SFTP_PORT = 22; + public static final int DEFAULT_TIMEOUT = 60000; + public static final int DEFAULT_MAX_TRAVERSAL_LEVEL = 100; + public static final String DEFAULT_FTP_CONNECT_PATTERN = "PASV"; + + +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpHelper.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpHelper.java new file mode 100644 index 000000000..f8b3f56f2 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpHelper.java @@ -0,0 +1,107 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.InputStream; +import java.util.HashSet; +import java.util.List; + +public abstract class FtpHelper { + /** + * + * @Title: LoginFtpServer + * @Description: 与ftp服务器建立连接 + * @param @param host + * @param @param username + * @param @param password + * @param @param port + * @param @param timeout + * @param @param connectMode + * @return void + * @throws + */ + public abstract void loginFtpServer(String host, String username, String password, int port, int timeout,String connectMode) ; + /** + * + * @Title: LogoutFtpServer + * todo 方法名首字母 + * @Description: 断开与ftp服务器的连接 + * @param + * @return void + * @throws + */ + public abstract void logoutFtpServer(); + /** + * + * @Title: isDirExist + * @Description: 判断指定路径是否是目录 + * @param @param directoryPath + * @param @return + * @return boolean + * @throws + */ + public abstract boolean isDirExist(String directoryPath); + /** + * + * @Title: isFileExist + * @Description: 判断指定路径是否是文件 + * @param @param filePath + * @param @return + * @return boolean + * @throws + */ + public abstract boolean isFileExist(String filePath); + /** + * + * @Title: isSymbolicLink + * @Description: 判断指定路径是否是软链接 + * @param @param filePath + * @param @return + * @return boolean + * @throws + */ + public abstract boolean isSymbolicLink(String filePath); + /** + * + * @Title: getListFiles + * @Description: 递归获取指定路径下符合条件的所有文件绝对路径 + * @param @param directoryPath + * @param @param parentLevel 父目录的递归层数(首次为0) + * @param @param maxTraversalLevel 允许的最大递归层数 + * @param @return + * @return HashSet + * @throws + */ + public abstract HashSet getListFiles(String directoryPath, int parentLevel, int maxTraversalLevel); + + /** + * + * @Title: getInputStream + * @Description: 获取指定路径的输入流 + * @param @param filePath + * @param @return + * @return InputStream + * @throws + */ + public abstract InputStream getInputStream(String filePath); + + /** + * + * @Title: getAllFiles + * @Description: 获取指定路径列表下符合条件的所有文件的绝对路径 + * @param @param srcPaths 路径列表 + * @param @param parentLevel 父目录的递归层数(首次为0) + * @param @param maxTraversalLevel 允许的最大递归层数 + * @param @return + * @return HashSet + * @throws + */ + public HashSet getAllFiles(List srcPaths, int parentLevel, int maxTraversalLevel){ + HashSet sourceAllFiles = new HashSet(); + if (!srcPaths.isEmpty()) { + for (String eachPath : srcPaths) { + sourceAllFiles.addAll(getListFiles(eachPath, parentLevel, maxTraversalLevel)); + } + } + return sourceAllFiles; + } + +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReader.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReader.java new file mode 100644 index 000000000..3802a5f69 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReader.java @@ -0,0 +1,244 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.InputStream; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; + +/** + * + * @ClassName: FtpFileReader + * @date 2015年7月6日 上午9:24:57 + * + */ +public class FtpReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration originConfig = null; + + private List path = null; + + private HashSet sourceFiles; + + // ftp链接参数 + private String protocol; + private String host; + private int port; + private String username; + private String password; + private int timeout; + private String connectPattern; + private int maxTraversalLevel; + + private FtpHelper ftpHelper = null; + + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + this.sourceFiles = new HashSet(); + + this.validateParameter(); + UnstructuredStorageReaderUtil.validateParameter(this.originConfig); + + if ("sftp".equals(protocol)) { + //sftp协议 + this.port = originConfig.getInt(Key.PORT, Constant.DEFAULT_SFTP_PORT); + this.ftpHelper = new SftpHelper(); + } else if ("ftp".equals(protocol)) { + // ftp 协议 + this.port = originConfig.getInt(Key.PORT, Constant.DEFAULT_FTP_PORT); + this.ftpHelper = new StandardFtpHelper(); + } + ftpHelper.loginFtpServer(host, username, password, port, timeout, connectPattern); + + } + + private void validateParameter() { + //todo 常量 + this.protocol = this.originConfig.getNecessaryValue(Key.PROTOCOL, FtpReaderErrorCode.REQUIRED_VALUE); + boolean ptrotocolTag = "ftp".equals(this.protocol) || "sftp".equals(this.protocol); + if (!ptrotocolTag) { + throw DataXException.asDataXException(FtpReaderErrorCode.ILLEGAL_VALUE, + String.format("仅支持 ftp和sftp 传输协议 , 不支持您配置的传输协议: [%s]", protocol)); + } + this.host = this.originConfig.getNecessaryValue(Key.HOST, FtpReaderErrorCode.REQUIRED_VALUE); + this.username = this.originConfig.getNecessaryValue(Key.USERNAME, FtpReaderErrorCode.REQUIRED_VALUE); + this.password = this.originConfig.getNecessaryValue(Key.PASSWORD, FtpReaderErrorCode.REQUIRED_VALUE); + this.timeout = originConfig.getInt(Key.TIMEOUT, Constant.DEFAULT_TIMEOUT); + this.maxTraversalLevel = originConfig.getInt(Key.MAXTRAVERSALLEVEL, Constant.DEFAULT_MAX_TRAVERSAL_LEVEL); + + // only support connect pattern + this.connectPattern = this.originConfig.getUnnecessaryValue(Key.CONNECTPATTERN, Constant.DEFAULT_FTP_CONNECT_PATTERN, null); + boolean connectPatternTag = "PORT".equals(connectPattern) || "PASV".equals(connectPattern); + if (!connectPatternTag) { + throw DataXException.asDataXException(FtpReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的ftp传输模式: [%s]", connectPattern)); + }else{ + this.originConfig.set(Key.CONNECTPATTERN, connectPattern); + } + + //path check + String pathInString = this.originConfig.getNecessaryValue(Key.PATH, FtpReaderErrorCode.REQUIRED_VALUE); + if (!pathInString.startsWith("[") && !pathInString.endsWith("]")) { + path = new ArrayList(); + path.add(pathInString); + } else { + path = this.originConfig.getList(Key.PATH, String.class); + if (null == path || path.size() == 0) { + throw DataXException.asDataXException(FtpReaderErrorCode.REQUIRED_VALUE, "您需要指定待读取的源目录或文件"); + } + for (String eachPath : path) { + if(!eachPath.startsWith("/")){ + String message = String.format("请检查参数path:[%s],需要配置为绝对路径", eachPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.ILLEGAL_VALUE, message); + } + } + } + + } + + @Override + public void prepare() { + LOG.debug("prepare() begin..."); + + this.sourceFiles = ftpHelper.getAllFiles(path, 0, maxTraversalLevel); + + LOG.info(String.format("您即将读取的文件数为: [%s]", this.sourceFiles.size())); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + ftpHelper.logoutFtpServer(); + } + + // warn: 如果源目录为空会报错,拖空目录意图=>空文件显示指定此意图 + @Override + public List split(int adviceNumber) { + LOG.debug("split() begin..."); + List readerSplitConfigs = new ArrayList(); + + // warn:每个slice拖且仅拖一个文件, + // int splitNumber = adviceNumber; + int splitNumber = this.sourceFiles.size(); + if (0 == splitNumber) { + throw DataXException.asDataXException(FtpReaderErrorCode.EMPTY_DIR_EXCEPTION, + String.format("未能找到待读取的文件,请确认您的配置项path: %s", this.originConfig.getString(Key.PATH))); + } + + List> splitedSourceFiles = this.splitSourceFiles(new ArrayList(this.sourceFiles), splitNumber); + for (List files : splitedSourceFiles) { + Configuration splitedConfig = this.originConfig.clone(); + splitedConfig.set(Constant.SOURCE_FILES, files); + readerSplitConfigs.add(splitedConfig); + } + LOG.debug("split() ok and end..."); + return readerSplitConfigs; + } + + private List> splitSourceFiles(final List sourceList, int adviceNumber) { + List> splitedList = new ArrayList>(); + int averageLength = sourceList.size() / adviceNumber; + averageLength = averageLength == 0 ? 1 : averageLength; + + for (int begin = 0, end = 0; begin < sourceList.size(); begin = end) { + end = begin + averageLength; + if (end > sourceList.size()) { + end = sourceList.size(); + } + splitedList.add(sourceList.subList(begin, end)); + } + return splitedList; + } + + } + + public static class Task extends Reader.Task { + private static Logger LOG = LoggerFactory.getLogger(Task.class); + + private String host; + private int port; + private String username; + private String password; + private String protocol; + private int timeout; + private String connectPattern; + + private Configuration readerSliceConfig; + private List sourceFiles; + + private FtpHelper ftpHelper = null; + + @Override + public void init() {//连接重试 + /* for ftp connection */ + this.readerSliceConfig = this.getPluginJobConf(); + this.host = readerSliceConfig.getString(Key.HOST); + this.protocol = readerSliceConfig.getString(Key.PROTOCOL); + this.username = readerSliceConfig.getString(Key.USERNAME); + this.password = readerSliceConfig.getString(Key.PASSWORD); + this.timeout = readerSliceConfig.getInt(Key.TIMEOUT, Constant.DEFAULT_TIMEOUT); + + this.sourceFiles = this.readerSliceConfig.getList(Constant.SOURCE_FILES, String.class); + + if ("sftp".equals(protocol)) { + //sftp协议 + this.port = readerSliceConfig.getInt(Key.PORT, Constant.DEFAULT_SFTP_PORT); + this.ftpHelper = new SftpHelper(); + } else if ("ftp".equals(protocol)) { + // ftp 协议 + this.port = readerSliceConfig.getInt(Key.PORT, Constant.DEFAULT_FTP_PORT); + this.connectPattern = readerSliceConfig.getString(Key.CONNECTPATTERN, Constant.DEFAULT_FTP_CONNECT_PATTERN);// 默认为被动模式 + this.ftpHelper = new StandardFtpHelper(); + } + ftpHelper.loginFtpServer(host, username, password, port, timeout, connectPattern); + + } + + @Override + public void prepare() { + + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + ftpHelper.logoutFtpServer(); + } + + @Override + public void startRead(RecordSender recordSender) { + LOG.debug("start read source files..."); + for (String fileName : this.sourceFiles) { + LOG.info(String.format("reading file : [%s]", fileName)); + InputStream inputStream = null; + + inputStream = ftpHelper.getInputStream(fileName); + + UnstructuredStorageReaderUtil.readFromStream(inputStream, fileName, this.readerSliceConfig, + recordSender, this.getTaskPluginCollector()); + recordSender.flush(); + } + + LOG.debug("end read source files..."); + } + + } +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReaderErrorCode.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReaderErrorCode.java new file mode 100755 index 000000000..8c71a2061 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReaderErrorCode.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public enum FtpReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("FtpReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("FtpReader-01", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("FtpReader-02", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("FtpReader-03","您明确的配置列信息,但未填写相应的index,value."), + + FILE_NOT_EXISTS("FtpReader-04", "您配置的目录文件路径不存在或者没有权限读取."), + OPEN_FILE_WITH_CHARSET_ERROR("FtpReader-05", "您配置的文件编码和实际文件编码不符合."), + OPEN_FILE_ERROR("FtpReader-06", "您配置的文件在打开时异常."), + READ_FILE_IO_ERROR("FtpReader-07", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("FtpReader-08", "您缺少权限执行相应的文件操作."), + CONFIG_INVALID_EXCEPTION("FtpReader-09", "您的参数配置错误."), + RUNTIME_EXCEPTION("FtpReader-10", "出现运行时异常, 请联系我们"), + EMPTY_DIR_EXCEPTION("FtpReader-11", "您尝试读取的文件目录为空."), + + FAIL_LOGIN("FtpReader-12", "登录失败,无法与ftp服务器建立连接."), + FAIL_DISCONNECT("FtpReader-13", "关闭ftp连接失败,无法与ftp服务器断开连接."), + COMMAND_FTP_IO_EXCEPTION("FtpReader-14", "与ftp服务器连接异常."), + OUT_MAX_DIRECTORY_LEVEL("FtpReader-15", "超出允许的最大目录层数."), + LINK_FILE("FtpReader-16", "您尝试读取的文件为链接文件."),; + + private final String code; + private final String description; + + private FtpReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Key.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Key.java new file mode 100755 index 000000000..cdbd043cd --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Key.java @@ -0,0 +1,13 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +public class Key { + public static final String PROTOCOL = "protocol"; + public static final String HOST = "host"; + public static final String USERNAME = "username"; + public static final String PASSWORD = "password"; + public static final String PORT = "port"; + public static final String TIMEOUT = "timeout"; + public static final String CONNECTPATTERN = "connectPattern"; + public static final String PATH = "path"; + public static final String MAXTRAVERSALLEVEL = "maxTraversalLevel"; +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/SftpHelper.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/SftpHelper.java new file mode 100644 index 000000000..cbf53520b --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/SftpHelper.java @@ -0,0 +1,247 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.InputStream; +import java.util.HashSet; +import java.util.Properties; +import java.util.Vector; + +import org.apache.commons.io.IOUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.reader.ftpreader.FtpReader.Job; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.jcraft.jsch.ChannelSftp; +import com.jcraft.jsch.JSch; +import com.jcraft.jsch.JSchException; +import com.jcraft.jsch.Session; +import com.jcraft.jsch.SftpATTRS; +import com.jcraft.jsch.SftpException; +import com.jcraft.jsch.ChannelSftp.LsEntry; + +public class SftpHelper extends FtpHelper { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + Session session = null; + ChannelSftp channelSftp = null; + @Override + public void loginFtpServer(String host, String username, String password, int port, int timeout, + String connectMode) { + JSch jsch = new JSch(); // 创建JSch对象 + try { + session = jsch.getSession(username, host, port); + // 根据用户名,主机ip,端口获取一个Session对象 + // 如果服务器连接不上,则抛出异常 + if (session == null) { + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, + "session is null,无法通过sftp与服务器建立链接,请检查主机名和用户名是否正确."); + } + + session.setPassword(password); // 设置密码 + Properties config = new Properties(); + config.put("StrictHostKeyChecking", "no"); + session.setConfig(config); // 为Session对象设置properties + session.setTimeout(timeout); // 设置timeout时间 + session.connect(); // 通过Session建立链接 + + channelSftp = (ChannelSftp) session.openChannel("sftp"); // 打开SFTP通道 + channelSftp.connect(); // 建立SFTP通道的连接 + + //设置命令传输编码 + //String fileEncoding = System.getProperty("file.encoding"); + //channelSftp.setFilenameEncoding(fileEncoding); + } catch (JSchException e) { + if(null != e.getCause()){ + String cause = e.getCause().toString(); + String unknownHostException = "java.net.UnknownHostException: " + host; + String illegalArgumentException = "java.lang.IllegalArgumentException: port out of range:" + port; + String wrongPort = "java.net.ConnectException: Connection refused"; + if (unknownHostException.equals(cause)) { + String message = String.format("请确认ftp服务器地址是否正确,无法连接到地址为: [%s] 的ftp服务器", host); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } else if (illegalArgumentException.equals(cause) || wrongPort.equals(cause) ) { + String message = String.format("请确认连接ftp服务器端口是否正确,错误的端口: [%s] ", port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } + }else { + if("Auth fail".equals(e.getMessage())){ + String message = String.format("与ftp服务器建立连接失败,请检查用户名和密码是否正确: [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message); + }else{ + String message = String.format("与ftp服务器建立连接失败 : [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } + } + } + + } + + @Override + public void logoutFtpServer() { + if (channelSftp != null) { + channelSftp.disconnect(); + } + if (session != null) { + session.disconnect(); + } + } + + @Override + public boolean isDirExist(String directoryPath) { + try { + SftpATTRS sftpATTRS = channelSftp.lstat(directoryPath); + return sftpATTRS.isDir(); + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + String message = String.format("进入目录:[%s]时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + @Override + public boolean isFileExist(String filePath) { + boolean isExitFlag = false; + try { + SftpATTRS sftpATTRS = channelSftp.lstat(filePath); + if(sftpATTRS.getSize() >= 0){ + isExitFlag = true; + } + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } else { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + return isExitFlag; + } + + @Override + public boolean isSymbolicLink(String filePath) { + try { + SftpATTRS sftpATTRS = channelSftp.lstat(filePath); + return sftpATTRS.isLink(); + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } else { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + } + + HashSet sourceFiles = new HashSet(); + @Override + public HashSet getListFiles(String directoryPath, int parentLevel, int maxTraversalLevel) { + if(parentLevel < maxTraversalLevel){ + String parentPath = null;// 父级目录,以'/'结尾 + int pathLen = directoryPath.length(); + if (directoryPath.contains("*") || directoryPath.contains("?")) {//*和?的限制 + // path是正则表达式 + String subPath = UnstructuredStorageReaderUtil.getRegexPathParentPath(directoryPath); + if (isDirExist(subPath)) { + parentPath = subPath; + } else { + String message = String.format("不能进入目录:[%s]," + "请确认您的配置项path:[%s]存在,且配置的用户有权限进入", subPath, + directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + } else if (isDirExist(directoryPath)) { + // path是目录 + if (directoryPath.charAt(pathLen - 1) == IOUtils.DIR_SEPARATOR) { + parentPath = directoryPath; + } else { + parentPath = directoryPath + IOUtils.DIR_SEPARATOR; + } + } else if(isSymbolicLink(directoryPath)){ + //path是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else if (isFileExist(directoryPath)) { + // path指向具体文件 + sourceFiles.add(directoryPath); + return sourceFiles; + } else { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + try { + Vector vector = channelSftp.ls(directoryPath); + for (int i = 0; i < vector.size(); i++) { + LsEntry le = (LsEntry) vector.get(i); + String strName = le.getFilename(); + String filePath = parentPath + strName; + + if (isDirExist(filePath)) { + // 是子目录 + if (!(strName.equals(".") || strName.equals(".."))) { + //递归处理 + getListFiles(filePath, parentLevel+1, maxTraversalLevel); + } + } else if(isSymbolicLink(filePath)){ + //是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else if (isFileExist(filePath)) { + // 是文件 + sourceFiles.add(filePath); + } else { + String message = String.format("请确认path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + } // end for vector + } catch (SftpException e) { + String message = String.format("获取path:[%s] 下文件列表时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + + return sourceFiles; + }else{ + //超出最大递归层数 + String message = String.format("获取path:[%s] 下文件列表时超出最大层数,请确认路径[%s]下不存在软连接文件", directoryPath, directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OUT_MAX_DIRECTORY_LEVEL, message); + } + } + + @Override + public InputStream getInputStream(String filePath) { + try { + return channelSftp.get(filePath); + } catch (SftpException e) { + String message = String.format("读取文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限读取", filePath, filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OPEN_FILE_ERROR, message); + } + } + +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/StandardFtpHelper.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/StandardFtpHelper.java new file mode 100644 index 000000000..0ff6b5938 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/StandardFtpHelper.java @@ -0,0 +1,229 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.IOException; +import java.io.InputStream; +import java.net.UnknownHostException; +import java.util.HashSet; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.net.ftp.FTP; +import org.apache.commons.net.ftp.FTPClient; +import org.apache.commons.net.ftp.FTPClientConfig; +import org.apache.commons.net.ftp.FTPFile; +import org.apache.commons.net.ftp.FTPReply; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.reader.ftpreader.FtpReader.Job; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; + +public class StandardFtpHelper extends FtpHelper { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + FTPClient ftpClient = null; + + @Override + public void loginFtpServer(String host, String username, String password, int port, int timeout, + String connectMode) { + ftpClient = new FTPClient(); + try { + // 连接 + ftpClient.connect(host, port); + // 登录 + ftpClient.login(username, password); + ftpClient.configure(new FTPClientConfig(FTPClientConfig.SYST_UNIX)); + ftpClient.setConnectTimeout(timeout); + ftpClient.setDataTimeout(timeout); + if ("PASV".equals(connectMode)) { + ftpClient.enterRemotePassiveMode(); + ftpClient.enterLocalPassiveMode(); + } else if ("PORT".equals(connectMode)) { + ftpClient.enterLocalActiveMode(); + // ftpClient.enterRemoteActiveMode(host, port); + } + int reply = ftpClient.getReplyCode(); + if (!FTPReply.isPositiveCompletion(reply)) { + ftpClient.disconnect(); + String message = String.format("与ftp服务器建立连接失败,请检查用户名和密码是否正确: [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message); + } + //设置命令传输编码 + String fileEncoding = System.getProperty("file.encoding"); + ftpClient.setControlEncoding(fileEncoding); + } catch (UnknownHostException e) { + String message = String.format("请确认ftp服务器地址是否正确,无法连接到地址为: [%s] 的ftp服务器", host); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } catch (IllegalArgumentException e) { + String message = String.format("请确认连接ftp服务器端口是否正确,错误的端口: [%s] ", port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } catch (Exception e) { + String message = String.format("与ftp服务器建立连接失败 : [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } + + } + + @Override + public void logoutFtpServer() { + if (ftpClient.isConnected()) { + try { + //todo ftpClient.completePendingCommand();//打开流操作之后必须,原因还需要深究 + ftpClient.logout(); + } catch (IOException e) { + String message = "与ftp服务器断开连接失败"; + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_DISCONNECT, message, e); + }finally { + if(ftpClient.isConnected()){ + try { + ftpClient.disconnect(); + } catch (IOException e) { + String message = "与ftp服务器断开连接失败"; + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_DISCONNECT, message, e); + } + } + + } + } + } + + @Override + public boolean isDirExist(String directoryPath) { + try { + return ftpClient.changeWorkingDirectory(new String(directoryPath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + } catch (IOException e) { + String message = String.format("进入目录:[%s]时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + @Override + public boolean isFileExist(String filePath) { + boolean isExitFlag = false; + try { + FTPFile[] ftpFiles = ftpClient.listFiles(new String(filePath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + if (ftpFiles.length == 1 && ftpFiles[0].isFile()) { + isExitFlag = true; + } + } catch (IOException e) { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return isExitFlag; + } + + @Override + public boolean isSymbolicLink(String filePath) { + boolean isExitFlag = false; + try { + FTPFile[] ftpFiles = ftpClient.listFiles(new String(filePath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + if (ftpFiles.length == 1 && ftpFiles[0].isSymbolicLink()) { + isExitFlag = true; + } + } catch (IOException e) { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return isExitFlag; + } + + HashSet sourceFiles = new HashSet(); + @Override + public HashSet getListFiles(String directoryPath, int parentLevel, int maxTraversalLevel) { + if(parentLevel < maxTraversalLevel){ + String parentPath = null;// 父级目录,以'/'结尾 + int pathLen = directoryPath.length(); + if (directoryPath.contains("*") || directoryPath.contains("?")) { + // path是正则表达式 + String subPath = UnstructuredStorageReaderUtil.getRegexPathParentPath(directoryPath); + if (isDirExist(subPath)) { + parentPath = subPath; + } else { + String message = String.format("不能进入目录:[%s]," + "请确认您的配置项path:[%s]存在,且配置的用户有权限进入", subPath, + directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + } else if (isDirExist(directoryPath)) { + // path是目录 + if (directoryPath.charAt(pathLen - 1) == IOUtils.DIR_SEPARATOR) { + parentPath = directoryPath; + } else { + parentPath = directoryPath + IOUtils.DIR_SEPARATOR; + } + } else if (isFileExist(directoryPath)) { + // path指向具体文件 + sourceFiles.add(directoryPath); + return sourceFiles; + } else if(isSymbolicLink(directoryPath)){ + //path是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + try { + FTPFile[] fs = ftpClient.listFiles(new String(directoryPath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + for (FTPFile ff : fs) { + String strName = ff.getName(); + String filePath = parentPath + strName; + if (ff.isDirectory()) { + if (!(strName.equals(".") || strName.equals(".."))) { + //递归处理 + getListFiles(filePath, parentLevel+1, maxTraversalLevel); + } + } else if (ff.isFile()) { + // 是文件 + sourceFiles.add(filePath); + } else if(ff.isSymbolicLink()){ + //是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else { + String message = String.format("请确认path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + } // end for FTPFile + } catch (IOException e) { + String message = String.format("获取path:[%s] 下文件列表时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return sourceFiles; + + } else{ + //超出最大递归层数 + String message = String.format("获取path:[%s] 下文件列表时超出最大层数,请确认路径[%s]下不存在软连接文件", directoryPath, directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OUT_MAX_DIRECTORY_LEVEL, message); + } + } + + @Override + public InputStream getInputStream(String filePath) { + try { + return ftpClient.retrieveFileStream(new String(filePath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + } catch (IOException e) { + String message = String.format("读取文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限读取", filePath, filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OPEN_FILE_ERROR, message); + } + } + +} diff --git a/ftpreader/src/main/resources/plugin-template.json b/ftpreader/src/main/resources/plugin-template.json new file mode 100755 index 000000000..9680aec67 --- /dev/null +++ b/ftpreader/src/main/resources/plugin-template.json @@ -0,0 +1,38 @@ +{ + "name": "ftpreader", + "parameter": { + "host": "", + "port": "", + "username": "", + "password": "", + "protocol": "", + "path": [ + "" + ], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "fieldDelimiter": "," + } +} diff --git a/ftpreader/src/main/resources/plugin.json b/ftpreader/src/main/resources/plugin.json new file mode 100755 index 000000000..ce5ce26b9 --- /dev/null +++ b/ftpreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "ftpreader", + "class": "com.alibaba.datax.plugin.reader.ftpreader.FtpReader", + "description": "useScene: test. mechanism: use datax framework to transport data from txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/ftpreader/src/main/resources/plugin_job_template.json b/ftpreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..88f429f3e --- /dev/null +++ b/ftpreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,19 @@ +{ + "name": "ftpreader", + "parameter": { + "host": "", + "protocol": "sftp", + "port":"", + "username": "", + "password": "", + "path": [], + "column": [ + { + "index": 0, + "type": "" + } + ], + "fieldDelimiter": ",", + "encoding": "UTF-8" + } +} \ No newline at end of file diff --git a/hbasereader/doc/hbasereader.md b/hbasereader/doc/hbasereader.md new file mode 100644 index 000000000..4a6c77364 --- /dev/null +++ b/hbasereader/doc/hbasereader.md @@ -0,0 +1,347 @@ + +# HbaseReader 插件文档 + + +___ + + + +## 1 快速介绍 + +HbaseReader 插件实现了从 Hbase 读取数据。在底层实现上,HbaseReader 通过 HBase 的 Java 客户端连接远程 HBase 服务,并通过 Scan 方式读取数据。典型示例如下: + + + Scan scan = new Scan(); + scan.setStartRow(startKey); + scan.setStopRow(endKey); + + ResultScanner resultScanner = table.getScanner(scan); + for(Result r:resultScanner){ + System.out.println(new String(r.getRow())); + for(KeyValue kv:r.raw()){ + System.out.println(new String(kv.getValue())); + } + } + + +HbaseReader 需要特别注意如下几点: + +1、HbaseReader 中有一个必填配置项是:hbaseConfig,需要你联系 HBase PE,将hbase-site.xml 中与连接 HBase 相关的配置项提取出来,以 json 格式填入。 + +2、HbaseReader 中的 mode 配置项,必须填写且值只能为:normal 或者 multiVersion。当值为 normal 时,会把 HBase 中的表,当成普通二维表进行读取;当值为 multiVersion 时,会把每一个 cell 中的值,读成 DataX 中的一个 Record,Record 中的格式是: + +| 第0列 | 第1列 | 第2列 | 第3列 | +| --------| ---------------- |----- |----- | +| rowKey | column:qualifier| timestamp | value | + + +## 2 实现原理 + +简而言之,HbaseReader 通过 HBase 的 Java 客户端,通过 HTable, Scan, ResultScanner 等 API,读取你指定 rowkey 范围内的数据,并将读取的数据使用 DataX 自定义的数据类型拼装为抽象的数据集,并传递给下游 Writer 处理。 + + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从 HBase 抽取数据到本地的作业:(normal 模式) + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度,单位为byte/s,DataX运行会尽可能达到该速度但是不超过它. + "byte": 1048576 + } + //出错限制 + "errorLimit": { + //出错的record条数上限,当大于该值即报错。 + "record": 0, + //出错的record百分比上限 1.0表示100%,0.02表示2% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "hbasereader", + "parameter": { + "hbaseConfig": "hbase-site 文件中与连接相关的配置项,以 json 格式填写", + "table": "hbase_test_table", + "encoding": "utf8", + "mode": "normal", + "column": [ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "fb:comm_result_code", + "type": "string" + }, + { + "name": "fb:exchange_amount", + "type": "string" + }, + { + "name": "fb:exchange_status", + "type": "string" + } + ], + "range": { + "startRowkey": "", + "endRowkey": "" + }, + "isBinaryRowkey": true + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + //是否打印内容 + "parameter": { + "print": true + } + } + } + ] + } +} + +``` + +* 配置一个从 HBase 抽取数据到本地的作业:( multiVersion 模式) + +``` + +TODO + +``` + + +### 3.2 参数说明 + +* **hbaseConfig** + + * 描述:每个HBase集群提供给DataX客户端连接 的配置信息存放在hbase-site.xml,请联系你的HBase DBA提供配置信息,并转换为JSON格式填写如下:{"key1":"value1","key2":"value2"}。比如:{"hbase.zookeeper.quorum":"????","hbase.zookeeper.property.clientPort":"????"} 这样的形式。注意:如果是手写json,那么需要把双引号 转义为\" + + * 必选:是
+ + * 默认值:无
+ +* **mode** + + * 描述:normal/multiVersion。。。toto
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:要读取的 hbase 表名(大小写敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **encoding** + + * 描述:编码方式,UTF-8 或是 GBK,用于对二进制存储的 HBase byte[] 转为 String 时
+ + * 必选:否
+ + * 默认值:UTF-8
+ + +* **column** + + * 描述:TODO。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照如下语法格式: + ["id", "\`table\`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + TODO。 + + * 必选:是
+ + * 默认值:无
+ +* **startRowkey** + + * 描述:TODO + + TODO + + * 必选:否
+ + * 默认值:空
+ +* **endRowkey** + + * 描述:TODO
。 + + TODO。 + + * 必选:否
+ + * 默认值:无
+ +* **isBinaryRowkey** + + * 描述:
+ + `当用户配置querySql时,MysqlReader直接忽略table、column、where条件的配置`,querySql优先级大于table、column、where选项。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换(TODO) + +目前 HbaseReader 支持大部分 HBase 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 HbaseReader 针对 HBase 类型转换列表: + + +| DataX 内部类型| HBase 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time, year | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `bit DataX属于未定义行为`。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + TODO + + +单行记录类似于: + + biz_order_id: 888888888 + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Mysql数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 是否按照主键切分| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡进入流量(MB/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| +|1| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|1| 是 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|4| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|4| 是 | 329733 | 32.60 | 58| 0.8 | 60| 0.76 | +|8| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|8| 是 | 549556 | 54.33 | 115| 1.46 | 120| 0.78 | + +说明: + +1. 这里的单表,主键类型为 bigint(20),范围为:190247559466810-570722244711460,从主键范围划分看,数据分布均匀。 +2. 对单表如果没有安装主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库16张分表,共计32张分表) + + +| 通道数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡进入流量(MB/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------| --------|--------|--------|--------|--------|--------| +|1| 202241 | 20.06 | 31.5| 1.0 | 32 | 1.1 | +|4| 726358 | 72.04 | 123.9 | 3.1 | 132 | 3.6 | +|8|1074405 | 106.56| 197 | 5.5 | 205| 5.1| +|16| 1227892 | 121.79 | 229.2 | 8.1 | 233 | 7.3 | + +## 5 约束限制 + +### 5.1 ? + +主备同步问题指Mysql使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 ? + +Mysql在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,MysqlReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在MysqlReader单线程模型下数据同步一致性的特性,由于MysqlReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当MysqlReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 ? + +Mysql本身的编码设置非常灵活,包括指定编码到库、表、字段级别,甚至可以均不同编码。优先级从高到低为字段、表、库、实例。我们不推荐数据库用户设置如此混乱的编码,最好在库级别就统一到UTF-8。 + +MysqlReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此MysqlReader不需用户指定编码,可以自动获取编码并转码。 + +对于Mysql底层写入编码和其设定的编码不一致的混乱情况,MysqlReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.4 ? + +MysqlReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,MysqlReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,MysqlReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,MysqlReader也无法进行增量数据同步,只能同步全量数据。 + + +## 6 FAQ + +*** + +**Q: ??同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用 HBase shell 命令行测试: + + TODO + +如果上述命令也报错,那可以证实是环境问题,请联系你的 PE。 diff --git a/hbasereader/hbasereader.iml b/hbasereader/hbasereader.iml new file mode 100644 index 000000000..8933a9f15 --- /dev/null +++ b/hbasereader/hbasereader.iml @@ -0,0 +1,83 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/hbasereader/pom.xml b/hbasereader/pom.xml new file mode 100755 index 000000000..50567df39 --- /dev/null +++ b/hbasereader/pom.xml @@ -0,0 +1,89 @@ + + + 4.0.0 + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + + hbasereader + hbasereader + 0.0.1-SNAPSHOT + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hbase + hbase + 0.94.27 + + + org.apache.hadoop + hadoop-core + 0.20.205.0 + + + org.apache.zookeeper + zookeeper + 3.3.2 + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/hbasereader/src/main/assembly/package.xml b/hbasereader/src/main/assembly/package.xml new file mode 100755 index 000000000..51ff86f4a --- /dev/null +++ b/hbasereader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/hbasereader + + + target/ + + hbasereader-0.0.1-SNAPSHOT.jar + + plugin/reader/hbasereader + + + + + + false + plugin/reader/hbasereader/libs + runtime + + + diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/ColumnType.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/ColumnType.java new file mode 100755 index 000000000..a6245722b --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/ColumnType.java @@ -0,0 +1,43 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +/** + * 只对 normal 模式读取时有用,多版本读取时,不存在列类型的 + */ +public enum ColumnType { + STRING("string"), + BINARY_STRING("binarystring"), + BYTES("bytes"), + BOOLEAN("boolean"), + SHORT("short"), + INT("int"), + LONG("long"), + FLOAT("float"), + DOUBLE("double"), + DATE("date"),; + + private String typeName; + + ColumnType(String typeName) { + this.typeName = typeName; + } + + public static ColumnType getByTypeName(String typeName) { + for (ColumnType columnType : values()) { + if (columnType.typeName.equalsIgnoreCase(typeName)) { + return columnType; + } + } + + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbasereader 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + + @Override + public String toString() { + return this.typeName; + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/Constant.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/Constant.java new file mode 100755 index 000000000..53973a08b --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/Constant.java @@ -0,0 +1,10 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +public final class Constant { + public static final String RANGE = "range"; + + public static final String ROWKEY_FLAG = "rowkey"; + + public static final int DEFAULT_SCAN_CACHE = 256; + +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HTableManager.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HTableManager.java new file mode 100755 index 000000000..1747a38c1 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HTableManager.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hbase.client.HBaseAdmin; +import org.apache.hadoop.hbase.client.HTable; + +import java.io.IOException; + +public final class HTableManager { + + public static HTable createHTable(Configuration config, String tableName) + throws IOException { + + return new HTable(config, tableName); + } + + public static HBaseAdmin createHBaseAdmin(Configuration config) + throws IOException { + return new HBaseAdmin(config); + } + + public static void closeHTable(HTable hTable) throws IOException { + if (hTable != null) { + hTable.close(); + } + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseColumnCell.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseColumnCell.java new file mode 100755 index 000000000..24f052bb0 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseColumnCell.java @@ -0,0 +1,124 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.plugin.reader.hbasereader.util.HbaseUtil; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.hbase.util.Bytes; + +/** + * 描述 hbasereader 插件中,column 配置中的一个单元项实体 + */ +public class HbaseColumnCell extends BaseObject { + private ColumnType columnType; + + // columnName 格式为:列族:列名 + private String columnName; + + private byte[] cf; + private byte[] qualifier; + + //对于常量类型,其常量值放到 columnValue 里 + private String columnValue; + + //当配置了 columnValue 时,isConstant=true(这个成员变量是用于方便使用本类的地方判断是否是常量类型字段) + private boolean isConstant; + + // 只在类型是时间类型时,才会设置该值,无默认值。形式如:yyyy-MM-dd HH:mm:ss + private String dateformat; + + private HbaseColumnCell(Builder builder) { + this.columnType = builder.columnType; + + //columnName 和 columnValue 必须有一个为 null + Validate.isTrue(builder.columnName == null || builder.columnValue == null, "Hbasereader 中,column 不能同时配置 列名称 和 列值,二者选其一."); + + //columnName 和 columnValue 不能都为 null + Validate.isTrue(builder.columnName != null || builder.columnValue != null, "Hbasereader 中,column 需要配置 列名称 或者 列值, 二者选其一."); + + if (builder.columnName != null) { + this.isConstant = false; + this.columnName = builder.columnName; + + // 如果 columnName 不是 rowkey,则必须配置为:列族:列名 格式 + if (!HbaseUtil.isRowkeyColumn(this.columnName)) { + + String promptInfo = "Hbasereader 中, column 的列配置格式应该是:列族:列名. 您配置的列错误:" + this.columnName; + String[] cfAndQualifier = this.columnName.split(":"); + Validate.isTrue(cfAndQualifier != null && cfAndQualifier.length == 2 + && StringUtils.isNotBlank(cfAndQualifier[0]) + && StringUtils.isNotBlank(cfAndQualifier[1]), promptInfo); + + this.cf = Bytes.toBytes(cfAndQualifier[0].trim()); + this.qualifier = Bytes.toBytes(cfAndQualifier[1].trim()); + } + } else { + this.isConstant = true; + this.columnValue = builder.columnValue; + } + + if (builder.dateformat != null) { + this.dateformat = builder.dateformat; + } + } + + public ColumnType getColumnType() { + return columnType; + } + + public String getColumnName() { + return columnName; + } + + public byte[] getCf() { + return cf; + } + + public byte[] getQualifier() { + return qualifier; + } + + public String getDateformat() { + return dateformat; + } + + public String getColumnValue() { + return columnValue; + } + + public boolean isConstant() { + return isConstant; + } + + // 内部 builder 类 + public static class Builder { + private ColumnType columnType; + private String columnName; + private String columnValue; + + private String dateformat; + + public Builder(ColumnType columnType) { + this.columnType = columnType; + } + + public Builder columnName(String columnName) { + this.columnName = columnName; + return this; + } + + public Builder columnValue(String columnValue) { + this.columnValue = columnValue; + return this; + } + + public Builder dateformat(String dateformat) { + this.dateformat = dateformat; + return this; + } + + public HbaseColumnCell build() { + return new HbaseColumnCell(this); + } + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseColumnConfig.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseColumnConfig.java new file mode 100755 index 000000000..bf679793e --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseColumnConfig.java @@ -0,0 +1,21 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import java.util.Arrays; + +public class HbaseColumnConfig { + public String[] columnTypes = null; + public String[] columnFamilyAndQualifiers = null; + + public HbaseColumnConfig() { + } + + @Override + public String toString() { + if (null != columnTypes && null != columnFamilyAndQualifiers) { + return "columnTypes:" + Arrays.asList(columnTypes) + "\n" + + "columnNames:" + Arrays.toString(columnFamilyAndQualifiers); + } else { + return null; + } + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseReader.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseReader.java new file mode 100755 index 000000000..ffe775848 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseReader.java @@ -0,0 +1,130 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.util.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class HbaseReader extends Reader { + public static class Job extends Reader.Job { + private static Logger LOG = LoggerFactory.getLogger(Job.class); + private static boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration originalConfig; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + HbaseUtil.doPretreatment(this.originalConfig); + + if (IS_DEBUG) { + LOG.debug("After init(), now originalConfig is:\n{}\n", this.originalConfig); + } + } + + @Override + public void prepare() { + } + + @Override + public List split(int adviceNumber) { + return HbaseSplitUtil.split(this.originalConfig); + } + + + @Override + public void post() { + + } + + @Override + public void destroy() { + } + + } + + public static class Task extends Reader.Task { + private Configuration taskConfig; + private static Logger LOG = LoggerFactory.getLogger(Task.class); + private HbaseAbstractTask hbaseTaskProxy; + + @Override + public void init() { + this.taskConfig = super.getPluginJobConf(); + + String mode = this.taskConfig.getString(Key.MODE); + ModeType modeType = ModeType.getByTypeName(mode); + + switch (modeType) { + case Normal: + this.hbaseTaskProxy = new NormalTask(this.taskConfig); + break; + case MultiVersionFixedColumn: + this.hbaseTaskProxy = new MultiVersionFixedColumnTask(this.taskConfig); + break; + case MultiVersionDynamicColumn: + this.hbaseTaskProxy = new MultiVersionDynamicColumnTask(this.taskConfig); + break; + default: + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持此类模式:" + modeType); + } + } + + @Override + public void prepare() { + try { + this.hbaseTaskProxy.prepare(); + } catch (Exception e) { + throw DataXException.asDataXException(HbaseReaderErrorCode.PREPAR_READ_ERROR, e); + } + } + + @Override + public void startRead(RecordSender recordSender) { + Record record = recordSender.createRecord(); + boolean fetchOK; + while (true) { + try { + fetchOK = this.hbaseTaskProxy.fetchLine(record); + } catch (Exception e) { + LOG.info("Exception", e); + super.getTaskPluginCollector().collectDirtyRecord(record, e); + record = recordSender.createRecord(); + continue; + } + if (fetchOK) { + recordSender.sendToWriter(record); + record = recordSender.createRecord(); + } else { + break; + } + } + recordSender.flush(); + } + + @Override + public void post() { + super.post(); + } + + @Override + public void destroy() { + if (this.hbaseTaskProxy != null) { + try { + this.hbaseTaskProxy.close(); + } catch (Exception e) { + // + } + } + } + + + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseReaderErrorCode.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseReaderErrorCode.java new file mode 100755 index 000000000..99275e4bf --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/HbaseReaderErrorCode.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum HbaseReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("HbaseReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("HbaseReader-01", "您配置的值不合法."), + PREPAR_READ_ERROR("HbaseReader-02", "准备读取 Hbase 时出错."), + SPLIT_ERROR("HbaseReader-03", "切分 Hbase 表时出错."), + INIT_TABLE_ERROR("HbaseReader-04", "初始化 Hbase 抽取表时出错."), + + ; + + private final String code; + private final String description; + + private HbaseReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/Key.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/Key.java new file mode 100755 index 000000000..42e4b6f00 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/Key.java @@ -0,0 +1,50 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +public final class Key { + + public final static String HBASE_CONFIG = "hbaseConfig"; + + /** + * mode 可以取 normal 或者 multiVersionFixedColumn 或者 multiVersionDynamicColumn 三个值,无默认值。 + *

+ * normal 配合 column(Map 结构的)使用 + *

+ * multiVersionFixedColumn 配合 maxVersion,tetradType, column(List 结构的)使用 + *

+ * multiVersionDynamicColumn 配合 maxVersion,tetradType, columnFamily(List 结构的)使用 + */ + public final static String MODE = "mode"; + + /** + * 配合 mode = multiVersion 时使用,指明需要读取的版本个数。无默认值 + * -1 表示去读全部版本 + * 不能为0,1 + * >1 表示最多读取对应个数的版本数(不能超过 Integer 的最大值) + */ + public final static String MAX_VERSION = "maxVersion"; + + /** + * 多版本情况下,必须配置 四元组的类型(rowkey,column,timestamp,value) + */ + public final static String TETRAD_TYPE = "tetradType"; + + /** + * 默认为 utf8 + */ + public final static String ENCODING = "encoding"; + + public final static String TABLE = "table"; + + public final static String COLUMN_FAMILY = "columnFamily"; + + public final static String COLUMN = "column"; + + public final static String START_ROWKEY = "startRowkey"; + + public final static String END_ROWKEY = "endRowkey"; + + public final static String IS_BINARY_ROWKEY = "isBinaryRowkey"; + + public final static String SCAN_CACHE = "scanCache"; + +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/ModeType.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/ModeType.java new file mode 100644 index 000000000..3348d0752 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/ModeType.java @@ -0,0 +1,28 @@ +package com.alibaba.datax.plugin.reader.hbasereader; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum ModeType { + Normal("normal"), + MultiVersionFixedColumn("multiVersionFixedColumn"), + MultiVersionDynamicColumn("multiVersionDynamicColumn"),; + + private String mode; + + ModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public static ModeType getByTypeName(String modeName) { + for (ModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbasereader 不支持该 mode 类型:%s, 目前支持的 mode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseAbstractTask.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseAbstractTask.java new file mode 100755 index 000000000..8061f73dc --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseAbstractTask.java @@ -0,0 +1,89 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.Constant; +import com.alibaba.datax.plugin.reader.hbasereader.HTableManager; +import com.alibaba.datax.plugin.reader.hbasereader.Key; +import org.apache.hadoop.hbase.client.HTable; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.ResultScanner; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +public abstract class HbaseAbstractTask { + private final static Logger LOG = LoggerFactory.getLogger(HbaseAbstractTask.class); + + private int scanCache; + private byte[] startKey = null; + private byte[] endKey = null; + + protected HTable htable; + protected String encoding; + protected Result lastResult = null; + protected Scan scan; + protected ResultScanner resultScanner; + + public HbaseAbstractTask(Configuration configuration) { + this.htable = HbaseUtil.initHtable(configuration); + + this.encoding = configuration.getString(Key.ENCODING); + + this.scanCache = configuration.getInt(Key.SCAN_CACHE, Constant.DEFAULT_SCAN_CACHE); + + this.startKey = HbaseUtil.convertInnerStartRowkey(configuration); + this.endKey = HbaseUtil.convertInnerEndRowkey(configuration); + } + + public abstract boolean fetchLine(Record record) throws Exception; + + public abstract void initScan(Scan scan); + + public void prepare() throws Exception { + this.scan = new Scan(); + scan.setCacheBlocks(false); + + this.scan.setStartRow(startKey); + this.scan.setStopRow(endKey); + + LOG.info("The task set startRowkey=[{}], endRowkey=[{}].", Bytes.toStringBinary(this.startKey), Bytes.toStringBinary(this.endKey)); + + initScan(this.scan); + + this.scan.setCaching(this.scanCache); + this.resultScanner = this.htable.getScanner(this.scan); + } + + + public void close() throws IOException { + if (this.resultScanner != null) { + this.resultScanner.close(); + } + HTableManager.closeHTable(this.htable); + } + + protected Result getNextHbaseRow() throws IOException { + Result result; + try { + result = resultScanner.next(); + } catch (IOException e) { + if (lastResult != null) { + scan.setStartRow(lastResult.getRow()); + } + resultScanner = this.htable.getScanner(scan); + result = resultScanner.next(); + if (lastResult != null && Bytes.equals(lastResult.getRow(), result.getRow())) { + result = resultScanner.next(); + } + } + + lastResult = result; + + // may be null + return result; + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseSplitUtil.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseSplitUtil.java new file mode 100755 index 000000000..59e8ac2ab --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseSplitUtil.java @@ -0,0 +1,149 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.HTableManager; +import com.alibaba.datax.plugin.reader.hbasereader.HbaseReaderErrorCode; +import com.alibaba.datax.plugin.reader.hbasereader.Key; +import org.apache.hadoop.hbase.HConstants; +import org.apache.hadoop.hbase.client.HTable; +import org.apache.hadoop.hbase.util.Bytes; +import org.apache.hadoop.hbase.util.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +public final class HbaseSplitUtil { + private final static Logger LOG = LoggerFactory.getLogger(HbaseSplitUtil.class); + + + public static List split(Configuration configuration) { + byte[] startRowkeyByte = HbaseUtil.convertUserStartRowkey(configuration); + byte[] endRowkeyByte = HbaseUtil.convertUserEndRowkey(configuration); + + /* 如果用户配置了 startRowkey 和 endRowkey,需要确保:startRowkey <= endRowkey */ + if (startRowkeyByte.length != 0 && endRowkeyByte.length != 0 + && Bytes.compareTo(startRowkeyByte, endRowkeyByte) > 0) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 中 startRowkey 不得大于 endRowkey."); + } + + HTable htable = HbaseUtil.initHtable(configuration); + + List resultConfigurations; + + try { + Pair regionRanges = htable.getStartEndKeys(); + if (null == regionRanges) { + throw DataXException.asDataXException(HbaseReaderErrorCode.SPLIT_ERROR, "获取源头 Hbase 表的 rowkey 范围失败."); + } + + resultConfigurations = HbaseSplitUtil.doSplit(configuration, startRowkeyByte, endRowkeyByte, + regionRanges); + + LOG.info("HBaseReader split job into {} tasks.", resultConfigurations.size()); + + return resultConfigurations; + } catch (Exception e) { + throw DataXException.asDataXException(HbaseReaderErrorCode.SPLIT_ERROR, "切分源头 Hbase 表失败.", e); + } finally { + try { + HTableManager.closeHTable(htable); + } catch (Exception e) { + // + } + } + } + + + private static List doSplit(Configuration config, byte[] startRowkeyByte, + byte[] endRowkeyByte, Pair regionRanges) { + + List configurations = new ArrayList(); + + for (int i = 0; i < regionRanges.getFirst().length; i++) { + + byte[] regionStartKey = regionRanges.getFirst()[i]; + byte[] regionEndKey = regionRanges.getSecond()[i]; + + // 当前的region为最后一个region + // 如果最后一个region的start Key大于用户指定的userEndKey,则最后一个region,应该不包含在内 + // 注意如果用户指定userEndKey为"",则此判断应该不成立。userEndKey为""表示取得最大的region + if (Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) == 0 + && (endRowkeyByte.length != 0 && (Bytes.compareTo( + regionStartKey, endRowkeyByte) > 0))) { + continue; + } + + // 如果当前的region不是最后一个region, + // 用户配置的userStartKey大于等于region的endkey,则这个region不应该含在内 + if ((Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) != 0) + && (Bytes.compareTo(startRowkeyByte, regionEndKey) >= 0)) { + continue; + } + + // 如果用户配置的userEndKey小于等于 region的startkey,则这个region不应该含在内 + // 注意如果用户指定的userEndKey为"",则次判断应该不成立。userEndKey为""表示取得最大的region + if (endRowkeyByte.length != 0 + && (Bytes.compareTo(endRowkeyByte, regionStartKey) <= 0)) { + continue; + } + + Configuration p = config.clone(); + + String thisStartKey = getStartKey(startRowkeyByte, regionStartKey); + + String thisEndKey = getEndKey(endRowkeyByte, regionEndKey); + + p.set(Key.START_ROWKEY, thisStartKey); + p.set(Key.END_ROWKEY, thisEndKey); + + LOG.debug("startRowkey:[{}], endRowkey:[{}] .", thisStartKey, thisEndKey); + + configurations.add(p); + } + + return configurations; + } + + private static String getEndKey(byte[] endRowkeyByte, byte[] regionEndKey) { + if (endRowkeyByte == null) {// 由于之前处理过,所以传入的userStartKey不可能为null + throw new IllegalArgumentException("userEndKey should not be null!"); + } + + byte[] tempEndRowkeyByte; + + if (endRowkeyByte.length == 0) { + tempEndRowkeyByte = regionEndKey; + } else if (Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) == 0) { + // 为最后一个region + tempEndRowkeyByte = endRowkeyByte; + } else { + if (Bytes.compareTo(endRowkeyByte, regionEndKey) > 0) { + tempEndRowkeyByte = regionEndKey; + } else { + tempEndRowkeyByte = endRowkeyByte; + } + } + + return Bytes.toStringBinary(tempEndRowkeyByte); + } + + private static String getStartKey(byte[] startRowkeyByte, byte[] regionStarKey) { + if (startRowkeyByte == null) {// 由于之前处理过,所以传入的userStartKey不可能为null + throw new IllegalArgumentException( + "userStartKey should not be null!"); + } + + byte[] tempStartRowkeyByte; + + if (Bytes.compareTo(startRowkeyByte, regionStarKey) < 0) { + tempStartRowkeyByte = regionStarKey; + } else { + tempStartRowkeyByte = startRowkeyByte; + } + + return Bytes.toStringBinary(tempStartRowkeyByte); + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseUtil.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseUtil.java new file mode 100755 index 000000000..065c37278 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/HbaseUtil.java @@ -0,0 +1,362 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.*; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.TypeReference; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hbase.HConstants; +import org.apache.hadoop.hbase.client.HBaseAdmin; +import org.apache.hadoop.hbase.client.HTable; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +public final class HbaseUtil { + private static Logger LOG = LoggerFactory.getLogger(HbaseUtil.class); + + private static final String META_SCANNER_CACHING = "100"; + + private static final int TETRAD_TYPE_COUNT = 4; + + public static void doPretreatment(Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.HBASE_CONFIG, + HbaseReaderErrorCode.REQUIRED_VALUE); + + originalConfig.getNecessaryValue(Key.TABLE, HbaseReaderErrorCode.REQUIRED_VALUE); + + String mode = HbaseUtil.dealMode(originalConfig); + originalConfig.set(Key.MODE, mode); + + String encoding = originalConfig.getString(Key.ENCODING, "utf-8"); + if (!Charset.isSupported(encoding)) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, String.format("Hbasereader 不支持您所配置的编码:[%s]", encoding)); + } + originalConfig.set(Key.ENCODING, encoding); + + // 此处增强一个检查:isBinaryRowkey 配置不能出现在与 hbaseConfig 等配置平级地位 + Boolean isBinaryRowkey = originalConfig.getBool(Key.IS_BINARY_ROWKEY); + if (isBinaryRowkey != null) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, String.format("%s 不能配置在此处,它应该配置在 range 里面.", Key.IS_BINARY_ROWKEY)); + } + + // 处理 range 的配置,将 range 内的配置值提取到与 hbaseConfig 等配置项平级地位,方便后续获取值 + boolean hasConfiguredRange = false; + String startRowkey = originalConfig.getString(Constant.RANGE + "." + Key.START_ROWKEY); + + //此处判断需要谨慎:如果有 key range.startRowkey 但是没有值,得到的 startRowkey 是空字符串,而不是 null + if (startRowkey != null && startRowkey.length() != 0) { + hasConfiguredRange = true; + originalConfig.set(Key.START_ROWKEY, startRowkey); + } + + String endRowkey = originalConfig.getString(Constant.RANGE + "." + Key.END_ROWKEY); + //此处判断需要谨慎:如果有 key range.endRowkey 但是没有值,得到的 endRowkey 是空字符串,而不是 null + if (endRowkey != null && endRowkey.length() != 0) { + hasConfiguredRange = true; + originalConfig.set(Key.END_ROWKEY, endRowkey); + } + + // 如果配置了 range, 就必须要配置 isBinaryRowkey + if (hasConfiguredRange) { + isBinaryRowkey = originalConfig.getBool(Constant.RANGE + "." + Key.IS_BINARY_ROWKEY); + if (isBinaryRowkey == null) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, "您需要在 range 内配置 isBinaryRowkey 项,用于告诉 DataX 把您填写的 rowkey 转换为内部的二进制时,采用那个 API(值为 true 时,采用Bytes.toBytesBinary(String rowKey),值为 false 时,采用Bytes.toBytes(String rowKey))"); + } + + originalConfig.set(Key.IS_BINARY_ROWKEY, isBinaryRowkey); + } + } + + /** + * 对模式以及与模式进行配对的配置进行检查 + */ + private static String dealMode(Configuration originalConfig) { + String mode = originalConfig.getString(Key.MODE); + + ModeType modeType = ModeType.getByTypeName(mode); + switch (modeType) { + case Normal: { + // normal 模式不需要配置 maxVersion,需要配置 column,并且 column 格式为 Map 风格 + String maxVersion = originalConfig.getString(Key.MAX_VERSION); + Validate.isTrue(maxVersion == null, "您配置的是 normal 模式读取 hbase 中的数据,所以不能配置无关项:maxVersion"); + + List column = originalConfig.getList(Key.COLUMN, Map.class); + + if (column == null || column.isEmpty()) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, "您配置的是 normal 模式读取 hbase 中的数据,所以必须配置 column,其形式为:column:[{\"name\": \"cf0:column0\",\"type\": \"string\"},{\"name\": \"cf1:column1\",\"type\": \"long\"}]"); + } + + // 通过 parse 进行 column 格式的进一步检查 + HbaseUtil.parseColumnOfNormalMode(column); + break; + } + case MultiVersionFixedColumn: { + // multiVersionFixedColumn 模式需要配置 maxVersion 和 column,并且 column 格式为 List 风格 + checkMaxVersion(originalConfig, mode); + + checkTetradType(originalConfig, mode); + + List columns = originalConfig.getList(Key.COLUMN, String.class); + if (columns == null || columns.isEmpty()) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, "您配置的是 multiVersionFixedColumn 模式读取 hbase 中的数据,所以必须配置 column,其形式为: column:[\"cf0:column0\",\"cf1:column1\"]"); + } + + // 检查配置的 column 格式是否包含cf:qualifier + for (String column : columns) { + if (StringUtils.isBlank(column) || column.split(":").length != 2) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, String.format("您配置的是 multiVersionFixedColumn 模式读取 hbase 中的数据,但是您配置的列格式[%s]不正确,每一个列元素应该配置为 列族:列名 的形式, 如 column:[\"cf0:column0\",\"cf1:column1\"]", column)); + } + } + + // 检查多版本固定列读取时,不能配置 columnFamily + List columnFamilies = originalConfig.getList(Key.COLUMN_FAMILY, String.class); + if (columnFamilies != null) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "您配置的是 multiVersionFixedColumn 模式读取 hbase 中的数据,所以不能配置 columnFamily"); + } + + break; + } + case MultiVersionDynamicColumn: { + // multiVersionDynamicColumn 模式需要配置 maxVersion 和 columnFamily,并且 columnFamily 格式为 List 风格 + checkMaxVersion(originalConfig, mode); + + checkTetradType(originalConfig, mode); + + List columnFamilies = originalConfig.getList(Key.COLUMN_FAMILY, String.class); + if (columnFamilies == null || columnFamilies.isEmpty()) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, "您配置的是 multiVersionDynamicColumn 模式读取 hbase 中的数据,所以必须配置 columnFamily,其形式为:columnFamily:[\"cf0\",\"cf1\"]"); + } + + // 检查多版本动态列读取时,不能配置 column + List columns = originalConfig.getList(Key.COLUMN, String.class); + if (columns != null) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "您配置的是 multiVersionDynamicColumn 模式读取 hbase 中的数据,所以不能配置 column"); + } + + break; + } + default: + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持此类模式:" + modeType); + } + + return mode; + } + + // 检查 maxVersion 是否存在,并且值是否合法 + private static void checkMaxVersion(Configuration configuration, String mode) { + Integer maxVersion = configuration.getInt(Key.MAX_VERSION); + Validate.notNull(maxVersion, String.format("您配置的是 %s 模式读取 hbase 中的数据,所以必须配置:maxVersion", mode)); + + boolean isMaxVersionValid = maxVersion == -1 || maxVersion > 1; + Validate.isTrue(isMaxVersionValid, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是配置的 maxVersion 值错误. maxVersion规定:-1为读取全部版本,不能配置为0或者1(因为0或者1,我们认为用户是想用 normal 模式读取数据,而非 %s 模式读取,二者差别大),大于1则表示读取最新的对应个数的版本", mode, mode)); + } + + public static org.apache.hadoop.conf.Configuration getHbaseConf(String hbaseConf) { + if (StringUtils.isBlank(hbaseConf)) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, "读 Hbase 时需要配置 hbaseConfig,其内容为 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + } + org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration(); + + try { + Map map = JSON.parseObject(hbaseConf, + new TypeReference>() { + }); + + // / 用户配置的 key-value 对 来表示 hbaseConf + Validate.isTrue(map != null, "hbaseConfig 不能为空 Map 结构!"); + for (Map.Entry entry : map.entrySet()) { + conf.set(entry.getKey(), entry.getValue()); + } + return conf; + } catch (Exception e) { + // 用户配置的 hbase 配置文件路径 + LOG.warn("尝试把您配置的 hbaseConfig: {} \n 当成 json 解析时遇到错误:", e); + LOG.warn("现在尝试把您配置的 hbaseConfig: {} \n 当成文件路径进行解析.", hbaseConf); + conf.addResource(new Path(hbaseConf)); + + LOG.warn("您配置的 hbaseConfig 是文件路径, 是不推荐的行为:因为当您的这个任务迁移到其他机器运行时,很可能出现该路径不存在的错误. 建议您把此项配置改成标准的 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + return conf; + } + } + + private static void checkTetradType(Configuration configuration, String mode) { + List userConfiguredTetradTypes = configuration.getList(Key.TETRAD_TYPE, String.class); + if (userConfiguredTetradTypes == null || userConfiguredTetradTypes.isEmpty()) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是缺失了 tetradType 配置项. tetradType规定:是长度为 4 的数组,用于指定四元组读取的类型,如:tetradType:[\"bytes\",\"string\",\"long\",\"bytes\"]", mode)); + } + + if (userConfiguredTetradTypes.size() != TETRAD_TYPE_COUNT) { + throw DataXException.asDataXException(HbaseReaderErrorCode.REQUIRED_VALUE, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是 tetradType 配置项元素个数错误. tetradType规定:是长度为 4 的数组,用于指定四元组读取的类型,如:tetradType:[\"bytes\",\"string\",\"long\",\"bytes\"]", mode)); + } + + + String rowkeyType = userConfiguredTetradTypes.get(0); + String columnNameType = userConfiguredTetradTypes.get(1); + String timestampType = userConfiguredTetradTypes.get(2); + String valueType = userConfiguredTetradTypes.get(3); + + ColumnType.getByTypeName(rowkeyType); + ColumnType.getByTypeName(columnNameType); + + if (!"long".equalsIgnoreCase(timestampType)) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是 tetradType 配置项元素类型错误. tetradType规定:第三项描述 timestamp 类型只能为 long,而您配置的值是:[%s]", mode, timestampType)); + } + + if ("date".equalsIgnoreCase(rowkeyType) || "date".equalsIgnoreCase(columnNameType) + || "date".equalsIgnoreCase(valueType)) { + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是 tetradType 配置项元素类型错误. tetradType规定:不支持 date 类型", mode)); + } + + ColumnType.getByTypeName(valueType); + } + + public static byte[] convertUserStartRowkey(Configuration configuration) { + String startRowkey = configuration.getString(Key.START_ROWKEY); + if (StringUtils.isBlank(startRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } else { + boolean isBinaryRowkey = configuration.getBool(Key.IS_BINARY_ROWKEY); + return HbaseUtil.stringToBytes(startRowkey, isBinaryRowkey); + } + } + + public static byte[] convertUserEndRowkey(Configuration configuration) { + String endRowkey = configuration.getString(Key.END_ROWKEY); + if (StringUtils.isBlank(endRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } else { + boolean isBinaryRowkey = configuration.getBool(Key.IS_BINARY_ROWKEY); + return HbaseUtil.stringToBytes(endRowkey, isBinaryRowkey); + } + } + + /** + * 注意:convertUserStartRowkey 和 convertInnerStartRowkey,前者会受到 isBinaryRowkey 的影响,只用于第一次对用户配置的 String 类型的 rowkey 转为二进制时使用。而后者约定:切分时得到的二进制的 rowkey 回填到配置中时采用 + */ + public static byte[] convertInnerStartRowkey(Configuration configuration) { + String startRowkey = configuration.getString(Key.START_ROWKEY); + if (StringUtils.isBlank(startRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } + + return Bytes.toBytesBinary(startRowkey); + } + + public static byte[] convertInnerEndRowkey(Configuration configuration) { + String endRowkey = configuration.getString(Key.END_ROWKEY); + if (StringUtils.isBlank(endRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } + + return Bytes.toBytesBinary(endRowkey); + } + + /** + * 每次都获取一个新的HTable 注意:HTable 本身是线程不安全的 + */ + public static HTable initHtable(com.alibaba.datax.common.util.Configuration configuration) { + String hbaseConnConf = configuration.getString(Key.HBASE_CONFIG); + String tableName = configuration.getString(Key.TABLE); + HBaseAdmin admin = null; + try { + org.apache.hadoop.conf.Configuration conf = HbaseUtil.getHbaseConf(hbaseConnConf); + conf.set("hbase.meta.scanner.caching", META_SCANNER_CACHING); + + HTable htable = HTableManager.createHTable(conf, tableName); + admin = HTableManager.createHBaseAdmin(conf); + check(admin, htable); + + return htable; + } catch (Exception e) { + throw DataXException.asDataXException(HbaseReaderErrorCode.INIT_TABLE_ERROR, e); + } finally { + if (admin != null) { + try { + admin.close(); + } catch (IOException e) { + // ignore it + } + } + } + } + + + private static void check(HBaseAdmin admin, HTable htable) throws DataXException, IOException { + if (!admin.isMasterRunning()) { + throw new IllegalStateException("HBase master 没有运行, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if (!admin.tableExists(htable.getTableName())) { + throw new IllegalStateException("HBase源头表" + Bytes.toString(htable.getTableName()) + + "不存在, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if (!admin.isTableAvailable(htable.getTableName()) || !admin.isTableEnabled(htable.getTableName())) { + throw new IllegalStateException("HBase源头表" + Bytes.toString(htable.getTableName()) + + " 不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + } + + private static byte[] stringToBytes(String rowkey, boolean isBinaryRowkey) { + if (isBinaryRowkey) { + return Bytes.toBytesBinary(rowkey); + } else { + return Bytes.toBytes(rowkey); + } + } + + public static boolean isRowkeyColumn(String columnName) { + return Constant.ROWKEY_FLAG.equalsIgnoreCase(columnName); + } + + /** + * 用于解析 Normal 模式下的列配置 + */ + public static List parseColumnOfNormalMode(List column) { + List hbaseColumnCells = new ArrayList(); + + HbaseColumnCell oneColumnCell; + + for (Map aColumn : column) { + ColumnType type = ColumnType.getByTypeName(aColumn.get("type")); + String columnName = aColumn.get("name"); + String columnValue = aColumn.get("value"); + String dateformat = aColumn.get("format"); + + if (type == ColumnType.DATE) { + Validate.notNull(dateformat, "Hbasereader 在 normal 方式读取时,其列配置中,如果类型为时间,则必须指定时间格式. 形如:yyyy-MM-dd HH:mm:ss,特别注意不能随意小写时间格式,那样可能导致时间转换错误!"); + + Validate.isTrue(StringUtils.isNotBlank(columnName) || StringUtils.isNotBlank(columnValue), "Hbasereader 在 normal 方式读取时则要么是 type + name + format 的组合,要么是type + value + format 的组合. 而您的配置非这两种组合,请检查并修改."); + + oneColumnCell = new HbaseColumnCell + .Builder(type) + .columnName(columnName) + .columnValue(columnValue) + .dateformat(dateformat) + .build(); + } else { + Validate.isTrue(dateformat == null, "Hbasereader 在 normal 方式读取时, 其列配置中,如果类型不为时间,则不需要指定时间格式."); + + Validate.isTrue(StringUtils.isNotBlank(columnName) || StringUtils.isNotBlank(columnValue), "Hbasereader 在 normal 方式读取时,其列配置中,如果类型不是时间,则要么是 type + name 的组合,要么是type + value 的组合. 而您的配置非这两种组合,请检查并修改."); + + oneColumnCell = new HbaseColumnCell + .Builder(type) + .columnName(columnName) + .columnValue(columnValue) + .build(); + } + + hbaseColumnCells.add(oneColumnCell); + } + + return hbaseColumnCells; + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionDynamicColumnTask.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionDynamicColumnTask.java new file mode 100644 index 000000000..070924d55 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionDynamicColumnTask.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.Key; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; + +public class MultiVersionDynamicColumnTask extends MultiVersionTask { + private List columnFamilies = null; + + public MultiVersionDynamicColumnTask(Configuration configuration){ + super(configuration); + + this.columnFamilies = configuration.getList(Key.COLUMN_FAMILY, String.class); + } + + @Override + public void initScan(Scan scan) { + for (String columnFamily : columnFamilies) { + scan.addFamily(Bytes.toBytes(columnFamily.trim())); + } + + super.setMaxVersions(scan); + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionFixedColumnTask.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionFixedColumnTask.java new file mode 100644 index 000000000..5826e3acd --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionFixedColumnTask.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.Key; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; + +public class MultiVersionFixedColumnTask extends MultiVersionTask { + private List column = null; + + public MultiVersionFixedColumnTask(Configuration configuration) { + super(configuration); + + this.column = configuration.getList(Key.COLUMN, String.class); + } + + @Override + public void initScan(Scan scan) { + for (String aColumn : this.column) { + String[] cfAndQualifier = aColumn.split(":"); + scan.addColumn(Bytes.toBytes(cfAndQualifier[0].trim()), Bytes.toBytes(cfAndQualifier[1].trim())); + } + super.setMaxVersions(scan); + } +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionTask.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionTask.java new file mode 100755 index 000000000..4ff1c7d8f --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/MultiVersionTask.java @@ -0,0 +1,141 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.ColumnType; +import com.alibaba.datax.plugin.reader.hbasereader.HbaseReaderErrorCode; +import com.alibaba.datax.plugin.reader.hbasereader.Key; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.List; + +public abstract class MultiVersionTask extends HbaseAbstractTask { + private static byte[] COLON_BYTE; + + private int maxVersion; + private List kvList = new ArrayList(); + private int currentReadPosition = 0; + + // 四元组的类型 + private ColumnType rowkeyReadoutType = null; + private ColumnType columnReadoutType = null; + private ColumnType timestampReadoutType = null; + private ColumnType valueReadoutType = null; + + public MultiVersionTask(Configuration configuration) { + super(configuration); + + this.maxVersion = configuration.getInt(Key.MAX_VERSION); + List userConfiguredTetradTypes = configuration.getList(Key.TETRAD_TYPE, String.class); + + this.rowkeyReadoutType = ColumnType.getByTypeName(userConfiguredTetradTypes.get(0)); + this.columnReadoutType = ColumnType.getByTypeName(userConfiguredTetradTypes.get(1)); + this.timestampReadoutType = ColumnType.getByTypeName(userConfiguredTetradTypes.get(2)); + this.valueReadoutType = ColumnType.getByTypeName(userConfiguredTetradTypes.get(3)); + + try { + MultiVersionTask.COLON_BYTE = ":".getBytes("utf8"); + } catch (UnsupportedEncodingException e) { + throw DataXException.asDataXException(HbaseReaderErrorCode.PREPAR_READ_ERROR, "系统内部获取 列族与列名冒号分隔符的二进制时失败.", e); + } + } + + private void convertKVToLine(KeyValue keyValue, Record record) throws Exception { + byte[] rawRowkey = keyValue.getRow(); + + long timestamp = keyValue.getTimestamp(); + + byte[] cfAndQualifierName = Bytes.add(keyValue.getFamily(), MultiVersionTask.COLON_BYTE, keyValue.getQualifier()); + + record.addColumn(convertBytesToAssignType(this.rowkeyReadoutType, rawRowkey)); + + record.addColumn(convertBytesToAssignType(this.columnReadoutType, cfAndQualifierName)); + + // 直接忽略了用户配置的 timestamp 的类型 + record.addColumn(new LongColumn(timestamp)); + + record.addColumn(convertBytesToAssignType(this.valueReadoutType, keyValue.getValue())); + } + + private Column convertBytesToAssignType(ColumnType columnType, byte[] byteArray) throws UnsupportedEncodingException { + Column column; + switch (columnType) { + case BOOLEAN: + column = new BoolColumn(byteArray == null ? null : Bytes.toBoolean(byteArray)); + break; + case SHORT: + column = new LongColumn(byteArray == null ? null : String.valueOf(Bytes.toShort(byteArray))); + break; + case INT: + column = new LongColumn(byteArray == null ? null : Bytes.toInt(byteArray)); + break; + case LONG: + column = new LongColumn(byteArray == null ? null : Bytes.toLong(byteArray)); + break; + case BYTES: + column = new BytesColumn(byteArray); + break; + case FLOAT: + column = new DoubleColumn(byteArray == null ? null : Bytes.toFloat(byteArray)); + break; + case DOUBLE: + column = new DoubleColumn(byteArray == null ? null : Bytes.toDouble(byteArray)); + break; + case STRING: + column = new StringColumn(byteArray == null ? null : new String(byteArray, super.encoding)); + break; + case BINARY_STRING: + column = new StringColumn(byteArray == null ? null : Bytes.toStringBinary(byteArray)); + break; + + default: + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持您配置的列类型:" + columnType); + } + + return column; + } + + @Override + public boolean fetchLine(Record record) throws Exception { + Result result; + if (this.kvList.size() == this.currentReadPosition) { + result = super.getNextHbaseRow(); + if (result == null) { + return false; + } + + this.kvList = result.list(); + if (this.kvList == null) { + return false; + } + + this.currentReadPosition = 0; + } + + try { + KeyValue keyValue = this.kvList.get(this.currentReadPosition); + convertKVToLine(keyValue, record); + } catch (Exception e) { + throw e; + } finally { + this.currentReadPosition++; + } + + return true; + } + + public void setMaxVersions(Scan scan) { + if (this.maxVersion == -1 || this.maxVersion == Integer.MAX_VALUE) { + scan.setMaxVersions(); + } else { + scan.setMaxVersions(this.maxVersion); + } + } + +} diff --git a/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/NormalTask.java b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/NormalTask.java new file mode 100755 index 000000000..d46218cc4 --- /dev/null +++ b/hbasereader/src/main/java/com/alibaba/datax/plugin/reader/hbasereader/util/NormalTask.java @@ -0,0 +1,160 @@ +package com.alibaba.datax.plugin.reader.hbasereader.util; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.hbasereader.ColumnType; +import com.alibaba.datax.plugin.reader.hbasereader.HbaseColumnCell; +import com.alibaba.datax.plugin.reader.hbasereader.HbaseReaderErrorCode; +import com.alibaba.datax.plugin.reader.hbasereader.Key; +import org.apache.commons.lang3.time.DateUtils; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; +import java.util.Map; + +public class NormalTask extends HbaseAbstractTask { + private List column; + private List hbaseColumnCells; + + public NormalTask(Configuration configuration) { + super(configuration); + + this.column = configuration.getList(Key.COLUMN, Map.class); + this.hbaseColumnCells = HbaseUtil.parseColumnOfNormalMode(this.column); + } + + @Override + public boolean fetchLine(Record record) throws Exception { + Result result = super.getNextHbaseRow(); + + if (null == result) { + return false; + } + super.lastResult = result; + + try { + byte[] hbaseColumnValue; + String columnName; + ColumnType columnType; + + byte[] cf; + byte[] qualifier; + + for (HbaseColumnCell cell : this.hbaseColumnCells) { + columnType = cell.getColumnType(); + if (cell.isConstant()) { + // 对常量字段的处理 + fillRecordWithConstantValue(record, cell); + } else { + // 根据列名称获取值 + columnName = cell.getColumnName(); + + if (HbaseUtil.isRowkeyColumn(columnName)) { + hbaseColumnValue = result.getRow(); + } else { + cf = cell.getCf(); + qualifier = cell.getQualifier(); + hbaseColumnValue = result.getValue(cf, qualifier); + } + + doFillRecord(hbaseColumnValue, columnType, super.encoding, cell.getDateformat(), record); + } + } + } catch (Exception e) { + // 注意,这里catch的异常,期望是byte数组转换失败的情况。而实际上,string的byte数组,转成整数类型是不容易报错的。但是转成double类型容易报错。 + + record.setColumn(0, new StringColumn(Bytes.toStringBinary(result.getRow()))); + + throw e; + } + + return true; + } + + @Override + public void initScan(Scan scan) { + boolean isConstant; + boolean isRowkeyColumn; + for (HbaseColumnCell cell : this.hbaseColumnCells) { + isConstant = cell.isConstant(); + isRowkeyColumn = HbaseUtil.isRowkeyColumn(cell.getColumnName()); + + if (!isConstant && !isRowkeyColumn) { + this.scan.addColumn(cell.getCf(), cell.getQualifier()); + } + } + } + + + protected void doFillRecord(byte[] byteArray, ColumnType columnType, String encoding, String dateformat, Record record) throws Exception { + switch (columnType) { + case BOOLEAN: + record.addColumn(new BoolColumn(byteArray == null ? null : Bytes.toBoolean(byteArray))); + break; + case SHORT: + record.addColumn(new LongColumn(byteArray == null ? null : String.valueOf(Bytes.toShort(byteArray)))); + break; + case INT: + record.addColumn(new LongColumn(byteArray == null ? null : Bytes.toInt(byteArray))); + break; + case LONG: + record.addColumn(new LongColumn(byteArray == null ? null : Bytes.toLong(byteArray))); + break; + case BYTES: + record.addColumn(new BytesColumn(byteArray == null ? null : byteArray)); + break; + case FLOAT: + record.addColumn(new DoubleColumn(byteArray == null ? null : Bytes.toFloat(byteArray))); + break; + case DOUBLE: + record.addColumn(new DoubleColumn(byteArray == null ? null : Bytes.toDouble(byteArray))); + break; + case STRING: + record.addColumn(new StringColumn(byteArray == null ? null : new String(byteArray, encoding))); + break; + case BINARY_STRING: + record.addColumn(new StringColumn(byteArray == null ? null : Bytes.toStringBinary(byteArray))); + break; + case DATE: + String dateValue = Bytes.toStringBinary(byteArray); + record.addColumn(byteArray == null ? null : new DateColumn(DateUtils.parseDate(dateValue, new String[]{dateformat}))); + break; + default: + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持您配置的列类型:" + columnType); + } + } + + // 注意:常量列,不支持 binaryString 类型 + private void fillRecordWithConstantValue(Record record, HbaseColumnCell cell) throws Exception { + String constantValue = cell.getColumnValue(); + ColumnType columnType = cell.getColumnType(); + switch (columnType) { + case BOOLEAN: + record.addColumn(new BoolColumn(constantValue)); + break; + case SHORT: + case INT: + case LONG: + record.addColumn(new LongColumn(constantValue)); + break; + case BYTES: + record.addColumn(new BytesColumn(constantValue.getBytes("utf-8"))); + break; + case FLOAT: + case DOUBLE: + record.addColumn(new DoubleColumn(constantValue)); + break; + case STRING: + record.addColumn(new StringColumn(constantValue)); + break; + case DATE: + record.addColumn(new DateColumn(DateUtils.parseDate(constantValue, new String[]{cell.getDateformat()}))); + break; + default: + throw DataXException.asDataXException(HbaseReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 常量列不支持您配置的列类型:" + columnType); + } + } +} diff --git a/hbasereader/src/main/resources/plugin.json b/hbasereader/src/main/resources/plugin.json new file mode 100755 index 000000000..20a101a0a --- /dev/null +++ b/hbasereader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "hbasereader", + "class": "com.alibaba.datax.plugin.reader.hbasereader.HbaseReader", + "description": "useScene: prod. mechanism: Scan to read data.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/hbasereader/src/main/resources/plugin_job_template.json b/hbasereader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..1a67cbd0d --- /dev/null +++ b/hbasereader/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "hbasereader", + "parameter": { + "hbaseConfig": "", + "table": "", + "encoding": "", + "mode": "", + "column": [], + "range": { + "startRowkey": "", + "endRowkey": "" + }, + "isBinaryRowkey": true + } +} \ No newline at end of file diff --git a/hdfsreader/doc/hdfsreader.md b/hdfsreader/doc/hdfsreader.md new file mode 100644 index 000000000..4bcc366b2 --- /dev/null +++ b/hdfsreader/doc/hdfsreader.md @@ -0,0 +1,257 @@ +# DataX HdfsReader 插件文档 + + +------------ + +## 1 快速介绍 + +HdfsReader提供了读取分布式文件系统数据存储的能力。在底层实现上,HdfsReader获取分布式文件系统上文件的数据,并转换为DataX传输协议传递给Writer。 + +**目前HdfsReader仅支持textfile和orcfile两种格式的文件,且文件内容存放的必须是一张逻辑意义上的二维表。** + +**HdfsReader需要Jdk1.7及以上版本的支持。** + + +## 2 功能与限制 + +HdfsReader实现了从Hadoop分布式文件系统Hdfs中读取文件数据并转为DataX协议的功能。textfile是Hive建表时默认使用的存储格式,数据不做压缩,本质上textfile就是以文本的形式将数据存放在hdfs中,对于DataX而言,HdfsReader实现上类比TxtFileReader,有诸多相似之处。orcfile,它的全名是Optimized Row Columnar file,是对RCFile做了优化。据官方文档介绍,这种文件格式可以提供一种高效的方法来存储Hive数据。HdfsReader利用Hive提供的OrcSerde类,读取解析orcfile文件的数据。目前HdfsReader支持的功能如下: + +1. 支持textfile和orcfile两种格式的文件,且要求文件内容存放的是一张逻辑意义上的二维表。 + +2. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +3. 支持递归读取、支持正则表达式("*"和"?")。 + +4. 支持orcfile数据压缩,目前支持SNAPPY,ZLIB两种压缩方式。 + +5. 多个File可以支持并发读取。 + +我们暂时不能做到: + +1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。 + + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 3 + } + }, + "content": [ + { + "reader": { + "name": "hdfsreader", + "parameter": { + "path": "/user/hive/warehouse/mytable01/*", + "defaultFS": "hdfs://10.101.169.107:9000", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "type": "string", + "value": "hello" + }, + { + "index": 2, + "type": "double" + } + ], + "fileType": "orc", + "encoding": "UTF-8", + "fieldDelimiter": "," + } + + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true + } + } + } + ] + } +} +``` + +### 3.2 参数说明(各个配置项值前后不允许有空格) + +* **path** + + * 描述:要读取的文件路径,如果要读取多个文件,可以使用正则表达式"*",注意这里可以支持填写多个路径。。
+ + 当指定单个Hdfs文件,HdfsReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个File可以进行多线程并发读取。 + + 当指定多个Hdfs文件,HdfsReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,HdfsReader尝试遍历出多个文件信息。例如: 指定/*代表读取/目录下所有的文件,指定/bazhen/\*代表读取bazhen目录下游所有的文件。HdfsReader目前只支持"*"和"?"作为文件通配符。 + + **特别需要注意的是,DataX会将一个作业下同步的所有的文件视作同一张数据表。用户必须自己保证所有的File能够适配同一套schema信息。并且提供给DataX权限可读。** + + + * 必选:是
+ + * 默认值:无
+ +* **defaultFS** + + * 描述:Hadoop hdfs文件系统namenode节点地址。
+ + + **特别需要注意的是,目前HdfsReader不支持Kerberos等认证,所以用户需要保证DATAX有权限访问该节点** + + + * 必选:是
+ + * 默认值:无
+ +* **fileType** + + * 描述:文件的类型,目前只支持用户配置为"text"或"orc"。
+ + text表示textfile文件格式 + + orc表示orcfile文件格式 + + **特别需要注意的是,HdfsReader能够自动识别文件是orcfile、textfile或者还是其它类型的文件,但该项是必填项,HdfsReader则会只读取用户配置的类型的文件,忽略路径下其他格式的文件** + + **另外需要注意的是,由于textfile和orcfile是两种完全不同的文件格式,所以HdfsReader对这两种文件的解析方式也存在差异,这种差异导致hive支持的复杂复合类型(比如map,array,struct,union)在转换为DataX支持的String类型时,转换的结果格式略有差异,比如以map类型为例:** + + orcfile map类型经hdfsreader解析转换成datax支持的string类型后,结果为"{job=80, team=60, person=70}" + + textfile map类型经hdfsreader解析转换成datax支持的string类型后,结果为"job:80,team:60,person:70" + + 从上面的转换结果可以看出,数据本身没有变化,但是表示的格式略有差异,所以如果用户配置的文件路径中要同步的字段在Hive中是复合类型的话,建议配置统一的文件格式。 + + **如果需要统一复合类型解析出来的格式,我们建议用户在hive客户端将textfile格式的表导成orcfile格式的表** + + * 必选:是
+ + * 默认值:无
+ + +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json +{ + "type": "long", + "index": 0 //从本地文件文本第一列获取int字段 +}, +{ + "type": "string", + "value": "alibaba" //HdfsReader内部生成alibaba的字符串字段作为当前字段 +} + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + **另外需要注意的是,HdfsReader在读取textfile数据时,需要指定字段分割符,如果不指定默认为',',HdfsReader在读取orcfile时,用户无需指定字段分割符** + + * 必选:否
+ + * 默认值:,
+ + +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat:"\\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +由于textfile和orcfile文件表的元数据信息由Hive维护并存放在Hive自己维护的数据库(如mysql)中,目前HdfsReader不支持对Hive元数 + +据数据库进行访问查询,因此用户在进行类型转换的时候,必须指定数据类型,如果用户配置的column为"*",则所有column默认转换为 + +string类型。HdfsReader提供了类型转换的建议表如下: + +| DataX 内部类型| Hive表 数据类型 | +| -------- | ----- | +| Long |TINYINT,SMALLINT,INT,BIGINT| +| Double |FLOAT,DOUBLE| +| String |String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY| +| Boolean |BOOLEAN| +| Date |Date,TIMESTAMP| + +其中: + +* Long是指Hdfs文件文本中使用整形的字符串表示形式,例如"123456789"。 +* Double是指Hdfs文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* Boolean是指Hdfs文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* Date是指Hdfs文件文本中使用Date的字符串表示形式,例如"2014-12-31"。 + +特别提醒: + +* Hive支持的数据类型TIMESTAMP可以精确到纳秒级别,所以textfile、orcfile中TIMESTAMP存放的数据类似于"2015-08-21 22:40:47.397898389",如果转换的类型配置为DataX的Date,转换之后会导致纳秒部分丢失,所以如果需要保留纳秒部分的数据,请配置转换类型为DataX的String类型。 + + +### 3.4 按分区读取 + +Hive在建表的时候,可以指定分区partition,例如创建分区partition(day="20150820",hour="09"),对应的hdfs文件系统中,相应的表的目录下则会多出/20150820和/09两个目录,且/20150820是/09的父目录。了解了分区都会列成相应的目录结构,在按照某个分区读取某个表所有数据时,则只需配置好json中path的值即可。 + +比如需要读取表名叫mytable01下分区day为20150820这一天的所有数据,则配置如下: + +```json +"path": "/user/hive/warehouse/mytable01/20150820/*" +``` + + +## 4 性能报告 + + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/hdfsreader/hdfsreader.iml b/hdfsreader/hdfsreader.iml new file mode 100644 index 000000000..8c7164a5b --- /dev/null +++ b/hdfsreader/hdfsreader.iml @@ -0,0 +1,170 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/hdfsreader/pom.xml b/hdfsreader/pom.xml new file mode 100644 index 000000000..59156d99e --- /dev/null +++ b/hdfsreader/pom.xml @@ -0,0 +1,123 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hdfsreader + com.alibaba.datax + 0.0.1-SNAPSHOT + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hadoop + hadoop-hdfs + 2.6.0 + + + org.apache.hadoop + hadoop-common + 2.6.0 + + + org.apache.hadoop + hadoop-yarn-common + 2.6.0 + + + org.apache.hadoop + hadoop-mapreduce-client-core + 2.6.0 + + + + org.apache.hive + hive-exec + 1.2.0 + + + org.apache.hive + hive-serde + 1.2.0 + + + org.apache.hive + hive-service + 1.2.0 + + + org.apache.hive + hive-common + 1.2.0 + + + org.apache.hive.hcatalog + hive-hcatalog-core + 1.2.0 + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/hdfsreader/src/main/assembly/package.xml b/hdfsreader/src/main/assembly/package.xml new file mode 100644 index 000000000..3f1393b76 --- /dev/null +++ b/hdfsreader/src/main/assembly/package.xml @@ -0,0 +1,49 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/hdfsreader + + + target/ + + hdfsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/hdfsreader + + + + + + + + + + + + + + + + + + + + false + plugin/reader/hdfsreader/libs + runtime + + + diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Constant.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Constant.java new file mode 100644 index 000000000..5d1887b6a --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Constant.java @@ -0,0 +1,10 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +/** + * Created by mingya.wmy on 2015/8/14. + */ +public class Constant { + public static final String SOURCE_FILES = "sourceFiles"; + public static final String TEXT = "TEXT"; + public static final String ORC = "ORC"; +} diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/DFSUtil.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/DFSUtil.java new file mode 100644 index 000000000..ee84dd8c4 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/DFSUtil.java @@ -0,0 +1,523 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.nio.ByteBuffer; +import java.text.SimpleDateFormat; +import java.util.*; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderErrorCode; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.*; +import org.apache.hadoop.hive.ql.io.orc.*; +import org.apache.hadoop.hive.serde2.objectinspector.StructField; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.io.compress.CompressionCodecFactory; +import org.apache.hadoop.io.compress.CompressionInputStream; +import org.apache.hadoop.mapred.*; +import org.apache.hadoop.mapred.RecordReader; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by mingya.wmy on 2015/8/12. + */ +public class DFSUtil { + private static final Logger LOG = LoggerFactory.getLogger(HdfsReader.Job.class); + + private org.apache.hadoop.conf.Configuration hadoopConf = null; + + private static final int DIRECTORY_SIZE_GUESS = 16 * 1024; + + private String specifiedFileType = null; + + public DFSUtil(String defaultFS){ + hadoopConf = new org.apache.hadoop.conf.Configuration(); + hadoopConf.set("fs.defaultFS", defaultFS); + } + + + /** + * @Title: getAllFiles + * @Description: 获取指定路径列表下符合条件的所有文件的绝对路径 + * @param @param srcPaths 路径列表 + * @param @return + * @return HashSet + * @throws + */ + public HashSet getAllFiles(List srcPaths, String specifiedFileType){ + + this.specifiedFileType = specifiedFileType; + + if (!srcPaths.isEmpty()) { + for (String eachPath : srcPaths) { + getHDFSAllFiles(eachPath); + } + } + return sourceHDFSAllFilesList; + } + + private HashSet sourceHDFSAllFilesList = new HashSet(); + + public HashSet getHDFSAllFiles(String hdfsPath){ + + try { + FileSystem hdfs = FileSystem.get(hadoopConf); + //判断hdfsPath是否包含正则符号 + if(hdfsPath.contains("*") || hdfsPath.contains("?")){ + Path path = new Path(hdfsPath); + FileStatus stats[] = hdfs.globStatus(path); + for(FileStatus f : stats){ + if(f.isFile()){ + + addSourceFileByType(f.getPath().toString()); + } + else if(f.isDirectory()){ + getHDFSALLFiles_NO_Regex(f.getPath().toString(), hdfs); + } + } + } + else{ + getHDFSALLFiles_NO_Regex(hdfsPath, hdfs); + } + + return sourceHDFSAllFilesList; + + }catch (IOException e){ + String message = String.format("无法读取路径[%s]下的所有文件,请确认您的配置项path是否正确," + + "是否有读写权限,网络是否已断开!", hdfsPath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.PATH_CONFIG_ERROR, message); + } + } + + private HashSet getHDFSALLFiles_NO_Regex(String path,FileSystem hdfs) throws IOException{ + + // 获取要读取的文件的根目录 + Path listFiles = new Path(path); + + // If the network disconnected, this method will retry 45 times + // each time the retry interval for 20 seconds + // 获取要读取的文件的根目录的所有二级子文件目录 + FileStatus stats[] = hdfs.listStatus(listFiles); + + for (FileStatus f : stats) { + // 判断是不是目录,如果是目录,递归调用 + if (f.isDirectory()) { + getHDFSALLFiles_NO_Regex(f.getPath().toString(),hdfs); + } + else if(f.isFile()){ + + addSourceFileByType(f.getPath().toString()); + } + else{ + String message = String.format("该路径[%s]文件类型既不是目录也不是文件,插件自动忽略。" + , f.getPath().toString()); + LOG.info(message); + } + } + return sourceHDFSAllFilesList; + } + + // 根据用户指定的文件类型,将指定的文件类型的路径加入sourceHDFSAllFilesList + private void addSourceFileByType(String filePath){ + HdfsFileType type = checkHdfsFileType(filePath); + + if(type.toString().contains(specifiedFileType.toUpperCase())){ + sourceHDFSAllFilesList.add(filePath); + } + else{ + String message = String.format("文件[%s]的类型与用户配置的fileType类型不一致," + + "请确认您配置的目录下面所有文件的类型均为[%s]" + , filePath, this.specifiedFileType); + LOG.error(message); + throw DataXException.asDataXException( + HdfsReaderErrorCode.FILE_TYPE_UNSUPPORT, message); + } + } + + + public InputStream getInputStream(String filepath){ + InputStream inputStream = null; + Path path = new Path(filepath); + try{ + FileSystem fs = FileSystem.get(hadoopConf); + inputStream = fs.open(path); + return inputStream; + }catch (IOException e) { + e.printStackTrace(); + } + return null; + } + + public BufferedReader getBufferedReader(String filepath, HdfsFileType fileType, String encoding){ + try { + FileSystem fs = FileSystem.get(hadoopConf); + Path path = new Path(filepath); + FSDataInputStream in = null; + + CompressionInputStream cin = null; + BufferedReader br = null; + + if (fileType.equals(HdfsFileType.COMPRESSED_TEXT)) { + CompressionCodecFactory factory = new CompressionCodecFactory(hadoopConf); + CompressionCodec codec = factory.getCodec(path); + if (codec == null) { + String message = String.format( + "Can't find any suitable CompressionCodec to this file:%value", + path.toString()); + throw DataXException.asDataXException(HdfsReaderErrorCode.CONFIG_INVALID_EXCEPTION, message); + } + //If the network disconnected, this method will retry 45 times + //each time the retry interval for 20 seconds + in = fs.open(path); + cin = codec.createInputStream(in); + br = new BufferedReader(new InputStreamReader(cin, encoding)); + } else { + //If the network disconnected, this method will retry 45 times + // each time the retry interval for 20 seconds + in = fs.open(path); + br = new BufferedReader(new InputStreamReader(in, encoding)); + } + return br; + }catch (Exception e){ + e.printStackTrace(); + } + return null; + } + + public void orcFileStartRead(String sourceOrcFilePath, Configuration readerSliceConfig, + RecordSender recordSender, TaskPluginCollector taskPluginCollector){ + + List columnConfigs = readerSliceConfig.getListConfiguration(Key.COLUMN); + String nullFormat = readerSliceConfig.getString(Key.NULL_FORMAT); + String allColumns = ""; + String allColumnTypes = ""; + boolean isReadAllColumns = false; + int columnIndexMax = -1; + // 判断是否读取所有列 + if (null == columnConfigs || columnConfigs.size() == 0) { + int allColumnsCount = getAllColumnsCount(sourceOrcFilePath); + columnIndexMax = allColumnsCount-1; + isReadAllColumns = true; + } + else { + columnIndexMax = getMaxIndex(columnConfigs); + } + for(int i=0; i<=columnIndexMax; i++){ + allColumns += "col"; + allColumnTypes += "string"; + if(i!=columnIndexMax){ + allColumns += ","; + allColumnTypes += ":"; + } + } + if(columnIndexMax>=0) { + JobConf conf = new JobConf(hadoopConf); + Path orcFilePath = new Path(sourceOrcFilePath); + Properties p = new Properties(); + p.setProperty("columns", allColumns); + p.setProperty("columns.types", allColumnTypes); + try { + OrcSerde serde = new OrcSerde(); + serde.initialize(conf, p); + StructObjectInspector inspector = (StructObjectInspector) serde.getObjectInspector(); + InputFormat in = new OrcInputFormat(); + FileInputFormat.setInputPaths(conf, orcFilePath.toString()); + + //If the network disconnected, will retry 45 times, each time the retry interval for 20 seconds + //Each file as a split + InputSplit[] splits = in.getSplits(conf, 1); + + RecordReader reader = in.getRecordReader(splits[0], conf, Reporter.NULL); + Object key = reader.createKey(); + Object value = reader.createValue(); + // 获取列信息 + List fields = inspector.getAllStructFieldRefs(); + + List recordFields = null; + while (reader.next(key, value)) { + recordFields = new ArrayList(); + + for(int i=0; i<=columnIndexMax; i++){ + Object field = inspector.getStructFieldData(value, fields.get(i)); + recordFields.add(field); + } + transportOneRecord(columnConfigs, recordFields, recordSender, + taskPluginCollector, isReadAllColumns,nullFormat); + } + reader.close(); + }catch (Exception e){ + String message = String.format("从orcfile文件路径[%s]中读取数据发生异常,请联系系统管理员。" + , sourceOrcFilePath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message); + } + } else { + String message = String.format("请确认您所读取的列配置正确!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.BAD_CONFIG_VALUE, message); + } + } + + private Record transportOneRecord(List columnConfigs, List recordFields + , RecordSender recordSender, TaskPluginCollector taskPluginCollector, boolean isReadAllColumns, String nullFormat){ + Record record = recordSender.createRecord(); + Column columnGenerated = null; + try { + if(isReadAllColumns){ + // 读取所有列,创建都为String类型的column + for(Object recordField :recordFields){ + String columnValue = null; + if(recordField != null){ + columnValue = recordField.toString(); + } + columnGenerated = new StringColumn(columnValue); + record.addColumn(columnGenerated); + } + } + else { + for (Configuration columnConfig : columnConfigs) { + String columnType = columnConfig + .getNecessaryValue(Key.TYPE, HdfsReaderErrorCode.CONFIG_INVALID_EXCEPTION); + Integer columnIndex = columnConfig.getInt(Key.INDEX); + String columnConst = columnConfig.getString(Key.VALUE); + + String columnValue = null; + + if (null != columnIndex) { + if (null != recordFields.get(columnIndex)) + columnValue = recordFields.get(columnIndex).toString(); + } else { + columnValue = columnConst; + } + Type type = Type.valueOf(columnType.toUpperCase()); + // it's all ok if nullFormat is null + if (columnValue.equals(nullFormat)) { + columnValue = null; + } + switch (type) { + case STRING: + columnGenerated = new StringColumn(columnValue); + break; + case LONG: + try { + columnGenerated = new LongColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "LONG")); + } + break; + case DOUBLE: + try { + columnGenerated = new DoubleColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DOUBLE")); + } + break; + case BOOLEAN: + try { + columnGenerated = new BoolColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "BOOLEAN")); + } + + break; + case DATE: + try { + if (columnValue == null) { + Date date = null; + columnGenerated = new DateColumn(date); + } else { + String formatString = columnConfig.getString(Key.FORMAT); + if (StringUtils.isNotBlank(formatString)) { + // 用户自己配置的格式转换 + SimpleDateFormat format = new SimpleDateFormat( + formatString); + columnGenerated = new DateColumn( + format.parse(columnValue)); + } else { + // 框架尝试转换 + columnGenerated = new DateColumn( + new StringColumn(columnValue) + .asDate()); + } + } + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DATE")); + } + break; + default: + String errorMessage = String.format( + "您配置的列类型暂不支持 : [%s]", columnType); + LOG.error(errorMessage); + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.NOT_SUPPORT_TYPE, + errorMessage); + } + + record.addColumn(columnGenerated); + } + } + recordSender.sendToWriter(record); + } catch (IllegalArgumentException iae) { + taskPluginCollector + .collectDirtyRecord(record, iae.getMessage()); + } catch (IndexOutOfBoundsException ioe) { + taskPluginCollector + .collectDirtyRecord(record, ioe.getMessage()); + } catch (Exception e) { + if (e instanceof DataXException) { + throw (DataXException) e; + } + // 每一种转换失败都是脏数据处理,包括数字格式 & 日期格式 + taskPluginCollector.collectDirtyRecord(record, e.getMessage()); + } + + return record; + } + + private int getAllColumnsCount(String filePath){ + int columnsCount = 0; + final String colFinal = "_col"; + Path path = new Path(filePath); + try { + Reader reader = OrcFile.createReader(path, OrcFile.readerOptions(hadoopConf)); + String type_struct = reader.getObjectInspector().getTypeName(); + columnsCount = (type_struct.length() - type_struct.replace(colFinal, "").length()) + / colFinal.length(); + return columnsCount; + }catch (IOException e){ + String message = "读取orcfile column列数失败,请联系系统管理员"; + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message); + } + } + + private int getMaxIndex(List columnConfigs){ + int maxIndex = -1; + for (Configuration columnConfig : columnConfigs) { + Integer columnIndex = columnConfig.getInt(Key.INDEX); + if (columnIndex != null && columnIndex < 0) { + String message = String.format("您column中配置的index不能小于0,请修改为正确的index"); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.CONFIG_INVALID_EXCEPTION, message); + } else if (columnIndex != null && columnIndex > maxIndex) { + maxIndex = columnIndex; + } + } + return maxIndex; + } + + private static enum Type { + STRING, LONG, BOOLEAN, DOUBLE, DATE, + } + + public HdfsFileType checkHdfsFileType(String filepath){ + + Path path = new Path(filepath); + + try { + FileSystem fs = FileSystem.get(hadoopConf); + + // figure out the size of the file using the option or filesystem + long size = fs.getFileStatus(path).getLen(); + + //read last bytes into buffer to get PostScript + int readSize = (int) Math.min(size, DIRECTORY_SIZE_GUESS); + FSDataInputStream file = fs.open(path); + file.seek(size - readSize); + ByteBuffer buffer = ByteBuffer.allocate(readSize); + file.readFully(buffer.array(), buffer.arrayOffset() + buffer.position(), + buffer.remaining()); + + //read the PostScript + //get length of PostScript + int psLen = buffer.get(readSize - 1) & 0xff; + HdfsFileType type = checkType(file, path, psLen, buffer, hadoopConf); + + return type; + }catch (Exception e){ + String message = String.format("检查文件[%s]类型失败,请检查您的文件是否合法。" + , filepath); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message); + } + } + + /** + * Check the file type + * @param in the file being read + * @param path the filename for error messages + * @param psLen the postscript length + * @param buffer the tail of the file + * @throws IOException + */ + private HdfsFileType checkType(FSDataInputStream in, + Path path, + int psLen, + ByteBuffer buffer, + org.apache.hadoop.conf.Configuration hadoopConf) throws IOException { + int len = OrcFile.MAGIC.length(); + if (psLen < len + 1) { + String message = String.format("Malformed ORC file [%s]. Invalid postscript length [%s]" + , path, psLen); + LOG.error(message); + throw DataXException.asDataXException( + HdfsReaderErrorCode.MALFORMED_ORC_ERROR, message); + } + int offset = buffer.arrayOffset() + buffer.position() + buffer.limit() - 1 + - len; + byte[] array = buffer.array(); + // now look for the magic string at the end of the postscript. + if (Text.decode(array, offset, len).equals(OrcFile.MAGIC)) { + return HdfsFileType.ORC; + } + else{ + // If it isn't there, this may be the 0.11.0 version of ORC. + // Read the first 3 bytes of the file to check for the header + in.seek(0); + byte[] header = new byte[len]; + in.readFully(header, 0, len); + // if it isn't there, this isn't an ORC file + if (Text.decode(header, 0 , len).equals(OrcFile.MAGIC)) { + return HdfsFileType.ORC; + } + else{ + in.seek(0); + switch (in.readShort()) { + case 0x5345: + if (in.readByte() == 'Q') { + return HdfsFileType.SEQ; + } + default: + in.seek(0); + CompressionCodecFactory compressionCodecFactory = new CompressionCodecFactory(hadoopConf); + CompressionCodec codec = compressionCodecFactory.getCodec(path); + if (null == codec) + return HdfsFileType.TEXT; + else { + return HdfsFileType.COMPRESSED_TEXT; + } + } + } + + } + } + +} \ No newline at end of file diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsFileType.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsFileType.java new file mode 100644 index 000000000..793489ae2 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsFileType.java @@ -0,0 +1,8 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +/** + * Created by mingya.wmy on 2015/8/22. + */ +public enum HdfsFileType { + TEXT, COMPRESSED_TEXT, ORC, SEQ, +} diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReader.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReader.java new file mode 100644 index 000000000..677c0f3fa --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReader.java @@ -0,0 +1,298 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import org.apache.commons.io.Charsets; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.nio.charset.UnsupportedCharsetException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; + +public class HdfsReader extends Reader { + + /** + * Job 中的方法仅执行一次,Task 中方法会由框架启动多个 Task 线程并行执行。 + *

+ * 整个 Reader 执行流程是: + *

+     * Job类init-->prepare-->split
+     *
+     * Task类init-->prepare-->startRead-->post-->destroy
+     * Task类init-->prepare-->startRead-->post-->destroy
+     *
+     * Job类post-->destroy
+     * 
+ */ + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private Configuration readerOriginConfig = null; + private String defaultFS = null; + private String encoding = null; + private HashSet sourceFiles; + private String specifiedFileType = null; + private DFSUtil dfsUtil = null; + private List path = null; + + @Override + public void init() { + + LOG.info("init() begin..."); + this.readerOriginConfig = super.getPluginJobConf(); + this.validate(); + dfsUtil = new DFSUtil(defaultFS); + LOG.info("init() ok and end..."); + + } + + private void validate(){ + defaultFS = this.readerOriginConfig.getNecessaryValue(Key.DEFAULT_FS, + HdfsReaderErrorCode.DEFAULT_FS_NOT_FIND_ERROR); + if (StringUtils.isBlank(defaultFS)) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.PATH_NOT_FIND_ERROR, "您需要指定 defaultFS"); + } + + // path check + String pathInString = this.readerOriginConfig.getNecessaryValue(Key.PATH, HdfsReaderErrorCode.REQUIRED_VALUE); + if (!pathInString.startsWith("[") && !pathInString.endsWith("]")) { + path = new ArrayList(); + path.add(pathInString); + } else { + path = this.readerOriginConfig.getList(Key.PATH, String.class); + if (null == path || path.size() == 0) { + throw DataXException.asDataXException(HdfsReaderErrorCode.REQUIRED_VALUE, "您需要指定待读取的源目录或文件"); + } + for (String eachPath : path) { + if(!eachPath.startsWith("/")){ + String message = String.format("请检查参数path:[%s],需要配置为绝对路径", eachPath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.ILLEGAL_VALUE, message); + } + } + } + + specifiedFileType = this.readerOriginConfig.getNecessaryValue(Key.FILETYPE, HdfsReaderErrorCode.REQUIRED_VALUE); + if( !specifiedFileType.equalsIgnoreCase("ORC") && + !specifiedFileType.equalsIgnoreCase("TEXT")){ + String message = "HdfsReader插件目前只支持ORC和TEXT两种格式的文件," + + "如果您需要指定读取的文件类型,请将filetype选项的值配置为ORC或者TEXT"; + throw DataXException.asDataXException( + HdfsReaderErrorCode.FILE_TYPE_ERROR, message); + } + + encoding = this.readerOriginConfig.getString(Key.ENCODING, "UTF-8"); + + try { + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.ILLEGAL_VALUE, + String.format("运行配置异常 : %s", e.getMessage()), e); + } + + // validate the Columns + validateColumns(); + + } + + private void validateColumns(){ + + // 检测是column 是否为 ["*"] 若是则填为空 + List column = this.readerOriginConfig + .getListConfiguration(Key.COLUMN); + if (null != column + && 1 == column.size() + && ("\"*\"".equals(column.get(0).toString()) || "'*'" + .equals(column.get(0).toString()))) { + readerOriginConfig + .set(Key.COLUMN, new ArrayList()); + } else { + // column: 1. index type 2.value type 3.when type is Data, may have format + List columns = this.readerOriginConfig + .getListConfiguration(Key.COLUMN); + + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 columns"); + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf.getNecessaryValue(Key.TYPE, HdfsReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf.getInt(Key.INDEX); + String columnValue = eachColumnConf.getString(Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + + } + } + } + } + + @Override + public void prepare() { + + LOG.info("prepare()"); + this.sourceFiles = dfsUtil.getAllFiles(path, specifiedFileType); + LOG.info(String.format("您即将读取的文件数为: [%s]", this.sourceFiles.size())); + LOG.info("待读取的所有文件绝对路径如下:"); + for(String filePath :sourceFiles){ + LOG.info(String.format("[%s]", filePath)); + } + } + + @Override + public List split(int adviceNumber) { + + LOG.info("split() begin..."); + List readerSplitConfigs = new ArrayList(); + // warn:每个slice拖且仅拖一个文件, + // int splitNumber = adviceNumber; + int splitNumber = this.sourceFiles.size(); + if (0 == splitNumber) { + throw DataXException.asDataXException(HdfsReaderErrorCode.EMPTY_DIR_EXCEPTION, + String.format("未能找到待读取的文件,请确认您的配置项path: %s", this.readerOriginConfig.getString(Key.PATH))); + } + + List> splitedSourceFiles = this.splitSourceFiles(new ArrayList(this.sourceFiles), splitNumber); + for (List files : splitedSourceFiles) { + Configuration splitedConfig = this.readerOriginConfig.clone(); + splitedConfig.set(Constant.SOURCE_FILES, files); + readerSplitConfigs.add(splitedConfig); + } + + return readerSplitConfigs; + } + + + private List> splitSourceFiles(final List sourceList, int adviceNumber) { + List> splitedList = new ArrayList>(); + int averageLength = sourceList.size() / adviceNumber; + averageLength = averageLength == 0 ? 1 : averageLength; + + for (int begin = 0, end = 0; begin < sourceList.size(); begin = end) { + end = begin + averageLength; + if (end > sourceList.size()) { + end = sourceList.size(); + } + splitedList.add(sourceList.subList(begin, end)); + } + return splitedList; + } + + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + } + + public static class Task extends Reader.Task { + + private static Logger LOG = LoggerFactory.getLogger(Reader.Task.class); + private Configuration taskConfig; + private List sourceFiles; + private String defaultFS; + private HdfsFileType fileType; + private String specifiedFileType; + private String encoding; + private DFSUtil dfsUtil = null; + + @Override + public void init() { + + this.taskConfig = super.getPluginJobConf(); + this.sourceFiles = this.taskConfig.getList(Constant.SOURCE_FILES, String.class); + this.defaultFS = this.taskConfig.getNecessaryValue(Key.DEFAULT_FS, + HdfsReaderErrorCode.DEFAULT_FS_NOT_FIND_ERROR); + this.specifiedFileType = this.taskConfig.getNecessaryValue(Key.FILETYPE, HdfsReaderErrorCode.REQUIRED_VALUE); + this.encoding = this.taskConfig.getString(Key.ENCODING, "UTF-8"); + this.dfsUtil = new DFSUtil(defaultFS); + } + + @Override + public void prepare() { + + } + + @Override + public void startRead(RecordSender recordSender) { + + LOG.info("read start"); + for (String sourceFile : this.sourceFiles) { + LOG.info(String.format("reading file : [%s]", sourceFile)); + fileType = dfsUtil.checkHdfsFileType(sourceFile); + + if((fileType.equals(HdfsFileType.TEXT) || fileType.equals(HdfsFileType.COMPRESSED_TEXT)) + &&(this.specifiedFileType.equalsIgnoreCase(Constant.TEXT))) { + + BufferedReader bufferedReader = dfsUtil.getBufferedReader(sourceFile, fileType, encoding); + UnstructuredStorageReaderUtil.doReadFromStream(bufferedReader, sourceFile, + this.taskConfig, recordSender, this.getTaskPluginCollector()); + }else if(fileType.equals(HdfsFileType.ORC) + && (this.specifiedFileType.equalsIgnoreCase(Constant.ORC))){ + + dfsUtil.orcFileStartRead(sourceFile, this.taskConfig, + recordSender, this.getTaskPluginCollector()); + }else { + + String message = String.format("文件[%s]的类型与用户配置的fileType类型不一致," + + "请确认您配置的目录下面所有文件的类型均为[%s]" + , sourceFile, this.specifiedFileType); + LOG.error(message); + throw DataXException.asDataXException( + HdfsReaderErrorCode.FILE_TYPE_UNSUPPORT, message); + } + + if(recordSender != null){ + recordSender.flush(); + } + } + + LOG.info("end read source files..."); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + } + +} \ No newline at end of file diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReaderErrorCode.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReaderErrorCode.java new file mode 100644 index 000000000..4b1af3da9 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReaderErrorCode.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum HdfsReaderErrorCode implements ErrorCode { + BAD_CONFIG_VALUE("HdfsReader-00", "您配置的值不合法."), + PATH_NOT_FIND_ERROR("HdfsReader-01", "您未配置path值"), + DEFAULT_FS_NOT_FIND_ERROR("HdfsReader-02", "您未配置defaultFS值"), + ILLEGAL_VALUE("HdfsReader-03", "值错误"), + CONFIG_INVALID_EXCEPTION("HdfsReader-04", "参数配置错误"), + REQUIRED_VALUE("HdfsReader-05", "您缺失了必须填写的参数值."), + NO_INDEX_VALUE("HdfsReader-06","没有 Index" ), + MIXED_INDEX_VALUE("HdfsReader-07","index 和 value 混合" ), + EMPTY_DIR_EXCEPTION("HdfsReader-08", "您尝试读取的文件目录为空."), + PATH_CONFIG_ERROR("HdfsReader-09", "您配置的path格式有误"), + READ_FILE_ERROR("HdfsReader-10", "读取文件出错"), + MALFORMED_ORC_ERROR("HdfsReader-10", "ORCFILE格式异常"), + FILE_TYPE_ERROR("HdfsReader-11", "文件类型配置错误"), + FILE_TYPE_UNSUPPORT("HdfsReader-12", "文件类型目前不支持"),; + + private final String code; + private final String description; + + private HdfsReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} \ No newline at end of file diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Key.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Key.java new file mode 100644 index 000000000..c9b8ae350 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Key.java @@ -0,0 +1,18 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +public final class Key { + + /** + * 此处声明插件用到的需要插件使用者提供的配置项 + */ + public final static String PATH = "path"; + public static final String COLUMN = "column"; + public final static String DEFAULT_FS = "defaultFS"; + public final static String ENCODING = "encoding"; + public static final String TYPE = "type"; + public static final String INDEX = "index"; + public static final String VALUE = "value"; + public static final String FORMAT = "format"; + public static final String FILETYPE = "fileType"; + public static final String NULL_FORMAT = "nullFormat"; +} diff --git a/hdfsreader/src/main/resources/plugin.json b/hdfsreader/src/main/resources/plugin.json new file mode 100644 index 000000000..f3f5c7277 --- /dev/null +++ b/hdfsreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "hdfsreader", + "class": "com.alibaba.datax.plugin.reader.hdfsreader.HdfsReader", + "description": "useScene: test. mechanism: use datax framework to transport data from hdfs. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/hdfsreader/src/main/resources/plugin_job_template.json b/hdfsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..d73427d30 --- /dev/null +++ b/hdfsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,11 @@ +{ + "name": "hdfsreader", + "parameter": { + "path": "", + "defaultFS": "", + "column": [], + "fileType": "orc", + "encoding": "UTF-8", + "fieldDelimiter": "," + } +} \ No newline at end of file diff --git a/hdfswriter/doc/hdfswriter.md b/hdfswriter/doc/hdfswriter.md new file mode 100644 index 000000000..eb7122459 --- /dev/null +++ b/hdfswriter/doc/hdfswriter.md @@ -0,0 +1,355 @@ +# DataX HdfsWriter 插件文档 + + +------------ + +## 1 快速介绍 + +HdfsWriter提供向HDFS文件系统指定路径中写入TEXTFile文件和ORCFile文件,文件内容可与hive中表关联。 + + +有关Hive的资料,可以参看http://baike.corp.taobao.com/index.php/Hive%E5%9F%BA%E7%A1%80%E6%96%87%E6%A1%A3 + +## 2 功能与限制 + +* (1)、目前HdfsWriter仅支持textfile和orcfile两种格式的文件,且文件内容存放的必须是一张逻辑意义上的二维表; +* (2)、由于HDFS是文件系统,不存在schema的概念,因此不支持对部分列写入; +* (3)、目前仅支持与以下Hive数据类型: +数值型:TINYINT,SMALLINT,INT,BIGINT,FLOAT,DOUBLE +字符串类型:STRING,VARCHAR,CHAR +布尔类型:BOOLEAN +时间类型:DATE,TIMESTAMP +**目前不支持:decimal、binary、arrays、maps、structs、union类型**; +* (4)、对于Hive分区表目前仅支持一次写入单个分区; +* (5)、对于textfile需用户保证写入hdfs文件的分隔符**与在Hive上创建表时的分隔符一致**,从而实现写入hdfs数据与Hive表字段关联; +* (6)、HdfsWriter实现过程是:首先根据用户指定的path,创建一个hdfs文件系统上不存在的临时目录,创建规则:path_随机;然后将读取的文件写入这个临时目录;全部写入后再将这个临时目录下的文件移动到用户指定目录(在创建文件时保证文件名不重复); 最后删除临时目录。如果在中间过程发生网络中断等情况造成无法与hdfs建立连接,需要用户手动删除已经写入的文件和临时目录。 +* (7)、目前插件中Hive版本为1.1.1,Hadoop版本为2.5.0(Apache[为适配JDK1.6],在Hadoop 2.6.0 和Hive 1.2.0测试环境中写入正常;其它版本需后期进一步测试; +* (8)、目前HdfsWriter不支持Kerberos等认证,所以用户需要保证DATAX有权限访问该hdfs节点,并且对指定的目录有写的权限 + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": ["/Users/shf/workplace/txtWorkplace/job/dataorcfull.txt"], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "long" + }, + { + "index": 2, + "type": "long" + }, + { + "index": 3, + "type": "long" + }, + { + "index": 4, + "type": "DOUBLE" + }, + { + "index": 5, + "type": "DOUBLE" + }, + { + "index": 6, + "type": "STRING" + }, + { + "index": 7, + "type": "STRING" + }, + { + "index": 8, + "type": "STRING" + }, + { + "index": 9, + "type": "BOOLEAN" + }, + { + "index": 10, + "type": "date" + }, + { + "index": 11, + "type": "date" + } + ], + "fieldDelimiter": "\t" + } + }, + "writer": { + "name": "hdfswriter", + "parameter": { + "defaultFS": "hdfs://10.101.204.12:9000", + "fileType": "orc", + "path": "/user/hive/warehouse/writerorc.db/orcfull", + "fileName": "qiran", + "column": [ + { + "name": "col1", + "type": "TINYINT" + }, + { + "name": "col2", + "type": "SMALLINT" + }, + { + "name": "col3", + "type": "INT" + }, + { + "name": "col4", + "type": "BIGINT" + }, + { + "name": "col5", + "type": "FLOAT" + }, + { + "name": "col6", + "type": "DOUBLE" + }, + { + "name": "col7", + "type": "STRING" + }, + { + "name": "col8", + "type": "VARCHAR" + }, + { + "name": "col9", + "type": "CHAR" + }, + { + "name": "col10", + "type": "BOOLEAN" + }, + { + "name": "col11", + "type": "date" + }, + { + "name": "col12", + "type": "TIMESTAMP" + } + ], + "writeMode": "truncate", + "fieldDelimiter": "\t", + "compress":"NONE" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **defaultFS** + + * 描述:Hadoop hdfs文件系统namenode节点地址。格式:hdfs://ip:端口;例如:hdfs://10.101.204.12:9000
+ + + **特别需要注意的是,目前HdfsWriter不支持Kerberos等认证,所以用户需要保证DATAX有权限访问该节点** + + + * 必选:是
+ + * 默认值:无
+ +* **fileType** + + * 描述:文件的类型,目前只支持用户配置为"text"或"orc"。
+ + text表示textfile文件格式 + + orc表示orcfile文件格式 + + * 必选:是
+ + * 默认值:无
+* **path** + + * 描述:存储到Hadoop hdfs文件系统的路径信息,HdfsWriter会根据并发配置在Path目录下写入多个文件。为与hive表关联,请填写hive表在hdfs上的存储路径。例:Hive上设置的数据仓库的存储路径为:/user/hive/warehouse/ ,已建立数据库:test,表:hello;则对应的存储路径为:/user/hive/warehouse/test.db/hello
+ + * 必选:是
+ + * 默认值:无
+ +* **fileName** + + * 描述:HdfsWriter写入时的文件名,实际执行时会在该文件名后添加随机的后缀作为每个线程写入实际文件名。
+ + * 必选:是
+ + * 默认值:无
+* **column** + + * 描述:写入数据的字段,不支持对部分列写入。为与hive中表关联,需要指定表中所有字段名和字段类型,其中:name指定字段名,type指定字段类型。
+ + 用户可以指定Column字段信息,配置如下: + + ```json + "column": + [ + { + "name": "userName", + "type": "string" + }, + { + "name": "age", + "type": "long" + } + ] + ``` + + * 必选:是
+ + * 默认值:无
+* **writeMode** + + * 描述:hdfswriter写入前数据清理处理模式:
+ + * append,写入前不做任何处理,DataX hdfswriter直接使用filename写入,并保证文件名不冲突。 + * nonConflict,如果目录下有fileName前缀的文件,直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:hdfswriter写入时的字段分隔符,**需要用户保证与创建的Hive表的字段分隔符一致,否则无法在Hive表中查到数据**
+ + * 必选:是
+ + * 默认值:无
+ +* **compress** + + * 描述:hdfs文件压缩类型,默认不填写意味着没有压缩。其中:text类型文件支持压缩类型有gzip、bzip2;orc类型文件支持的压缩类型有NONE、SNAPPY(需要用户安装SnappyCodec)。
+ + * 必选:否
+ + * 默认值:无压缩
+ +* **encoding** + + * 描述:写文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8,**慎重修改**
+ + +### 3.3 类型转换 + +目前 HdfsWriter 支持大部分 Hive 类型,请注意检查你的类型。 + +下面列出 HdfsWriter 针对 Hive 数据类型转换列表: + +| DataX 内部类型| HIVE 数据类型 | +| -------- | ----- | +| Long |TINYINT,SMALLINT,INT,BIGINT | +| Double |FLOAT,DOUBLE | +| String |STRING,VARCHAR,CHAR | +| Boolean |BOOLEAN | +| Date |DATE,TIMESTAMP | + + +## 4 配置步骤 +* 步骤一、在Hive中创建数据库、表 +Hive数据库在HDFS上存储配置,在hive安装目录下 conf/hive-site.xml文件中配置,默认值为:/user/hive/warehouse +如下所示: + +```xml + + hive.metastore.warehouse.dir + /user/hive/warehouse + location of default database for the warehouse + +``` +Hive建库/建表语法 参考 [Hive操作手册]( https://cwiki.apache.org/confluence/display/Hive/LanguageManual) + +例: +(1)建立存储为textfile文件类型的表 +```json +create database IF NOT EXISTS hdfswriter; +use hdfswriter; +create table text_table( +col1 TINYINT, +col2 SMALLINT, +col3 INT, +col4 BIGINT, +col5 FLOAT, +col6 DOUBLE, +col7 STRING, +col8 VARCHAR(10), +col9 CHAR(10), +col10 BOOLEAN, +col11 date, +col12 TIMESTAMP +) +row format delimited +fields terminated by "\t" +STORED AS TEXTFILE; +``` +text_table在hdfs上存储路径为:/user/hive/warehouse/hdfswriter.db/text_table/ + +(2)建立存储为orcfile文件类型的表 +```json +create database IF NOT EXISTS hdfswriter; +use hdfswriter; +create table orc_table( +col1 TINYINT, +col2 SMALLINT, +col3 INT, +col4 BIGINT, +col5 FLOAT, +col6 DOUBLE, +col7 STRING, +col8 VARCHAR(10), +col9 CHAR(10), +col10 BOOLEAN, +col11 date, +col12 TIMESTAMP +) +ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' +STORED AS ORC; +``` +orc_table在hdfs上存储路径为:/user/hive/warehouse/hdfswriter.db/orc_table/ + +* 步骤二、根据步骤一的配置信息配置HdfsWriter作业 + +## 5 约束限制 + +略 + +## 6 FAQ + +略 diff --git a/hdfswriter/hdfswriter.iml b/hdfswriter/hdfswriter.iml new file mode 100644 index 000000000..3b92a594e --- /dev/null +++ b/hdfswriter/hdfswriter.iml @@ -0,0 +1,163 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/hdfswriter/pom.xml b/hdfswriter/pom.xml new file mode 100644 index 000000000..b3f49cbd2 --- /dev/null +++ b/hdfswriter/pom.xml @@ -0,0 +1,135 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hdfswriter + hdfswriter + HdfsWriter提供了写入HDFS功能。 + jar + + 1.1.1 + 2.5.0 + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hadoop + hadoop-hdfs + ${hadoop.version} + + + org.apache.hadoop + hadoop-common + ${hadoop.version} + + + org.apache.hadoop + hadoop-yarn-common + ${hadoop.version} + + + org.apache.hadoop + hadoop-mapreduce-client-core + ${hadoop.version} + + + + org.apache.hive + hive-exec + ${hive.version} + + + org.apache.hive + hive-serde + ${hive.version} + + + org.apache.hive + hive-service + ${hive.version} + + + org.apache.hive + hive-common + ${hive.version} + + + org.apache.hive.hcatalog + hive-hcatalog-core + ${hive.version} + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + + junit + junit + test + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/hdfswriter/src/main/assembly/package.xml b/hdfswriter/src/main/assembly/package.xml new file mode 100755 index 000000000..316e6b196 --- /dev/null +++ b/hdfswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/hdfswriter + + + target/ + + hdfswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/hdfswriter + + + + + + false + plugin/writer/hdfswriter/libs + runtime + + + diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Constant.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Constant.java new file mode 100755 index 000000000..3e4fa52f9 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +public class Constant { + + public static final String DEFAULT_ENCODING = "UTF-8"; + public static final String DEFAULT_NULL_FORMAT = "\\N"; +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsHelper.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsHelper.java new file mode 100644 index 000000000..60f686dad --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsHelper.java @@ -0,0 +1,515 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.google.common.collect.Lists; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.apache.hadoop.fs.*; +import org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat; +import org.apache.hadoop.hive.ql.io.orc.OrcSerde; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.mapred.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import java.io.IOException; +import java.text.SimpleDateFormat; +import java.util.*; + +public class HdfsHelper { + public static final Logger LOG = LoggerFactory.getLogger(HdfsWriter.Job.class); + public FileSystem fileSystem = null; + public JobConf conf = null; + + public void getFileSystem(String defaultFS){ + org.apache.hadoop.conf.Configuration hadoopConf = new org.apache.hadoop.conf.Configuration(); + hadoopConf.set("fs.defaultFS", defaultFS); + conf = new JobConf(hadoopConf); + try { + fileSystem = FileSystem.get(conf); + } catch (IOException e) { + String message = String.format("获取FileSystem时发生网络IO异常,请检查您的网络是否正常!HDFS地址:[%s]", + "message:defaultFS =" + defaultFS); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + }catch (Exception e) { + String message = String.format("获取FileSystem失败,请检查HDFS地址是否正确: [%s]", + "message:defaultFS =" + defaultFS); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + } + + if(null == fileSystem || null == conf){ + String message = String.format("获取FileSystem失败,请检查HDFS地址是否正确: [%s]", + "message:defaultFS =" + defaultFS); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, message); + } + } + + /** + *获取指定目录先的文件列表 + * @param dir + * @return + * 拿到的是文件全路径, + */ + public String[] hdfsDirList(String dir){ + Path path = new Path(dir); + String[] files = null; + try { + FileStatus[] status = fileSystem.listStatus(path); + files = new String[status.length]; + for(int i=0;i tmpFiles, HashSet endFiles){ + Path tmpFilesParent = null; + if(tmpFiles.size() != endFiles.size()){ + String message = String.format("临时目录下文件名个数与目标文件名个数不一致!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.HDFS_RENAME_FILE_ERROR, message); + }else{ + try{ + for (Iterator it1=tmpFiles.iterator(),it2=endFiles.iterator();it1.hasNext()&&it2.hasNext();){ + String srcFile = it1.next().toString(); + String dstFile = it2.next().toString(); + Path srcFilePah = new Path(srcFile); + Path dstFilePah = new Path(dstFile); + if(tmpFilesParent == null){ + tmpFilesParent = srcFilePah.getParent(); + } + LOG.info(String.format("start rename file [%s] to file [%s].", srcFile,dstFile)); + boolean renameTag = false; + long fileLen = fileSystem.getFileStatus(srcFilePah).getLen(); + if(fileLen>0){ + renameTag = fileSystem.rename(srcFilePah,dstFilePah); + if(!renameTag){ + String message = String.format("重命名文件[%s]失败,请检查您的网络是否正常!", srcFile); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.HDFS_RENAME_FILE_ERROR, message); + } + LOG.info(String.format("finish rename file [%s] to file [%s].", srcFile,dstFile)); + }else{ + LOG.info(String.format("文件[%s]内容为空,请检查写入是否正常!", srcFile)); + } + } + }catch (Exception e) { + String message = String.format("重命名文件时发生异常,请检查您的网络是否正常!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + }finally { + deleteDir(tmpFilesParent); + } + } + } + + //关闭FileSystem + public void closeFileSystem(){ + try { + fileSystem.close(); + } catch (IOException e) { + String message = String.format("关闭FileSystem时发生IO异常,请检查您的网络是否正常!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + } + } + + + //textfile格式文件 + public FSDataOutputStream getOutputStream(String path){ + Path storePath = new Path(path); + FSDataOutputStream fSDataOutputStream = null; + try { + fSDataOutputStream = fileSystem.create(storePath); + } catch (IOException e) { + String message = String.format("Create an FSDataOutputStream at the indicated Path[%s] failed: [%s]", + "message:path =" + path); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e); + } + return fSDataOutputStream; + } + + /** + * 写textfile类型文件 + * @param lineReceiver + * @param config + * @param fileName + * @param taskPluginCollector + */ + public void textFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName, + TaskPluginCollector taskPluginCollector){ + char fieldDelimiter = config.getChar(Key.FIELD_DELIMITER); + List columns = config.getListConfiguration(Key.COLUMN); + String compress = config.getString(Key.COMPRESS,null); + + SimpleDateFormat dateFormat = new SimpleDateFormat("yyyyMMddHHmm"); + String attempt = "attempt_"+dateFormat.format(new Date())+"_0001_m_000000_0"; + Path outputPath = new Path(fileName); + //todo 需要进一步确定TASK_ATTEMPT_ID + conf.set(JobContext.TASK_ATTEMPT_ID, attempt); + FileOutputFormat outFormat = new TextOutputFormat(); + outFormat.setOutputPath(conf, outputPath); + outFormat.setWorkOutputPath(conf, outputPath); + if(null != compress) { + Class codecClass = getCompressCodec(compress); + if (null != codecClass) { + outFormat.setOutputCompressorClass(conf, codecClass); + } + } + try { + RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, outputPath.toString(), Reporter.NULL); + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + MutablePair transportResult = transportOneRecord(record, fieldDelimiter, columns, taskPluginCollector); + if (!transportResult.getRight()) { + writer.write(NullWritable.get(),transportResult.getLeft()); + } + } + writer.close(Reporter.NULL); + } catch (Exception e) { + String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName); + LOG.error(message); + Path path = new Path(fileName); + deleteDir(path.getParent()); + throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e); + } + } + + public static MutablePair transportOneRecord( + Record record, char fieldDelimiter, List columnsConfiguration, TaskPluginCollector taskPluginCollector) { + MutablePair, Boolean> transportResultList = transportOneRecord(record,columnsConfiguration,taskPluginCollector); + //保存<转换后的数据,是否是脏数据> + MutablePair transportResult = new MutablePair(); + transportResult.setRight(false); + if(null != transportResultList){ + Text recordResult = new Text(StringUtils.join(transportResultList.getLeft(), fieldDelimiter)); + transportResult.setRight(transportResultList.getRight()); + transportResult.setLeft(recordResult); + } + return transportResult; + } + + public Class getCompressCodec(String compress){ + Class codecClass = null; + if(null == compress){ + codecClass = null; + }else if("GZIP".equalsIgnoreCase(compress)){ + codecClass = org.apache.hadoop.io.compress.GzipCodec.class; + }else if ("BZIP2".equalsIgnoreCase(compress)) { + codecClass = org.apache.hadoop.io.compress.BZip2Codec.class; + }else if("SNAPPY".equalsIgnoreCase(compress)){ + //todo 等需求明确后支持 需要用户安装SnappyCodec + codecClass = org.apache.hadoop.io.compress.SnappyCodec.class; + // org.apache.hadoop.hive.ql.io.orc.ZlibCodec.class not public + //codecClass = org.apache.hadoop.hive.ql.io.orc.ZlibCodec.class; + }else { + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("目前不支持您配置的 compress 模式 : [%s]", compress)); + } + return codecClass; + } + + /** + * 写orcfile类型文件 + * @param lineReceiver + * @param config + * @param fileName + * @param taskPluginCollector + */ + public void orcFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName, + TaskPluginCollector taskPluginCollector){ + List columns = config.getListConfiguration(Key.COLUMN); + String compress = config.getString(Key.COMPRESS, null); + List columnNames = getColumnNames(columns); + List columnTypeInspectors = getColumnTypeInspectors(columns); + StructObjectInspector inspector = (StructObjectInspector)ObjectInspectorFactory + .getStandardStructObjectInspector(columnNames, columnTypeInspectors); + + OrcSerde orcSerde = new OrcSerde(); + + FileOutputFormat outFormat = new OrcOutputFormat(); + if(!"NONE".equalsIgnoreCase(compress) && null != compress ) { + Class codecClass = getCompressCodec(compress); + if (null != codecClass) { + outFormat.setOutputCompressorClass(conf, codecClass); + } + } + try { + RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, fileName, Reporter.NULL); + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + MutablePair, Boolean> transportResult = transportOneRecord(record,columns,taskPluginCollector); + if (!transportResult.getRight()) { + writer.write(NullWritable.get(), orcSerde.serialize(transportResult.getLeft(), inspector)); + } + } + writer.close(Reporter.NULL); + } catch (Exception e) { + String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName); + LOG.error(message); + Path path = new Path(fileName); + deleteDir(path.getParent()); + throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e); + } + } + + public List getColumnNames(List columns){ + List columnNames = Lists.newArrayList(); + for (Configuration eachColumnConf : columns) { + columnNames.add(eachColumnConf.getString(Key.NAME)); + } + return columnNames; + } + + /** + * 根据writer配置的字段类型,构建inspector + * @param columns + * @return + */ + public List getColumnTypeInspectors(List columns){ + List columnTypeInspectors = Lists.newArrayList(); + for (Configuration eachColumnConf : columns) { + SupportHiveDataType columnType = SupportHiveDataType.valueOf(eachColumnConf.getString(Key.TYPE).toUpperCase()); + ObjectInspector objectInspector = null; + switch (columnType) { + case TINYINT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Byte.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case SMALLINT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Short.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case INT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case BIGINT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Long.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case FLOAT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Float.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case DOUBLE: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Double.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case TIMESTAMP: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(java.sql.Timestamp.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case DATE: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(java.sql.Date.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case STRING: + case VARCHAR: + case CHAR: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case BOOLEAN: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Boolean.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + default: + throw DataXException + .asDataXException( + HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d]. 请修改表中该字段的类型或者不同步该字段.", + eachColumnConf.getString(Key.NAME), + eachColumnConf.getString(Key.TYPE))); + } + + columnTypeInspectors.add(objectInspector); + } + return columnTypeInspectors; + } + + public OrcSerde getOrcSerde(Configuration config){ + String fieldDelimiter = config.getString(Key.FIELD_DELIMITER); + String compress = config.getString(Key.COMPRESS); + String encoding = config.getString(Key.ENCODING); + + OrcSerde orcSerde = new OrcSerde(); + Properties properties = new Properties(); + properties.setProperty("orc.bloom.filter.columns", fieldDelimiter); + properties.setProperty("orc.compress", compress); + properties.setProperty("orc.encoding.strategy", encoding); + + orcSerde.initialize(conf, properties); + return orcSerde; + } + + public static MutablePair, Boolean> transportOneRecord( + Record record,List columnsConfiguration, + TaskPluginCollector taskPluginCollector){ + + MutablePair, Boolean> transportResult = new MutablePair, Boolean>(); + transportResult.setRight(false); + List recordList = Lists.newArrayList(); + int recordLength = record.getColumnNumber(); + if (0 != recordLength) { + Column column; + for (int i = 0; i < recordLength; i++) { + column = record.getColumn(i); + //todo as method + String rowData = column.getRawData().toString(); + if (null != column.getRawData() && StringUtils.isNotBlank(rowData)) { + SupportHiveDataType columnType = SupportHiveDataType.valueOf( + columnsConfiguration.get(i).getString(Key.TYPE).toUpperCase()); + //根据writer端类型配置做类型转换 + try { + switch (columnType) { + case TINYINT: + recordList.add(Byte.valueOf(rowData)); + break; + case SMALLINT: + recordList.add(Short.valueOf(rowData)); + break; + case INT: + recordList.add(Integer.valueOf(rowData)); + break; + case BIGINT: + recordList.add(column.asLong()); + break; + case FLOAT: + recordList.add(Float.valueOf(rowData)); + break; + case DOUBLE: + recordList.add(column.asDouble()); + break; + case STRING: + case VARCHAR: + case CHAR: + recordList.add(column.asString()); + break; + case BOOLEAN: + recordList.add(column.asBoolean()); + break; + case DATE: + recordList.add(new java.sql.Date(column.asDate().getTime())); + break; + case TIMESTAMP: + recordList.add(new java.sql.Timestamp(column.asDate().getTime())); + break; + default: + throw DataXException + .asDataXException( + HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d]. 请修改表中该字段的类型或者不同步该字段.", + columnsConfiguration.get(i).getString(Key.NAME), + columnsConfiguration.get(i).getString(Key.TYPE))); + } + } catch (Exception e) { + // warn: 此处认为脏数据 + String message = String.format( + "字段类型转换错误:你目标字段为[%s]类型,实际字段值为[%s].", + columnsConfiguration.get(i).getString(Key.TYPE), column.getRawData().toString()); + taskPluginCollector.collectDirtyRecord(record, message); + transportResult.setRight(true); + break; + } + }else { + // warn: it's all ok if nullFormat is null + recordList.add(null); + } + } + } + transportResult.setLeft(recordList); + return transportResult; + } +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriter.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriter.java new file mode 100644 index 000000000..8c57eb141 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriter.java @@ -0,0 +1,373 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.writer.Constant; +import com.google.common.collect.Sets; +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; + + +public class HdfsWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + + private String defaultFS; + private String path; + private String fileType; + private String fileName; + private List columns; + private String writeMode; + private String fieldDelimiter; + private String compress; + private String encoding; + private HashSet tmpFiles = new HashSet();//临时文件全路径 + private HashSet endFiles = new HashSet();//最终文件全路径 + + private HdfsHelper hdfsHelper = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + + //创建textfile存储 + hdfsHelper = new HdfsHelper(); + + hdfsHelper.getFileSystem(defaultFS); + } + + private void validateParameter() { + this.defaultFS = this.writerSliceConfig.getNecessaryValue(Key.DEFAULT_FS, HdfsWriterErrorCode.REQUIRED_VALUE); + //fileType check + this.fileType = this.writerSliceConfig.getNecessaryValue(Key.FILE_TYPE, HdfsWriterErrorCode.REQUIRED_VALUE); + if( !fileType.equalsIgnoreCase("ORC") && !fileType.equalsIgnoreCase("TEXT")){ + String message = "HdfsWriter插件目前只支持ORC和TEXT两种格式的文件,请将filetype选项的值配置为ORC或者TEXT"; + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message); + } + //path + this.path = this.writerSliceConfig.getNecessaryValue(Key.PATH, HdfsWriterErrorCode.REQUIRED_VALUE); + if(!path.startsWith("/")){ + String message = String.format("请检查参数path:[%s],需要配置为绝对路径", path); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message); + }else if(path.contains("*") || path.contains("?")){ + String message = String.format("请检查参数path:[%s],不能包含*,?等特殊字符", path); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message); + } + //fileName + this.fileName = this.writerSliceConfig.getNecessaryValue(Key.FILE_NAME, HdfsWriterErrorCode.REQUIRED_VALUE); + //columns check + this.columns = this.writerSliceConfig.getListConfiguration(Key.COLUMN); + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException(HdfsWriterErrorCode.REQUIRED_VALUE, "您需要指定 columns"); + }else{ + for (Configuration eachColumnConf : columns) { + eachColumnConf.getNecessaryValue(Key.NAME, HdfsWriterErrorCode.COLUMN_REQUIRED_VALUE); + eachColumnConf.getNecessaryValue(Key.TYPE, HdfsWriterErrorCode.COLUMN_REQUIRED_VALUE); + } + } + //writeMode check + this.writeMode = this.writerSliceConfig.getNecessaryValue(Key.WRITE_MODE, HdfsWriterErrorCode.REQUIRED_VALUE); + writeMode = writeMode.toLowerCase().trim(); + Set supportedWriteModes = Sets.newHashSet("append", "nonconflict"); + if (!supportedWriteModes.contains(writeMode)) { + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("仅支持append, nonConflict两种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + this.writerSliceConfig.set(Key.WRITE_MODE, writeMode); + //fieldDelimiter check + this.fieldDelimiter = this.writerSliceConfig.getString(Key.FIELD_DELIMITER,null); + if(null == fieldDelimiter){ + throw DataXException.asDataXException(HdfsWriterErrorCode.REQUIRED_VALUE, + String.format("您提供配置文件有误,[%s]是必填参数.", Key.FIELD_DELIMITER)); + }else if(1 != fieldDelimiter.length()){ + // warn: if have, length must be one + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", fieldDelimiter)); + } + //compress check + this.compress = this.writerSliceConfig.getString(Key.COMPRESS,null); + if(fileType.equalsIgnoreCase("TEXT")){ + Set textSupportedCompress = Sets.newHashSet("GZIP", "BZIP2"); + if(null == compress ){ + this.writerSliceConfig.set(Key.COMPRESS, null); + }else { + compress = compress.toUpperCase().trim(); + if(!textSupportedCompress.contains(compress) ){ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("目前TEXT FILE仅支持GZIP、BZIP2 两种压缩, 不支持您配置的 compress 模式 : [%s]", + compress)); + } + } + }else if(fileType.equalsIgnoreCase("ORC")){ + Set orcSupportedCompress = Sets.newHashSet("NONE", "SNAPPY"); + if(null == compress){ + this.writerSliceConfig.set(Key.COMPRESS, "NONE"); + }else { + compress = compress.toUpperCase().trim(); + if(!orcSupportedCompress.contains(compress)){ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("目前ORC FILE仅支持SNAPPY压缩, 不支持您配置的 compress 模式 : [%s]", + compress)); + } + } + + } + // encoding check + this.encoding = this.writerSliceConfig.getString(Key.ENCODING,Constant.DEFAULT_ENCODING); + try { + encoding = encoding.trim(); + this.writerSliceConfig.set(Key.ENCODING, encoding); + Charsets.toCharset(encoding); + } catch (Exception e) { + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式:[%s]", encoding), e); + } + } + + @Override + public void prepare() { + //若路径已经存在,检查path是否是目录 + if(hdfsHelper.isPathexists(path)){ + if(!hdfsHelper.isPathDir(path)){ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + //根据writeMode对目录下文件进行处理 + Path[] existFilePaths = hdfsHelper.hdfsDirList(path,fileName); + boolean isExistFile = false; + if(existFilePaths.length > 0){ + isExistFile = true; + } + /** + if ("truncate".equals(writeMode) && isExistFile ) { + LOG.info(String.format("由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的内容", + path, fileName)); + hdfsHelper.deleteFiles(existFilePaths); + } else + */ + if ("append".equalsIgnoreCase(writeMode)) { + LOG.info(String.format("由于您配置了writeMode append, 写入前不做清理工作, [%s] 目录下写入相应文件名前缀 [%s] 的文件", + path, fileName)); + } else if ("nonconflict".equalsIgnoreCase(writeMode) && isExistFile) { + LOG.info(String.format("由于您配置了writeMode nonConflict, 开始检查 [%s] 下面的内容", path)); + List allFiles = new ArrayList(); + for (Path eachFile : existFilePaths) { + allFiles.add(eachFile.toString()); + } + LOG.error(String.format("冲突文件列表为: [%s]", StringUtils.join(allFiles, ","))); + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("由于您配置了writeMode nonConflict,但您配置的path: [%s] 目录不为空, 下面存在其他文件或文件夹.", path)); + } + }else{ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的path: [%s] 不存在, 请先在hive端创建对应的数据库和表.", path)); + } + } + + @Override + public void post() { + hdfsHelper.renameFile(tmpFiles, endFiles); + } + + @Override + public void destroy() { + hdfsHelper.closeFileSystem(); + } + + @Override + public List split(int mandatoryNumber) { + LOG.info("begin do split..."); + List writerSplitConfigs = new ArrayList(); + String filePrefix = fileName; + + Set allFiles = new HashSet(); + + //获取该路径下的所有已有文件列表 + if(hdfsHelper.isPathexists(path)){ + allFiles.addAll(Arrays.asList(hdfsHelper.hdfsDirList(path))); + } + + String fileSuffix; + //临时存放路径 + String storePath = buildTmpFilePath(this.path); + //最终存放路径 + String endStorePath = buildFilePath(); + this.path = endStorePath; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same file name + + Configuration splitedTaskConfig = this.writerSliceConfig.clone(); + String fullFileName = null; + String endFullFileName = null; + + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + + fullFileName = String.format("%s%s%s__%s", defaultFS, storePath, filePrefix, fileSuffix); + endFullFileName = String.format("%s%s%s__%s", defaultFS, endStorePath, filePrefix, fileSuffix); + + while (allFiles.contains(endFullFileName)) { + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s%s%s__%s", defaultFS, storePath, filePrefix, fileSuffix); + endFullFileName = String.format("%s%s%s__%s", defaultFS, endStorePath, filePrefix, fileSuffix); + } + allFiles.add(endFullFileName); + + //设置临时文件全路径和最终文件全路径 + if("GZIP".equalsIgnoreCase(this.compress)){ + this.tmpFiles.add(fullFileName + ".gz"); + this.endFiles.add(endFullFileName + ".gz"); + }else if("BZIP2".equalsIgnoreCase(compress)){ + this.tmpFiles.add(fullFileName + ".bz2"); + this.endFiles.add(endFullFileName + ".bz2"); + }else{ + this.tmpFiles.add(fullFileName); + this.endFiles.add(endFullFileName); + } + + splitedTaskConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + fullFileName); + + LOG.info(String.format("splited write file name:[%s]", + fullFileName)); + + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + + private String buildFilePath() { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + this.path = this.path + IOUtils.DIR_SEPARATOR; + } + return this.path; + } + + /** + * 创建临时目录 + * @param userPath + * @return + */ + private String buildTmpFilePath(String userPath) { + String tmpFilePath; + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = userPath.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = userPath.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + String tmpSuffix; + tmpSuffix = UUID.randomUUID().toString().replace('-', '_'); + if (!isEndWithSeparator) { + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else if("/".equals(userPath)){ + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else{ + tmpFilePath = String.format("%s__%s%s", userPath.substring(0,userPath.length()-1), tmpSuffix, IOUtils.DIR_SEPARATOR); + } + while(hdfsHelper.isPathexists(tmpFilePath)){ + tmpSuffix = UUID.randomUUID().toString().replace('-', '_'); + if (!isEndWithSeparator) { + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else if("/".equals(userPath)){ + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else{ + tmpFilePath = String.format("%s__%s%s", userPath.substring(0,userPath.length()-1), tmpSuffix, IOUtils.DIR_SEPARATOR); + } + } + return tmpFilePath; + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration writerSliceConfig; + + private String defaultFS; + private String fileType; + private String fileName; + + private HdfsHelper hdfsHelper = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + + this.defaultFS = this.writerSliceConfig.getString(Key.DEFAULT_FS); + this.fileType = this.writerSliceConfig.getString(Key.FILE_TYPE); + this.fileName = this.writerSliceConfig.getString(Key.FILE_NAME); + + hdfsHelper = new HdfsHelper(); + hdfsHelper.getFileSystem(defaultFS); + } + + @Override + public void prepare() { + + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("begin do write..."); + LOG.info(String.format("write to file : [%s]", this.fileName)); + if(fileType.equalsIgnoreCase("TEXT")){ + //写TEXT FILE + hdfsHelper.textFileStartWrite(lineReceiver,this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + }else if(fileType.equalsIgnoreCase("ORC")){ + //写ORC FILE + hdfsHelper.orcFileStartWrite(lineReceiver,this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + } + + LOG.info("end do write"); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + } +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriterErrorCode.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriterErrorCode.java new file mode 100644 index 000000000..4dedfd4f9 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriterErrorCode.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by shf on 15/10/8. + */ +public enum HdfsWriterErrorCode implements ErrorCode { + + CONFIG_INVALID_EXCEPTION("HdfsWriter-00", "您的参数配置错误."), + REQUIRED_VALUE("HdfsWriter-01", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("HdfsWriter-02", "您填写的参数值不合法."), + WRITER_FILE_WITH_CHARSET_ERROR("HdfsWriter-03", "您配置的编码未能正常写入."), + Write_FILE_IO_ERROR("HdfsWriter-04", "您配置的文件在写入时出现IO异常."), + WRITER_RUNTIME_EXCEPTION("HdfsWriter-05", "出现运行时异常, 请联系我们."), + CONNECT_HDFS_IO_ERROR("HdfsWriter-06", "与HDFS建立连接时出现IO异常."), + COLUMN_REQUIRED_VALUE("HdfsWriter-07", "您column配置中缺失了必须填写的参数值."), + HDFS_RENAME_FILE_ERROR("HdfsWriter-08", "将文件移动到配置路径失败."); + + private final String code; + private final String description; + + private HdfsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Key.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Key.java new file mode 100644 index 000000000..dd7438fc6 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Key.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +/** + * Created by shf on 15/10/8. + */ +public class Key { + // must have + public static final String PATH = "path"; + //must have + public final static String DEFAULT_FS = "defaultFS"; + //must have + public final static String FILE_TYPE = "fileType"; + // must have + public static final String FILE_NAME = "fileName"; + // must have for column + public static final String COLUMN = "column"; + public static final String NAME = "name"; + public static final String TYPE = "type"; + public static final String DATE_FORMAT = "dateFormat"; + // must have + public static final String WRITE_MODE = "writeMode"; + // must have + public static final String FIELD_DELIMITER = "fieldDelimiter"; + // not must, default UTF-8 + public static final String ENCODING = "encoding"; + // not must, default no compress + public static final String COMPRESS = "compress"; + // not must, not default \N + public static final String NULL_FORMAT = "nullFormat"; + + +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/SupportHiveDataType.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/SupportHiveDataType.java new file mode 100644 index 000000000..b7949302c --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/SupportHiveDataType.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +public enum SupportHiveDataType { + TINYINT, + SMALLINT, + INT, + BIGINT, + FLOAT, + DOUBLE, + + TIMESTAMP, + DATE, + + STRING, + VARCHAR, + CHAR, + + BOOLEAN +} diff --git a/hdfswriter/src/main/resources/plugin.json b/hdfswriter/src/main/resources/plugin.json new file mode 100755 index 000000000..384f4a7a8 --- /dev/null +++ b/hdfswriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "hdfswriter", + "class": "com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter", + "description": "useScene: prod. mechanism: via FileSystem connect HDFS write data concurrent.", + "developer": "alibaba" +} + diff --git a/hdfswriter/src/main/resources/plugin_job_template.json b/hdfswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..08b4ab62e --- /dev/null +++ b/hdfswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "hdfswriter", + "parameter": { + "defaultFS": "", + "fileType": "", + "path": "", + "fileName": "", + "column": [], + "writeMode": "", + "fieldDelimiter": "", + "compress":"" + } +} \ No newline at end of file diff --git a/mongodbreader/doc/mongodbreader.md b/mongodbreader/doc/mongodbreader.md new file mode 100644 index 000000000..b716b8d38 --- /dev/null +++ b/mongodbreader/doc/mongodbreader.md @@ -0,0 +1,149 @@ +### Datax MongoDBReader +#### 1 快速介绍 + +MongoDBReader 插件利用 MongoDB 的java客户端MongoClient进行MongoDB的读操作。最新版本的Mongo已经将DB锁的粒度从DB级别降低到document级别,配合上MongoDB强大的索引功能,基本可以达到高性能的读取MongoDB的需求。 + +#### 2 实现原理 + +MongoDBReader通过Datax框架从MongoDB并行的读取数据,通过主控的JOB程序按照指定的规则对MongoDB中的数据进行分片,并行读取,然后将MongoDB支持的类型通过逐一判断转换成Datax支持的类型。 + +#### 3 功能说明 +* 该示例从ODPS读一份数据到MongoDB。 + + { + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "mongodbreader", + "parameter": { + "address": ["127.0.0.1:27017"], + "userName": "", + "userPassword": "", + "dbName": "tag_per_data", + "collectionName": "tag_data12", + "column": [ + { + "name": "unique_id", + "type": "string" + }, + { + "name": "sid", + "type": "string" + }, + { + "name": "user_id", + "type": "string" + }, + { + "name": "auction_id", + "type": "string" + }, + { + "name": "content_type", + "type": "string" + }, + { + "name": "pool_type", + "type": "string" + }, + { + "name": "frontcat_id", + "type": "Array", + "spliter": "" + }, + { + "name": "categoryid", + "type": "Array", + "spliter": "" + }, + { + "name": "gmt_create", + "type": "string" + }, + { + "name": "taglist", + "type": "Array", + "spliter": " " + }, + { + "name": "property", + "type": "string" + }, + { + "name": "scorea", + "type": "int" + }, + { + "name": "scoreb", + "type": "int" + }, + { + "name": "scorec", + "type": "int" + } + ] + } + }, + "writer": { + "name": "odpswriter", + "parameter": { + "project": "tb_ai_recommendation", + "table": "jianying_tag_datax_read_test01", + "column": [ + "unique_id", + "sid", + "user_id", + "auction_id", + "content_type", + "pool_type", + "frontcat_id", + "categoryid", + "gmt_create", + "taglist", + "property", + "scorea", + "scoreb" + ], + "accessId": "**************", + "accessKey": "********************", + "truncate": true, + "odpsServer": "http://service-corp.odps.aliyun-inc.com/api", + "tunnelServer": "http://dt-corp.odps.aliyun-inc.com", + "accountType": "aliyun" + } + } + } + ] + } + } +#### 4 参数说明 + +* address: MongoDB的数据地址信息,因为MonogDB可能是个集群,则ip端口信息需要以Json数组的形式给出。【必填】 +* userName:MongoDB的用户名。【选填】 +* userPassword: MongoDB的密码。【选填】 +* collectionName: MonogoDB的集合名。【必填】 +* column:MongoDB的文档列名。【必填】 +* name:Column的名字。【必填】 +* type:Column的类型。【选填】 +* splitter:因为MongoDB支持数组类型,但是Datax框架本身不支持数组类型,所以mongoDB读出来的数组类型要通过这个分隔符合并成字符串。【选填】 + +#### 5 类型转换 + +| DataX 内部类型| MongoDB 数据类型 | +| -------- | ----- | +| Long | int, Long | +| Double | double | +| String | string, array | +| Date | date | +| Boolean | boolean | +| Bytes | bytes | + + +#### 6 性能报告 +#### 7 测试报告 \ No newline at end of file diff --git a/mongodbreader/mongodbreader.iml b/mongodbreader/mongodbreader.iml new file mode 100644 index 000000000..2d482f6d0 --- /dev/null +++ b/mongodbreader/mongodbreader.iml @@ -0,0 +1,31 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/mongodbreader/pom.xml b/mongodbreader/pom.xml new file mode 100644 index 000000000..6f9736a34 --- /dev/null +++ b/mongodbreader/pom.xml @@ -0,0 +1,88 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + mongodbreader + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + + + + org.mongodb + mongo-java-driver + 3.0.3 + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/mongodbreader/src/main/assembly/package.xml b/mongodbreader/src/main/assembly/package.xml new file mode 100644 index 000000000..a7e967f90 --- /dev/null +++ b/mongodbreader/src/main/assembly/package.xml @@ -0,0 +1,36 @@ + + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/mongodbreader + + + target/ + + mongodbreader-0.0.1-SNAPSHOT.jar + + plugin/reader/mongodbreader + + + + + + false + plugin/reader/mongodbreader/libs + runtime + + + diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/KeyConstant.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/KeyConstant.java new file mode 100644 index 000000000..d70966476 --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/KeyConstant.java @@ -0,0 +1,67 @@ +package com.alibaba.datax.plugin.reader.mongodbreader; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public class KeyConstant { + /** + * 数组类型 + */ + public static final String ARRAY_TYPE = "array"; + /** + * mongodb 的 host 地址 + */ + public static final String MONGO_ADDRESS = "address"; + /** + * mongodb 的用户名 + */ + public static final String MONGO_USER_NAME = "userName"; + /** + * mongodb 密码 + */ + public static final String MONGO_USER_PASSWORD = "userPassword"; + /** + * mongodb 数据库名 + */ + public static final String MONGO_DB_NAME = "dbName"; + /** + * mongodb 集合名 + */ + public static final String MONGO_COLLECTION_NAME = "collectionName"; + /** + * mongodb 的列 + */ + public static final String MONGO_COLUMN = "column"; + /** + * 每个列的名字 + */ + public static final String COLUMN_NAME = "name"; + /** + * 每个列的类型 + */ + public static final String COLUMN_TYPE = "type"; + /** + * 列分隔符 + */ + public static final String COLUMN_SPLITTER = "splitter"; + /** + * 跳过的列数 + */ + public static final String SKIP_COUNT = "skipCount"; + /** + * 批量获取的记录数 + */ + public static final String BATCH_SIZE = "batchSize"; + /** + * MongoDB的idmeta + */ + public static final String MONGO_PRIMIARY_ID_META = "_id"; + /** + * 判断是否为数组类型 + * @param type 数据类型 + * @return + */ + public static boolean isArrayType(String type) { + return ARRAY_TYPE.equals(type); + } +} diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReader.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReader.java new file mode 100644 index 000000000..c5ae37d6a --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReader.java @@ -0,0 +1,173 @@ +package com.alibaba.datax.plugin.reader.mongodbreader; +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.mongodbreader.util.CollectionSplitUtil; +import com.alibaba.datax.plugin.reader.mongodbreader.util.MongoUtil; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONArray; +import com.alibaba.fastjson.JSONObject; +import com.google.common.base.Joiner; +import com.google.common.base.Strings; +import com.mongodb.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.reflect.Array; +import java.util.*; + +/** + * Created by jianying.wcj on 2015/3/19 0019. + */ +public class MongoDBReader extends Reader { + + public static class Job extends Reader.Job { + + private Configuration originalConfig = null; + + private MongoClient mongoClient; + + private String userName = null; + private String password = null; + + @Override + public List split(int adviceNumber) { + return CollectionSplitUtil.doSplit(originalConfig,adviceNumber,mongoClient); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + this.userName = originalConfig.getString(KeyConstant.MONGO_USER_NAME); + this.password = originalConfig.getString(KeyConstant.MONGO_USER_PASSWORD); + String database = originalConfig.getString(KeyConstant.MONGO_DB_NAME); + if(!Strings.isNullOrEmpty(this.userName) && !Strings.isNullOrEmpty(this.password)) { + this.mongoClient = MongoUtil.initCredentialMongoClient(originalConfig,userName,password,database); + } else { + this.mongoClient = MongoUtil.initMongoClient(originalConfig); + } + } + + @Override + public void destroy() { + + } + } + + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + + private MongoClient mongoClient; + + private String userName = null; + private String password = null; + + private String database = null; + private String collection = null; + + private JSONArray mongodbColumnMeta = null; + private Long batchSize = null; + /** + * 用来控制每个task取值的offset + */ + private Long skipCount = null; + /** + * 每页数据的大小 + */ + private int pageSize = 1000; + + @Override + public void startRead(RecordSender recordSender) { + + if(batchSize == null || + mongoClient == null || database == null || + collection == null || mongodbColumnMeta == null) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE, + MongoDBReaderErrorCode.ILLEGAL_VALUE.getDescription()); + } + DB db = mongoClient.getDB(database); + DBCollection col = db.getCollection(this.collection); + DBObject obj = new BasicDBObject(); + obj.put(KeyConstant.MONGO_PRIMIARY_ID_META,1); + + long pageCount = batchSize / pageSize; + long modCount = batchSize % pageSize; + + for(int i = 0; i <= pageCount; i++) { + skipCount += i * pageSize; + if(modCount == 0 && i == pageCount) { + break; + } + if (i == pageCount) { + pageCount = modCount; + } + DBCursor dbCursor = col.find().sort(obj).skip(skipCount.intValue()).limit(pageSize); + while (dbCursor.hasNext()) { + DBObject item = dbCursor.next(); + Record record = recordSender.createRecord(); + Iterator columnItera = mongodbColumnMeta.iterator(); + while (columnItera.hasNext()) { + JSONObject column = (JSONObject)columnItera.next(); + Object tempCol = item.get(column.getString(KeyConstant.COLUMN_NAME)); + if (tempCol == null) { + continue; + } + if (tempCol instanceof Double) { + record.addColumn(new DoubleColumn((Double) tempCol)); + } else if (tempCol instanceof Boolean) { + record.addColumn(new BoolColumn((Boolean) tempCol)); + } else if (tempCol instanceof Date) { + record.addColumn(new DateColumn((Date) tempCol)); + } else if (tempCol instanceof Integer) { + record.addColumn(new LongColumn((Integer) tempCol)); + }else if (tempCol instanceof Long) { + record.addColumn(new LongColumn((Long) tempCol)); + } else { + if(KeyConstant.isArrayType(column.getString(KeyConstant.COLUMN_TYPE))) { + String splitter = column.getString(KeyConstant.COLUMN_SPLITTER); + if(Strings.isNullOrEmpty(splitter)) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE, + MongoDBReaderErrorCode.ILLEGAL_VALUE.getDescription()); + } else { + ArrayList array = (ArrayList)tempCol; + String tempArrayStr = Joiner.on(splitter).join(array); + record.addColumn(new StringColumn(tempArrayStr)); + } + } else { + record.addColumn(new StringColumn(tempCol.toString())); + } + } + } + recordSender.sendToWriter(record); + } + } + } + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.userName = readerSliceConfig.getString(KeyConstant.MONGO_USER_NAME); + this.password = readerSliceConfig.getString(KeyConstant.MONGO_USER_PASSWORD); + this.database = readerSliceConfig.getString(KeyConstant.MONGO_DB_NAME); + if(!Strings.isNullOrEmpty(userName) && !Strings.isNullOrEmpty(password)) { + mongoClient = MongoUtil.initCredentialMongoClient(readerSliceConfig,userName,password,database); + } else { + mongoClient = MongoUtil.initMongoClient(readerSliceConfig); + } + + this.collection = readerSliceConfig.getString(KeyConstant.MONGO_COLLECTION_NAME); + this.mongodbColumnMeta = JSON.parseArray(readerSliceConfig.getString(KeyConstant.MONGO_COLUMN)); + this.batchSize = readerSliceConfig.getLong(KeyConstant.BATCH_SIZE); + this.skipCount = readerSliceConfig.getLong(KeyConstant.SKIP_COUNT); + } + + @Override + public void destroy() { + + } + } +} diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReaderErrorCode.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReaderErrorCode.java new file mode 100644 index 000000000..4b3780c26 --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReaderErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.plugin.reader.mongodbreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by jianying.wcj on 2015/3/19 0019. + */ +public enum MongoDBReaderErrorCode implements ErrorCode { + + ILLEGAL_VALUE("ILLEGAL_PARAMETER_VALUE","参数不合法"), + ILLEGAL_ADDRESS("ILLEGAL_ADDRESS","不合法的Mongo地址"), + UNEXCEPT_EXCEPTION("UNEXCEPT_EXCEPTION","未知异常"); + + private final String code; + + private final String description; + + private MongoDBReaderErrorCode(String code,String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return code; + } + + @Override + public String getDescription() { + return description; + } +} + diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/CollectionSplitUtil.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/CollectionSplitUtil.java new file mode 100644 index 000000000..3ff758ed9 --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/CollectionSplitUtil.java @@ -0,0 +1,77 @@ +package com.alibaba.datax.plugin.reader.mongodbreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.mongodbreader.KeyConstant; +import com.alibaba.datax.plugin.reader.mongodbreader.MongoDBReaderErrorCode; +import com.google.common.base.Strings; +import com.mongodb.*; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jianying.wcj on 2015/3/19 0019. + */ +public class CollectionSplitUtil { + + public static List doSplit( + Configuration originalSliceConfig,int adviceNumber,MongoClient mongoClient) { + + List confList = new ArrayList(); + + String dbName = originalSliceConfig.getString(KeyConstant.MONGO_DB_NAME); + + String collectionName = originalSliceConfig.getString(KeyConstant.MONGO_COLLECTION_NAME); + + if(Strings.isNullOrEmpty(dbName) || Strings.isNullOrEmpty(collectionName) || mongoClient == null) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE, + MongoDBReaderErrorCode.ILLEGAL_VALUE.getDescription()); + } + + DB db = mongoClient.getDB(dbName); + DBCollection collection = db.getCollection(collectionName); + + List countInterval = doSplitInterval(adviceNumber,collection); + for(Entry interval : countInterval) { + Configuration conf = originalSliceConfig.clone(); + conf.set(KeyConstant.SKIP_COUNT,interval.interval); + conf.set(KeyConstant.BATCH_SIZE,interval.batchSize); + confList.add(conf); + } + return confList; + } + + private static List doSplitInterval(int adviceNumber,DBCollection collection) { + + List intervalCountList = new ArrayList(); + + long totalCount = collection.count(); + if(totalCount < 0) { + return intervalCountList; + } + // 100 6 => 16 mod 4 + long batchSize = totalCount/adviceNumber; + for(int i = 0; i < adviceNumber; i++) { + Entry entry = new Entry(); + /** + * 这个判断确认不会丢失最后一页数据, + * 因为 totalCount/adviceNumber 不整除时,如果不做判断会丢失最后一页 + */ + if(i == (adviceNumber - 1)) { + entry.batchSize = batchSize + adviceNumber; + } else { + entry.batchSize = batchSize; + } + entry.interval = batchSize * i; + intervalCountList.add(entry); + } + return intervalCountList; + } + +} + +class Entry { + Long interval; + Long batchSize; +} diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/MongoUtil.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/MongoUtil.java new file mode 100644 index 000000000..51d4bea7a --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/MongoUtil.java @@ -0,0 +1,99 @@ +package com.alibaba.datax.plugin.reader.mongodbreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.mongodbreader.KeyConstant; +import com.alibaba.datax.plugin.reader.mongodbreader.MongoDBReaderErrorCode; +import com.mongodb.MongoClient; +import com.mongodb.MongoCredential; +import com.mongodb.ServerAddress; + +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public class MongoUtil { + + public static MongoClient initMongoClient(Configuration conf) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(addressList == null || addressList.size() <= 0) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + return new MongoClient(parseServerAddress(addressList)); + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + + public static MongoClient initCredentialMongoClient(Configuration conf,String userName,String password,String database) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(!isHostPortPattern(addressList)) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + MongoCredential credential = MongoCredential.createCredential(userName, database, password.toCharArray()); + return new MongoClient(parseServerAddress(addressList), Arrays.asList(credential)); + + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + /** + * 判断地址类型是否符合要求 + * @param addressList + * @return + */ + private static boolean isHostPortPattern(List addressList) { + boolean isMatch = false; + for(Object address : addressList) { + String regex = "([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+):([0-9]+)"; + if(((String)address).matches(regex)) { + isMatch = true; + } + } + return isMatch; + } + /** + * 转换为mongo地址协议 + * @param rawAddressList + * @return + */ + private static List parseServerAddress(List rawAddressList) throws UnknownHostException{ + List addressList = new ArrayList(); + for(Object address : rawAddressList) { + String[] tempAddress = ((String)address).split(":"); + try { + ServerAddress sa = new ServerAddress(tempAddress[0],Integer.valueOf(tempAddress[1])); + addressList.add(sa); + } catch (Exception e) { + throw new UnknownHostException(); + } + } + return addressList; + } + + public static void main(String[] args) { + try { + ArrayList hostAddress = new ArrayList(); + hostAddress.add("127.0.0.1:27017"); + System.out.println(MongoUtil.isHostPortPattern(hostAddress)); + } catch (Exception e) { + e.printStackTrace(); + } + } +} diff --git a/mongodbreader/src/main/resources/plugin.json b/mongodbreader/src/main/resources/plugin.json new file mode 100644 index 000000000..1bb1d262d --- /dev/null +++ b/mongodbreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mongodbreader", + "class": "com.alibaba.datax.plugin.reader.mongodbreader.MongoDBReader", + "description": "useScene: prod. mechanism: via mongoclient connect mongodb reader data concurrent.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/mongodbreader/src/main/resources/plugin_job_template.json b/mongodbreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..367936139 --- /dev/null +++ b/mongodbreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,11 @@ +{ + "name": "mongodbreader", + "parameter": { + "address": [], + "userName": "", + "userPassword": "", + "dbName": "", + "collectionName": "", + "column": [] + } +} \ No newline at end of file diff --git a/mongodbwriter/doc/mongodbwriter.md b/mongodbwriter/doc/mongodbwriter.md new file mode 100644 index 000000000..0e5b79524 --- /dev/null +++ b/mongodbwriter/doc/mongodbwriter.md @@ -0,0 +1,157 @@ +### Datax MongoDBWriter +#### 1 快速介绍 + +MongoDBWriter 插件利用 MongoDB 的java客户端MongoClient进行MongoDB的写操作。最新版本的Mongo已经将DB锁的粒度从DB级别降低到document级别,配合上MongoDB强大的索引功能,基本可以满足数据源向MongoDB写入数据的需求,针对数据更新的需求,通过配置业务主键的方式也可以实现。 + +#### 2 实现原理 + +MongoDBWriter通过Datax框架获取Reader生成的数据,然后将Datax支持的类型通过逐一判断转换成MongoDB支持的类型。其中一个值得指出的点就是Datax本身不支持数组类型,但是MongoDB支持数组类型,并且数组类型的索引还是蛮强大的。为了使用MongoDB的数组类型,则可以通过参数的特殊配置,将字符串可以转换成MongoDB中的数组。类型转换之后,就可以依托于Datax框架并行的写入MongoDB。 + +#### 3 功能说明 +* 该示例从ODPS读一份数据到MongoDB。 + + { + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "********", + "accessKey": "*********", + "project": "tb_ai_recommendation", + "table": "jianying_tag_datax_test", + "column": [ + "unique_id", + "sid", + "user_id", + "auction_id", + "content_type", + "pool_type", + "frontcat_id", + "categoryid", + "gmt_create", + "taglist", + "property", + "scorea", + "scoreb" + ], + "splitMode": "record", + "odpsServer": "http://service-corp.odps.aliyun-inc.com/api" + } + }, + "writer": { + "name": "mongodbwriter", + "parameter": { + "address": [ + "127.0.0.1:27017" + ], + "userName": "", + "userPassword": "", + "dbName": "tag_per_data", + "collectionName": "tag_data", + "column": [ + { + "name": "unique_id", + "type": "string" + }, + { + "name": "sid", + "type": "string" + }, + { + "name": "user_id", + "type": "string" + }, + { + "name": "auction_id", + "type": "string" + }, + { + "name": "content_type", + "type": "string" + }, + { + "name": "pool_type", + "type": "string" + }, + { + "name": "frontcat_id", + "type": "Array", + "splitter": " " + }, + { + "name": "categoryid", + "type": "Array", + "splitter": " " + }, + { + "name": "gmt_create", + "type": "string" + }, + { + "name": "taglist", + "type": "Array", + "splitter": " " + }, + { + "name": "property", + "type": "string" + }, + { + "name": "scorea", + "type": "int" + }, + { + "name": "scoreb", + "type": "int" + }, + { + "name": "scorec", + "type": "int" + } + ], + "upsertInfo": { + "isUpsert": "true", + "upsertKey": "unique_id" + } + } + } + } + ] + } + } + +#### 4 参数说明 + +* address: MongoDB的数据地址信息,因为MonogDB可能是个集群,则ip端口信息需要以Json数组的形式给出。【必填】 +* userName:MongoDB的用户名。【选填】 +* userPassword: MongoDB的密码。【选填】 +* collectionName: MonogoDB的集合名。【必填】 +* column:MongoDB的文档列名。【必填】 +* name:Column的名字。【必填】 +* type:Column的类型。【选填】 +* splitter:特殊分隔符,当且仅当要处理的字符串要用分隔符分隔为字符数组时,才使用这个参数,通过这个参数指定的分隔符,将字符串分隔存储到MongoDB的数组中。【选填】 +* upsertInfo:指定了传输数据时更新的信息。【选填】 +* isUpsert:当设置为true时,表示针对相同的upsertKey做更新操作。【选填】 +* upsertKey:upsertKey指定了没行记录的业务主键。用来做更新时使用。【选填】 + +#### 5 类型转换 + +| DataX 内部类型| MongoDB 数据类型 | +| -------- | ----- | +| Long | int, Long | +| Double | double | +| String | string, array | +| Date | date | +| Boolean | boolean | +| Bytes | bytes | + + +#### 6 性能报告 +#### 7 测试报告 \ No newline at end of file diff --git a/mongodbwriter/mongodbwriter.iml b/mongodbwriter/mongodbwriter.iml new file mode 100644 index 000000000..1bc9c81c6 --- /dev/null +++ b/mongodbwriter/mongodbwriter.iml @@ -0,0 +1,52 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/mongodbwriter/pom.xml b/mongodbwriter/pom.xml new file mode 100644 index 000000000..0504f1373 --- /dev/null +++ b/mongodbwriter/pom.xml @@ -0,0 +1,92 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + mongodbwriter + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + + + + org.mongodb + mongo-java-driver + 3.0.3 + + + com.google.guava + guava + 16.0.1 + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/mongodbwriter/src/main/assembly/package.xml b/mongodbwriter/src/main/assembly/package.xml new file mode 100644 index 000000000..9225be35e --- /dev/null +++ b/mongodbwriter/src/main/assembly/package.xml @@ -0,0 +1,36 @@ + + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/mongodbwriter + + + target/ + + mongodbwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/mongodbwriter + + + + + + false + plugin/writer/mongodbwriter/libs + runtime + + + diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/KeyConstant.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/KeyConstant.java new file mode 100644 index 000000000..8207d7212 --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/KeyConstant.java @@ -0,0 +1,79 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public class KeyConstant { + /** + * mongodb 的 host 地址 + */ + public static final String MONGO_ADDRESS = "address"; + /** + * 数组类型 + */ + public static final String ARRAY_TYPE = "array"; + /** + * mongodb 的用户名 + */ + public static final String MONGO_USER_NAME = "userName"; + /** + * mongodb 密码 + */ + public static final String MONGO_USER_PASSWORD = "userPassword"; + /** + * mongodb 数据库名 + */ + public static final String MONGO_DB_NAME = "dbName"; + /** + * mongodb 集合名 + */ + public static final String MONGO_COLLECTION_NAME = "collectionName"; + /** + * mongodb 的列 + */ + public static final String MONGO_COLUMN = "column"; + /** + * 每个列的名字 + */ + public static final String COLUMN_NAME = "name"; + /** + * 每个列的类型 + */ + public static final String COLUMN_TYPE = "type"; + /** + * 数组中每个元素的类型 + */ + public static final String ITEM_TYPE = "itemtype"; + /** + * 列分隔符 + */ + public static final String COLUMN_SPLITTER = "splitter"; + /** + * 数据更新列信息 + */ + public static final String UPSERT_INFO = "upsertInfo"; + /** + * 有相同的记录是否覆盖,默认为false + */ + public static final String IS_UPSERT = "isUpsert"; + /** + * 指定用来判断是否覆盖的 业务主键 + */ + public static final String UNIQUE_KEY = "upsertKey"; + /** + * 判断是否为数组类型 + * @param type 数据类型 + * @return + */ + public static boolean isArrayType(String type) { + return ARRAY_TYPE.equals(type); + } + /** + * 判断一个值是否为true + * @param value + * @return + */ + public static boolean isValueTrue(String value){ + return "true".equals(value); + } +} diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriter.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriter.java new file mode 100644 index 000000000..692d9a522 --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriter.java @@ -0,0 +1,309 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.writer.Key; +import com.alibaba.datax.plugin.writer.mongodbwriter.util.MongoUtil; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONArray; +import com.alibaba.fastjson.JSONObject; +import com.google.common.base.Strings; +import com.mongodb.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public class MongoDBWriter extends Writer{ + + public static class Job extends Writer.Job { + + private Configuration originalConfig = null; + + @Override + public List split(int mandatoryNumber) { + List configList = new ArrayList(); + for(int i = 0; i < mandatoryNumber; i++) { + configList.add(this.originalConfig.clone()); + } + return configList; + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + } + + @Override + public void prepare() { + super.prepare(); + } + + @Override + public void destroy() { + + } + } + + public static class Task extends Writer.Task { + + private static final Logger logger = LoggerFactory.getLogger(Task.class); + private Configuration writerSliceConfig; + private MongoClient mongoClient; + + private String userName = null; + private String password = null; + + private String database = null; + private String collection = null; + private Integer batchSize = null; + private JSONArray mongodbColumnMeta = null; + private JSONObject upsertInfoMeta = null; + private static int BATCH_SIZE = 1000; + + @Override + public void prepare() { + super.prepare(); + //获取presql配置,并执行 + String preSql = writerSliceConfig.getString(Key.PRE_SQL); + if(Strings.isNullOrEmpty(preSql)) { + return; + } + Configuration conConf = Configuration.from(preSql); + if(Strings.isNullOrEmpty(database) || Strings.isNullOrEmpty(collection) + || mongoClient == null || mongodbColumnMeta == null || batchSize == null) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + DB db = mongoClient.getDB(database); + DBCollection col = db.getCollection(this.collection); + String type = conConf.getString("type"); + if (Strings.isNullOrEmpty(type)){ + return; + } + if (type.equals("drop")){ + col.drop(); + } else if (type.equals("remove")){ + String json = conConf.getString("json"); + BasicDBObject query; + if (Strings.isNullOrEmpty(json)) { + query = new BasicDBObject(); + List items = conConf.getList("item", Object.class); + for (Object con : items) { + Configuration _conf = Configuration.from(con.toString()); + if (Strings.isNullOrEmpty(_conf.getString("condition"))) { + query.put(_conf.getString("name"), _conf.get("value")); + } else { + query.put(_conf.getString("name"), + new BasicDBObject(_conf.getString("condition"), _conf.get("value"))); + } + } +// and { "pv" : { "$gt" : 200 , "$lt" : 3000} , "pid" : { "$ne" : "xxx"}} +// or { "$or" : [ { "age" : { "$gt" : 27}} , { "age" : { "$lt" : 15}}]} + } else { + query = (BasicDBObject) com.mongodb.util.JSON.parse(json); + } + col.remove(query); + } + if(logger.isDebugEnabled()) { + logger.debug("After job prepare(), originalConfig now is:[\n{}\n]", writerSliceConfig.toJSON()); + } + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + if(Strings.isNullOrEmpty(database) || Strings.isNullOrEmpty(collection) + || mongoClient == null || mongodbColumnMeta == null || batchSize == null) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + DB db = mongoClient.getDB(database); + DBCollection col = db.getCollection(this.collection); + List writerBuffer = new ArrayList(this.batchSize); + Record record = null; + while((record = lineReceiver.getFromReader()) != null) { + writerBuffer.add(record); + if(writerBuffer.size() >= this.batchSize) { + doBatchInsert(col,writerBuffer,mongodbColumnMeta); + writerBuffer.clear(); + } + } + if(!writerBuffer.isEmpty()) { + doBatchInsert(col,writerBuffer,mongodbColumnMeta); + writerBuffer.clear(); + } + } + + private void doBatchInsert(DBCollection collection,List writerBuffer, JSONArray columnMeta) { + + List dataList = new ArrayList(); + + for(Record record : writerBuffer) { + + BasicDBObject data = new BasicDBObject(); + + for(int i = 0; i < record.getColumnNumber(); i++) { + + String type = columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_TYPE); + if (Strings.isNullOrEmpty(record.getColumn(i).asString())) { + if (KeyConstant.isArrayType(type.toLowerCase())) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), new Object[0]); + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString()); + } + continue; + } + if (Column.Type.INT.name().equalsIgnoreCase(type)){ + //配置文件中的type是没有用的,除了int,其他均按照保存时Column的类型进行处理 + try { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + Integer.parseInt(String.valueOf(record.getColumn(i).getRawData()))); + } catch (Exception e) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asString()); + e.printStackTrace(); + } + } else if(record.getColumn(i) instanceof StringColumn){ + //处理数组类型 + try { + if(KeyConstant.isArrayType(type.toLowerCase())) { + String splitter = columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_SPLITTER); + if (Strings.isNullOrEmpty(splitter)) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + String itemType = columnMeta.getJSONObject(i).getString(KeyConstant.ITEM_TYPE); + if (itemType != null && !itemType.isEmpty()) { + //如果数组指定类型不为空,将其转换为指定类型 + String[] item = record.getColumn(i).asString().split(splitter); + if (itemType.equalsIgnoreCase(Column.Type.DOUBLE.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Double.parseDouble(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Double[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.INT.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Integer.parseInt(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Integer[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.LONG.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Long.parseLong(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Long[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.BOOL.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Boolean.parseBoolean(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Boolean[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.BYTES.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Byte.parseByte(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Byte[0])); + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString().split(splitter)); + } + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString().split(splitter)); + } + } else if(type.toLowerCase().equalsIgnoreCase("json")) { + //如果是json类型,将其进行转换 + Object mode = com.mongodb.util.JSON.parse(record.getColumn(i).asString()); + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),JSON.toJSON(mode)); + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString()); + } + } catch (Exception e) { + e.printStackTrace(); + //发生异常就按照默认类型存数 + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString()); + } + } else if(record.getColumn(i) instanceof LongColumn) { + + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asLong()); + + } else if(record.getColumn(i) instanceof DateColumn) { + + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asDate()); + + } else if(record.getColumn(i) instanceof DoubleColumn) { + + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asDouble()); + + } else if(record.getColumn(i) instanceof BoolColumn) { + + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asBoolean()); + + } else if(record.getColumn(i) instanceof BytesColumn) { + + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asBytes()); + + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asString()); + } + } + dataList.add(data); + } + /** + * 如果存在重复的值覆盖 + */ + if(this.upsertInfoMeta != null && + this.upsertInfoMeta.getString(KeyConstant.IS_UPSERT) != null && + KeyConstant.isValueTrue(this.upsertInfoMeta.getString(KeyConstant.IS_UPSERT))) { + BulkWriteOperation bulkUpsert = collection.initializeUnorderedBulkOperation(); + String uniqueKey = this.upsertInfoMeta.getString(KeyConstant.UNIQUE_KEY); + if(!Strings.isNullOrEmpty(uniqueKey)) { + for(DBObject data : dataList) { + BasicDBObject query = new BasicDBObject(); + if(uniqueKey != null) { + query.put(uniqueKey,data.get(uniqueKey)); + } + bulkUpsert.find(query).upsert().replaceOne(data); + } + bulkUpsert.execute(); + } else { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + } else { + collection.insert(dataList); + } + } + + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.userName = writerSliceConfig.getString(KeyConstant.MONGO_USER_NAME); + this.password = writerSliceConfig.getString(KeyConstant.MONGO_USER_PASSWORD); + this.database = writerSliceConfig.getString(KeyConstant.MONGO_DB_NAME); + if(!Strings.isNullOrEmpty(userName) && !Strings.isNullOrEmpty(password)) { + this.mongoClient = MongoUtil.initCredentialMongoClient(this.writerSliceConfig,userName,password,database); + } else { + this.mongoClient = MongoUtil.initMongoClient(this.writerSliceConfig); + } + this.collection = writerSliceConfig.getString(KeyConstant.MONGO_COLLECTION_NAME); + this.batchSize = BATCH_SIZE; + this.mongodbColumnMeta = JSON.parseArray(writerSliceConfig.getString(KeyConstant.MONGO_COLUMN)); + this.upsertInfoMeta = JSON.parseObject(writerSliceConfig.getString(KeyConstant.UPSERT_INFO)); + } + + @Override + public void destroy() { + mongoClient.close(); + } + } + +} diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriterErrorCode.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriterErrorCode.java new file mode 100644 index 000000000..b3a19e4a3 --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriterErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public enum MongoDBWriterErrorCode implements ErrorCode { + + ILLEGAL_VALUE("ILLEGAL_PARAMETER_VALUE","参数不合法"), + ILLEGAL_ADDRESS("ILLEGAL_ADDRESS","不合法的Mongo地址"), + JSONCAST_EXCEPTION("JSONCAST_EXCEPTION","json类型转换异常"), + UNEXCEPT_EXCEPTION("UNEXCEPT_EXCEPTION","未知异常"); + + private final String code; + + private final String description; + + private MongoDBWriterErrorCode(String code,String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return code; + } + + @Override + public String getDescription() { + return description; + } +} diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/util/MongoUtil.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/util/MongoUtil.java new file mode 100644 index 000000000..726bd1fed --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/util/MongoUtil.java @@ -0,0 +1,103 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.mongodbwriter.KeyConstant; +import com.alibaba.datax.plugin.writer.mongodbwriter.MongoDBWriterErrorCode; +import com.google.common.base.Splitter; +import com.google.common.base.Strings; +import com.mongodb.MongoClient; +import com.mongodb.MongoCredential; +import com.mongodb.ServerAddress; + +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Map; +import java.util.regex.Pattern; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public class MongoUtil { + + public static MongoClient initMongoClient(Configuration conf) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(addressList == null || addressList.size() <= 0) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + return new MongoClient(parseServerAddress(addressList)); + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + + public static MongoClient initCredentialMongoClient(Configuration conf,String userName,String password,String database) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(!isHostPortPattern(addressList)) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + MongoCredential credential = MongoCredential.createCredential(userName, database, password.toCharArray()); + return new MongoClient(parseServerAddress(addressList), Arrays.asList(credential)); + + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + /** + * 判断地址类型是否符合要求 + * @param addressList + * @return + */ + private static boolean isHostPortPattern(List addressList) { + boolean isMatch = false; + for(Object address : addressList) { + String regex = "([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+):([0-9]+)"; + if(((String)address).matches(regex)) { + isMatch = true; + } + } + return isMatch; + } + /** + * 转换为mongo地址协议 + * @param rawAddressList + * @return + */ + private static List parseServerAddress(List rawAddressList) throws UnknownHostException{ + List addressList = new ArrayList(); + for(Object address : rawAddressList) { + String[] tempAddress = ((String)address).split(":"); + try { + ServerAddress sa = new ServerAddress(tempAddress[0],Integer.valueOf(tempAddress[1])); + addressList.add(sa); + } catch (Exception e) { + throw new UnknownHostException(); + } + } + return addressList; + } + + public static void main(String[] args) { + try { + ArrayList hostAddress = new ArrayList(); + hostAddress.add("127.0.0.1:27017"); + System.out.println(MongoUtil.isHostPortPattern(hostAddress)); + } catch (Exception e) { + e.printStackTrace(); + } + } +} diff --git a/mongodbwriter/src/main/resources/plugin.json b/mongodbwriter/src/main/resources/plugin.json new file mode 100644 index 000000000..daed8c2f1 --- /dev/null +++ b/mongodbwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mongodbwriter", + "class": "com.alibaba.datax.plugin.writer.mongodbwriter.MongoDBWriter", + "description": "useScene: prod. mechanism: via mongoclient connect mongodb write data concurrent.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/mongodbwriter/src/main/resources/plugin_job_template.json b/mongodbwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..d4ba4bf1f --- /dev/null +++ b/mongodbwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "mongodbwriter", + "parameter": { + "address": [], + "userName": "", + "userPassword": "", + "dbName": "", + "collectionName": "", + "column": [], + "upsertInfo": { + "isUpsert": "", + "upsertKey": "" + } + } +} \ No newline at end of file diff --git a/mysqlreader/doc/mysqlreader.md b/mysqlreader/doc/mysqlreader.md new file mode 100644 index 000000000..f2618554e --- /dev/null +++ b/mysqlreader/doc/mysqlreader.md @@ -0,0 +1,377 @@ + +# MysqlReader 插件文档 + + +___ + + + +## 1 快速介绍 + +MysqlReader插件实现了从Mysql读取数据。在底层实现上,MysqlReader通过JDBC连接远程Mysql数据库,并执行相应的sql语句将数据从mysql库中SELECT出来。 + +**不同于其他关系型数据库,MysqlReader不支持FetchSize.** + +## 2 实现原理 + +简而言之,MysqlReader通过JDBC连接器连接到远程的Mysql数据库,并根据用户配置的信息生成查询SELECT SQL语句,然后发送到远程Mysql数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,MysqlReader将其拼接为SQL语句发送到Mysql数据库;对于用户配置querySql信息,MysqlReader直接将其发送到Mysql数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从Mysql数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度,单位为byte/s,DataX运行会尽可能达到该速度但是不超过它. + "byte": 1048576 + } + //出错限制 + "errorLimit": { + //出错的record条数上限,当大于该值即报错。 + "record": 0, + //出错的record百分比上限 1.0表示100%,0.02表示2% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "mysqlreader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "checkSlave":true, + "column": [ + "id","name" + ], + //切分主键 + "splitPk": "db_id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:mysql://127.0.0.1:3306/database" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + //是否打印内容 + "parameter": { + "print":true, + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "channel":1 + } + }, + "content": [ + { + "reader": { + "name": "mysqlreader", + "parameter": { + "username": "root", + "password": "root", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:mysql://bad_ip:3306/database", + "jdbc:mysql://127.0.0.1:bad_port/database", + "jdbc:mysql://127.0.0.1:3306/database" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,MysqlReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,MysqlReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[Mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,MysqlReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用*代表默认使用所有列配置,例如['*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照Mysql SQL语法格式: + ["id", "\`table\`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,\`table\`为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:MysqlReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形、字符串型数据切分,`不支持浮点、日期等其他类型`。如果用户指定其他非支持类型,MysqlReader将报错! + + 如果splitPk不填写,包括不提供splitPk或者splitPk值为空,DataX视作使用单通道同步该表数据。 + + * 必选:否
+ + * 默认值:空
+ +* **where** + + * 描述:筛选条件,MysqlReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。例如在做测试时,可以将where条件指定为limit 10;在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,DataX均视作同步全量数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,MysqlReader直接忽略table、column、where条件的配置`,querySql优先级大于table、column、where选项。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前MysqlReader支持大部分Mysql类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出MysqlReader针对Mysql类型转换列表: + + +| DataX 内部类型| Mysql 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext, year | +| Date |date, datetime, timestamp, time | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `tinyint(1) DataX视作为整形`。 +* `year DataX视作为字符串类型` +* `bit DataX属于未定义行为`。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + CREATE TABLE `tc_biz_vertical_test_0000` ( + `biz_order_id` bigint(20) NOT NULL COMMENT 'id', + `key_value` varchar(4000) NOT NULL COMMENT 'Key-value的内容', + `gmt_create` datetime NOT NULL COMMENT '创建时间', + `gmt_modified` datetime NOT NULL COMMENT '修改时间', + `attribute_cc` int(11) DEFAULT NULL COMMENT '防止并发修改的标志', + `value_type` int(11) NOT NULL DEFAULT '0' COMMENT '类型', + `buyer_id` bigint(20) DEFAULT NULL COMMENT 'buyerid', + `seller_id` bigint(20) DEFAULT NULL COMMENT 'seller_id', + PRIMARY KEY (`biz_order_id`,`value_type`), + KEY `idx_biz_vertical_gmtmodified` (`gmt_modified`) + ) ENGINE=InnoDB DEFAULT CHARSET=gbk COMMENT='tc_biz_vertical' + + +单行记录类似于: + + biz_order_id: 888888888 + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Mysql数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 是否按照主键切分| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡进入流量(MB/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| +|1| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|1| 是 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|4| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|4| 是 | 329733 | 32.60 | 58| 0.8 | 60| 0.76 | +|8| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|8| 是 | 549556 | 54.33 | 115| 1.46 | 120| 0.78 | + +说明: + +1. 这里的单表,主键类型为 bigint(20),范围为:190247559466810-570722244711460,从主键范围划分看,数据分布均匀。 +2. 对单表如果没有安装主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库16张分表,共计32张分表) + + +| 通道数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡进入流量(MB/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------| --------|--------|--------|--------|--------|--------| +|1| 202241 | 20.06 | 31.5| 1.0 | 32 | 1.1 | +|4| 726358 | 72.04 | 123.9 | 3.1 | 132 | 3.6 | +|8|1074405 | 106.56| 197 | 5.5 | 205| 5.1| +|16| 1227892 | 121.79 | 229.2 | 8.1 | 233 | 7.3 | + +## 5 约束限制 + +### 5.1 主备同步数据恢复问题 + +主备同步问题指Mysql使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 一致性约束 + +Mysql在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,MysqlReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在MysqlReader单线程模型下数据同步一致性的特性,由于MysqlReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当MysqlReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 数据库编码问题 + +Mysql本身的编码设置非常灵活,包括指定编码到库、表、字段级别,甚至可以均不同编码。优先级从高到低为字段、表、库、实例。我们不推荐数据库用户设置如此混乱的编码,最好在库级别就统一到UTF-8。 + +MysqlReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此MysqlReader不需用户指定编码,可以自动获取编码并转码。 + +对于Mysql底层写入编码和其设定的编码不一致的混乱情况,MysqlReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.4 增量数据同步 + +MysqlReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,MysqlReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,MysqlReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,MysqlReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.5 Sql安全性 + +MysqlReader提供querySql语句交给用户自己实现SELECT抽取语句,MysqlReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + +*** + +**Q: MysqlReader同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用mysql命令行测试: + + mysql -u -p -h -D -e "select * from <表名>" + +如果上述命令也报错,那可以证实是环境问题,请联系你的DBA。 + + diff --git a/mysqlreader/mysqlreader.iml b/mysqlreader/mysqlreader.iml new file mode 100644 index 000000000..1f713bbe7 --- /dev/null +++ b/mysqlreader/mysqlreader.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/mysqlreader/pom.xml b/mysqlreader/pom.xml new file mode 100755 index 000000000..fe74d9776 --- /dev/null +++ b/mysqlreader/pom.xml @@ -0,0 +1,79 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + mysqlreader + mysqlreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + mysql + mysql-connector-java + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/mysqlreader/src/main/assembly/package.xml b/mysqlreader/src/main/assembly/package.xml new file mode 100755 index 000000000..3b35d9381 --- /dev/null +++ b/mysqlreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/mysqlreader + + + target/ + + mysqlreader-0.0.1-SNAPSHOT.jar + + plugin/reader/mysqlreader + + + + + + false + plugin/reader/mysqlreader/libs + runtime + + + diff --git a/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReader.java b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReader.java new file mode 100755 index 000000000..9dfff9c18 --- /dev/null +++ b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReader.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.reader.mysqlreader; + +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class MysqlReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.MySql; + + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + Integer userConfigedFetchSize = this.originalConfig.getInt(Constant.FETCH_SIZE); + if (userConfigedFetchSize != null) { + LOG.warn("对 mysqlreader 不需要配置 fetchSize, mysqlreader 将会忽略这项配置. 如果您不想再看到此警告,请去除fetchSize 配置."); + } + + this.originalConfig.set(Constant.FETCH_SIZE, Integer.MIN_VALUE); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job(DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + } + + @Override + public void preCheck(){ + init(); + this.commonRdbmsReaderJob.preCheck(this.originalConfig,DATABASE_TYPE); + + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderJob.split(this.originalConfig, adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task(DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig.getInt(Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, recordSender, + super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReaderErrorCode.java b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReaderErrorCode.java new file mode 100755 index 000000000..de9525e9d --- /dev/null +++ b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReaderErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.mysqlreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum MysqlReaderErrorCode implements ErrorCode { + ; + + private final String code; + private final String description; + + private MysqlReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/mysqlreader/src/main/resources/TODO.txt b/mysqlreader/src/main/resources/TODO.txt new file mode 100755 index 000000000..b9e54f0a1 --- /dev/null +++ b/mysqlreader/src/main/resources/TODO.txt @@ -0,0 +1,3 @@ +报错的异常,可重复跑的? +对所有必填参数,进行校验。注意:conf.getString(path) 如果 path 不存在,是直接返回 null,而不是报错。 +字符串主键切分 diff --git a/mysqlreader/src/main/resources/plugin.json b/mysqlreader/src/main/resources/plugin.json new file mode 100755 index 000000000..6a8227b8e --- /dev/null +++ b/mysqlreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mysqlreader", + "class": "com.alibaba.datax.plugin.reader.mysqlreader.MysqlReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/mysqlreader/src/main/resources/plugin_job_template.json b/mysqlreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..68b25b32c --- /dev/null +++ b/mysqlreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "mysqlreader", + "parameter": { + "username": "", + "password": "", + "column": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ], + "where": "" + } +} \ No newline at end of file diff --git a/mysqlwriter/doc/mysqlwriter.md b/mysqlwriter/doc/mysqlwriter.md new file mode 100644 index 000000000..4fa3e1a58 --- /dev/null +++ b/mysqlwriter/doc/mysqlwriter.md @@ -0,0 +1,359 @@ +# DataX MysqlWriter + + +--- + + +## 1 快速介绍 + +MysqlWriter 插件实现了写入数据到 Mysql 主库的目的表的功能。在底层实现上, MysqlWriter 通过 JDBC 连接远程 Mysql 数据库,并执行相应的 insert into ... 或者 ( replace into ...) 的 sql 语句将数据写入 Mysql,内部会分批次提交入库,需要数据库本身采用 innodb 引擎。 + +MysqlWriter 面向ETL开发工程师,他们使用 MysqlWriter 从数仓导入数据到 Mysql。同时 MysqlWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +MysqlWriter 通过 DataX 框架获取 Reader 生成的协议数据,根据你配置的 `writeMode` 生成 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +##### 或者 + +* `replace into...`(没有遇到主键/唯一性索引冲突时,与 insert into 行为一致,冲突时会用新行替换原有行所有字段) 的语句写入数据到 Mysql。出于性能考虑,采用了 `PreparedStatement + Batch`,并且设置了:`rewriteBatchedStatements=true`,将数据缓冲到线程上下文 Buffer 中,当 Buffer 累计到预定阈值时,才发起写入请求。 + +
+ + 注意:目的表所在数据库必须是主库才能写入数据;整个任务至少需要具备 insert/replace into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 Mysql 导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "mysqlwriter", + "parameter": { + "writeMode": "insert", + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "session": [ + "set session sql_mode='ANSI'" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息。作业运行时,DataX 会在你提供的 jdbcUrl 后面追加如下属性:yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true + + 注意:1、在一个数据库上只能配置一个 jdbcUrl 值。这与 MysqlReader 支持多个备库探测不同,因为此处不支持同一个数据库存在多个主库的情况(双主导入数据情况) + 2、jdbcUrl按照Mysql官方规范,并可以填写连接附加控制信息,比如想指定连接编码为 gbk ,则在 jdbcUrl 后面追加属性 useUnicode=true&characterEncoding=gbk。具体请参看 Mysql官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"]。 + + **column配置项必须指定,不能留空!** + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **session** + + * 描述: DataX在获取Mysql连接时,执行session指定的SQL语句,修改当前connection session属性 + + * 必须: 否 + + * 默认值: 空 + +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。比如你的任务是要写入到目的端的100个同构分表(表名称为:datax_00,datax01, ... datax_98,datax_99),并且你希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["delete from 表名"]`,效果是:在执行到每个表写入数据前,会先执行对应的 delete from 对应表名称
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **writeMode** + + * 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 语句
+ + * 必选:是
+ + * 默认值:insert
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与Mysql的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +类似 MysqlReader ,目前 MysqlWriter 支持大部分 Mysql 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 MysqlWriter 针对 Mysql 类型转换列表: + + +| DataX 内部类型| Mysql 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint, year| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + * `bit类型目前是未定义类型转换` + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + CREATE TABLE `datax_mysqlwriter_perf_00` ( + `biz_order_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'id', + `key_value` varchar(4000) NOT NULL COMMENT 'Key-value的内容', + `gmt_create` datetime NOT NULL COMMENT '创建时间', + `gmt_modified` datetime NOT NULL COMMENT '修改时间', + `attribute_cc` int(11) DEFAULT NULL COMMENT '防止并发修改的标志', + `value_type` int(11) NOT NULL DEFAULT '0' COMMENT '类型', + `buyer_id` bigint(20) DEFAULT NULL COMMENT 'buyerid', + `seller_id` bigint(20) DEFAULT NULL COMMENT 'seller_id', + PRIMARY KEY (`biz_order_id`,`value_type`), + KEY `idx_biz_vertical_gmtmodified` (`gmt_modified`) + ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='datax perf test' + + +单行记录类似于: + + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Mysql数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载|DB TPS| +|--------|--------| --------|--------|--------|--------|--------|--------|--------| +|1| 128 | 5319 | 0.260 | 0.580 | 0.05 | 0.620| 0.5 | 50 | +|1| 512 | 14285 | 0.697 | 1.6 | 0.12 | 1.6 | 0.6 | 28 | +|1| 1024 | 17241 | 0.842 | 1.9 | 0.20 | 1.9 | 0.6 | 16| +|1| 2048 | 31250 | 1.49 | 2.8 | 0.15 | 3.0| 0.8 | 15 | +|1| 4096 | 31250 | 1.49 | 3.5 | 0.20 | 3.6| 0.8 | 8 | +|4| 128 | 11764 | 0.574 | 1.5 | 0.21 | 1.6| 0.8 | 112 | +|4| 512 | 30769 | 1.47 | 3.5 | 0.3 | 3.6 | 0.9 | 88 | +|4| 1024 | 50000 | 2.38 | 5.4 | 0.3 | 5.5 | 1.0 | 66 | +|4| 2048 | 66666 | 3.18 | 7.0 | 0.3 | 7.1| 1.37 | 46 | +|4| 4096 | 80000 | 3.81 | 7.3| 0.5 | 7.3| 1.40 | 26 | +|8| 128 | 17777 | 0.868 | 2.9 | 0.28 | 2.9| 0.8 | 200 | +|8| 512 | 57142 | 2.72 | 8.5 | 0.5 | 8.5| 0.70 | 159 | +|8| 1024 | 88888 | 4.24 | 12.2 | 0.9 | 12.4 | 1.0 | 108 | +|8| 2048 | 133333 | 6.36 | 14.7 | 0.9 | 14.7 | 1.0 | 81 | +|8| 4096 | 166666 | 7.95 | 19.5 | 0.9 | 19.5 | 3.0 | 45 | +|16| 128 | 32000 | 1.53 | 3.3 | 0.6 | 3.4 | 0.88 | 401 | +|16| 512 | 106666 | 5.09 | 16.1| 0.9 | 16.2 | 2.16 | 260 | +|16| 1024 | 173913 | 8.29 | 22.1| 1.5 | 22.2 | 4.5 | 200 | +|16| 2048 | 228571 | 10.90 | 28.6 | 1.61 | 28.7 | 4.60 | 128 | +|16| 4096 | 246153 | 11.74 | 31.1| 1.65 | 31.2| 4.66 | 57 | +|32| 1024 | 246153 | 11.74 | 30.5| 3.17 | 30.7 | 12.10 | 270 | + + +说明: + +1. 这里的单表,主键类型为 bigint(20),自增。 +2. batchSize 和 通道个数,对性能影响较大。 +3. 16通道,4096批量提交时,出现 full gc 2次。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库4张分表,共计8张分表) + + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载|DB TPS| +|--------|--------| --------|--------|--------|--------|--------|--------|--------| +|8| 128 | 26764 | 1.28 | 2.9 | 0.5 | 3.0| 0.8 | 209 | +|8| 512 | 95180 | 4.54 | 10.5 | 0.7 | 10.9 | 0.8 | 188 | +|8| 1024 | 94117 | 4.49 | 12.3 | 0.6 | 12.4 | 1.09 | 120 | +|8| 2048 | 133333 | 6.36 | 19.4 | 0.9 | 19.5| 1.35 | 85 | +|8| 4096 | 191692 | 9.14 | 22.1 | 1.0 | 22.2| 1.45 | 45 | + + +#### 4.2.3 分表测试报告(2个分库,每个分库8张分表,共计16张分表) + + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载|DB TPS| +|--------|--------| --------|--------|--------|--------|--------|--------|--------| +|16| 128 | 50124 | 2.39 | 5.6 | 0.40 | 6.0| 2.42 | 378 | +|16| 512 | 155084 | 7.40 | 18.6 | 1.30 | 18.9| 2.82 | 325 | +|16| 1024 | 177777 | 8.48 | 24.1 | 1.43 | 25.5| 3.5 | 233 | +|16| 2048 | 289382 | 13.8 | 33.1 | 2.5 | 33.5| 4.5 | 150 | +|16| 4096 | 326451 | 15.52 | 33.7 | 1.5 | 33.9| 4.3 | 80 | + +#### 4.2.4 性能测试小结 +1. 批量提交行数(batchSize)对性能影响很大,当 `batchSize>=512` 之后,单线程写入速度能达到每秒写入一万行 +2. 在 `batchSize>=512` 的基础上,随着通道数的增加(通道数<32),速度呈线性比增加。 +3. `通常不建议写入数据库时,通道个数 >32` + + +## 5 约束限制 + + + + +## FAQ + +*** + +**Q: MysqlWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/mysqlwriter/mysqlwriter.iml b/mysqlwriter/mysqlwriter.iml new file mode 100644 index 000000000..1f713bbe7 --- /dev/null +++ b/mysqlwriter/mysqlwriter.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/mysqlwriter/pom.xml b/mysqlwriter/pom.xml new file mode 100755 index 000000000..3eaee3561 --- /dev/null +++ b/mysqlwriter/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + mysqlwriter + mysqlwriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + mysql + mysql-connector-java + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/mysqlwriter/src/main/assembly/package.xml b/mysqlwriter/src/main/assembly/package.xml new file mode 100755 index 000000000..03883c7be --- /dev/null +++ b/mysqlwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/mysqlwriter + + + target/ + + mysqlwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/mysqlwriter + + + + + + false + plugin/writer/mysqlwriter/libs + runtime + + + diff --git a/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriter.java b/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriter.java new file mode 100755 index 000000000..9d2c82ee7 --- /dev/null +++ b/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriter.java @@ -0,0 +1,101 @@ +package com.alibaba.datax.plugin.writer.mysqlwriter; + +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + + +//TODO writeProxy +public class MysqlWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.MySql; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + @Override + public void preCheck(){ + this.init(); + this.commonRdbmsWriterJob.writerPreCheck(this.originalConfig, DATABASE_TYPE); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + // 一般来说,是需要推迟到 task 中进行pre 的执行(单表情况例外) + @Override + public void prepare() { + //实跑先不支持 权限 检验 + //this.commonRdbmsWriterJob.privilegeValid(this.originalConfig, DATABASE_TYPE); + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, mandatoryNumber); + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task(DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + //TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, this.writerSliceConfig, + super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + @Override + public boolean supportFailOver(){ + String writeMode = writerSliceConfig.getString(Key.WRITE_MODE); + return "replace".equalsIgnoreCase(writeMode); + } + + } + + +} diff --git a/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriterErrorCode.java b/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriterErrorCode.java new file mode 100755 index 000000000..e94adbfe9 --- /dev/null +++ b/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriterErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.writer.mysqlwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum MysqlWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private MysqlWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/mysqlwriter/src/main/resources/plugin.json b/mysqlwriter/src/main/resources/plugin.json new file mode 100755 index 000000000..e2b62538a --- /dev/null +++ b/mysqlwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mysqlwriter", + "class": "com.alibaba.datax.plugin.writer.mysqlwriter.MysqlWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/mysqlwriter/src/main/resources/plugin_job_template.json b/mysqlwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..1edb50048 --- /dev/null +++ b/mysqlwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "mysqlwriter", + "parameter": { + "username": "", + "password": "", + "writeMode": "", + "column": [], + "session": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ] + } +} \ No newline at end of file diff --git a/ocswriter/doc/ocswriter.md b/ocswriter/doc/ocswriter.md new file mode 100644 index 000000000..885ac697a --- /dev/null +++ b/ocswriter/doc/ocswriter.md @@ -0,0 +1,168 @@ +# DataX OCSWriter 适用memcached客户端写入ocs +--- +## 1 快速介绍 +### 1.1 OCS简介 +开放缓存服务( Open Cache Service,简称OCS)是基于内存的缓存服务,支持海量小数据的高速访问。OCS可以极大缓解对后端存储的压力,提高网站或应用的响应速度。OCS支持Key-Value的数据结构,兼容Memcached协议的客户端都可与OCS通信。
+ +OCS 支持即开即用的方式快速部署;对于动态Web、APP应用,可通过缓存服务减轻对数据库的压力,从而提高网站整体的响应速度。
+ +与本地MemCache相同之处在于OCS兼容Memcached协议,与用户环境兼容,可直接用于OCS服务 不同之处在于硬件和数据部署在云端,有完善的基础设施、网络安全保障、系统维护服务。所有的这些服务,都不需要投资,只需根据使用量进行付费即可。 +### 1.2 OCSWriter简介 +OCSWriter是DataX实现的,基于Memcached协议的数据写入OCS通道。 +## 2 功能说明 +### 2.1 配置样例 +* 这里使用一份从内存产生的数据导入到OCS。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "ocswriter", + "parameter": { + "proxy": "10.101.72.137", + "port": "11211", + "userName": "user", + "password": "******", + "writeMode": "set|add|replace|append|prepend", + "writeFormat": "text|binary", + "fieldDelimiter": "\u0001", + "expireTime": 1000, + "indexes": "0,2", + "batchSize": 1000 + } + } + } + ] + } +} +``` + +### 2.2 参数说明 + +* **proxy** + + * 描述:OCS机器的ip或host。 + * 必选:是 + +* **port** + + * 描述:OCS的连接域名,默认为11211 + * 必选:否 + * 默认值:11211 + +* **username** + + * 描述:OCS连接的访问账号。 + * 必选:是 + +* **password** + + * 描述:OCS连接的访问密码 + * 必选:是 + +* **writeMode** + + * 描述: OCSWriter写入方式,具体为: + * set: 存储这个数据,如果已经存在则覆盖 + * add: 存储这个数据,当且仅当这个key不存在的时候 + * replace: 存储这个数据,当且仅当这个key存在 + * append: 将数据存放在已存在的key对应的内容的后面,忽略exptime + * prepend: 将数据存放在已存在的key对应的内容的前面,忽略exptime + * 必选:是 + +* **writeFormat** + + * 描述: OCSWriter写出数据格式,目前支持两类数据写入方式: + * text: 将源端数据序列化为文本格式,其中第一个字段作为OCS写入的KEY,后续所有字段序列化为STRING类型,使用用户指定的fieldDelimiter作为间隔符,将文本拼接为完整的字符串再写入OCS。 + * binary: 将源端数据作为二进制直接写入,这类场景为未来做扩展使用,目前不支持。如果填写binary将会报错! + * 必选:否 + * 默认值:text + +* **expireTime** + + * 描述: OCS值缓存失效时间,目前MemCache支持两类过期时间, + + * Unix时间(自1970.1.1开始到现在的秒数),该时间指定了到未来某个时刻数据失效。 + * 相对当前时间的秒数,该时间指定了从现在开始多长时间后数据失效。 + **注意:如果过期时间的秒数大于60*60*24*30(即30天),则服务端认为是Unix时间。** + * 单位:秒 + * 必选:否 + * 默认值:0【0表示永久有效】 + +* **indexes** + + * 描述: 用数据的第几列当做ocs的key + * 必选:否 + * 默认值:0 + +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与OCS的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况[memcached版本暂不支持批量写]。 + * 必选:否 + * 默认值:256 + +* **fieldDelimiter** + * 描述:写入ocs的key和value分隔符。比如:key=tom\u0001boston, value=28\u0001lawer\u0001male\u0001married + * 必选:否 + * 默认值:\u0001 + +## 3 性能报告 +### 3.1 datax机器配置 +``` +CPU:16核、内存:24GB、网卡:单网卡1000mbps +``` +### 3.2 任务资源配置 +``` +-Xms8g -Xmx8g -XX:+HeapDumpOnOutOfMemoryError +``` +### 3.3 测试报告 +| 单条数据大小 | 通道并发数 | TPS | 通道流量 | 出口流量 | 备注 | +| :--------: | :--------:| :--: | :--: | :--: | :--: | +| 1KB | 1 | 579 tps | 583.31KB/s | 648.63KB/s | 无 | +| 1KB | 10 | 6006 tps | 5.87MB/s | 6.73MB/s | 无 | +| 1KB | 100 | 49916 tps | 48.56MB/s | 55.55MB/s | 无 | +| 10KB | 1 | 438 tps | 4.62MB/s | 5.07MB/s | 无 | +| 10KB | 10 | 4313 tps | 45.57MB/s | 49.51MB/s | 无 | +| 10KB | 100 | 10713 tps | 112.80MB/s | 123.01MB/s | 无 | +| 100KB | 1 | 275 tps | 26.09MB/s | 144.90KB/s | 无。数据冗余大,压缩比高。 | +| 100KB | 10 | 2492 tps | 236.33MB/s | 1.30MB/s | 无 | +| 100KB | 100 | 3187 tps | 302.17MB/s | 1.77MB/s | 无 | + +### 3.4 性能测试小结 +1. 单条数据小于10KB时建议开启100并发。 +2. 不建议10KB以上的数据写入ocs。 diff --git a/ocswriter/ocswriter.iml b/ocswriter/ocswriter.iml new file mode 100644 index 000000000..a64d9e2c5 --- /dev/null +++ b/ocswriter/ocswriter.iml @@ -0,0 +1,64 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/ocswriter/pom.xml b/ocswriter/pom.xml new file mode 100644 index 000000000..0afc74db6 --- /dev/null +++ b/ocswriter/pom.xml @@ -0,0 +1,94 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + ocswriter + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + com.alibaba.datax + datax-core + ${datax-project-version} + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + org.testng + testng + 6.8.8 + test + + + org.easymock + easymock + 3.3.1 + test + + + com.google.code.simple-spring-memcached + spymemcached + 2.8.1 + + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + 3.2 + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + \ No newline at end of file diff --git a/ocswriter/src/main/assembly/package.xml b/ocswriter/src/main/assembly/package.xml new file mode 100644 index 000000000..804456e6f --- /dev/null +++ b/ocswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/ocswriter + + + target/ + + ocswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/ocswriter + + + + + + false + plugin/writer/ocswriter/libs + runtime + + + diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/Key.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/Key.java new file mode 100644 index 000000000..00bdea8d2 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/Key.java @@ -0,0 +1,26 @@ +package com.alibaba.datax.plugin.writer.ocswriter; + +/** + * Time: 2015-05-06 19:54 + */ +public final class Key { + public final static String USER = "username"; + + public final static String PASSWORD = "password"; + + public final static String PROXY = "proxy"; + + public final static String PORT = "port"; + + public final static String WRITE_MODE = "writeMode"; + + public final static String WRITE_FORMAT = "writeFormat"; + + public final static String FIELD_DELIMITER = "fieldDelimiter"; + + public final static String EXPIRE_TIME = "expireTime"; + + public final static String BATCH_SIZE = "batchSize"; + + public final static String INDEXES = "indexes"; +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/OcsWriter.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/OcsWriter.java new file mode 100644 index 000000000..6b46f38b9 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/OcsWriter.java @@ -0,0 +1,321 @@ +package com.alibaba.datax.plugin.writer.ocswriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.writer.ocswriter.utils.ConfigurationChecker; +import com.alibaba.datax.plugin.writer.ocswriter.utils.OcsWriterErrorCode; +import com.google.common.annotations.VisibleForTesting; +import net.spy.memcached.AddrUtil; +import net.spy.memcached.ConnectionFactoryBuilder; +import net.spy.memcached.MemcachedClient; +import net.spy.memcached.auth.AuthDescriptor; +import net.spy.memcached.auth.PlainCallbackHandler; +import net.spy.memcached.internal.OperationFuture; +import org.apache.commons.lang3.StringUtils; + +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.concurrent.Callable; +import java.util.concurrent.TimeUnit; + +/** + * Time: 2015-05-06 16:01 + * 1,控制速度 + * 2、MemcachedClient长连接长时间不写入数据可能会被server强行关闭 + * 3、以后支持json格式的value + */ +public class OcsWriter extends Writer { + + public static class Job extends Writer.Job { + private Configuration configuration; + + @Override + public void init() { + this.configuration = super.getPluginJobConf(); + //参数有效性检查 + ConfigurationChecker.check(this.configuration); + } + + @Override + public void prepare() { + super.prepare(); + } + + @Override + public List split(int mandatoryNumber) { + ArrayList configList = new ArrayList(); + for (int i = 0; i < mandatoryNumber; i++) { + configList.add(this.configuration.clone()); + } + return configList; + } + + @Override + public void destroy() { + } + } + + public static class Task extends Writer.Task { + + private Configuration configuration; + private MemcachedClient client; + private Set indexesFromUser = new HashSet(); + private String delimiter; + private int expireTime; + //private int batchSize; + private ConfigurationChecker.WRITE_MODE writeMode; + private TaskPluginCollector taskPluginCollector; + + @Override + public void init() { + this.configuration = this.getPluginJobConf(); + this.taskPluginCollector = super.getTaskPluginCollector(); + } + + @Override + public void prepare() { + super.prepare(); + + //如果用户不配置,默认为第0列 + String indexStr = this.configuration.getString(Key.INDEXES, "0"); + for (String index : indexStr.split(",")) { + indexesFromUser.add(Integer.parseInt(index)); + } + + //如果用户不配置,默认为\u0001 + delimiter = this.configuration.getString(Key.FIELD_DELIMITER, "\u0001"); + expireTime = this.configuration.getInt(Key.EXPIRE_TIME, 0); + //todo 此版本不支持批量提交,待ocswriter发布新版本client后支持。batchSize = this.configuration.getInt(Key.BATCH_SIZE, 100); + writeMode = ConfigurationChecker.WRITE_MODE.valueOf(this.configuration.getString(Key.WRITE_MODE)); + + String proxy = this.configuration.getString(Key.PROXY); + //默认端口为11211 + String port = this.configuration.getString(Key.PORT, "11211"); + String username = this.configuration.getString(Key.USER); + String password = this.configuration.getString(Key.PASSWORD); + AuthDescriptor ad = new AuthDescriptor(new String[]{"PLAIN"}, new PlainCallbackHandler(username, password)); + + try { + client = getMemcachedConn(proxy, port, ad); + } catch (Exception e) { + //异常不能吃掉,直接抛出,便于定位 + throw DataXException.asDataXException(OcsWriterErrorCode.OCS_INIT_ERROR, String.format("初始化ocs客户端失败"), e); + } + } + + /** + * 建立ocs客户端连接 + * 重试9次,间隔时间指数增长 + */ + private MemcachedClient getMemcachedConn(final String proxy, final String port, final AuthDescriptor ad) throws Exception { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public MemcachedClient call() throws Exception { + return new MemcachedClient( + new ConnectionFactoryBuilder().setProtocol(ConnectionFactoryBuilder.Protocol.BINARY) + .setAuthDescriptor(ad) + .build(), + AddrUtil.getAddresses(proxy + ":" + port)); + } + }, 9, 1000L, true); + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + Record record; + String key; + String value; + while ((record = lineReceiver.getFromReader()) != null) { + try { + key = buildKey(record); + value = buildValue(record); + switch (writeMode) { + case set: + case replace: + case add: + commitWithRetry(key, value); + break; + case append: + case prepend: + commit(key, value); + break; + default: + //没有default,因为参数检查的时候已经判断,不可能出现5中模式之外的模式 + } + } catch (Exception e) { + this.taskPluginCollector.collectDirtyRecord(record, e); + } + } + } + + /** + * 没有重试的commit + */ + private void commit(final String key, final String value) { + OperationFuture future; + switch (writeMode) { + case set: + future = client.set(key, expireTime, value); + break; + case add: + //幂等原则:相同的输入得到相同的输出,不管调用多少次。 + //所以add和replace是幂等的。 + future = client.add(key, expireTime, value); + break; + case replace: + future = client.replace(key, expireTime, value); + break; + //todo 【注意】append和prepend重跑任务不能支持幂等,使用需谨慎,不需要重试 + case append: + future = client.append(0L, key, value); + break; + case prepend: + future = client.prepend(0L, key, value); + break; + default: + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不支持的写入模式%s", writeMode.toString())); + //因为前面参数校验的时候已经判断,不可能存在5中操作之外的类型。 + } + //【注意】getStatus()返回为null有可能是因为get()超时导致,此种情况当做脏数据处理。但有可能数据已经成功写入ocs。 + if (future == null || future.getStatus() == null || !future.getStatus().isSuccess()) { + throw DataXException.asDataXException(OcsWriterErrorCode.COMMIT_FAILED, "提交数据到ocs失败"); + } + } + + /** + * 提交数据到ocs,有重试机制 + */ + private void commitWithRetry(final String key, final String value) throws Exception { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Object call() throws Exception { + commit(key, value); + return null; + } + }, 3, 1000L, false); + } + + /** + * 构建value + * 如果有二进制字段当做脏数据处理 + * 如果col为null,当做脏数据处理 + */ + private String buildValue(Record record) { + ArrayList valueList = new ArrayList(); + int colNum = record.getColumnNumber(); + for (int i = 0; i < colNum; i++) { + Column col = record.getColumn(i); + if (col != null) { + String value; + Column.Type type = col.getType(); + switch (type) { + case STRING: + case BOOL: + case DOUBLE: + case LONG: + case DATE: + value = col.asString(); + //【注意】value字段中如果有分隔符,当做脏数据处理 + if (value != null && value.contains(delimiter)) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("数据中包含分隔符:%s", value)); + } + break; + default: + //目前不支持二进制,如果遇到二进制,则当做脏数据处理 + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不支持的数据格式:%s", type.toString())); + } + valueList.add(value); + } else { + //如果取到的列为null,需要当做脏数据处理 + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("record中不存在第%s个字段", i)); + } + } + return StringUtils.join(valueList, delimiter); + } + + /** + * 构建key + * 构建数据为空时当做脏数据处理 + */ + private String buildKey(Record record) { + ArrayList keyList = new ArrayList(); + for (int index : indexesFromUser) { + Column col = record.getColumn(index); + if (col == null) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不存在第%s列", index)); + } + Column.Type type = col.getType(); + String value; + switch (type) { + case STRING: + case BOOL: + case DOUBLE: + case LONG: + case DATE: + value = col.asString(); + if (value != null && value.contains(delimiter)) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("主键中包含分隔符:%s", value)); + } + keyList.add(value); + break; + default: + //目前不支持二进制,如果遇到二进制,则当做脏数据处理 + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不支持的数据格式:%s", type.toString())); + } + } + String rtn = StringUtils.join(keyList, delimiter); + if (StringUtils.isBlank(rtn)) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("构建主键为空,请检查indexes的配置")); + } + return rtn; + } + + /** + * shutdown中会有数据异步提交,需要重试。 + */ + @Override + public void destroy() { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Object call() throws Exception { + if (client == null || client.shutdown(10000L, TimeUnit.MILLISECONDS)) { + return null; + } else { + throw DataXException.asDataXException(OcsWriterErrorCode.SHUTDOWN_FAILED, "关闭ocsClient失败"); + } + } + }, 8, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OcsWriterErrorCode.SHUTDOWN_FAILED, "关闭ocsClient失败", e); + } + } + + /** + * 以下为测试使用 + */ + @VisibleForTesting + public String buildValue_test(Record record) { + return this.buildValue(record); + } + + @VisibleForTesting + public String buildKey_test(Record record) { + return this.buildKey(record); + } + + @VisibleForTesting + public void setIndexesFromUser(HashSet indexesFromUser) { + this.indexesFromUser = indexesFromUser; + } + + } +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/CommonUtils.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/CommonUtils.java new file mode 100644 index 000000000..cf1f9630a --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/CommonUtils.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.plugin.writer.ocswriter.utils; + +/** + * Time: 2015-05-12 15:02 + */ +public class CommonUtils { + + public static void sleepInMs(long time) { + try{ + Thread.sleep(time); + } catch (InterruptedException e) { + // + } + } +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/ConfigurationChecker.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/ConfigurationChecker.java new file mode 100644 index 000000000..f0bf42714 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/ConfigurationChecker.java @@ -0,0 +1,147 @@ +package com.alibaba.datax.plugin.writer.ocswriter.utils; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.ocswriter.Key; +import com.google.common.annotations.VisibleForTesting; + +import net.spy.memcached.AddrUtil; +import net.spy.memcached.ConnectionFactoryBuilder; +import net.spy.memcached.MemcachedClient; +import net.spy.memcached.auth.AuthDescriptor; +import net.spy.memcached.auth.PlainCallbackHandler; + +import org.apache.commons.lang3.EnumUtils; +import org.apache.commons.lang3.StringUtils; + + +/** + * Time: 2015-05-07 16:48 + */ +public class ConfigurationChecker { + + public static void check(Configuration config) { + paramCheck(config); + hostReachableCheck(config); + } + + public enum WRITE_MODE { + set, + add, + replace, + append, + prepend + } + + private enum WRITE_FORMAT { + text + } + + /** + * 参数有效性基本检查 + */ + private static void paramCheck(Configuration config) { + String proxy = config.getString(Key.PROXY); + if (StringUtils.isBlank(proxy)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("ocs服务地址%s不能设置为空", Key.PROXY)); + } + String user = config.getString(Key.USER); + if (StringUtils.isBlank(user)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("访问ocs的用户%s不能设置为空", Key.USER)); + } + String password = config.getString(Key.PASSWORD); + if (StringUtils.isBlank(password)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("访问ocs的用户%s不能设置为空", Key.PASSWORD)); + } + + String port = config.getString(Key.PORT, "11211"); + if (StringUtils.isBlank(port)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("ocs端口%s不能设置为空", Key.PORT)); + } + + String indexes = config.getString(Key.INDEXES, "0"); + if (StringUtils.isBlank(indexes)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("当做key的列编号%s不能为空", Key.INDEXES)); + } + for (String index : indexes.split(",")) { + try { + if (Integer.parseInt(index) < 0) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("列编号%s必须为逗号分隔的非负整数", Key.INDEXES)); + } + } catch (NumberFormatException e) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("列编号%s必须为逗号分隔的非负整数", Key.INDEXES)); + } + } + + String writerMode = config.getString(Key.WRITE_MODE); + if (StringUtils.isBlank(writerMode)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("操作方式%s不能为空", Key.WRITE_MODE)); + } + if (!EnumUtils.isValidEnum(WRITE_MODE.class, writerMode.toLowerCase())) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("不支持操作方式%s,仅支持%s", writerMode, StringUtils.join(WRITE_MODE.values(), ","))); + } + + String writerFormat = config.getString(Key.WRITE_FORMAT, "text"); + if (StringUtils.isBlank(writerFormat)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("写入格式%s不能为空", Key.WRITE_FORMAT)); + } + if (!EnumUtils.isValidEnum(WRITE_FORMAT.class, writerFormat.toLowerCase())) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("不支持写入格式%s,仅支持%s", writerFormat, StringUtils.join(WRITE_FORMAT.values(), ","))); + } + + int expireTime = config.getInt(Key.EXPIRE_TIME, 0); + if (expireTime < 0) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("数据过期时间设置%s不能小于0", Key.EXPIRE_TIME)); + } + + int batchSiz = config.getInt(Key.BATCH_SIZE, 100); + if (batchSiz <= 0) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("批量写入大小设置%s必须大于0", Key.BATCH_SIZE)); + } + //fieldDelimiter不需要检查,默认为\u0001 + } + + /** + * 检查ocs服务器网络是否可达 + */ + private static void hostReachableCheck(Configuration config) { + String proxy = config.getString(Key.PROXY); + String port = config.getString(Key.PORT); + String username = config.getString(Key.USER); + String password = config.getString(Key.PASSWORD); + AuthDescriptor ad = new AuthDescriptor(new String[] { "PLAIN" }, + new PlainCallbackHandler(username, password)); + try { + MemcachedClient client = new MemcachedClient( + new ConnectionFactoryBuilder() + .setProtocol( + ConnectionFactoryBuilder.Protocol.BINARY) + .setAuthDescriptor(ad).build(), + AddrUtil.getAddresses(proxy + ":" + port)); + client.get("for_check_connectivity"); + client.getVersions(); + if (client.getAvailableServers().isEmpty()) { + throw new RuntimeException( + "没有可用的Servers: getAvailableServers() -> is empty"); + } + client.shutdown(); + } catch (Exception e) { + throw DataXException.asDataXException( + OcsWriterErrorCode.HOST_UNREACHABLE, + String.format("OCS[%s]服务不可用", proxy), e); + } + } + + /** + * 以下为测试使用 + */ + @VisibleForTesting + public static void paramCheck_test(Configuration configuration) { + paramCheck(configuration); + } + + @VisibleForTesting + public static void hostReachableCheck_test(Configuration configuration) { + hostReachableCheck(configuration); + } +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/OcsWriterErrorCode.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/OcsWriterErrorCode.java new file mode 100644 index 000000000..e67e746a5 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/OcsWriterErrorCode.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.writer.ocswriter.utils; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Time: 2015-05-22 16:45 + */ +public enum OcsWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("OcsWriterErrorCode-000", "参数不能为空"), + ILLEGAL_PARAM_VALUE("OcsWriterErrorCode-001", "参数不合法"), + HOST_UNREACHABLE("OcsWriterErrorCode-002", "服务不可用"), + OCS_INIT_ERROR("OcsWriterErrorCode-003", "初始化ocs client失败"), + DIRTY_RECORD("OcsWriterErrorCode-004", "脏数据"), + SHUTDOWN_FAILED("OcsWriterErrorCode-005", "关闭ocs client失败"), + COMMIT_FAILED("OcsWriterErrorCode-006", "提交数据到ocs失败"); + + private final String code; + private final String description; + + private OcsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return null; + } + + @Override + public String getDescription() { + return null; + } +} diff --git a/ocswriter/src/main/resources/plugin.json b/ocswriter/src/main/resources/plugin.json new file mode 100644 index 000000000..4874911a4 --- /dev/null +++ b/ocswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "ocswriter", + "class": "com.alibaba.datax.plugin.writer.ocswriter.OcsWriter", + "description": "set|add|replace|append|prepend record into ocs.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/ocswriter/src/main/resources/plugin_job_template.json b/ocswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..d62f3c96c --- /dev/null +++ b/ocswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "ocswriter", + "parameter": { + "proxy": "", + "port": "", + "userName": "", + "password": "", + "writeMode": "", + "writeFormat": "", + "fieldDelimiter": "", + "expireTime": "", + "indexes": "", + "batchSize": "" + } +} \ No newline at end of file diff --git a/odpsreader/doc/odpsreader.md b/odpsreader/doc/odpsreader.md new file mode 100644 index 000000000..6c04d2632 --- /dev/null +++ b/odpsreader/doc/odpsreader.md @@ -0,0 +1,349 @@ +# DataX ODPSReader + + +--- + + +## 1 快速介绍 +ODPSReader 实现了从 ODPS读取数据的功能,有关ODPS请参看(http://wiki.aliyun-inc.com/projects/apsara/wiki/odps)。 在底层实现上,ODPSReader 根据你配置的 源头项目 / 表 / 分区 / 表字段 等信息,通过 `Tunnel` 从 ODPS 系统中读取数据。 + +
+ + 注意 1、如果你需要使用ODPSReader/Writer插件,由于 AccessId/AccessKey 解密的需要,请务必使用 JDK 1.6.32 及以上版本。JDK 安装事项,请联系 PE 处理 + 2、ODPSReader 不是通过 ODPS SQL (select ... from ... where ... )来抽取数据的 + 3、注意区分你要读取的表是线上环境还是线下环境 + 4、目前 DataX3 依赖的 SDK 版本是: + + com.aliyun.odps + odps-sdk-core-internal + 0.13.2 + + + +## 2 实现原理 +ODPSReader 支持读取分区表、非分区表,不支持读取虚拟视图。当要读取分区表时,需要指定出具体的分区配置,比如读取 t0 表,其分区为 pt=1,ds=hangzhou 那么你需要在配置中配置该值。当要读取非分区表时,你不能提供分区配置。表字段可以依序指定全部列,也可以指定部分列,或者调整列顺序,或者指定常量字段,但是表字段中不能指定分区列(分区列不是表字段)。 + + 注意:要特别注意 odpsServer、project、table、accessId、accessKey 的配置,因为直接影响到是否能够加载到你需要读取数据的表。很多权限问题都出现在这里。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份读出 ODPS 数据然后打印到屏幕的配置样板。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "accessId", + "accessKey": "accessKey", + "project": "targetProjectName", + "table": "tableName", + "partition": [ + "pt=1,ds=hangzhou" + ], + "column": [ + "customer_id", + "nickname" + ], + "packageAuthorizedProject": "yourCurrentProjectName", + "splitMode": "record", + "odpsServer": "odpsServer" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "fieldDelimiter": "\t", + "print": "true" + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +## 参数 + +* **accessId** + * 描述:ODPS系统登录ID
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + * 描述:ODPS系统登录Key
+ + * 必选:是
+ + * 默认值:无
+ +* **project** + + * 描述:读取数据表所在的 ODPS 项目名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:读取数据表的表名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **partition** + + * 描述:读取数据所在的分区信息,支持linux shell通配符,包括 * 表示0个或多个字符,?代表任意一个字符。例如现在有分区表 test,其存在 pt=1,ds=hangzhou pt=1,ds=shanghai pt=2,ds=hangzhou pt=2,ds=beijing 四个分区,如果你想读取 pt=1,ds=shanghai 这个分区的数据,那么你应该配置为: `"partition":["pt=1,ds=shanghai"]`; 如果你想读取 pt=1下的所有分区,那么你应该配置为: `"partition":["pt=1,ds=* "]`;如果你想读取整个 test 表的所有分区的数据,那么你应该配置为: `"partition":["pt=*,ds=*"]`
+ + * 必选:如果表为分区表,则必填。如果表为非分区表,则不能填写
+ + * 默认值:无
+ +* **column** + + * 描述:读取 odps 源头表的列信息。例如现在有表 test,其字段为:id,name,age 如果你想依次读取 id,name,age 那么你应该配置为: `"column":["id","name","age"]` 或者配置为:`"column"=["*"]` 这里 * 表示依次读取表的每个字段,但是我们不推荐你配置抽取字段为 * ,因为当你的表字段顺序调整、类型变更或者个数增减,你的任务就会存在源头表列和目的表列不能对齐的风险,会直接导致你的任务运行结果不正确甚至运行失败。如果你想依次读取 name,id 那么你应该配置为: `"coulumn":["name","id"]` 如果你想在源头抽取的字段中添加常量字段(以适配目标表的字段顺序),比如你想抽取的每一行数据值为 age 列对应的值,name列对应的值,常量日期值1988-08-08 08:08:08,id 列对应的值 那么你应该配置为:`"column":["age","name","'1988-08-08 08:08:08'","id"]` 即常量列首尾用符号`'` 包住即可,我们内部实现上识别常量是通过检查你配置的每一个字段,如果发现有字段首尾都有`'`,则认为其是常量字段,其实际值为去除`'` 之后的值。 + + 注意:ODPSReader 抽取数据表不是通过 ODPS 的 Select SQL 语句,所以不能在字段上指定函数,也不能指定分区字段名称(分区字段不属于表字段) + + * 必选:是
+ + * 默认值:无
+ +* **odpsServer** + + * 描述:源头表 所在 ODPS 系统的server 地址
+ + * 必选:是
+ + * 默认值:无
+ +* **tunnelServer** + + * 描述:源头表 所在 ODPS 系统的tunnel 地址
+ + * 必选:是,如果地址是对内的(含有"-inc")则可以不填
+ + * 默认值:无
+ +* **splitMode** + + * 描述:读取源头表时切分所需要的模式。默认值为 record,可不填,表示根据切分份数,按照记录数进行切分。如果你的任务目的端为 Mysql,并且是 Mysql 的多个表,那么根据现在 DataX 结构,你的源头表必须是分区表,并且每个分区依次对应目的端 Mysql 的多个分表,则此时应该配置为`"splitMode":"partition"`
+ + * 必选:否
+ + * 默认值:record
+ +* **accountProvider** [待定] + + * 描述:读取时使用的 ODPS 账号类型。目前支持 aliyun/taobao 两种类型。默认为 aliyun,可不填
+ + * 必选:否
+ + * 默认值:aliyun
+ +* **packageAuthorizedProject** + + * 描述:被package授权的project,即用户当前所在project
+ + * 必选:否
+ + * 默认值:无
+ +* **isCompress** + + * 描述:是否压缩读取,bool类型: "true"表示压缩, "false"标示不压缩
+ + * 必选:否
+ + * 默认值:"false" : 不压缩
+ +### 3.3 类型转换 + +下面列出 ODPSReader 读出类型与 DataX 内部类型的转换关系: + + +| ODPS 数据类型| DataX 内部类型 | +| -------- | ----- | +| BIGINT | Long | +| DOUBLE | Double | +| STRING | String | +| DATETIME | Date | +| Boolean | Bool | + + +## 4 性能报告(线上环境实测) + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +建表语句: + + use cdo_datasync; + create table datax3_odpswriter_perf_10column_1kb_00( + s_0 string, + bool_1 boolean, + bi_2 bigint, + dt_3 datetime, + db_4 double, + s_5 string, + s_6 string, + s_7 string, + s_8 string, + s_9 string + )PARTITIONED by (pt string,year string); + +单行记录类似于: + + s_0 : 485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&* + bool_1 : true + bi_2 : 1696248667889 + dt_3 : 2013-07-0600: 00: 00 + db_4 : 3.141592653578 + s_5 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + s_6 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209 + s_7 : 100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209 + s_8 : 100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209 + s_9 : 12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu : 24 Core Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz cache 15.36MB + 2. mem : 50GB + 3. net : 千兆双网卡 + 4. jvm : -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + 5. disc: DataX 数据不落磁盘,不统计此项 + +* 任务配置为: +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "******************************", + "accessKey": "*****************************", + "column": [ + "*" + ], + "partition": [ + "pt=20141010000000,year=2014" + ], + "odpsServer": "http://service.odps.aliyun-inc.com/api", + "project": "cdo_datasync", + "table": "datax3_odpswriter_perf_10column_1kb_00", + "tunnelServer": "http://dt.odps.cm11.aliyun-inc.com" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "column": [ + { + "value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*" + }, + { + "value": "true", + "type": "bool" + }, + { + "value": "1696248667889", + "type": "long" + }, + { + "type": "date", + "value": "2013-07-06 00:00:00", + "dateFormat": "yyyy-mm-dd hh:mm:ss" + }, + { + "value": "3.141592653578", + "type": "double" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209" + }, + { + "value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209" + }, + { + "value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + } + ] + } + } + } + ] + } +} +``` + +### 4.2 测试报告 + + +| 并发任务数| DataX速度(Rec/s)|DataX流量(MB/S)|网卡流量(MB/S)|DataX运行负载| +|--------| --------|--------|--------|--------| +|1|117507|50.20|53.7|0.62| +|2|232976|99.54|108.1|0.99| +|4|387382|165.51|181.3|1.98| +|5|426054|182.03|202.2|2.35| +|6|434793|185.76|204.7|2.77| +|8|495904|211.87|230.2|2.86| +|16|501596|214.31|234.7|2.84| +|32|501577|214.30|234.7|2.99| +|64|501625|214.32|234.7|3.22| + +说明: + +1. OdpsReader 影响速度最主要的是channel数目,这里到达8时已经打满网卡,过多调大反而会影响系统性能。 +2. channel数目的选择,可以考虑odps表文件组织,可尝试合并小文件再进行同步调优。 + + +## 5 约束限制 + + + + +## FAQ(待补充) + +*** + +**Q: 你来问** + +A: 我来答。 + +*** + diff --git a/odpsreader/odpsreader.iml b/odpsreader/odpsreader.iml new file mode 100644 index 000000000..811ad6ab5 --- /dev/null +++ b/odpsreader/odpsreader.iml @@ -0,0 +1,50 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/odpsreader/pom.xml b/odpsreader/pom.xml new file mode 100755 index 000000000..1cf43b32a --- /dev/null +++ b/odpsreader/pom.xml @@ -0,0 +1,118 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + odpsreader + odpsreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + com.aliyun.odps + odps-sdk-core + 0.19.3-public + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/odpsreader/src/main/assembly/package.xml b/odpsreader/src/main/assembly/package.xml new file mode 100755 index 000000000..db659a179 --- /dev/null +++ b/odpsreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/odpsreader + + + target/ + + odpsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/odpsreader + + + + + + false + plugin/reader/odpsreader/libs + runtime + + + diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ColumnType.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ColumnType.java new file mode 100644 index 000000000..eb674a7f6 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ColumnType.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +public enum ColumnType { + PARTITION, NORMAL, CONSTANT, UNKNOWN, ; + + @Override + public String toString() { + switch (this) { + case PARTITION: + return "partition"; + case NORMAL: + return "normal"; + case CONSTANT: + return "constant"; + default: + return "unknown"; + } + } + + public static ColumnType asColumnType(String columnTypeString) { + if ("partition".equals(columnTypeString)) { + return PARTITION; + } else if ("normal".equals(columnTypeString)) { + return NORMAL; + } else if ("constant".equals(columnTypeString)) { + return CONSTANT; + } else { + return UNKNOWN; + } + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Constant.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Constant.java new file mode 100755 index 000000000..c3c674ddd --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Constant.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +public class Constant { + + public final static String START_INDEX = "startIndex"; + + public final static String STEP_COUNT = "stepCount"; + + public final static String SESSION_ID = "sessionId"; + + public final static String IS_PARTITIONED_TABLE = "isPartitionedTable"; + + public static final String DEFAULT_SPLIT_MODE = "record"; + + public static final String PARTITION_SPLIT_MODE = "partition"; + + public static final String DEFAULT_ACCOUNT_TYPE = "aliyun"; + + public static final String TAOBAO_ACCOUNT_TYPE = "taobao"; + + // 常量字段用COLUMN_CONSTANT_FLAG 首尾包住即可 + public final static String COLUMN_CONSTANT_FLAG = "'"; + + /** + * 以下是获取accesskey id 需要用到的常量值 + */ + public static final String SKYNET_ACCESSID = "SKYNET_ACCESSID"; + + public static final String SKYNET_ACCESSKEY = "SKYNET_ACCESSKEY"; + + public static final String PARTITION_COLUMNS = "partitionColumns"; + + public static final String PARSED_COLUMNS = "parsedColumns"; + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Key.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Key.java new file mode 100755 index 000000000..9537cb939 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Key.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +public class Key { + + public final static String ACCESS_ID = "accessId"; + + public final static String ACCESS_KEY = "accessKey"; + + public static final String PROJECT = "project"; + + public final static String TABLE = "table"; + + public final static String PARTITION = "partition"; + + public final static String ODPS_SERVER = "odpsServer"; + + // 线上环境不需要填写,线下环境必填 + public final static String TUNNEL_SERVER = "tunnelServer"; + + public final static String COLUMN = "column"; + + // 当值为:partition 则只切分到分区;当值为:record,则当按照分区切分后达不到adviceNum时,继续按照record切分 + public final static String SPLIT_MODE = "splitMode"; + + // 账号类型,默认为aliyun,也可能为taobao等其他类型 + public final static String ACCOUNT_TYPE = "accountType"; + + public final static String PACKAGE_AUTHORIZED_PROJECT = "packageAuthorizedProject"; + + public final static String IS_COMPRESS = "isCompress"; + + public final static String MAX_RETRY_TIME = "maxRetryTime"; + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReader.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReader.java new file mode 100755 index 000000000..f5cf10ca2 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReader.java @@ -0,0 +1,390 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.FilterUtil; +import com.alibaba.datax.plugin.reader.odpsreader.util.IdAndKeyUtil; +import com.alibaba.datax.plugin.reader.odpsreader.util.OdpsSplitUtil; +import com.alibaba.datax.plugin.reader.odpsreader.util.OdpsUtil; +import com.aliyun.odps.*; +import com.aliyun.odps.tunnel.TableTunnel.DownloadSession; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; + +public class OdpsReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private static boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration originalConfig; + private Odps odps; + private Table table; + + public void preCheck() { + this.init(); + } + + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + //如果用户没有配置accessId/accessKey,尝试从环境变量获取 + String accountType = originalConfig.getString(Key.ACCOUNT_TYPE, Constant.DEFAULT_ACCOUNT_TYPE); + if (Constant.DEFAULT_ACCOUNT_TYPE.equalsIgnoreCase(accountType)) { + this.originalConfig = IdAndKeyUtil.parseAccessIdAndKey(this.originalConfig); + } + + //检查必要的参数配置 + OdpsUtil.checkNecessaryConfig(this.originalConfig); + //重试次数的配置检查 + OdpsUtil.dealMaxRetryTime(this.originalConfig); + + //确定切分模式 + dealSplitMode(this.originalConfig); + + this.odps = OdpsUtil.initOdps(this.originalConfig); + String tableName = this.originalConfig.getString(Key.TABLE); + String projectName = this.originalConfig.getString(Key.PROJECT); + + this.table = OdpsUtil.getTable(this.odps, projectName, tableName); + this.originalConfig.set(Constant.IS_PARTITIONED_TABLE, + OdpsUtil.isPartitionedTable(table)); + + boolean isVirtualView = this.table.isVirtualView(); + if (isVirtualView) { + throw DataXException.asDataXException(OdpsReaderErrorCode.VIRTUAL_VIEW_NOT_SUPPORT, + String.format("源头表:%s 是虚拟视图,DataX 不支持读取虚拟视图.", tableName)); + } + + this.dealPartition(this.table); + this.dealColumn(this.table); + } + + private void dealSplitMode(Configuration originalConfig) { + String splitMode = originalConfig.getString(Key.SPLIT_MODE, Constant.DEFAULT_SPLIT_MODE).trim(); + if (splitMode.equalsIgnoreCase(Constant.DEFAULT_SPLIT_MODE) || + splitMode.equalsIgnoreCase(Constant.PARTITION_SPLIT_MODE)) { + originalConfig.set(Key.SPLIT_MODE, splitMode); + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.SPLIT_MODE_ERROR, + String.format("您所配置的 splitMode:%s 不正确. splitMode 仅允许配置为 record 或者 partition.", splitMode)); + } + } + + /** + * 对分区的配置处理。最终效果是所有正则配置,完全展开成实际对应的分区配置。正则规则如下: + *

+ *

    + *
  1. 如果是分区表,则必须配置分区:可以配置为*,表示整表读取;也可以配置为分别列出要读取的叶子分区.
    TODO + * 未来会支持一些常用的分区正则筛选配置. 分区配置中,不能在分区所表示的数组中配置多个*,因为那样就是多次读取全表,无意义.
  2. + *
  3. 如果是非分区表,则不能配置分区值.
  4. + *
+ */ + private void dealPartition(Table table) { + List userConfiguredPartitions = this.originalConfig.getList( + Key.PARTITION, String.class); + + boolean isPartitionedTable = this.originalConfig.getBool(Constant.IS_PARTITIONED_TABLE); + List partitionColumns = new ArrayList(); + + if (isPartitionedTable) { + // 分区表,需要配置分区 + if (null == userConfiguredPartitions || userConfiguredPartitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区信息没有配置.由于源头表:%s 为分区表, 所以您需要配置其抽取的表的分区信息. 格式形如:pt=hello,ds=hangzhou,请您参考此格式修改该配置项.", + table.getName())); + } else { + List allPartitions = OdpsUtil.getTableAllPartitions(table); + + if (null == allPartitions || allPartitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区信息配置错误.源头表:%s 虽然为分区表, 但其实际分区值并不存在. 请确认源头表已经生成该分区,再进行数据抽取.", + table.getName())); + } + + List parsedPartitions = expandUserConfiguredPartition( + allPartitions, userConfiguredPartitions); + + if (null == parsedPartitions || parsedPartitions.isEmpty()) { + throw DataXException.asDataXException( + OdpsReaderErrorCode.PARTITION_ERROR, + String.format( + "分区配置错误,根据您所配置的分区没有匹配到源头表中的分区. 源头表所有分区是:[\n%s\n], 您配置的分区是:[\n%s\n]. 请您根据实际情况在作出修改. ", + StringUtils.join(allPartitions, "\n"), + StringUtils.join(userConfiguredPartitions, "\n"))); + } + this.originalConfig.set(Key.PARTITION, parsedPartitions); + + for (Column column : table.getSchema() + .getPartitionColumns()) { + partitionColumns.add(column.getName()); + } + } + } else { + // 非分区表,则不能配置分区 + if (null != userConfiguredPartitions + && !userConfiguredPartitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区配置错误,源头表:%s 为非分区表, 您不能配置分区. 请您删除该配置项. ", table.getName())); + } + } + + this.originalConfig.set(Constant.PARTITION_COLUMNS, partitionColumns); + if (isPartitionedTable) { + LOG.info("{源头表:{} 的所有分区列是:[{}]}", table.getName(), + StringUtils.join(partitionColumns, ",")); + } + } + + private List expandUserConfiguredPartition( + List allPartitions, List userConfiguredPartitions) { + // 对odps 本身的所有分区进行特殊字符的处理 + List allStandardPartitions = OdpsUtil + .formatPartitions(allPartitions); + + // 对用户自身配置的所有分区进行特殊字符的处理 + List allStandardUserConfiguredPartitions = OdpsUtil + .formatPartitions(userConfiguredPartitions); + + /** + * 对配置的分区级数(深度)进行检查 + * (1)先检查用户配置的分区级数,自身级数是否相等 + * (2)检查用户配置的分区级数是否与源头表的的分区级数一样 + */ + String firstPartition = allStandardUserConfiguredPartitions.get(0); + int firstPartitionDepth = firstPartition.split(",").length; + + String comparedPartition = null; + int comparedPartitionDepth = -1; + for (int i = 1, len = allStandardUserConfiguredPartitions.size(); i < len; i++) { + comparedPartition = allStandardUserConfiguredPartitions.get(i); + comparedPartitionDepth = comparedPartition.split(",").length; + if (comparedPartitionDepth != firstPartitionDepth) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区配置错误, 您所配置的分区级数和该表的实际情况不一致, 比如分区:[%s] 是 %s 级分区, 而分区:[%s] 是 %s 级分区. DataX 是通过英文逗号判断您所配置的分区级数的. 正确的格式形如\"pt=${bizdate}, type=0\" ,请您参考示例修改该配置项. ", + firstPartition, firstPartitionDepth, comparedPartition, comparedPartitionDepth)); + } + } + + int tableOriginalPartitionDepth = allStandardPartitions.get(0).split(",").length; + if (firstPartitionDepth != tableOriginalPartitionDepth) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区配置错误, 您所配置的分区:%s 的级数:%s 与您要读取的 ODPS 源头表的分区级数:%s 不相等. DataX 是通过英文逗号判断您所配置的分区级数的.正确的格式形如\"pt=${bizdate}, type=0\" ,请您参考示例修改该配置项.", + firstPartition, firstPartitionDepth, tableOriginalPartitionDepth)); + } + + List retPartitions = FilterUtil.filterByRegulars(allStandardPartitions, + allStandardUserConfiguredPartitions); + + return retPartitions; + } + + private void dealColumn(Table table) { + // 用户配置的 column 之前已经确保其不为空 + List userConfiguredColumns = this.originalConfig.getList( + Key.COLUMN, String.class); + + List allColumns = OdpsUtil.getTableAllColumns(table); + List allNormalColumns = OdpsUtil + .getTableOriginalColumnNameList(allColumns); + + StringBuilder columnMeta = new StringBuilder(); + for (Column column : allColumns) { + columnMeta.append(column.getName()).append(":").append(column.getType()).append(","); + } + columnMeta.setLength(columnMeta.length() - 1); + + LOG.info("源头表:{} 的所有字段是:[{}]", table.getName(), columnMeta.toString()); + + if (1 == userConfiguredColumns.size() + && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("这是一条警告信息,您配置的 ODPS 读取的列为*,这是不推荐的行为,因为当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错. 建议您把所有需要抽取的列都配置上. "); + this.originalConfig.set(Key.COLUMN, allNormalColumns); + } + + userConfiguredColumns = this.originalConfig.getList( + Key.COLUMN, String.class); + + /** + * warn: 字符串常量需要与表原生字段tableOriginalColumnNameList 分开存放 demo: + * ["id","'id'","name"] + */ + List allPartitionColumns = this.originalConfig.getList( + Constant.PARTITION_COLUMNS, String.class); + List> parsedColumns = OdpsUtil + .parseColumns(allNormalColumns, allPartitionColumns, + userConfiguredColumns); + + this.originalConfig.set(Constant.PARSED_COLUMNS, parsedColumns); + + StringBuilder sb = new StringBuilder(); + sb.append("[ "); + for (int i = 0, len = parsedColumns.size(); i < len; i++) { + Pair pair = parsedColumns.get(i); + sb.append(String.format(" %s : %s", pair.getLeft(), + pair.getRight())); + if (i != len - 1) { + sb.append(","); + } + } + sb.append(" ]"); + LOG.info("parsed column details: {} .", sb.toString()); + } + + + @Override + public void prepare() { + } + + @Override + public List split(int adviceNumber) { + return OdpsSplitUtil.doSplit(this.originalConfig, this.odps, adviceNumber); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + } + + public static class Task extends Reader.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private Configuration readerSliceConf; + + private String tunnelServer; + private Odps odps = null; + private Table table = null; + private String projectName = null; + private String tableName = null; + private boolean isPartitionedTable; + private String sessionId; + private boolean isCompress; + + @Override + public void init() { + this.readerSliceConf = super.getPluginJobConf(); + this.tunnelServer = this.readerSliceConf.getString( + Key.TUNNEL_SERVER, null); + + this.odps = OdpsUtil.initOdps(this.readerSliceConf); + this.projectName = this.readerSliceConf.getString(Key.PROJECT); + this.tableName = this.readerSliceConf.getString(Key.TABLE); + this.table = OdpsUtil.getTable(this.odps, projectName, tableName); + this.isPartitionedTable = this.readerSliceConf + .getBool(Constant.IS_PARTITIONED_TABLE); + this.sessionId = this.readerSliceConf.getString(Constant.SESSION_ID, null); + + + + this.isCompress = this.readerSliceConf.getBool(Key.IS_COMPRESS, false); + + // sessionId 为空的情况是:切分级别只到 partition 的情况 + if (StringUtils.isBlank(this.sessionId)) { + DownloadSession session = OdpsUtil.createMasterSessionForPartitionedTable(odps, + tunnelServer, projectName, tableName, this.readerSliceConf.getString(Key.PARTITION)); + this.sessionId = session.getId(); + } + + LOG.info("sessionId:{}", this.sessionId); + } + + @Override + public void prepare() { + } + + @Override + public void startRead(RecordSender recordSender) { + DownloadSession downloadSession = null; + String partition = this.readerSliceConf.getString(Key.PARTITION); + + if (this.isPartitionedTable) { + downloadSession = OdpsUtil.getSlaveSessionForPartitionedTable(this.odps, this.sessionId, + this.tunnelServer, this.projectName, this.tableName, partition); + } else { + downloadSession = OdpsUtil.getSlaveSessionForNonPartitionedTable(this.odps, this.sessionId, + this.tunnelServer, this.projectName, this.tableName); + } + + long start = this.readerSliceConf.getLong(Constant.START_INDEX, 0); + long count = this.readerSliceConf.getLong(Constant.STEP_COUNT, + downloadSession.getRecordCount()); + + if (count > 0) { + LOG.info(String.format( + "Begin to read ODPS table:%s, partition:%s, startIndex:%s, count:%s.", + this.tableName, partition, start, count)); + } else if (count == 0) { + LOG.warn(String.format("源头表:%s 的分区:%s 没有内容可抽取, 请您知晓.", + this.tableName, partition)); + return; + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.READ_DATA_FAIL, + String.format("源头表:%s 的分区:%s 读取行数为负数, 请联系 ODPS 管理员查看表状态!", + this.tableName, partition)); + } + + TableSchema tableSchema = this.table.getSchema(); + Set allColumns = new HashSet(); + allColumns.addAll(tableSchema.getColumns()); + allColumns.addAll(tableSchema.getPartitionColumns()); + + Map columnTypeMap = new HashMap(); + for (Column column : allColumns) { + columnTypeMap.put(column.getName(), column.getType()); + } + + try { + List parsedColumnsTmp = this.readerSliceConf + .getListConfiguration(Constant.PARSED_COLUMNS); + List> parsedColumns = new ArrayList>(); + for (int i = 0; i < parsedColumnsTmp.size(); i++) { + Configuration eachColumnConfig = parsedColumnsTmp.get(i); + String columnName = eachColumnConfig.getString("left"); + ColumnType columnType = ColumnType + .asColumnType(eachColumnConfig.getString("right")); + parsedColumns.add(new MutablePair( + columnName, columnType)); + + } + ReaderProxy readerProxy = new ReaderProxy(recordSender, downloadSession, + columnTypeMap, parsedColumns, partition, this.isPartitionedTable, + start, count, this.isCompress); + + readerProxy.doRead(); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.READ_DATA_FAIL, + String.format("源头表:%s 的分区:%s 读取失败, 请联系 ODPS 管理员查看错误详情.", this.tableName, partition), e); + } + + } + + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReaderErrorCode.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReaderErrorCode.java new file mode 100755 index 000000000..cdda6ac86 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReaderErrorCode.java @@ -0,0 +1,60 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OdpsReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("OdpsReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("OdpsReader-01", "您配置的值不合法."), + CREATE_DOWNLOADSESSION_FAIL("OdpsReader-03", "创建 ODPS 的 downloadSession 失败."), + GET_DOWNLOADSESSION_FAIL("OdpsReader-04", "获取 ODPS 的 downloadSession 失败."), + READ_DATA_FAIL("OdpsReader-05", "读取 ODPS 源头表失败."), + GET_ID_KEY_FAIL("OdpsReader-06", "获取 accessId/accessKey 失败."), + + ODPS_READ_EXCEPTION("OdpsReader-07", "读取 odps 异常"), + OPEN_RECORD_READER_FAILED("OdpsReader-08", "打开 recordReader 失败."), + + ODPS_PROJECT_NOT_FOUNT("OdpsReader-10", "您配置的值不合法, odps project 不存在."), //ODPS-0420111: Project not found + + ODPS_TABLE_NOT_FOUNT("OdpsReader-12", "您配置的值不合法, odps table 不存在."), // ODPS-0130131:Table not found + + ODPS_ACCESS_KEY_ID_NOT_FOUND("OdpsReader-13", "您配置的值不合法, odps accessId,accessKey 不存在."), //ODPS-0410051:Invalid credentials - accessKeyId not found + + ODPS_ACCESS_KEY_INVALID("OdpsReader-14", "您配置的值不合法, odps accessKey 错误."), //ODPS-0410042:Invalid signature value - User signature dose not match + + ODPS_ACCESS_DENY("OdpsReader-15", "拒绝访问, 您不在 您配置的 project 中."), //ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project + + + + SPLIT_MODE_ERROR("OdpsReader-30", "splitMode配置错误."), + + ACCOUNT_TYPE_ERROR("OdpsReader-31", "odps 账号类型错误."), + + VIRTUAL_VIEW_NOT_SUPPORT("OdpsReader-32", "Datax 不支持 读取虚拟视图."), + + PARTITION_ERROR("OdpsReader-33", "分区配置错误."), + + ; + private final String code; + private final String description; + + private OdpsReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ReaderProxy.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ReaderProxy.java new file mode 100755 index 000000000..8e069ef56 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ReaderProxy.java @@ -0,0 +1,281 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.plugin.reader.odpsreader.util.OdpsUtil; +import com.aliyun.odps.OdpsType; +import com.aliyun.odps.data.Record; +import com.aliyun.odps.data.RecordReader; +import com.aliyun.odps.tunnel.TableTunnel; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public class ReaderProxy { + private static final Logger LOG = LoggerFactory + .getLogger(ReaderProxy.class); + private static boolean IS_DEBUG = LOG.isDebugEnabled(); + + private RecordSender recordSender; + private TableTunnel.DownloadSession downloadSession; + private Map columnTypeMap; + private List> parsedColumns; + private String partition; + private boolean isPartitionTable; + + private long start; + private long count; + private boolean isCompress; + + public ReaderProxy(RecordSender recordSender, TableTunnel.DownloadSession downloadSession, + Map columnTypeMap, + List> parsedColumns, String partition, + boolean isPartitionTable, long start, long count, boolean isCompress) { + this.recordSender = recordSender; + this.downloadSession = downloadSession; + this.columnTypeMap = columnTypeMap; + this.parsedColumns = parsedColumns; + this.partition = partition; + this.isPartitionTable = isPartitionTable; + this.start = start; + this.count = count; + this.isCompress = isCompress; + } + + // warn: odps 分区列和正常列不能重名, 所有列都不不区分大小写 + public void doRead() { + try { + LOG.info("start={}, count={}",start, count); + //RecordReader recordReader = downloadSession.openRecordReader(start, count, isCompress); + RecordReader recordReader = OdpsUtil.getRecordReader(downloadSession, start, count, isCompress); + + Record odpsRecord; + Map partitionMap = this + .parseCurrentPartitionValue(); + + int retryTimes = 1; + while (true) { + try { + odpsRecord = recordReader.read(); + } catch(Exception e) { + //odps read 异常后重试10次 + LOG.warn("warn : odps read exception: {}", e.getMessage()); + if(retryTimes < 10) { + try { + Thread.sleep(2000); + } catch (InterruptedException ignored) { + } + recordReader = downloadSession.openRecordReader(start, count, isCompress); + LOG.warn("odps-read-exception, 重试第{}次", retryTimes); + retryTimes++; + continue; + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_READ_EXCEPTION, e); + } + } + //记录已经读取的点 + start++; + count--; + + if (odpsRecord != null) { + + com.alibaba.datax.common.element.Record dataXRecord = recordSender + .createRecord(); + // warn: for PARTITION||NORMAL columnTypeMap's key + // sets(columnName) is big than parsedColumns's left + // sets(columnName), always contain + for (Pair pair : this.parsedColumns) { + String columnName = pair.getLeft(); + switch (pair.getRight()) { + case PARTITION: + String partitionColumnValue = this + .getPartitionColumnValue(partitionMap, + columnName); + this.odpsColumnToDataXField(odpsRecord, dataXRecord, + this.columnTypeMap.get(columnName), + partitionColumnValue, true); + break; + case NORMAL: + this.odpsColumnToDataXField(odpsRecord, dataXRecord, + this.columnTypeMap.get(columnName), columnName, + false); + break; + case CONSTANT: + dataXRecord.addColumn(new StringColumn(columnName)); + break; + default: + break; + } + } + recordSender.sendToWriter(dataXRecord); + } else { + break; + } + } + //fixed, 避免recordReader.close失败,跟鸣天确认过,可以不用关闭RecordReader + try { + recordReader.close(); + } catch (Exception e) { + LOG.warn("recordReader close exception", e); + } + } catch (DataXException e) { + throw e; + } catch (Exception e) { + // warn: if dirty + throw DataXException.asDataXException( + OdpsReaderErrorCode.READ_DATA_FAIL, e); + } + } + + private Map parseCurrentPartitionValue() { + Map partitionMap = new HashMap(); + if (this.isPartitionTable) { + String[] splitedPartition = this.partition.split(","); + for (String eachPartition : splitedPartition) { + String[] partitionDetail = eachPartition.split("="); + // warn: check partition like partition=1 + if (2 != partitionDetail.length) { + throw DataXException + .asDataXException( + OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format( + "您的分区 [%s] 解析出现错误,解析后正确的配置方式类似为 [ pt=1,dt=1 ].", + eachPartition)); + } + // warn: translate to lower case, it's more comfortable to + // compare whit user's input columns + String partitionName = partitionDetail[0].toLowerCase(); + String partitionValue = partitionDetail[1]; + partitionMap.put(partitionName, partitionValue); + } + } + if (IS_DEBUG) { + LOG.debug(String.format("partition value details: %s", + com.alibaba.fastjson.JSON.toJSONString(partitionMap))); + } + return partitionMap; + } + + private String getPartitionColumnValue(Map partitionMap, + String partitionColumnName) { + // warn: to lower case + partitionColumnName = partitionColumnName.toLowerCase(); + // it's will never happen, but add this checking + if (!partitionMap.containsKey(partitionColumnName)) { + String errorMessage = String.format( + "表所有分区信息为: %s 其中找不到 [%s] 对应的分区值.", + com.alibaba.fastjson.JSON.toJSONString(partitionMap), + partitionColumnName); + throw DataXException.asDataXException( + OdpsReaderErrorCode.READ_DATA_FAIL, errorMessage); + } + return partitionMap.get(partitionColumnName); + } + + /** + * TODO warn: odpsRecord 的 String 可能获取出来的是 binary + * + * warn: there is no dirty data in reader plugin, so do not handle dirty + * data with TaskPluginCollector + * + * warn: odps only support BIGINT && String partition column actually + * + * @param odpsRecord + * every line record of odps table + * @param dataXRecord + * every datax record, to be send to writer. method getXXX() case sensitive + * @param type + * odps column type + * @param columnNameValue + * for partition column it's column value, for normal column it's + * column name + * @param isPartitionColumn + * true means partition column and false means normal column + * */ + private void odpsColumnToDataXField(Record odpsRecord, + com.alibaba.datax.common.element.Record dataXRecord, OdpsType type, + String columnNameValue, boolean isPartitionColumn) { + switch (type) { + case BIGINT: { + if (isPartitionColumn) { + dataXRecord.addColumn(new LongColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new LongColumn(odpsRecord + .getBigint(columnNameValue))); + } + break; + } + case BOOLEAN: { + if (isPartitionColumn) { + dataXRecord.addColumn(new BoolColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new BoolColumn(odpsRecord + .getBoolean(columnNameValue))); + } + break; + } + case DATETIME: { + if (isPartitionColumn) { + try { + dataXRecord.addColumn(new DateColumn(ColumnCast + .string2Date(new StringColumn(columnNameValue)))); + } catch (ParseException e) { + LOG.error(String.format("", this.partition)); + String errMessage = String.format( + "您读取分区 [%s] 出现日期转换异常, 日期的字符串表示为 [%s].", + this.partition, columnNameValue); + LOG.error(errMessage); + throw DataXException.asDataXException( + OdpsReaderErrorCode.READ_DATA_FAIL, errMessage, e); + } + } else { + dataXRecord.addColumn(new DateColumn(odpsRecord + .getDatetime(columnNameValue))); + } + + break; + } + case DOUBLE: { + if (isPartitionColumn) { + dataXRecord.addColumn(new DoubleColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new DoubleColumn(odpsRecord + .getDouble(columnNameValue))); + } + break; + } + case DECIMAL: { + if(isPartitionColumn) { + dataXRecord.addColumn(new DoubleColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new DoubleColumn(odpsRecord.getDecimal(columnNameValue))); + } + break; + } + case STRING: { + if (isPartitionColumn) { + dataXRecord.addColumn(new StringColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new StringColumn(odpsRecord + .getString(columnNameValue))); + } + break; + } + default: + throw DataXException + .asDataXException( + OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format( + "DataX 抽取 ODPS 数据不支持字段类型为:[%s]. 目前支持抽取的字段类型有:bigint, boolean, datetime, double, decimal, string. " + + "您可以选择不抽取 DataX 不支持的字段或者联系 ODPS 管理员寻求帮助.", + type)); + } + } + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/DESCipher.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/DESCipher.java new file mode 100644 index 000000000..dad82d501 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/DESCipher.java @@ -0,0 +1,355 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import javax.crypto.Cipher; +import javax.crypto.SecretKey; +import javax.crypto.SecretKeyFactory; +import javax.crypto.spec.DESKeySpec; +import java.security.SecureRandom; + +/** + *   * DES加解密,支持与delphi交互(字符串编码需统一为UTF-8) + * + *   * + * + *   * @author wym + * + *    + */ + +public class DESCipher { + + /** + *   * 密钥 + * + *    + */ + + public static final String KEY = "u4Gqu4Z8"; + + private final static String DES = "DES"; + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成加密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.ENCRYPT_MODE, securekey, sr); + + // 现在,获取数据并加密 + + // 正式执行加密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建一个DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec对象转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成解密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.DECRYPT_MODE, securekey, sr); + + // 现在,获取数据并解密 + + // 正式执行解密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src) throws Exception { + + return encrypt(src, KEY.getBytes()); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src) throws Exception { + + return decrypt(src, KEY.getBytes()); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字符串) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String encrypt(String src) { + + try { + + return byte2hex(encrypt(src.getBytes(), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字符串) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String decrypt(String src) { + try { + + return new String(decrypt(hex2byte(src.getBytes()), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public static String encryptToString(byte[] src) throws Exception { + + return encrypt(new String(src)); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public static String decryptToString(byte[] src) throws Exception { + + return decrypt(new String(src)); + + } + + public static String byte2hex(byte[] b) { + + String hs = ""; + + String stmp = ""; + + for (int n = 0; n < b.length; n++) { + + stmp = (Integer.toHexString(b[n] & 0XFF)); + + if (stmp.length() == 1) + + hs = hs + "0" + stmp; + + else + + hs = hs + stmp; + + } + + return hs.toUpperCase(); + + } + + public static byte[] hex2byte(byte[] b) { + + if ((b.length % 2) != 0) + + throw new IllegalArgumentException("长度不是偶数"); + + byte[] b2 = new byte[b.length / 2]; + + for (int n = 0; n < b.length; n += 2) { + + String item = new String(b, n, 2); + + b2[n / 2] = (byte) Integer.parseInt(item, 16); + + } + return b2; + + } + + /* + * public static void main(String[] args) { try { String src = "cheetah"; + * String crypto = DESCipher.encrypt(src); System.out.println("密文[" + src + + * "]:" + crypto); System.out.println("解密后:" + DESCipher.decrypt(crypto)); } + * catch (Exception e) { e.printStackTrace(); } } + */ +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/IdAndKeyUtil.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/IdAndKeyUtil.java new file mode 100644 index 000000000..faa90a987 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/IdAndKeyUtil.java @@ -0,0 +1,85 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.odpsreader.Constant; +import com.alibaba.datax.plugin.reader.odpsreader.Key; +import com.alibaba.datax.plugin.reader.odpsreader.OdpsReaderErrorCode; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; + +public class IdAndKeyUtil { + private static Logger LOG = LoggerFactory.getLogger(IdAndKeyUtil.class); + + public static Configuration parseAccessIdAndKey(Configuration originalConfig) { + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + + // 只要 accessId,accessKey 二者配置了一个,就理解为是用户本意是要直接手动配置其 accessid/accessKey + if (StringUtils.isNotBlank(accessId) || StringUtils.isNotBlank(accessKey)) { + LOG.info("Try to get accessId/accessKey from your config."); + //通过如下语句,进行检查是否确实配置了 + accessId = originalConfig.getNecessaryValue(Key.ACCESS_ID, OdpsReaderErrorCode.REQUIRED_VALUE); + accessKey = originalConfig.getNecessaryValue(Key.ACCESS_KEY, OdpsReaderErrorCode.REQUIRED_VALUE); + //检查完毕,返回即可 + return originalConfig; + } else { + Map envProp = System.getenv(); + return getAccessIdAndKeyFromEnv(originalConfig, envProp); + } + } + + private static Configuration getAccessIdAndKeyFromEnv(Configuration originalConfig, + Map envProp) { + String accessId = null; + String accessKey = null; + + String skynetAccessID = envProp.get(Constant.SKYNET_ACCESSID); + String skynetAccessKey = envProp.get(Constant.SKYNET_ACCESSKEY); + + if (StringUtils.isNotBlank(skynetAccessID) + || StringUtils.isNotBlank(skynetAccessKey)) { + /** + * 环境变量中,如果存在SKYNET_ACCESSID/SKYNET_ACCESSKEy(只要有其中一个变量,则认为一定是两个都存在的!), + * 则使用其值作为odps的accessId/accessKey(会解密) + */ + + LOG.info("Try to get accessId/accessKey from environment."); + accessId = skynetAccessID; + accessKey = DESCipher.decrypt(skynetAccessKey); + if (StringUtils.isNotBlank(accessKey)) { + originalConfig.set(Key.ACCESS_ID, accessId); + originalConfig.set(Key.ACCESS_KEY, accessKey); + LOG.info("Get accessId/accessKey from environment variables successfully."); + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_ID_KEY_FAIL, + String.format("从环境变量中获取accessId/accessKey 失败, accessId=[%s]", accessId)); + } + } else { + // 无处获取(既没有配置在作业中,也没用在环境变量中) + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_ID_KEY_FAIL, + "无法获取到accessId/accessKey. 它们既不存在于您的配置中,也不存在于环境变量中."); + } + + return originalConfig; + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsExceptionMsg.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsExceptionMsg.java new file mode 100644 index 000000000..35ac82214 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsExceptionMsg.java @@ -0,0 +1,18 @@ +package com.alibaba.datax.plugin.reader.odpsreader.util; + +/** + * Created by hongjiao.hj on 2015/6/9. + */ +public class OdpsExceptionMsg { + + public static final String ODPS_PROJECT_NOT_FOUNT = "ODPS-0420111: Project not found"; + + public static final String ODPS_TABLE_NOT_FOUNT = "ODPS-0130131:Table not found"; + + public static final String ODPS_ACCESS_KEY_ID_NOT_FOUND = "ODPS-0410051:Invalid credentials - accessKeyId not found"; + + public static final String ODPS_ACCESS_KEY_INVALID = "ODPS-0410042:Invalid signature value - User signature dose not match"; + + public static final String ODPS_ACCESS_DENY = "ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project"; + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsSplitUtil.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsSplitUtil.java new file mode 100755 index 000000000..b7f4f1aaf --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsSplitUtil.java @@ -0,0 +1,168 @@ +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RangeSplitUtil; +import com.alibaba.datax.plugin.reader.odpsreader.Constant; +import com.alibaba.datax.plugin.reader.odpsreader.Key; +import com.alibaba.datax.plugin.reader.odpsreader.OdpsReaderErrorCode; +import com.aliyun.odps.Odps; +import com.aliyun.odps.tunnel.TableTunnel.DownloadSession; +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; + +import java.util.ArrayList; +import java.util.List; + +public final class OdpsSplitUtil { + + public static List doSplit(Configuration originalConfig, Odps odps, + int adviceNum) { + boolean isPartitionedTable = originalConfig.getBool(Constant.IS_PARTITIONED_TABLE); + if (isPartitionedTable) { + // 分区表 + return splitPartitionedTable(odps, originalConfig, adviceNum); + } else { + // 非分区表 + return splitForNonPartitionedTable(odps, adviceNum, originalConfig); + } + + } + + private static List splitPartitionedTable(Odps odps, Configuration originalConfig, + int adviceNum) { + List splittedConfigs = new ArrayList(); + + List partitions = originalConfig.getList(Key.PARTITION, + String.class); + + if (null == partitions || partitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, + "您所配置的分区不能为空白."); + } + + //splitMode 默认为 record + String splitMode = originalConfig.getString(Key.SPLIT_MODE); + Configuration tempConfig = null; + if (partitions.size() > adviceNum || Constant.PARTITION_SPLIT_MODE.equals(splitMode)) { + // 此时不管 splitMode 是什么,都不需要再进行切分了 + // 注意:此处没有把 sessionId 设置到 config 中去,所以后续在 task 中获取 sessionId 时,需要针对这种情况重新创建 sessionId + for (String onePartition : partitions) { + tempConfig = originalConfig.clone(); + tempConfig.set(Key.PARTITION, onePartition); + splittedConfigs.add(tempConfig); + } + + return splittedConfigs; + } else { + // 还需要计算对每个分区,切分份数等信息 + int eachPartitionShouldSplittedNumber = calculateEachPartitionShouldSplittedNumber( + adviceNum, partitions.size()); + + for (String onePartition : partitions) { + List configs = splitOnePartition(odps, + onePartition, eachPartitionShouldSplittedNumber, + originalConfig); + splittedConfigs.addAll(configs); + } + + return splittedConfigs; + } + } + + private static int calculateEachPartitionShouldSplittedNumber( + int adviceNumber, int partitionNumber) { + double tempNum = 1.0 * adviceNumber / partitionNumber; + + return (int) Math.ceil(tempNum); + } + + private static List splitForNonPartitionedTable(Odps odps, + int adviceNum, Configuration sliceConfig) { + List params = new ArrayList(); + + String tunnelServer = sliceConfig.getString(Key.TUNNEL_SERVER); + String tableName = sliceConfig.getString(Key.TABLE); + + String projectName = sliceConfig.getString(Key.PROJECT); + + DownloadSession session = OdpsUtil.createMasterSessionForNonPartitionedTable(odps, + tunnelServer, projectName, tableName); + + String id = session.getId(); + long count = session.getRecordCount(); + + List> splitResult = splitRecordCount(count, adviceNum); + + for (Pair pair : splitResult) { + Configuration iParam = sliceConfig.clone(); + iParam.set(Constant.SESSION_ID, id); + iParam.set(Constant.START_INDEX, pair.getLeft().longValue()); + iParam.set(Constant.STEP_COUNT, pair.getRight().longValue()); + + params.add(iParam); + } + + return params; + } + + private static List splitOnePartition(Odps odps, + String onePartition, int adviceNum, Configuration sliceConfig) { + List params = new ArrayList(); + + String tunnelServer = sliceConfig.getString(Key.TUNNEL_SERVER); + String tableName = sliceConfig.getString(Key.TABLE); + + String projectName = sliceConfig.getString(Key.PROJECT); + + DownloadSession session = OdpsUtil.createMasterSessionForPartitionedTable(odps, + tunnelServer, projectName, tableName, onePartition); + + String id = session.getId(); + long count = session.getRecordCount(); + + List> splitResult = splitRecordCount(count, adviceNum); + + for (Pair pair : splitResult) { + Configuration iParam = sliceConfig.clone(); + iParam.set(Key.PARTITION, onePartition); + iParam.set(Constant.SESSION_ID, id); + iParam.set(Constant.START_INDEX, pair.getLeft().longValue()); + iParam.set(Constant.STEP_COUNT, pair.getRight().longValue()); + + params.add(iParam); + } + + return params; + } + + /** + * Pair left: startIndex, right: stepCount + */ + private static List> splitRecordCount(long recordCount, int adviceNum) { + if(recordCount<0){ + throw new IllegalArgumentException("切分的 recordCount 不能为负数.recordCount=" + recordCount); + } + + if(adviceNum<1){ + throw new IllegalArgumentException("切分的 adviceNum 不能为负数.adviceNum=" + adviceNum); + } + + List> result = new ArrayList>(); + // 为了适配 RangeSplitUtil 的处理逻辑,起始值从0开始计算 + if (recordCount == 0) { + result.add(ImmutablePair.of(0L, 0L)); + return result; + } + + long[] tempResult = RangeSplitUtil.doLongSplit(0L, recordCount - 1, adviceNum); + + tempResult[tempResult.length - 1]++; + + for (int i = 0; i < tempResult.length - 1; i++) { + result.add(ImmutablePair.of(tempResult[i], (tempResult[i + 1] - tempResult[i]))); + } + return result; + } + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsUtil.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsUtil.java new file mode 100755 index 000000000..f441cbb10 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsUtil.java @@ -0,0 +1,377 @@ +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.reader.odpsreader.ColumnType; +import com.alibaba.datax.plugin.reader.odpsreader.Constant; +import com.alibaba.datax.plugin.reader.odpsreader.Key; +import com.alibaba.datax.plugin.reader.odpsreader.OdpsReaderErrorCode; +import com.aliyun.odps.*; +import com.aliyun.odps.account.Account; +import com.aliyun.odps.account.AliyunAccount; +import com.aliyun.odps.data.RecordReader; +import com.aliyun.odps.tunnel.TableTunnel; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import java.util.concurrent.Callable; + +public final class OdpsUtil { + private static final Logger LOG = LoggerFactory.getLogger(OdpsUtil.class); + + public static int MAX_RETRY_TIME = 10; + + public static void checkNecessaryConfig(Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.ODPS_SERVER, OdpsReaderErrorCode.REQUIRED_VALUE); + + originalConfig.getNecessaryValue(Key.PROJECT, OdpsReaderErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, OdpsReaderErrorCode.REQUIRED_VALUE); + + if (null == originalConfig.getList(Key.COLUMN) || + originalConfig.getList(Key.COLUMN, String.class).isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.REQUIRED_VALUE, "datax获取不到源表的列信息, 由于您未配置读取源头表的列信息. datax无法知道该抽取表的哪些字段的数据 " + + "正确的配置方式是给 column 配置上您需要读取的列名称,用英文逗号分隔."); + } + + } + + public static void dealMaxRetryTime(Configuration originalConfig) { + int maxRetryTime = originalConfig.getInt(Key.MAX_RETRY_TIME, + OdpsUtil.MAX_RETRY_TIME); + if (maxRetryTime < 1 || maxRetryTime > OdpsUtil.MAX_RETRY_TIME) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, "您所配置的maxRetryTime 值错误. 该值不能小于1, 且不能大于 " + OdpsUtil.MAX_RETRY_TIME + + ". 推荐的配置方式是给maxRetryTime 配置1-11之间的某个值. 请您检查配置并做出相应修改."); + } + MAX_RETRY_TIME = maxRetryTime; + } + + public static Odps initOdps(Configuration originalConfig) { + String odpsServer = originalConfig.getString(Key.ODPS_SERVER); + + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + String project = originalConfig.getString(Key.PROJECT); + + String packageAuthorizedProject = originalConfig.getString(Key.PACKAGE_AUTHORIZED_PROJECT); + + String defaultProject; + if(StringUtils.isBlank(packageAuthorizedProject)) { + defaultProject = project; + } else { + defaultProject = packageAuthorizedProject; + } + + String accountType = originalConfig.getString(Key.ACCOUNT_TYPE, + Constant.DEFAULT_ACCOUNT_TYPE); + + Account account = null; + if (accountType.equalsIgnoreCase(Constant.DEFAULT_ACCOUNT_TYPE)) { + account = new AliyunAccount(accessId, accessKey); + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.ACCOUNT_TYPE_ERROR, + String.format("不支持的账号类型:[%s]. 账号类型目前仅支持aliyun, taobao.", accountType)); + } + + Odps odps = new Odps(account); + boolean isPreCheck = originalConfig.getBool("dryRun", false); + if(isPreCheck) { + odps.getRestClient().setConnectTimeout(3); + odps.getRestClient().setReadTimeout(3); + odps.getRestClient().setRetryTimes(2); + } + odps.setDefaultProject(defaultProject); + odps.setEndpoint(odpsServer); + + return odps; + } + + public static Table getTable(Odps odps, String projectName, String tableName) { + final Table table = odps.tables().get(projectName, tableName); + try { + //通过这种方式检查表是否存在,失败重试。重试策略:每秒钟重试一次,最大重试3次 + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Table call() throws Exception { + table.reload(); + return table; + } + }, 3, 1000, false); + } catch (Exception e) { + throwDataXExceptionWhenReloadTable(e, tableName); + } + return table; + } + + public static boolean isPartitionedTable(Table table) { + return getPartitionDepth(table) > 0; + } + + public static int getPartitionDepth(Table table) { + TableSchema tableSchema = table.getSchema(); + + return tableSchema.getPartitionColumns().size(); + } + + public static List getTableAllPartitions(Table table) { + List tableAllPartitions = table.getPartitions(); + + List retPartitions = new ArrayList(); + + if (null != tableAllPartitions) { + for (Partition partition : tableAllPartitions) { + retPartitions.add(partition.getPartitionSpec().toString()); + } + } + + return retPartitions; + } + + public static List getTableAllColumns(Table table) { + TableSchema tableSchema = table.getSchema(); + return tableSchema.getColumns(); + } + + + public static List getTableOriginalColumnNameList( + List columns) { + List tableOriginalColumnNameList = new ArrayList(); + + for (Column column : columns) { + tableOriginalColumnNameList.add(column.getName()); + } + + return tableOriginalColumnNameList; + } + + public static String formatPartition(String partition) { + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, + "您所配置的分区不能为空白."); + } else { + return partition.trim().replaceAll(" *= *", "=") + .replaceAll(" */ *", ",").replaceAll(" *, *", ",") + .replaceAll("'", ""); + } + } + + public static List formatPartitions(List partitions) { + if (null == partitions || partitions.isEmpty()) { + return Collections.emptyList(); + } else { + List formattedPartitions = new ArrayList(); + for (String partition : partitions) { + formattedPartitions.add(formatPartition(partition)); + + } + return formattedPartitions; + } + } + + public static List> parseColumns( + List allNormalColumns, List allPartitionColumns, + List userConfiguredColumns) { + List> parsededColumns = new ArrayList>(); + // warn: upper & lower case + for (String column : userConfiguredColumns) { + MutablePair pair = new MutablePair(); + + // if constant column + if (OdpsUtil.checkIfConstantColumn(column)) { + // remove first and last ' + pair.setLeft(column.substring(1, column.length() - 1)); + pair.setRight(ColumnType.CONSTANT); + parsededColumns.add(pair); + continue; + } + + // if normal column, warn: in o d p s normal columns can not + // repeated in partitioning columns + int index = OdpsUtil.indexOfIgnoreCase(allNormalColumns, column); + if (0 <= index) { + pair.setLeft(allNormalColumns.get(index)); + pair.setRight(ColumnType.NORMAL); + parsededColumns.add(pair); + continue; + } + + // if partition column + index = OdpsUtil.indexOfIgnoreCase(allPartitionColumns, column); + if (0 <= index) { + pair.setLeft(allPartitionColumns.get(index)); + pair.setRight(ColumnType.PARTITION); + parsededColumns.add(pair); + continue; + } + // not exist column + throw DataXException.asDataXException( + OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format("源头表的列配置错误. 您所配置的列 [%s] 不存在.", column)); + + } + return parsededColumns; + } + + private static int indexOfIgnoreCase(List columnCollection, + String column) { + int index = -1; + for (int i = 0; i < columnCollection.size(); i++) { + if (columnCollection.get(i).equalsIgnoreCase(column)) { + index = i; + break; + } + } + return index; + } + + public static boolean checkIfConstantColumn(String column) { + if (column.length() >= 2 && column.startsWith(Constant.COLUMN_CONSTANT_FLAG) && + column.endsWith(Constant.COLUMN_CONSTANT_FLAG)) { + return true; + } else { + return false; + } + } + + public static TableTunnel.DownloadSession createMasterSessionForNonPartitionedTable(Odps odps, String tunnelServer, + final String projectName, final String tableName) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.createDownloadSession( + projectName, tableName); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.CREATE_DOWNLOADSESSION_FAIL, e); + } + } + + public static TableTunnel.DownloadSession getSlaveSessionForNonPartitionedTable(Odps odps, final String sessionId, + String tunnelServer, final String projectName, final String tableName) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.getDownloadSession( + projectName, tableName, sessionId); + } + }, MAX_RETRY_TIME ,1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_DOWNLOADSESSION_FAIL, e); + } + } + + public static TableTunnel.DownloadSession createMasterSessionForPartitionedTable(Odps odps, String tunnelServer, + final String projectName, final String tableName, String partition) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + final PartitionSpec partitionSpec = new PartitionSpec(partition); + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.createDownloadSession( + projectName, tableName, partitionSpec); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.CREATE_DOWNLOADSESSION_FAIL, e); + } + } + + public static TableTunnel.DownloadSession getSlaveSessionForPartitionedTable(Odps odps, final String sessionId, + String tunnelServer, final String projectName, final String tableName, String partition) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + final PartitionSpec partitionSpec = new PartitionSpec(partition); + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.getDownloadSession( + projectName, tableName, partitionSpec, sessionId); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_DOWNLOADSESSION_FAIL, e); + } + } + + + + public static RecordReader getRecordReader(final TableTunnel.DownloadSession downloadSession, final long start, final long count, + final boolean isCompress) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public RecordReader call() throws Exception { + return downloadSession.openRecordReader(start, count, isCompress); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.OPEN_RECORD_READER_FAILED, + "open RecordReader失败. 请联系 ODPS 管理员处理.", e); + } + } + + /** + * table.reload() 方法抛出的 odps 异常 转化为更清晰的 datax 异常 抛出 + */ + public static void throwDataXExceptionWhenReloadTable(Exception e, String tableName) { + if(e.getMessage() != null) { + if(e.getMessage().contains(OdpsExceptionMsg.ODPS_PROJECT_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_PROJECT_NOT_FOUNT, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [project] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_TABLE_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_TABLE_NOT_FOUNT, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [table] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_ID_NOT_FOUND)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_ACCESS_KEY_ID_NOT_FOUND, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [accessId] [accessKey]是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_INVALID)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_ACCESS_KEY_INVALID, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [accessKey] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_DENY)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_ACCESS_DENY, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [accessId] [accessKey] [project]是否匹配.", tableName), e); + } + } + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 project,table,accessId,accessKey,odpsServer等值.", tableName), e); + } +} diff --git a/odpsreader/src/main/resources/TODO.txt b/odpsreader/src/main/resources/TODO.txt new file mode 100755 index 000000000..aaf8e64eb --- /dev/null +++ b/odpsreader/src/main/resources/TODO.txt @@ -0,0 +1,6 @@ +1、哪些方法需要重试? 风险何在? +2、哪些对象是非线程安全的? +3、切分后为 null 的任务,框架怎么处理? +4、读取分区的配置方式? +5、内部的解密key 的逻辑 +6、表名称大小写敏感性的问题? \ No newline at end of file diff --git a/odpsreader/src/main/resources/plugin.json b/odpsreader/src/main/resources/plugin.json new file mode 100755 index 000000000..2d441acf6 --- /dev/null +++ b/odpsreader/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "odpsreader", + "class": "com.alibaba.datax.plugin.reader.odpsreader.OdpsReader", + "description": { + "useScene": "prod.", + "mechanism": "TODO", + "warn": "TODO" + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/odpsreader/src/main/resources/plugin_job_template.json b/odpsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..6eddf0cd2 --- /dev/null +++ b/odpsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "odpsreader", + "parameter": { + "accessId": "", + "accessKey": "", + "project": "", + "table": "", + "partition": [], + "column": [], + "packageAuthorizedProject": "", + "splitMode": "", + "odpsServer": "" + } +} \ No newline at end of file diff --git a/odpswriter/doc/odpswriter.md b/odpswriter/doc/odpswriter.md new file mode 100644 index 000000000..229405d66 --- /dev/null +++ b/odpswriter/doc/odpswriter.md @@ -0,0 +1,338 @@ +# DataX ODPS写入 + + +--- + + +## 1 快速介绍 + +ODPSWriter插件用于实现往ODPS插入或者更新数据,主要提供给etl开发同学将业务数据导入odps,适合于TB,GB数量级的数据传输,如果需要传输PB量级的数据,请选择dt task工具 (http://odps.alibaba-inc.com/download/DTTask_User_Manuals.pdf?spm=0.0.0.0.TWkq8m&file=DTTask_User_Manuals.pdf). + + + +## 2 实现原理 + +在底层实现上,ODPSWriter是通过DT Tunnel写入ODPS系统的,有关ODPS的更多技术细节请参看 ODPS主站 http://odps.alibaba-inc.com/ 和ODPS产品文档 http://odps.alibaba-inc.com/doc/ + +目前 DataX3 依赖的 SDK 版本是: + + + com.aliyun.odps + odps-sdk-core-internal + 0.13.2 + + + +注意: **如果你需要使用ODPSReader/Writer插件,请务必使用JDK 1.6-32及以上版本** +使用java -version查看Java版本号 + +## 3 功能说明 + +### 3.1 配置样例 +* 这里使用一份从内存产生到ODPS导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": {"byte": 1048576} + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "odpswriter", + "parameter": { + "project": "chinan_test", + "table": "odps_write_test00_partitioned", + "partition":"school=SiChuan-School,class=1", + "column": ["id","name"], + "accessId": "**b7**", + "accessKey": "***dv**yk**mm", + "truncate": true, + "odpsServer": "http://service.odpsstg.aliyun-inc.com/stgnew/", + "tunnelServer": "http://tunnel.odpsstg.aliyun-inc.com", + "accountType": "aliyun" + } + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + + +* **accessId** + * 描述:ODPS系统登录ID
+ * 必选:是
+ * 默认值:无
+ +* **accessKey** + * 描述:ODPS系统登录Key
+ * 必选:是
+ * 默认值:无
+ +* **project** + + * 描述:ODPS表所属的project,注意:Project只能是字母+数字组合,请填写英文名称。在云端等用户看到的ODPS项目中文名只是显示名,请务必填写底层真实地Project英文标识名。
+ * 必选:是
+ * 默认值:无
+ +* **table** + + * 描述:写入数据的表名,不能填写多张表,因为DataX不支持同时导入多张表。
+ * 必选:是
+ * 默认值:无
+ +* **partition** + + * 描述:需要写入数据表的分区信息,必须指定到最后一级分区。把数据写入一个三级分区表,必须配置到最后一级分区,例如pt=20150101/type=1/biz=2。 +
+ * 必选:**如果是分区表,该选项必填,如果非分区表,该选项不可填写。** + * 默认值:空
+ +* **column** + + * 描述:需要导入的字段列表,当导入全部字段时,可以配置为"column": ["*"], 当需要插入部分odps列填写部分列,例如"column": ["id", "name"]。ODPSWriter支持列筛选、列换序,例如表有a,b,c三个字段,用户只同步c,b两个字段。可以配置成["c","b"], 在导入过程中,字段a自动补空,设置为null。
+ * 必选:否
+ * 默认值:无
+ +* **truncate** + * 描述:ODPSWriter通过配置"truncate": true,保证写入的幂等性,即当出现写入失败再次运行时,ODPSWriter将清理前述数据,并导入新数据,这样可以保证每次重跑之后的数据都保持一致。
+ + **truncate选项不是原子操作!ODPS SQL无法做到原子性。因此当多个任务同时向一个Table/Partition清理分区时候,可能出现并发时序问题,请务必注意!**针对这类问题,我们建议尽量不要多个作业DDL同时操作同一份分区,或者在多个并发作业启动前,提前创建分区。 + + * 必选:是
+ * 默认值:无
+ +* **odpsServer** + + * 描述:ODPS的server,线上地址为 http://service.odps.aliyun-inc.com/api 日常地址:http://service-corp.odps.aliyun-inc.com/api
+ * 必选:是
+ * 默认值:无
+ +* **tunnelServer** + + * 描述:ODPS的tunnelserver,线上地址为 http://dt.odps.aliyun-inc.com 日常地址:http://dt-corp.odps.aliyun-inc.com
+ * 必选:是,如果地址是对内的(含有"-inc")则可以不填
+ * 默认值:无
+ + +### 3.3 类型转换 + +类似ODPSReader,目前ODPSWriter支持大部分ODPS类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出ODPSWriter针对ODPS类型转换列表: + + +| DataX 内部类型| ODPS 数据类型 | +| -------- | ----- | +| Long |bigint | +| Double |double | +| String |string | +| Date |datetime | +| Boolean |bool | + + + + +## 4 插件特点 + +### 4.1 关于列筛选的问题 + +* ODPS本身不支持列筛选、重排序、补空等等,但是DataX ODPSWriter完成了上述需求,支持列筛选、重排序、补空。例如需要导入的字段列表,当导入全部字段时,可以配置为"column": ["*"],odps表有a,b,c三个字段,用户只同步c,b两个字段,在列配置中可以写成"column": ["c","b"],表示会把reader的第一列和第二列导入odps的c字段和b字段,而odps表中新插入纪的录的a字段会被置为null. + +### 4.2 列配置错误的处理 + +* 为了保证写入数据的可靠性,避免多余列数据丢失造成数据质量故障。对于写入多余的列,ODPSWriter将报错。例如ODPS表字段为a,b,c,但是ODPSWriter写入的字段为多于3列的话ODPSWriter将报错。 + +### 4.3 分区配置注意事项 + +* ODPSWriter只提供 **写入到最后一级分区** 功能,不支持写入按照某个字段进行分区路由等功能。假设表一共有3级分区,那么在分区配置中就必须指明写入到某个三级分区,例如把数据写入一个表的第三级分区,可以配置为 pt=20150101/type=1/biz=2,但是不能配置为pt=20150101/type=1或者pt=20150101。 + +### 4.4 任务重跑和failover +* ODPSWriter通过配置"truncate": true,保证写入的幂等性,即当出现写入失败再次运行时,ODPSWriter将清理前述数据,并导入新数据,这样可以保证每次重跑之后的数据都保持一致。如果在运行过程中因为其他的异常导致了任务中断,是不能保证数据的原子性的,数据不会回滚也不会自动重跑,需要用户利用幂等性这一特点重跑去确保保证数据的完整性。**truncate为true的情况下,会将指定分区\表的数据全部清理,请谨慎使用!** + + + +## 5 性能报告(线上环境实测) + +### 5.1 环境准备 + +#### 5.1.1 数据特征 + +建表语句: + + use cdo_datasync; + create table datax3_odpswriter_perf_10column_1kb_00( + s_0 string, + bool_1 boolean, + bi_2 bigint, + dt_3 datetime, + db_4 double, + s_5 string, + s_6 string, + s_7 string, + s_8 string, + s_9 string + )PARTITIONED by (pt string,year string); + +单行记录类似于: + + s_0 : 485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&* + bool_1 : true + bi_2 : 1696248667889 + dt_3 : 2013-07-0600: 00: 00 + db_4 : 3.141592653578 + s_5 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + s_6 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209 + s_7 : 100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209 + s_8 : 100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209 + s_9 : 12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + +#### 5.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu : 24 Core Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz cache 15.36MB + 2. mem : 50GB + 3. net : 千兆双网卡 + 4. jvm : -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + 5. disc: DataX 数据不落磁盘,不统计此项 + +* 任务配置为: +``` +{ + "job": { + "setting": { + "speed": { + "channel": "1,2,4,5,6,8,16,32,64" + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "******************************", + "accessKey": "*****************************", + "column": [ + "*" + ], + "partition": [ + "pt=20141010000000,year=2014" + ], + "odpsServer": "http://service.odps.aliyun-inc.com/api", + "project": "cdo_datasync", + "table": "datax3_odpswriter_perf_10column_1kb_00", + "tunnelServer": "http://dt.odps.cm11.aliyun-inc.com" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "column": [ + { + "value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*" + }, + { + "value": "true", + "type": "bool" + }, + { + "value": "1696248667889", + "type": "long" + }, + { + "type": "date", + "value": "2013-07-06 00:00:00", + "dateFormat": "yyyy-mm-dd hh:mm:ss" + }, + { + "value": "3.141592653578", + "type": "double" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209" + }, + { + "value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209" + }, + { + "value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + } + ] + } + } + } + ] + } +} +``` + +### 5.2 测试报告 + + +| 并发任务数|blockSizeInMB| DataX速度(Rec/s)|DataX流量(MB/S)|网卡流量(MB/S)|DataX运行负载| +|--------| --------|--------|--------|--------|--------| +|1|32|30303|13.03|14.5|0.12| +|1|64|38461|16.54|16.5|0.44| +|1|128|46454|20.55|26.7|0.47| +|1|256|52631|22.64|26.7|0.47| +|1|512|58823|25.30|28.7|0.44| +|4|32|114816|49.38|55.3|0.75| +|4|64|147577|63.47|71.3|0.82| +|4|128|177744|76.45|83.2|0.97| +|4|256|173913|74.80|80.1|1.01| +|4|512|200000|86.02|95.1|1.41| +|8|32|204480|87.95|92.7|1.16| +|8|64|294224|126.55|135.3|1.65| +|8|128|365475|157.19|163.7|2.89| +|8|256|394713|169.83|176.7|2.72| +|8|512|241691|103.95|125.7|2.29| +|16|32|420838|181.01|198.0|2.56| +|16|64|458144|197.05|217.4|2.85| +|16|128|443219|190.63|210.5|3.29| +|16|256|315235|135.58|140.0|0.95| +|16|512|OOM||||| + +说明: + +1. OdpsWriter 影响速度的是channel 和 blockSizeInMB。blockSizeInMB 取`32` 和 `64`时,速度比较稳定,过分大的 blockSizeInMB 可能造成速度波动以及内存OOM。 +2. channel 和 blockSizeInMB 对速度的影响都很明显,建议综合考虑配合选择。 +3. channel 数目的选择,可以综合考虑源端数据特征进行选择,对于StreamReader,在16个channel时将网卡打满。 + + +## 6 FAQ +#### 1 导数据到 odps 的日志中有以下报错,该怎么处理呢?"ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn‘t exist in project example_dev“ + +解决办法 :找ODPS Prject 的 owner给用户的云账号授权,授权语句: +grant Describe,Select,Alter,Update on table [tableName] to user XXX + +#### 2 可以导入数据到odps的视图吗? +目前不支持通过视图到数据到odps,视图是ODPS非实体化数据存储对象,技术上无法向视图导入数据。 diff --git a/odpswriter/odpswriter.iml b/odpswriter/odpswriter.iml new file mode 100644 index 000000000..8f28870d8 --- /dev/null +++ b/odpswriter/odpswriter.iml @@ -0,0 +1,51 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/odpswriter/pom.xml b/odpswriter/pom.xml new file mode 100755 index 000000000..4e633fd18 --- /dev/null +++ b/odpswriter/pom.xml @@ -0,0 +1,100 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + odpswriter + odpswriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.aliyun.odps + odps-sdk-core + 0.19.3-public + + + + + commons-httpclient + commons-httpclient + 3.1 + + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/odpswriter/src/main/assembly/package.xml b/odpswriter/src/main/assembly/package.xml new file mode 100755 index 000000000..0ef0b43b1 --- /dev/null +++ b/odpswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/odpswriter + + + target/ + + odpswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/odpswriter + + + + + + false + plugin/writer/odpswriter/libs + runtime + + + diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Constant.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Constant.java new file mode 100755 index 000000000..22bcc16cb --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Constant.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + + +public class Constant { + public static final String SKYNET_ACCESSID = "SKYNET_ACCESSID"; + + public static final String SKYNET_ACCESSKEY = "SKYNET_ACCESSKEY"; + + public static final String DEFAULT_ACCOUNT_TYPE = "aliyun"; + + public static final String TAOBAO_ACCOUNT_TYPE = "taobao"; + + public static final String COLUMN_POSITION = "columnPosition"; + +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Key.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Key.java new file mode 100755 index 000000000..f578d72d9 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Key.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + + +public final class Key { + + public final static String ODPS_SERVER = "odpsServer"; + + public final static String TUNNEL_SERVER = "tunnelServer"; + + public final static String ACCESS_ID = "accessId"; + + public final static String ACCESS_KEY = "accessKey"; + + public final static String PROJECT = "project"; + + public final static String TABLE = "table"; + + public final static String PARTITION = "partition"; + + public final static String COLUMN = "column"; + + public final static String TRUNCATE = "truncate"; + + public final static String MAX_RETRY_TIME = "maxRetryTime"; + + public final static String BLOCK_SIZE_IN_MB = "blockSizeInMB"; + + //boolean 类型,default:false + public final static String EMPTY_AS_NULL = "emptyAsNull"; + + public final static String ACCOUNT_TYPE = "accountType"; + + public final static String IS_COMPRESS = "isCompress"; +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriter.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriter.java new file mode 100755 index 000000000..60deb5dd3 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriter.java @@ -0,0 +1,356 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.writer.odpswriter.util.IdAndKeyUtil; +import com.alibaba.datax.plugin.writer.odpswriter.util.OdpsUtil; + +import com.aliyun.odps.Odps; +import com.aliyun.odps.Table; +import com.aliyun.odps.TableSchema; +import com.aliyun.odps.tunnel.TableTunnel; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.atomic.AtomicLong; + +/** + * 已修改为:每个 task 各自创建自己的 upload,拥有自己的 uploadId,并在 task 中完成对对应 block 的提交。 + */ +public class OdpsWriter extends Writer { + + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration originalConfig; + private Odps odps; + private Table table; + + private String projectName; + private String tableName; + private String tunnelServer; + private String partition; + private String accountType; + private boolean truncate; + private String uploadId; + private TableTunnel.UploadSession masterUpload; + private int blockSizeInMB; + + public void preCheck() { + this.init(); + this.doPreCheck(); + } + + public void doPreCheck() { + //检查accessId,accessKey配置 + if (Constant.DEFAULT_ACCOUNT_TYPE + .equalsIgnoreCase(this.accountType)) { + this.originalConfig = IdAndKeyUtil.parseAccessIdAndKey(this.originalConfig); + String accessId = this.originalConfig.getString(Key.ACCESS_ID); + String accessKey = this.originalConfig.getString(Key.ACCESS_KEY); + if (IS_DEBUG) { + LOG.debug("accessId:[{}], accessKey:[{}] .", accessId, + accessKey); + } + LOG.info("accessId:[{}] .", accessId); + } + // init odps config + this.odps = OdpsUtil.initOdpsProject(this.originalConfig); + + //检查表等配置是否正确 + this.table = OdpsUtil.getTable(odps,this.projectName,this.tableName); + + //检查列信息是否正确 + List allColumns = OdpsUtil.getAllColumns(this.table.getSchema()); + LOG.info("allColumnList: {} .", StringUtils.join(allColumns, ',')); + dealColumn(this.originalConfig, allColumns); + + //检查分区信息是否正确 + OdpsUtil.preCheckPartition(this.odps, this.table, this.partition, this.truncate); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + OdpsUtil.checkNecessaryConfig(this.originalConfig); + OdpsUtil.dealMaxRetryTime(this.originalConfig); + + this.projectName = this.originalConfig.getString(Key.PROJECT); + this.tableName = this.originalConfig.getString(Key.TABLE); + this.tunnelServer = this.originalConfig.getString(Key.TUNNEL_SERVER, null); + + //check isCompress + this.originalConfig.getBool(Key.IS_COMPRESS, false); + + this.partition = OdpsUtil.formatPartition(this.originalConfig + .getString(Key.PARTITION, "")); + this.originalConfig.set(Key.PARTITION, this.partition); + + this.accountType = this.originalConfig.getString(Key.ACCOUNT_TYPE, + Constant.DEFAULT_ACCOUNT_TYPE); + if (!Constant.DEFAULT_ACCOUNT_TYPE.equalsIgnoreCase(this.accountType) && + !Constant.TAOBAO_ACCOUNT_TYPE.equalsIgnoreCase(this.accountType)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ACCOUNT_TYPE_ERROR, + String.format("账号类型错误,因为你的账号 [%s] 不是datax目前支持的账号类型,目前仅支持aliyun, taobao账号,请修改您的账号信息.", accountType)); + } + this.originalConfig.set(Key.ACCOUNT_TYPE, this.accountType); + + this.truncate = this.originalConfig.getBool(Key.TRUNCATE); + + boolean emptyAsNull = this.originalConfig.getBool(Key.EMPTY_AS_NULL, false); + this.originalConfig.set(Key.EMPTY_AS_NULL, emptyAsNull); + if (emptyAsNull) { + LOG.warn("这是一条需要注意的信息 由于您的作业配置了写入 ODPS 的目的表时emptyAsNull=true, 所以 DataX将会把长度为0的空字符串作为 java 的 null 写入 ODPS."); + } + + this.blockSizeInMB = this.originalConfig.getInt(Key.BLOCK_SIZE_IN_MB, 64); + if(this.blockSizeInMB < 8) { + this.blockSizeInMB = 8; + } + this.originalConfig.set(Key.BLOCK_SIZE_IN_MB, this.blockSizeInMB); + LOG.info("blockSizeInMB={}.", this.blockSizeInMB); + + if (IS_DEBUG) { + LOG.debug("After master init(), job config now is: [\n{}\n] .", + this.originalConfig.toJSON()); + } + } + + @Override + public void prepare() { + String accessId = null; + String accessKey = null; + if (Constant.DEFAULT_ACCOUNT_TYPE + .equalsIgnoreCase(this.accountType)) { + this.originalConfig = IdAndKeyUtil.parseAccessIdAndKey(this.originalConfig); + accessId = this.originalConfig.getString(Key.ACCESS_ID); + accessKey = this.originalConfig.getString(Key.ACCESS_KEY); + if (IS_DEBUG) { + LOG.debug("accessId:[{}], accessKey:[{}] .", accessId, + accessKey); + } + LOG.info("accessId:[{}] .", accessId); + } + + // init odps config + this.odps = OdpsUtil.initOdpsProject(this.originalConfig); + + //检查表等配置是否正确 + this.table = OdpsUtil.getTable(odps,this.projectName,this.tableName); + + OdpsUtil.dealTruncate(this.odps, this.table, this.partition, this.truncate); + } + + /** + * 此处主要是对 uploadId进行设置,以及对 blockId 的开始值进行设置。 + *

+ * 对 blockId 需要同时设置开始值以及下一个 blockId 的步长值(INTERVAL_STEP)。 + */ + @Override + public List split(int mandatoryNumber) { + List configurations = new ArrayList(); + + // 此处获取到 masterUpload 只是为了拿到 RecordSchema,以完成对 column 的处理 + TableTunnel tableTunnel = new TableTunnel(this.odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tableTunnel.setEndpoint(tunnelServer); + } + + this.masterUpload = OdpsUtil.createMasterTunnelUpload( + tableTunnel, this.projectName, this.tableName, this.partition); + this.uploadId = this.masterUpload.getId(); + LOG.info("Master uploadId:[{}].", this.uploadId); + + TableSchema schema = this.masterUpload.getSchema(); + List allColumns = OdpsUtil.getAllColumns(schema); + LOG.info("allColumnList: {} .", StringUtils.join(allColumns, ',')); + + dealColumn(this.originalConfig, allColumns); + + for (int i = 0; i < mandatoryNumber; i++) { + Configuration tempConfig = this.originalConfig.clone(); + + configurations.add(tempConfig); + } + + if (IS_DEBUG) { + LOG.debug("After master split, the job config now is:[\n{}\n].", this.originalConfig); + } + + this.masterUpload = null; + + return configurations; + } + + private void dealColumn(Configuration originalConfig, List allColumns) { + //之前已经检查了userConfiguredColumns 一定不为空 + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, String.class); + if (1 == userConfiguredColumns.size() && "*".equals(userConfiguredColumns.get(0))) { + userConfiguredColumns = allColumns; + originalConfig.set(Key.COLUMN, allColumns); + } else { + //检查列是否重复,大小写不敏感(所有写入,都是不允许写入段的列重复的) + ListUtil.makeSureNoValueDuplicate(userConfiguredColumns, false); + + //检查列是否存在,大小写不敏感 + ListUtil.makeSureBInA(allColumns, userConfiguredColumns, false); + } + + List columnPositions = OdpsUtil.parsePosition(allColumns, userConfiguredColumns); + originalConfig.set(Constant.COLUMN_POSITION, columnPositions); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + } + + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration sliceConfig; + private Odps odps; + + private String projectName; + private String tableName; + private String tunnelServer; + private String partition; + private boolean emptyAsNull; + private boolean isCompress; + + private TableTunnel.UploadSession managerUpload; + private TableTunnel.UploadSession workerUpload; + + private String uploadId = null; + private List blocks; + private int blockSizeInMB; + + private Integer failoverState = 0; //0 未failover 1准备failover 2已提交,不能failover + private byte[] lock = new byte[0]; + + @Override + public void init() { + this.sliceConfig = super.getPluginJobConf(); + + this.projectName = this.sliceConfig.getString(Key.PROJECT); + this.tableName = this.sliceConfig.getString(Key.TABLE); + this.tunnelServer = this.sliceConfig.getString(Key.TUNNEL_SERVER, null); + this.partition = OdpsUtil.formatPartition(this.sliceConfig + .getString(Key.PARTITION, "")); + this.sliceConfig.set(Key.PARTITION, this.partition); + + this.emptyAsNull = this.sliceConfig.getBool(Key.EMPTY_AS_NULL); + this.blockSizeInMB = this.sliceConfig.getInt(Key.BLOCK_SIZE_IN_MB); + this.isCompress = this.sliceConfig.getBool(Key.IS_COMPRESS, false); + if (this.blockSizeInMB < 1 || this.blockSizeInMB > 512) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的blockSizeInMB:%s 参数错误. 正确的配置是[1-512]之间的整数. 请修改此参数的值为该区间内的数值", this.blockSizeInMB)); + } + + if (IS_DEBUG) { + LOG.debug("After init in task, sliceConfig now is:[\n{}\n].", this.sliceConfig); + } + + } + + @Override + public void prepare() { + this.odps = OdpsUtil.initOdpsProject(this.sliceConfig); + + TableTunnel tableTunnel = new TableTunnel(this.odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tableTunnel.setEndpoint(tunnelServer); + } + + this.managerUpload = OdpsUtil.createMasterTunnelUpload(tableTunnel, this.projectName, + this.tableName, this.partition); + this.uploadId = this.managerUpload.getId(); + LOG.info("task uploadId:[{}].", this.uploadId); + + this.workerUpload = OdpsUtil.getSlaveTunnelUpload(tableTunnel, this.projectName, + this.tableName, this.partition, uploadId); + } + + @Override + public void startWrite(RecordReceiver recordReceiver) { + blocks = new ArrayList(); + + AtomicLong blockId = new AtomicLong(0); + + List columnPositions = this.sliceConfig.getList(Constant.COLUMN_POSITION, + Integer.class); + + try { + TaskPluginCollector taskPluginCollector = super.getTaskPluginCollector(); + + OdpsWriterProxy proxy = new OdpsWriterProxy(this.workerUpload, this.blockSizeInMB, blockId, + columnPositions, taskPluginCollector, this.emptyAsNull, this.isCompress); + + com.alibaba.datax.common.element.Record dataXRecord = null; + + PerfRecord blockClose = new PerfRecord(super.getTaskGroupId(),super.getTaskId(), PerfRecord.PHASE.ODPS_BLOCK_CLOSE); + blockClose.start(); + long blockCloseUsedTime = 0; + while ((dataXRecord = recordReceiver.getFromReader()) != null) { + blockCloseUsedTime += proxy.writeOneRecord(dataXRecord, blocks); + } + + blockCloseUsedTime += proxy.writeRemainingRecord(blocks); + blockClose.end(blockCloseUsedTime); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.WRITER_RECORD_FAIL, "写入 ODPS 目的表失败. 请联系 ODPS 管理员处理.", e); + } + } + + @Override + public void post() { + synchronized (lock){ + if(failoverState==0){ + failoverState = 2; + LOG.info("Slave which uploadId=[{}] begin to commit blocks:[\n{}\n].", this.uploadId, + StringUtils.join(blocks, ",")); + OdpsUtil.masterCompleteBlocks(this.managerUpload, blocks.toArray(new Long[0])); + LOG.info("Slave which uploadId=[{}] commit blocks ok.", this.uploadId); + }else{ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + } + } + + @Override + public void destroy() { + } + + @Override + public boolean supportFailOver(){ + synchronized (lock){ + if(failoverState==0){ + failoverState = 1; + return true; + } + return false; + } + } + } +} \ No newline at end of file diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterErrorCode.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterErrorCode.java new file mode 100755 index 000000000..02020c046 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterErrorCode.java @@ -0,0 +1,66 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OdpsWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("OdpsWriter-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("OdpsWriter-01", "您配置的值不合法."), + UNSUPPORTED_COLUMN_TYPE("OdpsWriter-02", "DataX 不支持写入 ODPS 的目的表的此种数据类型."), + + TABLE_TRUNCATE_ERROR("OdpsWriter-03", "清空 ODPS 目的表时出错."), + CREATE_MASTER_UPLOAD_FAIL("OdpsWriter-04", "创建 ODPS 的 uploadSession 失败."), + GET_SLAVE_UPLOAD_FAIL("OdpsWriter-05", "获取 ODPS 的 uploadSession 失败."), + GET_ID_KEY_FAIL("OdpsWriter-06", "获取 accessId/accessKey 失败."), + GET_PARTITION_FAIL("OdpsWriter-07", "获取 ODPS 目的表的所有分区失败."), + + ADD_PARTITION_FAILED("OdpsWriter-08", "添加分区到 ODPS 目的表失败."), + WRITER_RECORD_FAIL("OdpsWriter-09", "写入数据到 ODPS 目的表失败."), + + COMMIT_BLOCK_FAIL("OdpsWriter-10", "提交 block 到 ODPS 目的表失败."), + RUN_SQL_FAILED("OdpsWriter-11", "执行 ODPS Sql 失败."), + CHECK_IF_PARTITIONED_TABLE_FAILED("OdpsWriter-12", "检查 ODPS 目的表:%s 是否为分区表失败."), + + RUN_SQL_ODPS_EXCEPTION("OdpsWriter-13", "执行 ODPS Sql 时抛出异常, 可重试"), + + ACCOUNT_TYPE_ERROR("OdpsWriter-30", "账号类型错误."), + + PARTITION_ERROR("OdpsWriter-31", "分区配置错误."), + + COLUMN_NOT_EXIST("OdpsWriter-32", "用户配置的列不存在."), + + ODPS_PROJECT_NOT_FOUNT("OdpsWriter-100", "您配置的值不合法, odps project 不存在."), //ODPS-0420111: Project not found + + ODPS_TABLE_NOT_FOUNT("OdpsWriter-101", "您配置的值不合法, odps table 不存在"), // ODPS-0130131:Table not found + + ODPS_ACCESS_KEY_ID_NOT_FOUND("OdpsWriter-102", "您配置的值不合法, odps accessId,accessKey 不存在"), //ODPS-0410051:Invalid credentials - accessKeyId not found + + ODPS_ACCESS_KEY_INVALID("OdpsWriter-103", "您配置的值不合法, odps accessKey 错误"), //ODPS-0410042:Invalid signature value - User signature dose not match; + + ODPS_ACCESS_DENY("OdpsWriter-104", "拒绝访问, 您不在 您配置的 project 中") //ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project + + ; + + private final String code; + private final String description; + + private OdpsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterProxy.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterProxy.java new file mode 100755 index 000000000..000a1d8e7 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterProxy.java @@ -0,0 +1,190 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.writer.odpswriter.util.OdpsUtil; + +import com.aliyun.odps.OdpsType; +import com.aliyun.odps.TableSchema; + +import com.aliyun.odps.data.Record; + +import com.aliyun.odps.tunnel.TableTunnel; + +import com.aliyun.odps.tunnel.TunnelException; +import com.aliyun.odps.tunnel.io.ProtobufRecordPack; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.List; +import java.util.concurrent.atomic.AtomicLong; + +public class OdpsWriterProxy { + private static final Logger LOG = LoggerFactory + .getLogger(OdpsWriterProxy.class); + + private volatile boolean printColumnLess;// 是否打印对于源头字段数小于 ODPS 目的表的行的日志 + + private TaskPluginCollector taskPluginCollector; + + private TableTunnel.UploadSession slaveUpload; + private TableSchema schema; + private int maxBufferSize; + private ProtobufRecordPack protobufRecordPack; + private int protobufCapacity; + private AtomicLong blockId; + + private List columnPositions; + private List tableOriginalColumnTypeList; + private boolean emptyAsNull; + private boolean isCompress; + + public OdpsWriterProxy(TableTunnel.UploadSession slaveUpload, int blockSizeInMB, + AtomicLong blockId, List columnPositions, + TaskPluginCollector taskPluginCollector, boolean emptyAsNull, boolean isCompress) + throws IOException, TunnelException { + this.slaveUpload = slaveUpload; + this.schema = this.slaveUpload.getSchema(); + this.tableOriginalColumnTypeList = OdpsUtil + .getTableOriginalColumnTypeList(this.schema); + + this.blockId = blockId; + this.columnPositions = columnPositions; + this.taskPluginCollector = taskPluginCollector; + this.emptyAsNull = emptyAsNull; + this.isCompress = isCompress; + + // 初始化与 buffer 区相关的值 + this.maxBufferSize = (blockSizeInMB - 4) * 1024 * 1024; + this.protobufCapacity = blockSizeInMB * 1024 * 1024; + this.protobufRecordPack = new ProtobufRecordPack(this.schema, null, this.protobufCapacity); + printColumnLess = true; + + } + + public long writeOneRecord( + com.alibaba.datax.common.element.Record dataXRecord, + List blocks) throws Exception { + + Record record = dataxRecordToOdpsRecord(dataXRecord); + + if (null == record) { + return 0; + } + protobufRecordPack.append(record); + + if (protobufRecordPack.getTotalBytes() >= maxBufferSize) { + long startTimeInNs = System.nanoTime(); + OdpsUtil.slaveWriteOneBlock(this.slaveUpload, + protobufRecordPack, blockId.get(), this.isCompress); + LOG.info("write block {} ok.", blockId.get()); + blocks.add(blockId.get()); + protobufRecordPack.reset(); + this.blockId.incrementAndGet(); + return System.nanoTime() - startTimeInNs; + } + return 0; + } + + public long writeRemainingRecord(List blocks) throws Exception { + // complete protobuf stream, then write to http + if (protobufRecordPack.getTotalBytes() != 0) { + long startTimeInNs = System.nanoTime(); + OdpsUtil.slaveWriteOneBlock(this.slaveUpload, + protobufRecordPack, blockId.get(), this.isCompress); + LOG.info("write block {} ok.", blockId.get()); + + blocks.add(blockId.get()); + // reset the buffer for next block + protobufRecordPack.reset(); + return System.nanoTime() - startTimeInNs; + } + return 0; + } + + public Record dataxRecordToOdpsRecord( + com.alibaba.datax.common.element.Record dataXRecord) throws Exception { + int sourceColumnCount = dataXRecord.getColumnNumber(); + Record odpsRecord = slaveUpload.newRecord(); + + int userConfiguredColumnNumber = this.columnPositions.size(); +//todo + if (sourceColumnCount > userConfiguredColumnNumber) { + throw DataXException + .asDataXException( + OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format( + "亲,配置中的源表的列个数和目的端表不一致,源表中您配置的列数是:%s 大于目的端的列数是:%s , 这样会导致源头数据无法正确导入目的端, 请检查您的配置并修改.", + sourceColumnCount, + userConfiguredColumnNumber)); + } else if (sourceColumnCount < userConfiguredColumnNumber) { + if (printColumnLess) { + LOG.warn( + "源表的列个数小于目的表的列个数,源表列数是:{} 目的表列数是:{} , 数目不匹配. DataX 会把目的端多出的列的值设置为空值. 如果这个默认配置不符合您的期望,请保持源表和目的表配置的列数目保持一致.", + sourceColumnCount, userConfiguredColumnNumber); + } + printColumnLess = false; + } + + int currentIndex; + int sourceIndex = 0; + try { + com.alibaba.datax.common.element.Column columnValue; + + for (; sourceIndex < sourceColumnCount; sourceIndex++) { + currentIndex = columnPositions.get(sourceIndex); + OdpsType type = this.tableOriginalColumnTypeList + .get(currentIndex); + columnValue = dataXRecord.getColumn(sourceIndex); + + if (columnValue == null) { + continue; + } + // for compatible dt lib, "" as null + if(this.emptyAsNull && columnValue instanceof StringColumn && "".equals(columnValue.asString())){ + continue; + } + + switch (type) { + case STRING: + odpsRecord.setString(currentIndex, columnValue.asString()); + break; + case BIGINT: + odpsRecord.setBigint(currentIndex, columnValue.asLong()); + break; + case BOOLEAN: + odpsRecord.setBoolean(currentIndex, columnValue.asBoolean()); + break; + case DATETIME: + odpsRecord.setDatetime(currentIndex, columnValue.asDate()); + break; + case DOUBLE: + odpsRecord.setDouble(currentIndex, columnValue.asDouble()); + break; + case DECIMAL: + odpsRecord.setDecimal(currentIndex, columnValue.asBigDecimal()); + String columnStr = columnValue.asString(); + if(columnStr != null && columnStr.indexOf(".") >= 36) { + throw new Exception("Odps decimal 类型的整数位个数不能超过35"); + } + default: + break; + } + } + + return odpsRecord; + } catch (Exception e) { + String message = String.format( + "写入 ODPS 目的表时遇到了脏数据, 因为源端第[%s]个字段, 具体值[%s] 的数据不符合 ODPS 对应字段的格式要求,请检查该数据并作出修改 或者您可以增大阀值,忽略这条记录.", sourceIndex, + dataXRecord.getColumn(sourceIndex)); + this.taskPluginCollector.collectDirtyRecord(dataXRecord, e, + message); + + return null; + } + + } +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/DESCipher.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/DESCipher.java new file mode 100755 index 000000000..bf7f5a883 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/DESCipher.java @@ -0,0 +1,355 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.writer.odpswriter.util; + +import javax.crypto.Cipher; +import javax.crypto.SecretKey; +import javax.crypto.SecretKeyFactory; +import javax.crypto.spec.DESKeySpec; +import java.security.SecureRandom; + +/** + *   * DES加解密,支持与delphi交互(字符串编码需统一为UTF-8) + * + *   * + * + *   * @author wym + * + *    + */ + +public class DESCipher { + + /** + *   * 密钥 + * + *    + */ + + public static final String KEY = "u4Gqu4Z8"; + + private final static String DES = "DES"; + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成加密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.ENCRYPT_MODE, securekey, sr); + + // 现在,获取数据并加密 + + // 正式执行加密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建一个DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec对象转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成解密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.DECRYPT_MODE, securekey, sr); + + // 现在,获取数据并解密 + + // 正式执行解密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src) throws Exception { + + return encrypt(src, KEY.getBytes()); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src) throws Exception { + + return decrypt(src, KEY.getBytes()); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字符串) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String encrypt(String src) { + + try { + + return byte2hex(encrypt(src.getBytes(), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字符串) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String decrypt(String src) { + try { + + return new String(decrypt(hex2byte(src.getBytes()), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public static String encryptToString(byte[] src) throws Exception { + + return encrypt(new String(src)); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public static String decryptToString(byte[] src) throws Exception { + + return decrypt(new String(src)); + + } + + public static String byte2hex(byte[] b) { + + String hs = ""; + + String stmp = ""; + + for (int n = 0; n < b.length; n++) { + + stmp = (Integer.toHexString(b[n] & 0XFF)); + + if (stmp.length() == 1) + + hs = hs + "0" + stmp; + + else + + hs = hs + stmp; + + } + + return hs.toUpperCase(); + + } + + public static byte[] hex2byte(byte[] b) { + + if ((b.length % 2) != 0) + + throw new IllegalArgumentException("长度不是偶数"); + + byte[] b2 = new byte[b.length / 2]; + + for (int n = 0; n < b.length; n += 2) { + + String item = new String(b, n, 2); + + b2[n / 2] = (byte) Integer.parseInt(item, 16); + + } + return b2; + + } + + /* + * public static void main(String[] args) { try { String src = "cheetah"; + * String crypto = DESCipher.encrypt(src); System.out.println("密文[" + src + + * "]:" + crypto); System.out.println("解密后:" + DESCipher.decrypt(crypto)); } + * catch (Exception e) { e.printStackTrace(); } } + */ +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/IdAndKeyUtil.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/IdAndKeyUtil.java new file mode 100755 index 000000000..95e4b56b5 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/IdAndKeyUtil.java @@ -0,0 +1,85 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.writer.odpswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.odpswriter.Constant; +import com.alibaba.datax.plugin.writer.odpswriter.Key; +import com.alibaba.datax.plugin.writer.odpswriter.OdpsWriterErrorCode; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; + +public class IdAndKeyUtil { + private static Logger LOG = LoggerFactory.getLogger(IdAndKeyUtil.class); + + public static Configuration parseAccessIdAndKey(Configuration originalConfig) { + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + + // 只要 accessId,accessKey 二者配置了一个,就理解为是用户本意是要直接手动配置其 accessid/accessKey + if (StringUtils.isNotBlank(accessId) || StringUtils.isNotBlank(accessKey)) { + LOG.info("Try to get accessId/accessKey from your config."); + //通过如下语句,进行检查是否确实配置了 + accessId = originalConfig.getNecessaryValue(Key.ACCESS_ID, OdpsWriterErrorCode.REQUIRED_VALUE); + accessKey = originalConfig.getNecessaryValue(Key.ACCESS_KEY, OdpsWriterErrorCode.REQUIRED_VALUE); + //检查完毕,返回即可 + return originalConfig; + } else { + Map envProp = System.getenv(); + return getAccessIdAndKeyFromEnv(originalConfig, envProp); + } + } + + private static Configuration getAccessIdAndKeyFromEnv(Configuration originalConfig, + Map envProp) { + String accessId = null; + String accessKey = null; + + String skynetAccessID = envProp.get(Constant.SKYNET_ACCESSID); + String skynetAccessKey = envProp.get(Constant.SKYNET_ACCESSKEY); + + if (StringUtils.isNotBlank(skynetAccessID) + || StringUtils.isNotBlank(skynetAccessKey)) { + /** + * 环境变量中,如果存在SKYNET_ACCESSID/SKYNET_ACCESSKEy(只要有其中一个变量,则认为一定是两个都存在的!), + * 则使用其值作为odps的accessId/accessKey(会解密) + */ + + LOG.info("Try to get accessId/accessKey from environment."); + accessId = skynetAccessID; + accessKey = DESCipher.decrypt(skynetAccessKey); + if (StringUtils.isNotBlank(accessKey)) { + originalConfig.set(Key.ACCESS_ID, accessId); + originalConfig.set(Key.ACCESS_KEY, accessKey); + LOG.info("Get accessId/accessKey from environment variables successfully."); + } else { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_ID_KEY_FAIL, + String.format("从环境变量中获取accessId/accessKey 失败, accessId=[%s]", accessId)); + } + } else { + // 无处获取(既没有配置在作业中,也没用在环境变量中) + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_ID_KEY_FAIL, + "无法获取到accessId/accessKey. 它们既不存在于您的配置中,也不存在于环境变量中."); + } + + return originalConfig; + } +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsExceptionMsg.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsExceptionMsg.java new file mode 100644 index 000000000..d613eefda --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsExceptionMsg.java @@ -0,0 +1,18 @@ +package com.alibaba.datax.plugin.writer.odpswriter.util; + +/** + * Created by hongjiao.hj on 2015/6/9. + */ +public class OdpsExceptionMsg { + + public static final String ODPS_PROJECT_NOT_FOUNT = "ODPS-0420111: Project not found"; + + public static final String ODPS_TABLE_NOT_FOUNT = "ODPS-0130131:Table not found"; + + public static final String ODPS_ACCESS_KEY_ID_NOT_FOUND = "ODPS-0410051:Invalid credentials - accessKeyId not found"; + + public static final String ODPS_ACCESS_KEY_INVALID = "ODPS-0410042:Invalid signature value - User signature dose not match"; + + public static final String ODPS_ACCESS_DENY = "ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project"; + +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsUtil.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsUtil.java new file mode 100755 index 000000000..272c23423 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsUtil.java @@ -0,0 +1,585 @@ +package com.alibaba.datax.plugin.writer.odpswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.writer.odpswriter.Constant; +import com.alibaba.datax.plugin.writer.odpswriter.Key; + +import com.alibaba.datax.plugin.writer.odpswriter.OdpsWriterErrorCode; +import com.aliyun.odps.*; +import com.aliyun.odps.account.Account; +import com.aliyun.odps.account.AliyunAccount; +import com.aliyun.odps.task.SQLTask; +import com.aliyun.odps.tunnel.TableTunnel; + +import com.aliyun.odps.tunnel.io.ProtobufRecordPack; +import com.aliyun.odps.tunnel.io.TunnelRecordWriter; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; +import java.util.concurrent.Callable; + +public class OdpsUtil { + private static final Logger LOG = LoggerFactory.getLogger(OdpsUtil.class); + + public static int MAX_RETRY_TIME = 10; + + public static void checkNecessaryConfig(Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.ODPS_SERVER, + OdpsWriterErrorCode.REQUIRED_VALUE); + + originalConfig.getNecessaryValue(Key.PROJECT, + OdpsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, + OdpsWriterErrorCode.REQUIRED_VALUE); + + if (null == originalConfig.getList(Key.COLUMN) || + originalConfig.getList(Key.COLUMN, String.class).isEmpty()) { + throw DataXException.asDataXException(OdpsWriterErrorCode.REQUIRED_VALUE, "您未配置写入 ODPS 目的表的列信息. " + + "正确的配置方式是给datax的 column 项配置上您需要读取的列名称,用英文逗号分隔 例如: \"column\": [\"id\",\"name\"]."); + } + + // getBool 内部要求,值只能为 true,false 的字符串(大小写不敏感),其他一律报错,不再有默认配置 + Boolean truncate = originalConfig.getBool(Key.TRUNCATE); + if (null == truncate) { + throw DataXException.asDataXException(OdpsWriterErrorCode.REQUIRED_VALUE, "[truncate]是必填配置项, 意思是写入 ODPS 目的表前是否清空表/分区. " + + "请您增加 truncate 的配置,根据业务需要选择上true 或者 false."); + } + } + + public static void dealMaxRetryTime(Configuration originalConfig) { + int maxRetryTime = originalConfig.getInt(Key.MAX_RETRY_TIME, + OdpsUtil.MAX_RETRY_TIME); + if (maxRetryTime < 1 || maxRetryTime > OdpsUtil.MAX_RETRY_TIME) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, "您所配置的maxRetryTime 值错误. 该值不能小于1, 且不能大于 " + OdpsUtil.MAX_RETRY_TIME + + ". 推荐的配置方式是给maxRetryTime 配置1-11之间的某个值. 请您检查配置并做出相应修改."); + } + MAX_RETRY_TIME = maxRetryTime; + } + + public static String formatPartition(String partitionString) { + if (null == partitionString) { + return null; + } + + return partitionString.trim().replaceAll(" *= *", "=").replaceAll(" */ *", ",") + .replaceAll(" *, *", ",").replaceAll("'", ""); + } + + + public static Odps initOdpsProject(Configuration originalConfig) { + String accountType = originalConfig.getString(Key.ACCOUNT_TYPE); + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + + String odpsServer = originalConfig.getString(Key.ODPS_SERVER); + String project = originalConfig.getString(Key.PROJECT); + + Account account; + if (accountType.equalsIgnoreCase(Constant.DEFAULT_ACCOUNT_TYPE)) { + account = new AliyunAccount(accessId, accessKey); + } else { + throw DataXException.asDataXException(OdpsWriterErrorCode.ACCOUNT_TYPE_ERROR, + String.format("不支持的账号类型:[%s]. 账号类型目前仅支持aliyun, taobao.", accountType)); + } + + Odps odps = new Odps(account); + boolean isPreCheck = originalConfig.getBool("dryRun", false); + if(isPreCheck) { + odps.getRestClient().setConnectTimeout(3); + odps.getRestClient().setReadTimeout(3); + odps.getRestClient().setRetryTimes(2); + } + odps.setDefaultProject(project); + odps.setEndpoint(odpsServer); + + return odps; + } + + public static Table getTable(Odps odps, String projectName, String tableName) { + final Table table = odps.tables().get(projectName, tableName); + try { + //通过这种方式检查表是否存在,失败重试。重试策略:每秒钟重试一次,最大重试3次 + return RetryUtil.executeWithRetry(new Callable

() { + @Override + public Table call() throws Exception { + table.reload(); + return table; + } + }, 3, 1000, false); + } catch (Exception e) { + throwDataXExceptionWhenReloadTable(e, tableName); + } + return table; + } + + public static List listOdpsPartitions(Table table) { + List parts = new ArrayList(); + try { + List partitions = table.getPartitions(); + for(Partition partition : partitions) { + parts.add(partition.getPartitionSpec().toString()); + } + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_PARTITION_FAIL, String.format("获取 ODPS 目的表:%s 的所有分区失败. 请联系 ODPS 管理员处理.", + table.getName()), e); + } + return parts; + } + + public static boolean isPartitionedTable(Table table) { + //必须要是非分区表才能 truncate 整个表 + List partitionKeys; + try { + partitionKeys = table.getSchema().getPartitionColumns(); + if (null != partitionKeys && !partitionKeys.isEmpty()) { + return true; + } + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.CHECK_IF_PARTITIONED_TABLE_FAILED, + String.format("检查 ODPS 目的表:%s 是否为分区表失败, 请联系 ODPS 管理员处理.", table.getName()), e); + } + return false; + } + + + public static void truncateNonPartitionedTable(Odps odps, Table tab) { + String truncateNonPartitionedTableSql = "truncate table " + tab.getName() + ";"; + + try { + runSqlTaskWithRetry(odps, truncateNonPartitionedTableSql, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.TABLE_TRUNCATE_ERROR, + String.format(" 清空 ODPS 目的表:%s 失败, 请联系 ODPS 管理员处理.", tab.getName()), e); + } + } + + public static void truncatePartition(Odps odps, Table table, String partition) { + if (isPartitionExist(table, partition)) { + dropPart(odps, table, partition); + } + addPart(odps, table, partition); + } + + private static boolean isPartitionExist(Table table, String partition) { + // check if exist partition 返回值不为 null + List odpsParts = OdpsUtil.listOdpsPartitions(table); + + int j = 0; + for (; j < odpsParts.size(); j++) { + if (odpsParts.get(j).replaceAll("'", "").equals(partition)) { + break; + } + } + + return j != odpsParts.size(); + } + + public static void addPart(Odps odps, Table table, String partition) { + String partSpec = getPartSpec(partition); + // add if not exists partition + StringBuilder addPart = new StringBuilder(); + addPart.append("alter table ").append(table.getName()).append(" add IF NOT EXISTS partition(") + .append(partSpec).append(");"); + try { + runSqlTaskWithRetry(odps, addPart.toString(), MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ADD_PARTITION_FAILED, + String.format("添加 ODPS 目的表的分区失败. 错误发生在添加 ODPS 的项目:%s 的表:%s 的分区:%s. 请联系 ODPS 管理员处理.", + table.getProject(), table.getName(), partition), e); + } + } + + + public static TableTunnel.UploadSession createMasterTunnelUpload(final TableTunnel tunnel, final String projectName, + final String tableName, final String partition) { + if(StringUtils.isBlank(partition)) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.createUploadSession(projectName, tableName); + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.CREATE_MASTER_UPLOAD_FAIL, + "创建TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } else { + final PartitionSpec partitionSpec = new PartitionSpec(partition); + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.createUploadSession(projectName, tableName, partitionSpec); + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.CREATE_MASTER_UPLOAD_FAIL, + "创建TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } + } + + public static TableTunnel.UploadSession getSlaveTunnelUpload(final TableTunnel tunnel, final String projectName, final String tableName, + final String partition, final String uploadId) { + + if(StringUtils.isBlank(partition)) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.getUploadSession(projectName, tableName, uploadId); + } + }, MAX_RETRY_TIME, 1000L, true); + + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_SLAVE_UPLOAD_FAIL, + "获取TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } else { + final PartitionSpec partitionSpec = new PartitionSpec(partition); + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.getUploadSession(projectName, tableName, partitionSpec, uploadId); + } + }, MAX_RETRY_TIME, 1000L, true); + + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_SLAVE_UPLOAD_FAIL, + "获取TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } + } + + + private static void dropPart(Odps odps, Table table, String partition) { + String partSpec = getPartSpec(partition); + StringBuilder dropPart = new StringBuilder(); + dropPart.append("alter table ").append(table.getName()) + .append(" drop IF EXISTS partition(").append(partSpec) + .append(");"); + try { + runSqlTaskWithRetry(odps, dropPart.toString(), MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ADD_PARTITION_FAILED, + String.format("Drop ODPS 目的表分区失败. 错误发生在项目:%s 的表:%s 的分区:%s .请联系 ODPS 管理员处理.", + table.getProject(), table.getName(), partition), e); + } + } + + private static String getPartSpec(String partition) { + StringBuilder partSpec = new StringBuilder(); + String[] parts = partition.split(","); + for (int i = 0; i < parts.length; i++) { + String part = parts[i]; + String[] kv = part.split("="); + if (kv.length != 2) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format("ODPS 目的表自身的 partition:%s 格式不对. 正确的格式形如: pt=1,ds=hangzhou", partition)); + } + partSpec.append(kv[0]).append("="); + partSpec.append("'").append(kv[1].replace("'", "")).append("'"); + if (i != parts.length - 1) { + partSpec.append(","); + } + } + return partSpec.toString(); + } + + /** + * 该方法只有在 sql 为幂等的才可以使用,且odps抛出异常时候才会进行重试 + * + * @param odps odps + * @param query 执行sql + * @throws Exception + */ + public static void runSqlTaskWithRetry(final Odps odps, final String query, int retryTimes, + long sleepTimeInMilliSecond, boolean exponential) throws Exception { + for(int i = 0; i < retryTimes; i++) { + try { + runSqlTask(odps, query); + return; + } catch (DataXException e) { + if (OdpsWriterErrorCode.RUN_SQL_ODPS_EXCEPTION.equals(e.getErrorCode())) { + LOG.debug("Exception when calling callable", e); + if (i + 1 < retryTimes && sleepTimeInMilliSecond > 0) { + long timeToSleep; + if (exponential) { + timeToSleep = sleepTimeInMilliSecond * (long) Math.pow(2, i); + if(timeToSleep >= 128 * 1000) { + timeToSleep = 128 * 1000; + } + } else { + timeToSleep = sleepTimeInMilliSecond; + if(timeToSleep >= 128 * 1000) { + timeToSleep = 128 * 1000; + } + } + + try { + Thread.sleep(timeToSleep); + } catch (InterruptedException ignored) { + } + } else { + throw e; + } + } else { + throw e; + } + } catch (Exception e) { + throw e; + } + } + } + + public static void runSqlTask(Odps odps, String query) { + if (StringUtils.isBlank(query)) { + return; + } + + String taskName = "datax_odpswriter_trunacte_" + UUID.randomUUID().toString().replace('-', '_'); + + LOG.info("Try to start sqlTask:[{}] to run odps sql:[\n{}\n] .", taskName, query); + + //todo:biz_id set (目前ddl先不做) + Instance instance; + Instance.TaskStatus status; + try { + instance = SQLTask.run(odps, odps.getDefaultProject(), query, taskName, null, null); + instance.waitForSuccess(); + status = instance.getTaskStatus().get(taskName); + if (!Instance.TaskStatus.Status.SUCCESS.equals(status.getStatus())) { + throw DataXException.asDataXException(OdpsWriterErrorCode.RUN_SQL_FAILED, + String.format("ODPS 目的表在运行 ODPS SQL失败, 返回值为:%s. 请联系 ODPS 管理员处理. SQL 内容为:[\n%s\n].", instance.getTaskResults().get(taskName), + query)); + } + } catch (DataXException e) { + throw e; + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.RUN_SQL_ODPS_EXCEPTION, + String.format("ODPS 目的表在运行 ODPS SQL 时抛出异常, 请联系 ODPS 管理员处理. SQL 内容为:[\n%s\n].", query), e); + } + } + + public static void masterCompleteBlocks(final TableTunnel.UploadSession masterUpload, final Long[] blocks) { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Void call() throws Exception { + masterUpload.commit(blocks); + return null; + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.COMMIT_BLOCK_FAIL, + String.format("ODPS 目的表在提交 block:[\n%s\n] 时失败, uploadId=[%s]. 请联系 ODPS 管理员处理.", StringUtils.join(blocks, ","), masterUpload.getId()), e); + } + } + + public static void slaveWriteOneBlock(final TableTunnel.UploadSession slaveUpload, final ProtobufRecordPack protobufRecordPack, + final long blockId, final boolean isCompress) { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Void call() throws Exception { + TunnelRecordWriter tunnelRecordWriter = (TunnelRecordWriter)slaveUpload.openRecordWriter(blockId, isCompress); + tunnelRecordWriter.write(protobufRecordPack); + tunnelRecordWriter.close(); + return null; + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.WRITER_RECORD_FAIL, + String.format("ODPS 目的表写 block:%s 失败, uploadId=[%s]. 请联系 ODPS 管理员处理.", blockId, slaveUpload.getId()), e); + } + + } + + public static List parsePosition(List allColumnList, + List userConfiguredColumns) { + List retList = new ArrayList(); + + boolean hasColumn; + for (String col : userConfiguredColumns) { + hasColumn = false; + for (int i = 0, len = allColumnList.size(); i < len; i++) { + if (allColumnList.get(i).equalsIgnoreCase(col)) { + retList.add(i); + hasColumn = true; + break; + } + } + if (!hasColumn) { + throw DataXException.asDataXException(OdpsWriterErrorCode.COLUMN_NOT_EXIST, + String.format("ODPS 目的表的列配置错误. 由于您所配置的列:%s 不存在,会导致datax无法正常插入数据,请检查该列是否存在,如果存在请检查大小写等配置.", col)); + } + } + return retList; + } + + public static List getAllColumns(TableSchema schema) { + if (null == schema) { + throw new IllegalArgumentException("parameter schema can not be null."); + } + + List allColumns = new ArrayList(); + + List columns = schema.getColumns(); + OdpsType type; + for(Column column: columns) { + allColumns.add(column.getName()); + type = column.getType(); + if (type == OdpsType.ARRAY || type == OdpsType.MAP) { + throw DataXException.asDataXException(OdpsWriterErrorCode.UNSUPPORTED_COLUMN_TYPE, + String.format("DataX 写入 ODPS 表不支持该字段类型:[%s]. 目前支持抽取的字段类型有:bigint, boolean, datetime, double, string. " + + "您可以选择不抽取 DataX 不支持的字段或者联系 ODPS 管理员寻求帮助.", + type)); + } + } + return allColumns; + } + + public static List getTableOriginalColumnTypeList(TableSchema schema) { + List tableOriginalColumnTypeList = new ArrayList(); + + List columns = schema.getColumns(); + for (Column column : columns) { + tableOriginalColumnTypeList.add(column.getType()); + } + + return tableOriginalColumnTypeList; + } + + public static void dealTruncate(Odps odps, Table table, String partition, boolean truncate) { + boolean isPartitionedTable = OdpsUtil.isPartitionedTable(table); + + if (truncate) { + //需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("您没有配置分区信息,因为你配置的表是分区表:%s 如果需要进行 truncate 操作,必须指定需要清空的具体分区. 请修改分区配置,格式形如 pt=${bizdate} .", + table.getName())); + } else { + LOG.info("Try to truncate partition=[{}] in table=[{}].", partition, table.getName()); + OdpsUtil.truncatePartition(odps, table, partition); + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("分区信息配置错误,你的ODPS表是非分区表:%s 进行 truncate 操作时不需要指定具体分区值. 请检查您的分区配置,删除该配置项的值.", + table.getName())); + } else { + LOG.info("Try to truncate table:[{}].", table.getName()); + OdpsUtil.truncateNonPartitionedTable(odps, table); + } + } + } else { + //不需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是分区表,写入分区表:%s 时必须指定具体分区值. 请修改您的分区配置信息,格式形如 格式形如 pt=${bizdate}.", table.getName())); + } else { + boolean isPartitionExists = OdpsUtil.isPartitionExist(table, partition); + if (!isPartitionExists) { + LOG.info("Try to add partition:[{}] in table:[{}].", partition, + table.getName()); + OdpsUtil.addPart(odps, table, partition); + } + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是非分区表,写入非分区表:%s 时不需要指定具体分区值. 请删除分区配置信息", table.getName())); + } + } + } + } + + + /** + * 检查odpswriter 插件的分区信息 + * + * @param odps + * @param table + * @param partition + * @param truncate + */ + public static void preCheckPartition(Odps odps, Table table, String partition, boolean truncate) { + boolean isPartitionedTable = OdpsUtil.isPartitionedTable(table); + + if (truncate) { + //需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("您没有配置分区信息,因为你配置的表是分区表:%s 如果需要进行 truncate 操作,必须指定需要清空的具体分区. 请修改分区配置,格式形如 pt=${bizdate} .", + table.getName())); + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("分区信息配置错误,你的ODPS表是非分区表:%s 进行 truncate 操作时不需要指定具体分区值. 请检查您的分区配置,删除该配置项的值.", + table.getName())); + } + } + } else { + //不需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是分区表,写入分区表:%s 时必须指定具体分区值. 请修改您的分区配置信息,格式形如 格式形如 pt=${bizdate}.", table.getName())); + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是非分区表,写入非分区表:%s 时不需要指定具体分区值. 请删除分区配置信息", table.getName())); + } + } + } + } + + /** + * table.reload() 方法抛出的 odps 异常 转化为更清晰的 datax 异常 抛出 + */ + public static void throwDataXExceptionWhenReloadTable(Exception e, String tableName) { + if(e.getMessage() != null) { + if(e.getMessage().contains(OdpsExceptionMsg.ODPS_PROJECT_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_PROJECT_NOT_FOUNT, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [project] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_TABLE_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_TABLE_NOT_FOUNT, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [table] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_ID_NOT_FOUND)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_ACCESS_KEY_ID_NOT_FOUND, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [accessId] [accessKey]是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_INVALID)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_ACCESS_KEY_INVALID, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [accessKey] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_DENY)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_ACCESS_DENY, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [accessId] [accessKey] [project]是否匹配.", tableName), e); + } + } + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 project,table,accessId,accessKey,odpsServer等值.", tableName), e); + } + +} diff --git a/odpswriter/src/main/resources/TODO.txt b/odpswriter/src/main/resources/TODO.txt new file mode 100755 index 000000000..d05f28f1d --- /dev/null +++ b/odpswriter/src/main/resources/TODO.txt @@ -0,0 +1,3 @@ +2、truncate 必填项 +3、对于非分区表,允许出现 partition 配置的 Key,Value 为空。 但是需要提示用户 +4、2w blockid 问题? \ No newline at end of file diff --git a/odpswriter/src/main/resources/plugin.json b/odpswriter/src/main/resources/plugin.json new file mode 100755 index 000000000..d867129e8 --- /dev/null +++ b/odpswriter/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "odpswriter", + "class": "com.alibaba.datax.plugin.writer.odpswriter.OdpsWriter", + "description": { + "useScene": "prod.", + "mechanism": "TODO", + "warn": "TODO" + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/odpswriter/src/main/resources/plugin_job_template.json b/odpswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..3570f9eba --- /dev/null +++ b/odpswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "odpswriter", + "parameter": { + "project": "", + "table": "", + "partition":"", + "column": [], + "accessId": "", + "accessKey": "", + "truncate": true, + "odpsServer": "", + "tunnelServer": "" + } +} \ No newline at end of file diff --git a/oraclereader/doc/oraclereader.md b/oraclereader/doc/oraclereader.md new file mode 100644 index 000000000..d5a5d110c --- /dev/null +++ b/oraclereader/doc/oraclereader.md @@ -0,0 +1,347 @@ + +# OracleReader 插件文档 + + +___ + + +## 1 快速介绍 + +OracleReader插件实现了从Oracle读取数据。在底层实现上,OracleReader通过JDBC连接远程Oracle数据库,并执行相应的sql语句将数据从Oracle库中SELECT出来。 + +## 2 实现原理 + +简而言之,OracleReader通过JDBC连接器连接到远程的Oracle数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程Oracle数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,OracleReader将其拼接为SQL语句发送到Oracle数据库;对于用户配置querySql信息,Oracle直接将其发送到Oracle数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从Oracle数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度 byte/s 尽量逼近这个速度但是不高于它. + "byte": 1048576 + }, + //出错限制 + "errorLimit": { + //先选择record + "record": 0, + //百分比 1表示100% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "oraclereader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "column": [ + "id","name" + ], + //切分主键 + "splitPk": "db_id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + // 是否打印内容 + "parameter": { + "print": true + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": 1048576 + }, + "content": [ + { + "reader": { + "name": "oraclereader", + "parameter": { + "username": "root", + "password": "root", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "visible": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,OracleReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,OracleReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照Oracle官方规范,并可以填写连接附件控制信息。具体请参看[Oracle官方文档](http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,OracleReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用*代表默认使用所有列配置,例如['*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照JSON格式: + ["id", "`table`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,\`table\`为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + Column必须显示填写,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:OracleReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形、字符串型数据切分,`不支持浮点、日期等其他类型`。如果用户指定其他非支持类型,OracleReader将报错! + + splitPk如果不填写,将视作用户不对单表进行切分,OracleReader使用单通道同步全量数据。 + + * 必选:否
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,OracleReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。例如在做测试时,可以将where条件指定为limit 10;在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,OracleReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ +* **session** + + * 描述:控制写入数据的时间格式,时区等的配置,如果表中有时间字段,配置该值以明确告知写入 oracle 的时间格式。通常配置的参数为:NLS_DATE_FORMAT,NLS_TIME_FORMAT。其配置的值为 json 格式,例如: +``` +"session": [ + "alter session set NLS_DATE_FORMAT='yyyy-mm-dd hh24:mi:ss'", + "alter session set NLS_TIMESTAMP_FORMAT='yyyy-mm-dd hh24:mi:ss'", + "alter session set NLS_TIMESTAMP_TZ_FORMAT='yyyy-mm-dd hh24:mi:ss'", + "alter session set TIME_ZONE='US/Pacific'" + ] +``` + `(注意"是 " 的转义字符串)`。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前OracleReader支持大部分Oracle类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出OracleReader针对Oracle类型转换列表: + + +| DataX 内部类型| Oracle 数据类型 | +| -------- | ----- | +| Long |NUMBER,RAWID,INTEGER,INT,SMALLINT| +| Double |NUMERIC,DECIMAL,FLOAT,DOUBLE PRECISION,REAL| +| String |LONG,CHAR,NCHAR,VARCHAR,VARCHAR2,NVARCHAR2,CLOB,NCLOB,CHARACTER,CHARACTER VARYING,CHAR VARYING,NATIONAL CHARACTER,NATIONAL CHAR,NATIONAL CHARACTER VARYING,NATIONAL CHAR VARYING,NCHAR VARYING | +| Date |TIMESTAMP,DATE | +| Boolean |bit, bool | +| Bytes |BLOB,BFILE,RAW,LONG RAW | + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +为了模拟线上真实数据,我们设计两个Oracle数据表,分别为: + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + +* Oracle数据库机器参数为: + +### 4.2 测试报告 + +#### 4.2.1 表1测试报告 + + +| 并发任务数| DataX速度(Rec/s)|DataX流量|网卡流量|DataX运行负载|DB运行负载| +|--------| --------|--------|--------|--------|--------| +|1| DataX 统计速度(Rec/s)|DataX统计流量|网卡流量|DataX运行负载|DB运行负载| + +## 5 约束限制 + +### 5.1 主备同步数据恢复问题 + +主备同步问题指Oracle使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 一致性约束 + +Oracle在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,OracleReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在OracleReader单线程模型下数据同步一致性的特性,由于OracleReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当OracleReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 数据库编码问题 + + +OracleReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此OracleReader不需用户指定编码,可以自动获取编码并转码。 + +对于Oracle底层写入编码和其设定的编码不一致的混乱情况,OracleReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.4 增量数据同步 + +OracleReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,OracleReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,OracleReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,OracleReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.5 Sql安全性 + +OracleReader提供querySql语句交给用户自己实现SELECT抽取语句,OracleReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + +*** + +**Q: OracleReader同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用Oracle命令行测试: + sqlplus username/password@//host:port/sid + + +如果上述命令也报错,那可以证实是环境问题,请联系你的DBA。 + + +**Q: OracleReader抽取速度很慢怎么办?** + + A: 影响抽取时间的原因大概有如下几个:(来自专业 DBA 卫绾) + 1. 由于SQL的plan异常,导致的抽取时间长; 在抽取时,尽可能使用全表扫描代替索引扫描; + 2. 合理sql的并发度,减少抽取时间;根据表的大小, + <50G可以不用并发, + <100G添加如下hint: parallel(a,2), + >100G添加如下hint : parallel(a,4); + 3. 抽取sql要简单,尽量不用replace等函数,这个非常消耗cpu,会严重影响抽取速度; \ No newline at end of file diff --git a/oraclereader/oraclereader.iml b/oraclereader/oraclereader.iml new file mode 100644 index 000000000..c352ce1cc --- /dev/null +++ b/oraclereader/oraclereader.iml @@ -0,0 +1,54 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/oraclereader/pom.xml b/oraclereader/pom.xml new file mode 100755 index 000000000..739dff042 --- /dev/null +++ b/oraclereader/pom.xml @@ -0,0 +1,86 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + oraclereader + oraclereader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + com.oracle + ojdbc6 + 11.2.0.3 + system + ${basedir}/src/main/lib/ojdbc6-11.2.0.3.jar + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/oraclereader/src/main/assembly/package.xml b/oraclereader/src/main/assembly/package.xml new file mode 100755 index 000000000..a954a30d5 --- /dev/null +++ b/oraclereader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/oraclereader + + + target/ + + oraclereader-0.0.1-SNAPSHOT.jar + + plugin/reader/oraclereader + + + + + + false + plugin/reader/oraclereader/libs + runtime + + + diff --git a/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/Constant.java b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/Constant.java new file mode 100755 index 000000000..8006b1a6c --- /dev/null +++ b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.oraclereader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1024; + +} diff --git a/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReader.java b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReader.java new file mode 100755 index 000000000..403b30e9b --- /dev/null +++ b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReader.java @@ -0,0 +1,126 @@ +package com.alibaba.datax.plugin.reader.oraclereader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.reader.util.HintUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class OracleReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.Oracle; + + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(OracleReader.Job.class); + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + dealFetchSize(this.originalConfig); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + + // 注意:要在 this.commonRdbmsReaderJob.init(this.originalConfig); 之后执行,这样可以直接快速判断是否是querySql 模式 + dealHint(this.originalConfig); + } + + @Override + public void preCheck(){ + init(); + this.commonRdbmsReaderJob.preCheck(this.originalConfig,DATABASE_TYPE); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderJob.split(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + private void dealFetchSize(Configuration originalConfig) { + int fetchSize = originalConfig.getInt( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + String.format("您配置的 fetchSize 有误,fetchSize:[%d] 值不能小于 1.", + fetchSize)); + } + originalConfig.set( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + fetchSize); + } + + private void dealHint(Configuration originalConfig) { + String hint = originalConfig.getString(Key.HINT); + if (StringUtils.isNotBlank(hint)) { + boolean isTableMode = originalConfig.getBool(com.alibaba.datax.plugin.rdbms.reader.Constant.IS_TABLE_MODE).booleanValue(); + if(!isTableMode){ + throw DataXException.asDataXException(OracleReaderErrorCode.HINT_ERROR, "当且仅当非 querySql 模式读取 oracle 时才能配置 HINT."); + } + HintUtil.initHintConf(DATABASE_TYPE, originalConfig); + } + } + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task( + DATABASE_TYPE ,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig + .getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReaderErrorCode.java b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReaderErrorCode.java new file mode 100755 index 000000000..05ee8604a --- /dev/null +++ b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReaderErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.plugin.reader.oraclereader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OracleReaderErrorCode implements ErrorCode { + HINT_ERROR("Oraclereader-00", "您的 Hint 配置出错."), + + ; + + private final String code; + private final String description; + + private OracleReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/oraclereader/src/main/lib/ojdbc6-11.2.0.3.jar b/oraclereader/src/main/lib/ojdbc6-11.2.0.3.jar new file mode 100644 index 000000000..01da074d5 Binary files /dev/null and b/oraclereader/src/main/lib/ojdbc6-11.2.0.3.jar differ diff --git a/oraclereader/src/main/resources/plugin.json b/oraclereader/src/main/resources/plugin.json new file mode 100755 index 000000000..f1ed98aec --- /dev/null +++ b/oraclereader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "oraclereader", + "class": "com.alibaba.datax.plugin.reader.oraclereader.OracleReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/oraclereader/src/main/resources/plugin_job_template.json b/oraclereader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..beae2552e --- /dev/null +++ b/oraclereader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "oraclereader", + "parameter": { + "username": "", + "password": "", + "column": [], + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } +} \ No newline at end of file diff --git a/oraclewriter/doc/oraclewriter.md b/oraclewriter/doc/oraclewriter.md new file mode 100644 index 000000000..91ab2e7cc --- /dev/null +++ b/oraclewriter/doc/oraclewriter.md @@ -0,0 +1,416 @@ +# DataX OracleWriter + + +--- + + +## 1 快速介绍 + +OracleWriter 插件实现了写入数据到 Oracle 主库的目的表的功能。在底层实现上, OracleWriter 通过 JDBC 连接远程 Oracle 数据库,并执行相应的 insert into ... sql 语句将数据写入 Oracle,内部会分批次提交入库。 + +OracleWriter 面向ETL开发工程师,他们使用 OracleWriter 从数仓导入数据到 Oracle。同时 OracleWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +OracleWriter 通过 DataX 框架获取 Reader 生成的协议数据,根据你配置生成相应的SQL语句 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +
+ + 注意: + 1. 目的表所在数据库必须是主库才能写入数据;整个任务至少需具备 insert into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + 2.OracleWriter和MysqlWriter不同,不支持配置writeMode参数。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 Oracle 导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "Oraclewriter", + "parameter": { + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息 ,jdbcUrl必须包含在connection配置单元中。 + + 注意:1、在一个数据库上只能配置一个值。这与 OracleReader 支持多个备库探测不同,因为此处不支持同一个数据库存在多个主库的情况(双主导入数据情况) + 2、jdbcUrl按照Oracle官方规范,并可以填写连接附加参数信息。具体请参看 Oracle官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + **column配置项必须指定,不能留空!** + + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。比如你的任务是要写入到目的端的100个同构分表(表名称为:datax_00,datax01, ... datax_98,datax_99),并且你希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["delete from @table"]`,效果是:在执行到每个表写入数据前,会先执行对应的 delete from 对应表名称
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与Oracle的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ +* **session** + + * 描述:设置oracle连接时的session信息,格式示例如下:
+ + ``` + "session":[ + "alter session set nls_date_format = 'dd.mm.yyyy hh24:mi:ss';" + "alter session set NLS_LANG = 'AMERICAN';" + ] + + ``` + + * 必选:否
+ + * 默认值:无
+ +### 3.3 类型转换 + +类似 OracleReader ,目前 OracleWriter 支持大部分 Oracle 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 OracleWriter 针对 Oracle 类型转换列表: + + +| DataX 内部类型| Oracle 数据类型 | +| -------- | ----- | +| Long |NUMBER,RAWID,INTEGER,INT,SMALLINT| +| Double |NUMERIC,DECIMAL,FLOAT,DOUBLE PRECISION,REAL| +| String |LONG,CHAR,NCHAR,VARCHAR,VARCHAR2,NVARCHAR2,CLOB,NCLOB,CHARACTER,CHARACTER VARYING,CHAR VARYING,NATIONAL CHARACTER,NATIONAL CHAR,NATIONAL CHARACTER VARYING,NATIONAL CHAR VARYING,NCHAR VARYING | +| Date |TIMESTAMP,DATE | +| Boolean |bit, bool | +| Bytes |BLOB,BFILE,RAW,LONG RAW | + + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: +``` +--DROP TABLE PERF_ORACLE_WRITER; +CREATE TABLE PERF_ORACLE_WRITER ( +COL1 VARCHAR2(255 BYTE) NULL , +COL2 NUMBER(32) NULL , +COL3 NUMBER(32) NULL , +COL4 DATE NULL , +COL5 FLOAT NULL , +COL6 VARCHAR2(255 BYTE) NULL , +COL7 VARCHAR2(255 BYTE) NULL , +COL8 VARCHAR2(255 BYTE) NULL , +COL9 VARCHAR2(255 BYTE) NULL , +COL10 VARCHAR2(255 BYTE) NULL +) +LOGGING +NOCOMPRESS +NOCACHE; +``` +单行记录类似于: +``` +col1:485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&* +co12:1 +co13:1696248667889 +co14:2013-01-06 00:00:00 +co15:3.141592653578 +co16:100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 +co17:100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209 +co18:100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209 +co19:100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209 +co110:12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 +``` +#### 4.1.2 机器参数 + +* 执行 DataX 的机器参数为: + 1. cpu: 24 Core Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + 2. mem: 94GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Oracle 数据库机器参数为: + 1. cpu: 4 Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz + 2. mem: 7GB + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +#### 4.1.4 性能测试作业配置 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 4 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "sliceRecordCount": 1000000000, + "column": [ + { + "value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*" + }, + { + "value": 1, + "type": "long" + }, + { + "value": "1696248667889", + "type": "long" + }, + { + "type": "date", + "value": "2013-07-06 00:00:00", + "dateFormat": "yyyy-mm-dd hh:mm:ss" + }, + { + "value": "3.141592653578", + "type": "double" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209" + }, + { + "value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209" + }, + { + "value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + } + ] + } + }, + "writer": { + "name": "oraclewriter", + "parameter": { + "username": "username", + "password": "password", + "truncate": "true", + "batchSize": "512", + "column": [ + "col1", + "col2", + "col3", + "col4", + "col5", + "col6", + "col7", + "col8", + "col9", + "col10" + ], + "connection": [ + { + "table": [ + "PERF_ORACLE_WRITER" + ], + "jdbcUrl": "jdbc:oracle:thin:@ip:port:dataplat" + } + ] + } + } + } + ] + } +} + +``` + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| +|1|128|15564|6.51|7.5|0.02|7.4|1.08| +|1|512|29491|10.90|12.6|0.05|12.4|1.55| +|1|1024|31529|11.87|13.5|0.22|13.3|1.58| +|1|2048|33469|12.57|14.3|0.17|14.3|1.53| +|1|4096|31363|12.48|13.4|0.10|10.0|1.72| +|4|10|9440|4.05|5.6|0.01|5.0|3.75| +|4|128|42832|16.48|18.3|0.07|18.5|2.89| +|4|512|46643|20.02|22.7|0.35|21.1|3.31| +|4|1024|39116|16.79|18.7|0.10|18.1|3.05| +|4|2048|39526|16.96|18.5|0.32|17.1|2.86| +|4|4096|37683|16.17|17.2|0.23|15.5|2.26| +|8|128|38336|16.45|17.5|0.13|16.2|3.87| +|8|512|31078|13.34|14.9|0.11|13.4|2.09| +|8|1024|37888|16.26|18.5|0.20|18.5|3.14| +|8|2048|38502|16.52|18.5|0.18|18.5|2.96| +|8|4096|38092|16.35|18.3|0.10|17.8|3.19| +|16|128|35366|15.18|16.9|0.13|15.6|3.49| +|16|512|35584|15.27|16.8|0.23|17.4|3.05| +|16|1024|38297|16.44|17.5|0.20|17.0|3.42| +|16|2048|28467|12.22|13.8|0.10|12.4|3.38| +|16|4096|27852|11.95|12.3|0.11|12.3|3.86| +|32|1024|34406|14.77|15.4|0.09|15.4|3.55| + + +1. `batchSize 和 通道个数,对性能影响较大` +2. `通常不建议写入数据库时,通道个数 >32` + + + +## 5 约束限制 + + + + +## FAQ + +*** + +**Q: OracleWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/oraclewriter/oraclewriter.iml b/oraclewriter/oraclewriter.iml new file mode 100644 index 000000000..c352ce1cc --- /dev/null +++ b/oraclewriter/oraclewriter.iml @@ -0,0 +1,54 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/oraclewriter/pom.xml b/oraclewriter/pom.xml new file mode 100755 index 000000000..1a85ac11e --- /dev/null +++ b/oraclewriter/pom.xml @@ -0,0 +1,82 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + oraclewriter + oraclewriter + jar + writer data into oracle database + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + com.oracle + ojdbc6 + 11.2.0.3 + system + ${basedir}/src/main/lib/ojdbc6-11.2.0.3.jar + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/oraclewriter/src/main/assembly/package.xml b/oraclewriter/src/main/assembly/package.xml new file mode 100755 index 000000000..9dab0c8e1 --- /dev/null +++ b/oraclewriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/oraclewriter + + + target/ + + oraclewriter-0.0.1-SNAPSHOT.jar + + plugin/writer/oraclewriter + + + + + + false + plugin/writer/oraclewriter/libs + runtime + + + diff --git a/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriter.java b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriter.java new file mode 100755 index 000000000..73a9ad6a3 --- /dev/null +++ b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriter.java @@ -0,0 +1,104 @@ +package com.alibaba.datax.plugin.writer.oraclewriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class OracleWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.Oracle; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + public void preCheck() { + this.init(); + this.commonRdbmsWriterJob.writerPreCheck(this.originalConfig, DATABASE_TYPE); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, oracle only support insert mode, don't use + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "写入模式(writeMode)配置错误. 因为Oracle不支持配置项 writeMode: %s, Oracle只能使用insert sql 插入数据. 请检查您的配置并作出修改", + writeMode)); + } + + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job( + DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + @Override + public void prepare() { + //oracle实跑先不做权限检查 + //this.commonRdbmsWriterJob.privilegeValid(this.originalConfig, DATABASE_TYPE); + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, + mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task(DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, + this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + } + +} diff --git a/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriterErrorCode.java b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriterErrorCode.java new file mode 100755 index 000000000..06f0cfa26 --- /dev/null +++ b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriterErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.writer.oraclewriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OracleWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private OracleWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/oraclewriter/src/main/lib/ojdbc6-11.2.0.3.jar b/oraclewriter/src/main/lib/ojdbc6-11.2.0.3.jar new file mode 100644 index 000000000..01da074d5 Binary files /dev/null and b/oraclewriter/src/main/lib/ojdbc6-11.2.0.3.jar differ diff --git a/oraclewriter/src/main/resources/plugin.json b/oraclewriter/src/main/resources/plugin.json new file mode 100755 index 000000000..54df0a890 --- /dev/null +++ b/oraclewriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "oraclewriter", + "class": "com.alibaba.datax.plugin.writer.oraclewriter.OracleWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/oraclewriter/src/main/resources/plugin_job_template.json b/oraclewriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..0ef68e9ed --- /dev/null +++ b/oraclewriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "oraclewriter", + "parameter": { + "username": "", + "password": "", + "column": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ] + } +} \ No newline at end of file diff --git a/ossreader/doc/ossreader.md b/ossreader/doc/ossreader.md new file mode 100644 index 000000000..3020455a9 --- /dev/null +++ b/ossreader/doc/ossreader.md @@ -0,0 +1,244 @@ +# DataX OSSReader 说明 + + +------------ + +## 1 快速介绍 + +OSSReader提供了读取OSS数据存储的能力。在底层实现上,OSSReader使用OSS官方Java SDK获取OSS数据,并转换为DataX传输协议传递给Writer。 + +* OSS 产品介绍, 参看[[阿里云OSS Portal](http://www.aliyun.com/product/oss)] +* OSS Java SDK, 参看[[阿里云OSS Java SDK](http://oss.aliyuncs.com/aliyun_portal_storage/help/oss/OSS_Java_SDK_Dev_Guide_20141113.pdf)] + +## 2 功能与限制 + +OSSReader实现了从OSS读取数据并转为DataX协议的功能,OSS本身是无结构化数据存储,对于DataX而言,OSSReader实现上类比TxtFileReader,有诸多相似之处。目前OSSReader支持功能如下: + +1. 支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +4. 支持递归读取、支持文件名过滤。 + +5. 支持文本压缩,现有压缩格式为gzip、bzip2。注意,一个压缩包不允许多文件打包压缩。 + +6. 多个object可以支持并发读取。 + +我们暂时不能做到: + +1. 单个Object(File)支持多线程并发读取,这里涉及到单个Object内部切分算法。二期考虑支持。 + +2. 单个Object在压缩情况下,从技术上无法支持多线程并发读取。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "job": { + "setting": {}, + "content": [ + { + "reader": { + "name": "ossreader", + "parameter": { + "endpoint": "http://oss.aliyuncs.com", + "accessId": "", + "accessKey": "", + "bucket": "myBucket", + "object": [ + "bazhen/*" + ], + "column": [ + { + "type": "long", + "index": 0 + }, + { + "type": "string", + "value": "alibaba" + }, + { + "type": "date", + "index": 1, + "format": "yyyy-MM-dd" + } + ], + "encoding": "UTF-8", + "fieldDelimiter": "\t", + "compress": "gzip" + } + }, + "writer": {} + } + ] + } +} +``` + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OSS Server的EndPoint地址,例如http://oss.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OSS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OSS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **bucket** + + * 描述:OSS的bucket
+ + * 必选:是
+ + * 默认值:无
+ +* **object** + + * 描述:OSS的object信息,注意这里可以支持填写多个Object。
+ + 当指定单个OSS Object,OSSReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个Object可以进行多线程并发读取。 + + 当指定多个OSS Object,OSSReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,OSSReader尝试遍历出多个Object信息。例如: 指定/*代表读取bucket下游所有的Object,指定/bazhen/\*代表读取bazhen目录下游所有的Object。 + + **特别需要注意的是,DataX会将一个作业下同步的所有Object视作同一张数据表。用户必须自己保证所有的Object能够适配同一套schema信息。** + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json + { + "type": "long", + "index": 0 //从OSS文本第一列获取int字段 + }, + { + "type": "string", + "value": "alibaba" //从OSSReader内部生成alibaba的字符串字段作为当前字段 + } + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:是
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为 gzip、bzip2。
+ + * 必选:否
+ + * 默认值:不压缩
+ +* **encoding** + + * 描述:读取文件的编码配置,目前只支持utf-8/gbk。
+ + * 必选:否
+ + * 默认值:utf-8
+ +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **skipHeader** + + * 描述:类CSV格式文件可能存在表头为标题情况,需要跳过。默认不跳过。
+ + * 必选:否
+ + * 默认值:false
+ + +### 3.3 类型转换 + + +OSS本身不提供数据类型,该类型是DataX OSSReader定义: + +| DataX 内部类型| OSS 数据类型 | +| -------- | ----- | +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* OSS Long是指OSS文本中使用整形的字符串表示形式,例如"19901219"。 +* OSS Double是指OSS文本中使用Double的字符串表示形式,例如"3.1415"。 +* OSS Boolean是指OSS文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* OSS Date是指OSS文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + +## 4 性能报告 + +|并发数|DataX 流量|Datax 记录数| +|--------|--------| --------| +|1| 971.40KB/s |10047rec/s | +|2| 1.81MB/s | 19181rec/s | +|4| 3.46MB/s| 36695rec/s | +|8| 6.57MB/s | 69289 records/s | +|16|7.92MB/s| 83920 records/s| +|32|7.87MB/s| 83350 records/s| + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/ossreader/ossreader.iml b/ossreader/ossreader.iml new file mode 100644 index 000000000..ae8ef56f3 --- /dev/null +++ b/ossreader/ossreader.iml @@ -0,0 +1,35 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/ossreader/pom.xml b/ossreader/pom.xml new file mode 100755 index 000000000..e2318b4fe --- /dev/null +++ b/ossreader/pom.xml @@ -0,0 +1,82 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + ossreader + ossreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + com.aliyun.oss + aliyun-sdk-oss + 2.0.2 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/ossreader/src/main/assembly/package.xml b/ossreader/src/main/assembly/package.xml new file mode 100755 index 000000000..e6f7257dc --- /dev/null +++ b/ossreader/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/ossreader + + + target/ + + ossreader-0.0.1-SNAPSHOT.jar + + plugin/reader/ossreader + + + + + + false + plugin/reader/ossreader/libs + runtime + + + diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Constants.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Constants.java new file mode 100755 index 000000000..ced380867 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Constants.java @@ -0,0 +1,10 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public class Constants { + + public static final String OBJECT = "object"; + public static final int SOCKETTIMEOUT = 5000000; +} diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Key.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Key.java new file mode 100755 index 000000000..72c0e1bd5 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Key.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public class Key { + public static final String ENDPOINT = "endpoint"; + + public static final String ACCESSID = "accessId"; + + public static final String ACCESSKEY = "accessKey"; + + public static final String ENCODING = "encoding"; + + public static final String BUCKET = "bucket"; + + public static final String OBJECT = "object"; + +} diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReader.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReader.java new file mode 100755 index 000000000..336d77e4f --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReader.java @@ -0,0 +1,317 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.ossreader.util.OssUtil; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.aliyun.oss.ClientException; +import com.aliyun.oss.OSSClient; +import com.aliyun.oss.OSSException; +import com.aliyun.oss.model.ListObjectsRequest; +import com.aliyun.oss.model.OSSObject; +import com.aliyun.oss.model.OSSObjectSummary; +import com.aliyun.oss.model.ObjectListing; +import com.google.common.collect.Sets; + +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.InputStream; +import java.nio.charset.UnsupportedCharsetException; +import java.util.ArrayList; +import java.util.List; +import java.util.Set; +import java.util.regex.Pattern; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public class OssReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(OssReader.Job.class); + + private Configuration readerOriginConfig = null; + + @Override + public void init() { + LOG.debug("init() begin..."); + this.readerOriginConfig = this.getPluginJobConf(); + this.validate(); + LOG.debug("init() ok and end..."); + } + + private void validate() { + String endpoint = this.readerOriginConfig.getString(Key.ENDPOINT); + if (StringUtils.isBlank(endpoint)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 endpoint"); + } + + String accessId = this.readerOriginConfig.getString(Key.ACCESSID); + if (StringUtils.isBlank(accessId)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 accessId"); + } + + String accessKey = this.readerOriginConfig.getString(Key.ACCESSKEY); + if (StringUtils.isBlank(accessKey)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 accessKey"); + } + + String bucket = this.readerOriginConfig.getString(Key.BUCKET); + if (StringUtils.isBlank(bucket)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 endpoint"); + } + + String object = this.readerOriginConfig.getString(Key.OBJECT); + if (StringUtils.isBlank(object)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 object"); + } + + String fieldDelimiter = this.readerOriginConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER); + // warn: need length 1 + if (null == fieldDelimiter || fieldDelimiter.length() == 0) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 fieldDelimiter"); + } + + String encoding = this.readerOriginConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING); + try { + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, + String.format("运行配置异常 : %s", e.getMessage()), e); + } + + // 检测是column 是否为 ["*"] 若是则填为空 + List column = this.readerOriginConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + if (null != column + && 1 == column.size() + && ("\"*\"".equals(column.get(0).toString()) || "'*'" + .equals(column.get(0).toString()))) { + readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, + new ArrayList()); + } else { + // column: 1. index type 2.value type 3.when type is Data, may + // have + // format + List columns = this.readerOriginConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 columns"); + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, + OssReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf + .getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException( + OssReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException( + OssReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + + } + } + } + + // only support compress: gzip,bzip2 + String compress = this.readerOriginConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS); + if (StringUtils.isBlank(compress)) { + this.readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + null); + } else { + Set supportedCompress = Sets + .newHashSet("gzip", "bzip2"); + compress = compress.toLowerCase().trim(); + if (!supportedCompress.contains(compress)) { + throw DataXException + .asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 gzip, bzip2 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + this.readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + compress); + } + } + + @Override + public void prepare() { + LOG.debug("prepare()"); + } + + @Override + public void post() { + LOG.debug("post()"); + } + + @Override + public void destroy() { + LOG.debug("destroy()"); + } + + @Override + public List split(int adviceNumber) { + LOG.debug("split() begin..."); + List readerSplitConfigs = new ArrayList(); + + // 将每个单独的 object 作为一个 slice + List objects = parseOriginObjects(readerOriginConfig + .getList(Constants.OBJECT, String.class)); + if (0 == objects.size()) { + throw DataXException.asDataXException( + OssReaderErrorCode.EMPTY_BUCKET_EXCEPTION, + String.format( + "未能找到待读取的Object,请确认您的配置项bucket: %s object: %s", + this.readerOriginConfig.get(Key.BUCKET), + this.readerOriginConfig.get(Key.OBJECT))); + } + + for (String object : objects) { + Configuration splitedConfig = this.readerOriginConfig.clone(); + splitedConfig.set(Constants.OBJECT, object); + readerSplitConfigs.add(splitedConfig); + } + LOG.debug("split() ok and end..."); + return readerSplitConfigs; + } + + private List parseOriginObjects(List originObjects) { + List parsedObjects = new ArrayList(); + + for (String object : originObjects) { + int firstMetaChar = (object.indexOf('*') > object.indexOf('?')) ? object + .indexOf('*') : object.indexOf('?'); + + if (firstMetaChar != -1) { + int lastDirSeparator = object.lastIndexOf( + IOUtils.DIR_SEPARATOR, firstMetaChar); + String parentDir = object + .substring(0, lastDirSeparator + 1); + List remoteObjects = getRemoteObjects(parentDir); + Pattern pattern = Pattern.compile(object.replace("*", ".*") + .replace("?", ".?")); + + for (String remoteObject : remoteObjects) { + if (pattern.matcher(remoteObject).matches()) { + parsedObjects.add(remoteObject); + } + } + } else { + parsedObjects.add(object); + } + } + return parsedObjects; + } + + private List getRemoteObjects(String parentDir) + throws OSSException, ClientException { + + LOG.debug(String.format("父文件夹 : %s", parentDir)); + List remoteObjects = new ArrayList(); + OSSClient client = OssUtil.initOssClient(readerOriginConfig); + try { + ListObjectsRequest listObjectsRequest = new ListObjectsRequest( + readerOriginConfig.getString(Key.BUCKET)); + listObjectsRequest.setPrefix(parentDir); + ObjectListing objectList; + do { + objectList = client.listObjects(listObjectsRequest); + for (OSSObjectSummary objectSummary : objectList + .getObjectSummaries()) { + LOG.debug(String.format("找到文件 : %s", + objectSummary.getKey())); + remoteObjects.add(objectSummary.getKey()); + } + listObjectsRequest.setMarker(objectList.getNextMarker()); + LOG.debug(listObjectsRequest.getMarker()); + LOG.debug(String.valueOf(objectList.isTruncated())); + + } while (objectList.isTruncated()); + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + OssReaderErrorCode.OSS_EXCEPTION, e.getMessage()); + } + + return remoteObjects; + } + } + + public static class Task extends Reader.Task { + private static Logger LOG = LoggerFactory.getLogger(Reader.Task.class); + + private Configuration readerSliceConfig; + + @Override + public void startRead(RecordSender recordSender) { + LOG.debug("read start"); + String object = readerSliceConfig.getString(Key.OBJECT); + OSSClient client = OssUtil.initOssClient(readerSliceConfig); + + OSSObject ossObject = client.getObject( + readerSliceConfig.getString(Key.BUCKET), object); + InputStream objectStream = ossObject.getObjectContent(); + UnstructuredStorageReaderUtil.readFromStream(objectStream, object, + this.readerSliceConfig, recordSender, + this.getTaskPluginCollector()); + recordSender.flush(); + } + + @Override + public void init() { + this.readerSliceConfig = this.getPluginJobConf(); + } + + @Override + public void destroy() { + + } + } +} diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReaderErrorCode.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReaderErrorCode.java new file mode 100755 index 000000000..aa33c7582 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public enum OssReaderErrorCode implements ErrorCode { + // TODO: 修改错误码类型 + RUNTIME_EXCEPTION("OssReader-00", "运行时异常"), + OSS_EXCEPTION("OssFileReader-01", "OSS配置异常"), + CONFIG_INVALID_EXCEPTION("OssFileReader-02", "参数配置错误"), + NOT_SUPPORT_TYPE("OssReader-03", "不支持的类型"), + CAST_VALUE_TYPE_ERROR("OssFileReader-04", "无法完成指定类型的转换"), + SECURITY_EXCEPTION("OssReader-05", "缺少权限"), + ILLEGAL_VALUE("OssReader-06", "值错误"), + REQUIRED_VALUE("OssReader-07", "必选项"), + NO_INDEX_VALUE("OssReader-08","没有 Index" ), + MIXED_INDEX_VALUE("OssReader-09","index 和 value 混合" ), + EMPTY_BUCKET_EXCEPTION("OssReader-10", "您尝试读取的Bucket为空"); + + private final String code; + private final String description; + + private OssReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} \ No newline at end of file diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/util/OssUtil.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/util/OssUtil.java new file mode 100755 index 000000000..e56970bb3 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/util/OssUtil.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.reader.ossreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.ossreader.Constants; +import com.alibaba.datax.plugin.reader.ossreader.Key; +import com.alibaba.datax.plugin.reader.ossreader.OssReaderErrorCode; +import com.aliyun.oss.ClientConfiguration; +import com.aliyun.oss.OSSClient; + +/** + * Created by mengxin.liumx on 2014/12/8. + */ +public class OssUtil { + public static OSSClient initOssClient(Configuration conf){ + String endpoint = conf.getString(Key.ENDPOINT); + String accessId = conf.getString(Key.ACCESSID); + String accessKey = conf.getString(Key.ACCESSKEY); + ClientConfiguration ossConf = new ClientConfiguration(); + ossConf.setSocketTimeout(Constants.SOCKETTIMEOUT); + OSSClient client = null; + try{ + client = new OSSClient(endpoint, accessId, accessKey, ossConf); + + } catch (IllegalArgumentException e){ + throw DataXException.asDataXException( + OssReaderErrorCode.OSS_EXCEPTION,e.getMessage()); + } + + return client; + } +} diff --git a/ossreader/src/main/resources/basic0.json b/ossreader/src/main/resources/basic0.json new file mode 100755 index 000000000..4a9565cfd --- /dev/null +++ b/ossreader/src/main/resources/basic0.json @@ -0,0 +1,55 @@ +{ + "job": { + "setting": { + "speed": { + "byte":10485760 + }, + "errorLimit": { + "record": 0, + "percentage": 0.02 + } + }, + + "content": [ + { + "reader": { + "name": "ossreader", + "parameter": { + "endpoint": "http://oss-cn-hangzhou-zmf.aliyuncs.com", + "accessId": "", + "accessKey": "", + "bucket": "061115", + "object": [ + "osstest/osster2.gz" + ], + "column": [ + { + "type": "double", + "index": 0 + }, + { + "type": "string", + "value": "alibaba" + }, + { + "type": "date", + "index": 1, + "format": "yyyy-MM-dd" + } + ], + "encoding": "UTF-8", + "fieldDelimiter": ",", + "compress": "gzip" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true, + "encoding": "UTF-8" + } + } + } + ] + } +} diff --git a/ossreader/src/main/resources/plugin.json b/ossreader/src/main/resources/plugin.json new file mode 100755 index 000000000..bf1cf5be0 --- /dev/null +++ b/ossreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "ossreader", + "class": "com.alibaba.datax.plugin.reader.ossreader.OssReader", + "description": "", + "developer": "alibaba" +} + diff --git a/ossreader/src/main/resources/plugin_job_template.json b/ossreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..41b5e2195 --- /dev/null +++ b/ossreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "ossreader", + "parameter": { + "endpoint": "", + "accessId": "", + "accessKey": "", + "bucket": "", + "object": [], + "column": [], + "encoding": "", + "fieldDelimiter": "", + "compress": "" + } +} \ No newline at end of file diff --git a/osswriter/doc/osswriter.md b/osswriter/doc/osswriter.md new file mode 100644 index 000000000..cf7180e1d --- /dev/null +++ b/osswriter/doc/osswriter.md @@ -0,0 +1,214 @@ +# DataX OSSWriter 说明 + + +------------ + +## 1 快速介绍 + +OSSWriter提供了向OSS写入类CSV格式的一个或者多个表文件。 + +**写入OSS内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +* OSS 产品介绍, 参看[[阿里云OSS Portal](http://www.aliyun.com/product/oss)] +* OSS Java SDK, 参看[[阿里云OSS Java SDK](http://oss.aliyuncs.com/aliyun_portal_storage/help/oss/OSS_Java_SDK_Dev_Guide_20141113.pdf)] + + +## 2 功能与限制 + +OSSWriter实现了从DataX协议转为OSS中的TXT文件功能,OSS本身是无结构化数据存储,OSSWriter需要在如下几个方面增加: + +1. 支持且仅支持写入 TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 暂时不支持文本压缩。 + +6. 支持多线程写入,每个线程写入不同子文件。 + +7. 文件支持滚动,当文件大于某个size值或者行数值,文件需要切换。 [暂不支持] + +我们不能做到: + +1. 单个文件不能支持并发写入。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "job": { + "setting": {}, + "content": [ + { + "reader": { + + }, + "writer": { + "parameter": { + "endpoint": "http://oss.aliyuncs.com", + "accessId": "", + "accessKey": "", + "bucket": "myBucket", + "object": "/cdo/datax", + "encoding": "UTF-8", + "fieldDelimiter": ",", + "writeMode": "truncate|append|nonConflict" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OSS Server的EndPoint地址,例如http://oss.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OSS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OSS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **bucket** + + * 描述:OSS的bucket
+ + * 必选:是
+ + * 默认值:无
+ +* **object** + + * 描述:OSSWriter写入的文件名,OSS使用文件名模拟目录的实现。
+ + 使用"object": "datax",写入object以datax开头,后缀添加随机字符串。 + 使用"object": "/cdo/datax",写入的object以/cdo/datax开头,后缀随机添加字符串,/作为OSS模拟目录的分隔符。 + + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:OSSWriter写入前数据清理处理:
+ + * truncate,写入前清理object名称前缀匹配的所有object。例如: "object": "abc",将清理所有abc开头的object。 + * append,写入前不做任何处理,DataX OSSWriter直接使用object名称写入,并使用随机UUID的后缀名来保证文件名不冲突。例如用户指定的object名为datax,实际写入为datax_xxxxxx_xxxx_xxxx + * nonConflict,如果指定路径出现前缀匹配的object,直接报错。例如: "object": "abc",如果存在abc123的object,将直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:否
+ + * 默认值:,
+ + +* **encoding** + + * 描述:写出文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+* **dateFormat** + + * 描述:日期类型的数据序列化到object中时的格式,例如 "dateFormat": "yyyy-MM-dd"。
+ + + * 必选:否
+ + * 默认值:无
+ +* **fileFormat** + + * 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。
+ + * 必选:否
+ + * 默认值:text
+ +* **header** + + * 描述:Oss写出时的表头,示例['id', 'name', 'age']。
+ + * 必选:否
+ + * 默认值:无
+ +* **maxFileSize** + + * 描述:Oss写出时单个Object文件的最大大小,默认为10000*10MB,类似log4j日志打印时根据日志文件大小轮转。OSS分块上传时,每个分块大小为10MB,每个OSS InitiateMultipartUploadRequest支持的分块最大数量为10000。轮转发生时,object名字规则是:在原有object前缀加UUID随机数的基础上,拼接_1,_2,_3等后缀。
+ + * 必选:否
+ + * 默认值:100000MB
+ +### 3.3 类型转换 + +## 4 性能报告 + +OSS本身不提供数据类型,该类型是DataX OSSWriter定义: + +| DataX 内部类型| OSS 数据类型 | +| -------- | ----- | +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* OSS Long是指OSS文本中使用整形的字符串表示形式,例如"19901219"。 +* OSS Double是指OSS文本中使用Double的字符串表示形式,例如"3.1415"。 +* OSS Boolean是指OSS文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* OSS Date是指OSS文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/osswriter/osswriter.iml b/osswriter/osswriter.iml new file mode 100644 index 000000000..ae8ef56f3 --- /dev/null +++ b/osswriter/osswriter.iml @@ -0,0 +1,35 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/osswriter/pom.xml b/osswriter/pom.xml new file mode 100644 index 000000000..2eae0c48e --- /dev/null +++ b/osswriter/pom.xml @@ -0,0 +1,80 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + osswriter + osswriter + jar + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + com.aliyun.oss + aliyun-sdk-oss + 2.0.2 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/osswriter/src/main/assembly/package.xml b/osswriter/src/main/assembly/package.xml new file mode 100644 index 000000000..aa40643de --- /dev/null +++ b/osswriter/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/osswriter + + + target/ + + osswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/osswriter + + + + + + false + plugin/writer/osswriter/libs + runtime + + + diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Constant.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Constant.java new file mode 100644 index 000000000..5bf2eb46e --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Constant.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +/** + * Created by haiwei.luo on 15-02-09. + */ +public class Constant { + public static final String OBJECT = "object"; + public static final int SOCKETTIMEOUT = 5000000; +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Key.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Key.java new file mode 100644 index 000000000..29ed7ea03 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Key.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +/** + * Created by haiwei.luo on 15-02-09. + */ +public class Key { + public static final String ENDPOINT = "endpoint"; + + public static final String ACCESSID = "accessId"; + + public static final String ACCESSKEY = "accessKey"; + + public static final String BUCKET = "bucket"; + + public static final String OBJECT = "object"; + +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriter.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriter.java new file mode 100644 index 000000000..8da95b9c6 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriter.java @@ -0,0 +1,436 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +import java.io.ByteArrayInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.UUID; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.writer.UnstructuredStorageWriterUtil; +import com.alibaba.datax.plugin.writer.osswriter.util.OssUtil; +import com.aliyun.oss.ClientException; +import com.aliyun.oss.OSSClient; +import com.aliyun.oss.OSSException; +import com.aliyun.oss.model.CompleteMultipartUploadRequest; +import com.aliyun.oss.model.CompleteMultipartUploadResult; +import com.aliyun.oss.model.InitiateMultipartUploadRequest; +import com.aliyun.oss.model.InitiateMultipartUploadResult; +import com.aliyun.oss.model.ListObjectsRequest; +import com.aliyun.oss.model.OSSObjectSummary; +import com.aliyun.oss.model.ObjectListing; +import com.aliyun.oss.model.PartETag; +import com.aliyun.oss.model.UploadPartRequest; +import com.aliyun.oss.model.UploadPartResult; + +/** + * Created by haiwei.luo on 15-02-09. + */ +public class OssWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + private OSSClient ossClient = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + this.ossClient = OssUtil.initOssClient(this.writerSliceConfig); + } + + private void validateParameter() { + this.writerSliceConfig.getNecessaryValue(Key.ENDPOINT, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.ACCESSID, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.ACCESSKEY, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.BUCKET, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.OBJECT, + OssWriterErrorCode.REQUIRED_VALUE); + // warn: do not support compress!! + String compress = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.COMPRESS); + if (StringUtils.isNotBlank(compress)) { + String errorMessage = String.format( + "OSS写暂时不支持压缩, 该压缩配置项[%s]不起效用", compress); + LOG.error(errorMessage); + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, errorMessage); + + } + UnstructuredStorageWriterUtil + .validateParameter(this.writerSliceConfig); + + } + + @Override + public void prepare() { + LOG.info("begin do prepare..."); + String bucket = this.writerSliceConfig.getString(Key.BUCKET); + String object = this.writerSliceConfig.getString(Key.OBJECT); + String writeMode = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.WRITE_MODE); + // warn: bucket is not exists, create it + try { + // warn: do not create bucket for user + if (!this.ossClient.doesBucketExist(bucket)) { + // this.ossClient.createBucket(bucket); + String errorMessage = String.format( + "您配置的bucket [%s] 不存在, 请您确认您的配置项.", bucket); + LOG.error(errorMessage); + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, errorMessage); + } + LOG.info(String.format("access control details [%s].", + this.ossClient.getBucketAcl(bucket).toString())); + + // truncate option handler + if ("truncate".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的Object", + bucket, object)); + // warn: 默认情况下,如果Bucket中的Object数量大于100,则只会返回100个Object + while (true) { + ObjectListing listing = null; + LOG.info("list objects with listObject(bucket, object)"); + listing = this.ossClient.listObjects(bucket, object); + List objectSummarys = listing + .getObjectSummaries(); + for (OSSObjectSummary objectSummary : objectSummarys) { + LOG.info(String.format("delete oss object [%s].", + objectSummary.getKey())); + this.ossClient.deleteObject(bucket, + objectSummary.getKey()); + } + if (objectSummarys.isEmpty()) { + break; + } + } + } else if ("append".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode append, 写入前不做清理工作, 数据写入Bucket [%s] 下, 写入相应Object的前缀为 [%s]", + bucket, object)); + } else if ("nonConflict".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode nonConflict, 开始检查Bucket [%s] 下面以 [%s] 命名开头的Object", + bucket, object)); + ObjectListing listing = this.ossClient.listObjects(bucket, + object); + if (0 < listing.getObjectSummaries().size()) { + StringBuilder objectKeys = new StringBuilder(); + objectKeys.append("[ "); + for (OSSObjectSummary ossObjectSummary : listing + .getObjectSummaries()) { + objectKeys.append(ossObjectSummary.getKey() + " ,"); + } + objectKeys.append(" ]"); + LOG.info(String.format( + "object with prefix [%s] details: %s", object, + objectKeys.toString())); + throw DataXException + .asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的Bucket: [%s] 下面存在其Object有前缀 [%s].", + bucket, object)); + } + } + } catch (OSSException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } catch (ClientException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + @Override + public List split(int mandatoryNumber) { + LOG.info("begin do split..."); + List writerSplitConfigs = new ArrayList(); + String object = this.writerSliceConfig.getString(Key.OBJECT); + String bucket = this.writerSliceConfig.getString(Key.BUCKET); + + Set allObjects = new HashSet(); + try { + List ossObjectlisting = this.ossClient + .listObjects(bucket).getObjectSummaries(); + for (OSSObjectSummary objectSummary : ossObjectlisting) { + allObjects.add(objectSummary.getKey()); + } + } catch (OSSException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } catch (ClientException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } + + String objectSuffix; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same object name + Configuration splitedTaskConfig = this.writerSliceConfig + .clone(); + + String fullObjectName = null; + objectSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullObjectName = String.format("%s__%s", object, objectSuffix); + while (allObjects.contains(fullObjectName)) { + objectSuffix = UUID.randomUUID().toString() + .replace('-', '_'); + fullObjectName = String.format("%s__%s", object, + objectSuffix); + } + allObjects.add(fullObjectName); + + splitedTaskConfig.set(Key.OBJECT, fullObjectName); + + LOG.info(String.format("splited write object name:[%s]", + fullObjectName)); + + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private OSSClient ossClient; + private Configuration writerSliceConfig; + private String bucket; + private String object; + private String nullFormat; + private String encoding; + private char fieldDelimiter; + private String dateFormat; + private String fileFormat; + private List header; + private Long maxFileSize;//MB + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.ossClient = OssUtil.initOssClient(this.writerSliceConfig); + this.bucket = this.writerSliceConfig.getString(Key.BUCKET); + this.object = this.writerSliceConfig.getString(Key.OBJECT); + this.nullFormat = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.NULL_FORMAT); + this.dateFormat = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.DATE_FORMAT, + null); + this.encoding = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.DEFAULT_ENCODING); + this.fieldDelimiter = this.writerSliceConfig + .getChar( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FIELD_DELIMITER, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.DEFAULT_FIELD_DELIMITER); + this.fileFormat = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_FORMAT, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.FILE_FORMAT_TEXT); + this.header = this.writerSliceConfig + .getList( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.HEADER, + null, String.class); + this.maxFileSize = this.writerSliceConfig + .getLong(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.MAX_FILE_SIZE, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.MAX_FILE_SIZE); + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + // 设置每块字符串长度 + final long partSize = 1024 * 1024 * 10L; + long numberCacul = (this.maxFileSize * 1024 * 1024L) / partSize; + final long maxPartNumber = numberCacul >= 1 ? numberCacul : 1; + int objectSufix = 0; + StringBuilder sb = new StringBuilder(); + Record record = null; + + LOG.info(String.format( + "begin do write, each object maxFileSize: [%s]MB...", + maxPartNumber * 10)); + String currentObject = this.object; + InitiateMultipartUploadRequest currentInitiateMultipartUploadRequest = null; + InitiateMultipartUploadResult currentInitiateMultipartUploadResult = null; + List currentPartETags = null; + // to do: 可以根据currentPartNumber做分块级别的重试,InitiateMultipartUploadRequest多次一个currentPartNumber会覆盖原有 + int currentPartNumber = 1; + try { + // warn + boolean needInitMultipartTransform = true; + while ((record = lineReceiver.getFromReader()) != null) { + // init:begin new multipart upload + if (needInitMultipartTransform) { + if (objectSufix == 0) { + currentObject = this.object; + } else { + currentObject = String.format("%s_%s", this.object, + objectSufix); + } + objectSufix++; + currentInitiateMultipartUploadRequest = new InitiateMultipartUploadRequest( + this.bucket, currentObject); + currentInitiateMultipartUploadResult = this.ossClient + .initiateMultipartUpload(currentInitiateMultipartUploadRequest); + currentPartETags = new ArrayList(); + LOG.info(String + .format("write to bucket: [%s] object: [%s] with oss uploadId: [%s]", + this.bucket, currentObject, + currentInitiateMultipartUploadResult + .getUploadId())); + + // each object's header + if (null != this.header && !this.header.isEmpty()) { + sb.append(UnstructuredStorageWriterUtil + .doTransportOneRecord(this.header, + this.fieldDelimiter, + this.fileFormat)); + } + // warn + needInitMultipartTransform = false; + currentPartNumber = 1; + } + + // write: upload data to current object + MutablePair transportResult = UnstructuredStorageWriterUtil + .transportOneRecord(record, nullFormat, dateFormat, + fieldDelimiter, this.fileFormat, + this.getTaskPluginCollector()); + if (!transportResult.getRight()) { + sb.append(transportResult.getLeft()); + } + + if (sb.length() >= partSize) { + this.uploadOnePart(sb, currentPartNumber, + currentInitiateMultipartUploadResult, + currentPartETags, currentObject); + currentPartNumber++; + sb.setLength(0); + } + + // save: end current multipart upload + if (currentPartNumber > maxPartNumber) { + LOG.info(String + .format("current object [%s] size > %s, complete current multipart upload and begin new one", + currentObject, currentPartNumber * partSize)); + CompleteMultipartUploadRequest currentCompleteMultipartUploadRequest = new CompleteMultipartUploadRequest( + this.bucket, currentObject, + currentInitiateMultipartUploadResult + .getUploadId(), currentPartETags); + CompleteMultipartUploadResult currentCompleteMultipartUploadResult = this.ossClient + .completeMultipartUpload(currentCompleteMultipartUploadRequest); + LOG.info(String.format( + "final object [%s] etag is:[%s]", + currentObject, + currentCompleteMultipartUploadResult.getETag())); + // warn + needInitMultipartTransform = true; + } + } + + // warn: may be some data stall in sb + if (0 < sb.length()) { + this.uploadOnePart(sb, currentPartNumber, + currentInitiateMultipartUploadResult, + currentPartETags, currentObject); + } + CompleteMultipartUploadRequest completeMultipartUploadRequest = new CompleteMultipartUploadRequest( + this.bucket, currentObject, + currentInitiateMultipartUploadResult.getUploadId(), + currentPartETags); + CompleteMultipartUploadResult completeMultipartUploadResult = this.ossClient + .completeMultipartUpload(completeMultipartUploadRequest); + LOG.info(String.format("final object etag is:[%s]", + completeMultipartUploadResult.getETag())); + + } catch (UnsupportedEncodingException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式:[%s]", encoding)); + } catch (OSSException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.Write_OBJECT_ERROR, e.getMessage()); + } catch (ClientException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.Write_OBJECT_ERROR, e.getMessage()); + } catch (IOException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.Write_OBJECT_ERROR, e.getMessage()); + } + LOG.info("end do write"); + } + + private void uploadOnePart(StringBuilder sb, int partNumber, + InitiateMultipartUploadResult initiateMultipartUploadResult, + List partETags, String currentObject) + throws IOException, UnsupportedEncodingException { + byte[] byteArray = sb.toString().getBytes(this.encoding); + InputStream inputStream = new ByteArrayInputStream(byteArray); + // 创建UploadPartRequest,上传分块 + UploadPartRequest uploadPartRequest = new UploadPartRequest(); + uploadPartRequest.setBucketName(this.bucket); + uploadPartRequest.setKey(currentObject); + uploadPartRequest.setUploadId(initiateMultipartUploadResult + .getUploadId()); + uploadPartRequest.setInputStream(inputStream); + uploadPartRequest.setPartSize(byteArray.length); + uploadPartRequest.setPartNumber(partNumber); + UploadPartResult uploadPartResult = this.ossClient + .uploadPart(uploadPartRequest); + partETags.add(uploadPartResult.getPartETag()); + LOG.info(String.format( + "upload part [%s] size [%s] Byte has been completed.", + partNumber, byteArray.length)); + inputStream.close(); + } + + @Override + public void prepare() { + + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + } +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriterErrorCode.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriterErrorCode.java new file mode 100644 index 000000000..c25846062 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriterErrorCode.java @@ -0,0 +1,41 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public enum OssWriterErrorCode implements ErrorCode { + + CONFIG_INVALID_EXCEPTION("OssWriter-00", "您的参数配置错误."), + REQUIRED_VALUE("OssWriter-01", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("OssWriter-02", "您填写的参数值不合法."), + Write_OBJECT_ERROR("OssWriter-03", "您配置的目标Object在写入时异常."), + OSS_COMM_ERROR("OssWriter-05", "执行相应的OSS操作异常."), + ; + + private final String code; + private final String description; + + private OssWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/util/OssUtil.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/util/OssUtil.java new file mode 100644 index 000000000..e6f5c8a11 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/util/OssUtil.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.writer.osswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.osswriter.Constant; +import com.alibaba.datax.plugin.writer.osswriter.Key; +import com.alibaba.datax.plugin.writer.osswriter.OssWriterErrorCode; +import com.aliyun.oss.ClientConfiguration; +import com.aliyun.oss.OSSClient; + +public class OssUtil { + public static OSSClient initOssClient(Configuration conf) { + String endpoint = conf.getString(Key.ENDPOINT); + String accessId = conf.getString(Key.ACCESSID); + String accessKey = conf.getString(Key.ACCESSKEY); + ClientConfiguration ossConf = new ClientConfiguration(); + ossConf.setSocketTimeout(Constant.SOCKETTIMEOUT); + OSSClient client = null; + try { + client = new OSSClient(endpoint, accessId, accessKey, ossConf); + + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, e.getMessage()); + } + + return client; + } +} diff --git a/osswriter/src/main/resources/plugin.json b/osswriter/src/main/resources/plugin.json new file mode 100644 index 000000000..d7d99960b --- /dev/null +++ b/osswriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "osswriter", + "class": "com.alibaba.datax.plugin.writer.osswriter.OssWriter", + "description": "", + "developer": "alibaba" +} + diff --git a/osswriter/src/main/resources/plugin_job_template.json b/osswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..0692b1916 --- /dev/null +++ b/osswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "osswriter", + "parameter": { + "endpoint": "", + "accessId": "", + "accessKey": "", + "bucket": "", + "object": "", + "encoding": "", + "fieldDelimiter": "", + "writeMode": "" + } +} \ No newline at end of file diff --git a/otsreader/doc/otsreader.md b/otsreader/doc/otsreader.md new file mode 100644 index 000000000..1297dbd69 --- /dev/null +++ b/otsreader/doc/otsreader.md @@ -0,0 +1,340 @@ + +# OTSReader 插件文档 + + +___ + + +## 1 快速介绍 + +OTSReader插件实现了从OTS读取数据,并可以通过用户指定抽取数据范围可方便的实现数据增量抽取的需求。目前支持三种抽取方式: + +* 全表抽取 +* 范围抽取 +* 指定分片抽取 + +OTS是构建在阿里云飞天分布式系统之上的 NoSQL数据库服务,提供海量结构化数据的存储和实时访问。OTS 以实例和表的形式组织数据,通过数据分片和负载均衡技术,实现规模上的无缝扩展。 + +## 2 实现原理 + +简而言之,OTSReader通过OTS官方Java SDK连接到OTS服务端,获取并按照DataX官方协议标准转为DataX字段信息传递给下游Writer端。 + +OTSReader会根据OTS的表范围,按照Datax并发的数目N,将范围等分为N份Task。每个Task都会有一个OTSReader线程来执行。 + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从OTS全表同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + }, + "content": [ + { + "reader": { + "name": "otsreader", + "parameter": { + /* ----------- 必填 --------------*/ + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + + // 导出数据表的表名 + "table":"", + + // 需要导出的列名,支持重复列和常量列,区分大小写 + // 常量列:类型支持STRING,INT,DOUBLE,BOOL和BINARY + // 备注:BINARY需要通过Base64转换为对应的字符串传入插件 + "column":[ + {"name":"col1"}, // 普通列 + {"name":"col2"}, // 普通列 + {"name":"col3"}, // 普通列 + {"type":"STRING", "value" : "bazhen"}, // 常量列(字符串) + {"type":"INT", "value" : ""}, // 常量列(整形) + {"type":"DOUBLE", "value" : ""}, // 常量列(浮点) + {"type":"BOOL", "value" : ""}, // 常量列(布尔) + {"type":"BINARY", "value" : "Base64(bin)"} // 常量列(二进制),使用Base64编码完成 + ], + "range":{ + // 导出数据的起始范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "begin":[ + {"type":"INF_MIN"}, + ], + // 导出数据的结束范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "end":[ + {"type":"INF_MAX"}, + ] + } + } + }, + "writer": {} + } + ] + } +} +``` + +* 配置一个定义抽取范围的OTSReader: + +``` +{ + "job": { + "setting": { + "speed": { + "byte":10485760 + }, + "errorLimit":0.0 + }, + "content": [ + { + "reader": { + "name": "otsreader", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + + // 导出数据表的表名 + "table":"", + + // 需要导出的列名,支持重复类和常量列,区分大小写 + // 常量列:类型支持STRING,INT,DOUBLE,BOOL和BINARY + // 备注:BINARY需要通过Base64转换为对应的字符串传入插件 + "column":[ + {"name":"col1"}, // 普通列 + {"name":"col2"}, // 普通列 + {"name":"col3"}, // 普通列 + {"type":"STRING","value" : ""}, // 常量列(字符串) + {"type":"INT","value" : ""}, // 常量列(整形) + {"type":"DOUBLE","value" : ""}, // 常量列(浮点) + {"type":"BOOL","value" : ""}, // 常量列(布尔) + {"type":"BINARY","value" : "Base64(bin)"} // 常量列(二进制) + ], + "range":{ + // 导出数据的起始范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "begin":[ + {"type":"INF_MIN"}, + {"type":"INF_MAX"}, + {"type":"STRING", "value":"hello"}, + {"type":"INT", "value":"2999"}, + ], + // 导出数据的结束范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "end":[ + {"type":"INF_MAX"}, + {"type":"INF_MIN"}, + {"type":"STRING", "value":"hello"}, + {"type":"INT", "value":"2999"}, + ] + } + } + }, + "writer": {} + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OTS Server的EndPoint地址,例如http://bazhen.cn−hangzhou.ots.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OTS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OTS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **instanceName** + + * 描述:OTS的实例名称,实例是用户使用和管理 OTS 服务的实体,用户在开通 OTS 服务之后,需要通过管理控制台来创建实例,然后在实例内进行表的创建和管理。实例是 OTS 资源管理的基础单元,OTS 对应用程序的访问控制和资源计量都在实例级别完成。
+ + * 必选:是
+ + * 默认值:无
+ + +* **table** + + * 描述:所选取的需要抽取的表名称,这里有且只能填写一张表。在OTS不存在多表同步的需求。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。由于OTS本身是NoSQL系统,在OTSReader抽取数据过程中,必须指定相应地字段名称。 + + 支持普通的列读取,例如: {"name":"col1"} + + 支持部分列读取,如用户不配置该列,则OTSReader不予读取。 + + 支持常量列读取,例如: {"type":"STRING", "value" : "DataX"}。使用type描述常量类型,目前支持STRING、INT、DOUBLE、BOOL、BINARY(用户使用Base64编码填写)、INF_MIN(OTS的系统限定最小值,使用该值用户不能填写value属性,否则报错)、INF_MAX(OTS的系统限定最大值,使用该值用户不能填写value属性,否则报错)。 + + 不支持函数或者自定义表达式,由于OTS本身不提供类似SQL的函数或者表达式功能,OTSReader也不能提供函数或表达式列功能。 + + * 必选:是
+ + * 默认值:无
+ +* **begin/end** + + * 描述:该配置项必须配对使用,用于支持OTS表范围抽取。begin/end中描述的是OTS **PrimaryKey**的区间分布状态,而且必须保证区间覆盖到所有的PrimaryKey,**需要指定该表下所有的PrimaryKey范围,不能遗漏任意一个PrimaryKey**,对于无限大小的区间,可以使用{"type":"INF_MIN"},{"type":"INF_MAX"}指代。例如对一张主键为 [DeviceID, SellerID]的OTS进行抽取任务,begin/end可以配置为: + + ```json + "range": { + "begin": { + {"type":"INF_MIN"}, //指定deviceID最小值 + {"type":"INT", "value":"0"} //指定deviceID最小值 + }, + "end": { + {"type":"INF_MAX"}, //指定deviceID抽取最大值 + {"type":"INT", "value":"9999"} //指定deviceID抽取最大值 + } + } + ``` + + + 如果要对上述表抽取全表,可以使用如下配置: + + ``` + "range": { + "begin": [ + {"type":"INF_MIN"}, //指定deviceID最小值 + {"type":"INF_MIN"} //指定SellerID最小值 + ], + "end": [ + {"type":"INF_MAX"}, //指定deviceID抽取最大值 + {"type":"INF_MAX"} //指定SellerID抽取最大值 + ] + } + ``` + + * 必选:是
+ + * 默认值:空
+ +* **split** + + * 描述:该配置项属于高级配置项,是用户自己定义切分配置信息,普通情况下不建议用户使用。适用场景通常在OTS数据存储发生热点,使用OTSReader自动切分的策略不能生效情况下,使用用户自定义的切分规则。split指定是的在Begin、End区间内的切分点,且只能是partitionKey的切分点信息,即在split仅配置partitionKey,而不需要指定全部的PrimaryKey。 + + 例如对一张主键为 [DeviceID, SellerID]的OTS进行抽取任务,可以配置为: + + ```json + "range": { + "begin": { + {"type":"INF_MIN"}, //指定deviceID最小值 + {"type":"INF_MIN"} //指定deviceID最小值 + }, + "end": { + {"type":"INF_MAX"}, //指定deviceID抽取最大值 + {"type":"INF_MAX"} //指定deviceID抽取最大值 + }, + // 用户指定的切分点,如果指定了切分点,Job将按照begin、end和split进行Task的切分, + // 切分的列只能是Partition Key(ParimaryKey的第一列) + // 支持INF_MIN, INF_MAX, STRING, INT + "split":[ + {"type":"STRING", "value":"1"}, + {"type":"STRING", "value":"2"}, + {"type":"STRING", "value":"3"}, + {"type":"STRING", "value":"4"}, + {"type":"STRING", "value":"5"} + ] + } + ``` + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前OTSReader支持所有OTS类型,下面列出OTSReader针对OTS类型转换列表: + + +| DataX 内部类型| OTS 数据类型 | +| -------- | ----- | +| Long |Integer | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Bytes |Binary | + + +* 注意,OTS本身不支持日期型类型。应用层一般使用Long报错时间的Unix TimeStamp。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +15列String(10 Byte), 2两列Integer(8 Byte),总计168Byte/r。 + +#### 4.1.2 机器参数 + +OTS端:3台前端机,5台后端机 + +DataX运行端: 24核CPU, 98GB内存 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + +|并发数|DataX CPU|OTS 流量|DATAX流量 | 前端QPS| 前端延时| +|--------|--------| --------|--------|--------|------| +|2| 36% |6.3M/s |12739 rec/s | 4.7 | 308ms | +|11| 155% | 32M/s |60732 rec/s | 23.9 | 412ms | +|50| 377% | 73M/s |145139 rec/s | 54 | 874ms | +|100| 448% | 82M/s | 156262 rec/s |60 | 1570ms | + + + +## 5 约束限制 + +### 5.1 一致性约束 + +OTS是类BigTable的存储系统,OTS本身能够保证单行写事务性,无法提供跨行级别的事务。对于OTSReader而言也无法提供全表的一致性视图。例如对于OTSReader在0点启动的数据同步任务,在整个表数据同步过程中,OTSReader同样会抽取到后续更新的数据,无法提供准确的0点时刻该表一致性视图。 + +### 5.2 增量数据同步 + +OTS本质上KV存储,目前只能针对PK进行范围查询,暂不支持按照字段范围抽取数据。因此只能对于增量查询,如果PK能够表示范围信息,例如自增ID,或者时间戳。 + +自增ID,OTSReader可以通过记录上次最大的ID信息,通过指定Range范围进行增量抽取。这样使用的前提是OTS中的PrimaryKey必须包含主键自增列(自增主键需要使用OTS应用方生成。) + +时间戳, OTSReader可以通过PK过滤时间戳,通过制定Range范围进行增量抽取。这样使用的前提是OTS中的PrimaryKey必须包含主键时间列(时间主键需要使用OTS应用方生成。) + +## 6 FAQ + diff --git a/otsreader/otsreader.iml b/otsreader/otsreader.iml new file mode 100644 index 000000000..7f8503ed7 --- /dev/null +++ b/otsreader/otsreader.iml @@ -0,0 +1,38 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/otsreader/pom.xml b/otsreader/pom.xml new file mode 100644 index 000000000..25ea75e87 --- /dev/null +++ b/otsreader/pom.xml @@ -0,0 +1,91 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + otsreader + otsreader + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.aliyun.openservices + ots-public + 2.1 + + + com.google.code.gson + gson + 2.2.4 + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + org.apache.maven.plugins + maven-surefire-plugin + 2.5 + + all + -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=. + + **/unittest/*.java + **/functiontest/*.java + + + + + + + diff --git a/otsreader/src/main/assembly/package.xml b/otsreader/src/main/assembly/package.xml new file mode 100644 index 000000000..7ee305d14 --- /dev/null +++ b/otsreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/otsreader + + + target/ + + otsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/otsreader + + + + + + false + plugin/reader/otsreader/libs + runtime + + + diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/Key.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/Key.java new file mode 100644 index 000000000..da6d4a5f7 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/Key.java @@ -0,0 +1,50 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.reader.otsreader; + +public final class Key { + /* ots account configuration */ + public final static String OTS_ENDPOINT = "endpoint"; + + public final static String OTS_ACCESSID = "accessId"; + + public final static String OTS_ACCESSKEY = "accessKey"; + + public final static String OTS_INSTANCE_NAME = "instanceName"; + + public final static String TABLE_NAME = "table"; + + public final static String COLUMN = "column"; + + //====================================================== + // 注意:如果range-begin大于range-end,那么系统将逆序导出所有数据 + //====================================================== + // Range的组织格式 + // "range":{ + // "begin":[], + // "end":[], + // "split":[] + // } + public final static String RANGE = "range"; + + public final static String RANGE_BEGIN = "begin"; + + public final static String RANGE_END = "end"; + + public final static String RANGE_SPLIT = "split"; + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReader.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReader.java new file mode 100644 index 000000000..8880c07ed --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReader.java @@ -0,0 +1,124 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.utils.Common; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; + +public class OtsReader extends Reader { + + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + private OtsReaderMasterProxy proxy = new OtsReaderMasterProxy(); + @Override + public void init() { + LOG.info("init() begin ..."); + try { + this.proxy.init(getPluginJobConf()); + } catch (OTSException e) { + LOG.error("OTSException. ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException. ErrorCode:{}, ErrorMsg:{}", + new Object[]{e.getErrorCode(), e.getMessage()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("init() end ..."); + } + + @Override + public void destroy() { + this.proxy.close(); + } + + @Override + public List split(int adviceNumber) { + LOG.info("split() begin ..."); + + if (adviceNumber <= 0) { + throw DataXException.asDataXException(OtsReaderError.ERROR, "Datax input adviceNumber <= 0."); + } + + List confs = null; + + try { + confs = this.proxy.split(adviceNumber); + } catch (OTSException e) { + LOG.error("OTSException. ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException. ErrorCode:{}, ErrorMsg:{}", + new Object[]{e.getErrorCode(), e.getMessage()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.ERROR, Common.getDetailMessage(e), e); + } + + LOG.info("split() end ..."); + return confs; + } + } + + public static class Task extends Reader.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private OtsReaderSlaveProxy proxy = new OtsReaderSlaveProxy(); + + @Override + public void init() { + } + + @Override + public void destroy() { + } + + @Override + public void startRead(RecordSender recordSender) { + LOG.info("startRead() begin ..."); + try { + this.proxy.read(recordSender,getPluginJobConf()); + } catch (OTSException e) { + LOG.error("OTSException. ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException. ErrorCode:{}, ErrorMsg:{}", + new Object[]{e.getErrorCode(), e.getMessage()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("startRead() end ..."); + } + + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderError.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderError.java new file mode 100644 index 000000000..05a13c1a7 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderError.java @@ -0,0 +1,42 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class OtsReaderError implements ErrorCode { + + private String code; + + private String description; + + // TODO + // 这一块需要DATAX来统一定义分类, OTS基于这些分类在细化 + // 所以暂定两个基础的Error Code,其他错误统一使用OTS的错误码和错误消息 + + public final static OtsReaderError ERROR = new OtsReaderError( + "OtsReaderError", + "该错误表示插件的内部错误,表示系统没有处理到的异常"); + public final static OtsReaderError INVALID_PARAM = new OtsReaderError( + "OtsReaderInvalidParameter", + "该错误表示参数错误,表示用户输入了错误的参数格式等"); + + public OtsReaderError (String code) { + this.code = code; + this.description = code; + } + + public OtsReaderError (String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderMasterProxy.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderMasterProxy.java new file mode 100644 index 000000000..2b758f068 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderMasterProxy.java @@ -0,0 +1,221 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.callable.GetFirstRowPrimaryKeyCallable; +import com.alibaba.datax.plugin.reader.otsreader.callable.GetTableMetaCallable; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConf; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConst; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.alibaba.datax.plugin.reader.otsreader.utils.ParamChecker; +import com.alibaba.datax.plugin.reader.otsreader.utils.Common; +import com.alibaba.datax.plugin.reader.otsreader.utils.GsonParser; +import com.alibaba.datax.plugin.reader.otsreader.utils.ReaderModelParser; +import com.alibaba.datax.plugin.reader.otsreader.utils.RangeSplit; +import com.alibaba.datax.plugin.reader.otsreader.utils.RetryHelper; +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class OtsReaderMasterProxy { + + private OTSConf conf = new OTSConf(); + + private OTSRange range = null; + + private OTSClient ots = null; + + private TableMeta meta = null; + + private Direction direction = null; + + private static final Logger LOG = LoggerFactory.getLogger(OtsReaderMasterProxy.class); + + /** + * 1.检查参数是否为 + * null,endpoint,accessid,accesskey,instance-name,table,column,range-begin,range-end,range-split + * 2.检查参数是否为空字符串 + * endpoint,accessid,accesskey,instance-name,table + * 3.检查是否为空数组 + * column + * 4.检查Range的类型个个数是否和PrimaryKey匹配 + * column,range-begin,range-end + * 5.检查Range Split 顺序和类型是否Range一致,类型是否于PartitionKey一致 + * column-split + * @param param + * @throws Exception + */ + public void init(Configuration param) throws Exception { + // 默认参数 + // 每次重试的时间都是上一次的一倍,当sleep时间大于30秒时,Sleep重试时间不在增长。18次能覆盖OTS的Failover时间5分钟 + conf.setRetry(param.getInt(OTSConst.RETRY, 18)); + conf.setSleepInMilliSecond(param.getInt(OTSConst.SLEEP_IN_MILLI_SECOND, 100)); + + // 必选参数 + conf.setEndpoint(ParamChecker.checkStringAndGet(param, Key.OTS_ENDPOINT)); + conf.setAccessId(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSID)); + conf.setAccesskey(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSKEY)); + conf.setInstanceName(ParamChecker.checkStringAndGet(param, Key.OTS_INSTANCE_NAME)); + conf.setTableName(ParamChecker.checkStringAndGet(param, Key.TABLE_NAME)); + + ots = new OTSClient( + this.conf.getEndpoint(), + this.conf.getAccessId(), + this.conf.getAccesskey(), + this.conf.getInstanceName()); + + meta = getTableMeta(ots, conf.getTableName()); + LOG.info("Table Meta : {}", GsonParser.metaToJson(meta)); + + conf.setColumns(ReaderModelParser.parseOTSColumnList(ParamChecker.checkListAndGet(param, Key.COLUMN, true))); + + Map rangeMap = ParamChecker.checkMapAndGet(param, Key.RANGE, true); + conf.setRangeBegin(ReaderModelParser.parsePrimaryKey(ParamChecker.checkListAndGet(rangeMap, Key.RANGE_BEGIN, false))); + conf.setRangeEnd(ReaderModelParser.parsePrimaryKey(ParamChecker.checkListAndGet(rangeMap, Key.RANGE_END, false))); + + range = ParamChecker.checkRangeAndGet(meta, this.conf.getRangeBegin(), this.conf.getRangeEnd()); + + direction = ParamChecker.checkDirectionAndEnd(meta, range.getBegin(), range.getEnd()); + LOG.info("Direction : {}", direction); + + List points = ReaderModelParser.parsePrimaryKey(ParamChecker.checkListAndGet(rangeMap, Key.RANGE_SPLIT)); + ParamChecker.checkInputSplitPoints(meta, range, direction, points); + conf.setRangeSplit(points); + } + + public List split(int num) throws Exception { + LOG.info("Expect split num : " + num); + + List configurations = new ArrayList(); + + List ranges = null; + + if (this.conf.getRangeSplit() != null) { // 用户显示指定了拆分范围 + LOG.info("Begin userDefinedRangeSplit"); + ranges = userDefinedRangeSplit(meta, range, this.conf.getRangeSplit()); + LOG.info("End userDefinedRangeSplit"); + } else { // 采用默认的切分算法 + LOG.info("Begin defaultRangeSplit"); + ranges = defaultRangeSplit(ots, meta, range, num); + LOG.info("End defaultRangeSplit"); + } + + // 解决大量的Split Point序列化消耗内存的问题 + // 因为slave中不会使用这个配置,所以置为空 + this.conf.setRangeSplit(null); + + for (OTSRange item : ranges) { + Configuration configuration = Configuration.newDefault(); + configuration.set(OTSConst.OTS_CONF, GsonParser.confToJson(this.conf)); + configuration.set(OTSConst.OTS_RANGE, GsonParser.rangeToJson(item)); + configuration.set(OTSConst.OTS_DIRECTION, GsonParser.directionToJson(direction)); + configurations.add(configuration); + } + + LOG.info("Configuration list count : " + configurations.size()); + + return configurations; + } + + public OTSConf getConf() { + return conf; + } + + public void close() { + ots.shutdown(); + } + + // private function + + private TableMeta getTableMeta(OTSClient ots, String tableName) throws Exception { + return RetryHelper.executeWithRetry( + new GetTableMetaCallable(ots, tableName), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + } + + private RowPrimaryKey getPKOfFirstRow( + OTSRange range , Direction direction) throws Exception { + + RangeRowQueryCriteria cur = new RangeRowQueryCriteria(this.conf.getTableName()); + cur.setInclusiveStartPrimaryKey(range.getBegin()); + cur.setExclusiveEndPrimaryKey(range.getEnd()); + cur.setLimit(1); + cur.setColumnsToGet(Common.getPrimaryKeyNameList(meta)); + cur.setDirection(direction); + + return RetryHelper.executeWithRetry( + new GetFirstRowPrimaryKeyCallable(ots, meta, cur), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + } + + private List defaultRangeSplit(OTSClient ots, TableMeta meta, OTSRange range, int num) throws Exception { + if (num == 1) { + List ranges = new ArrayList(); + ranges.add(range); + return ranges; + } + + OTSRange reverseRange = new OTSRange(); + reverseRange.setBegin(range.getEnd()); + reverseRange.setEnd(range.getBegin()); + + Direction reverseDirection = (direction == Direction.FORWARD ? Direction.BACKWARD : Direction.FORWARD); + + RowPrimaryKey realBegin = getPKOfFirstRow(range, direction); + RowPrimaryKey realEnd = getPKOfFirstRow(reverseRange, reverseDirection); + + // 因为如果其中一行为空,表示这个范围内至多有一行数据 + // 所以不再细分,直接使用用户定义的范围 + if (realBegin == null || realEnd == null) { + List ranges = new ArrayList(); + ranges.add(range); + return ranges; + } + + // 如果出现realBegin,realEnd的方向和direction不一致的情况,直接返回range + int cmp = Common.compareRangeBeginAndEnd(meta, realBegin, realEnd); + Direction realDirection = cmp > 0 ? Direction.BACKWARD : Direction.FORWARD; + if (realDirection != direction) { + LOG.warn("Expect '" + direction + "', but direction of realBegin and readlEnd is '" + realDirection + "'"); + List ranges = new ArrayList(); + ranges.add(range); + return ranges; + } + + List ranges = RangeSplit.rangeSplitByCount(meta, realBegin, realEnd, num); + + if (ranges.isEmpty()) { // 当PartitionKey相等时,工具内部不会切分Range + ranges.add(range); + } else { + // replace first and last + OTSRange first = ranges.get(0); + OTSRange last = ranges.get(ranges.size() - 1); + + first.setBegin(range.getBegin()); + last.setEnd(range.getEnd()); + } + + return ranges; + } + + private List userDefinedRangeSplit(TableMeta meta, OTSRange range, List points) { + List ranges = RangeSplit.rangeSplitByPoint(meta, range.getBegin(), range.getEnd(), points); + if (ranges.isEmpty()) { // 当PartitionKey相等时,工具内部不会切分Range + ranges.add(range); + } + return ranges; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderSlaveProxy.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderSlaveProxy.java new file mode 100644 index 000000000..e64b4e7e2 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderSlaveProxy.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.callable.GetRangeCallable; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConf; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConst; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.alibaba.datax.plugin.reader.otsreader.utils.Common; +import com.alibaba.datax.plugin.reader.otsreader.utils.GsonParser; +import com.alibaba.datax.plugin.reader.otsreader.utils.DefaultNoRetry; +import com.alibaba.datax.plugin.reader.otsreader.utils.RetryHelper; +import com.aliyun.openservices.ots.OTSClientAsync; +import com.aliyun.openservices.ots.OTSServiceConfiguration; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.GetRangeRequest; +import com.aliyun.openservices.ots.model.GetRangeResult; +import com.aliyun.openservices.ots.model.OTSFuture; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; +import com.aliyun.openservices.ots.model.Row; +import com.aliyun.openservices.ots.model.RowPrimaryKey; + +public class OtsReaderSlaveProxy { + + class RequestItem { + private RangeRowQueryCriteria criteria; + private OTSFuture future; + + RequestItem(RangeRowQueryCriteria criteria, OTSFuture future) { + this.criteria = criteria; + this.future = future; + } + + public RangeRowQueryCriteria getCriteria() { + return criteria; + } + + public OTSFuture getFuture() { + return future; + } + } + + private static final Logger LOG = LoggerFactory.getLogger(OtsReaderSlaveProxy.class); + + private void rowsToSender(List rows, RecordSender sender, List columns) { + for (Row row : rows) { + Record line = sender.createRecord(); + line = Common.parseRowToLine(row, columns, line); + + LOG.debug("Reader send record : {}", line.toString()); + + sender.sendToWriter(line); + } + } + + private RangeRowQueryCriteria generateRangeRowQueryCriteria(String tableName, RowPrimaryKey begin, RowPrimaryKey end, Direction direction, List columns) { + RangeRowQueryCriteria criteria = new RangeRowQueryCriteria(tableName); + criteria.setInclusiveStartPrimaryKey(begin); + criteria.setDirection(direction); + criteria.setColumnsToGet(columns); + criteria.setLimit(-1); + criteria.setExclusiveEndPrimaryKey(end); + return criteria; + } + + private RequestItem generateRequestItem( + OTSClientAsync ots, + OTSConf conf, + RowPrimaryKey begin, + RowPrimaryKey end, + Direction direction, + List columns) throws Exception { + RangeRowQueryCriteria criteria = generateRangeRowQueryCriteria(conf.getTableName(), begin, end, direction, columns); + + GetRangeRequest request = new GetRangeRequest(); + request.setRangeRowQueryCriteria(criteria); + OTSFuture future = ots.getRange(request); + + return new RequestItem(criteria, future); + } + + public void read(RecordSender sender, Configuration configuration) throws Exception { + LOG.info("read begin."); + + OTSConf conf = GsonParser.jsonToConf(configuration.getString(OTSConst.OTS_CONF)); + OTSRange range = GsonParser.jsonToRange(configuration.getString(OTSConst.OTS_RANGE)); + Direction direction = GsonParser.jsonToDirection(configuration.getString(OTSConst.OTS_DIRECTION)); + + OTSServiceConfiguration configure = new OTSServiceConfiguration(); + configure.setRetryStrategy(new DefaultNoRetry()); + + OTSClientAsync ots = new OTSClientAsync( + conf.getEndpoint(), + conf.getAccessId(), + conf.getAccesskey(), + conf.getInstanceName(), + null, + configure, + null); + + RowPrimaryKey token = range.getBegin(); + List columns = Common.getNormalColumnNameList(conf.getColumns()); + + RequestItem request = null; + + do { + LOG.debug("Next token : {}", GsonParser.rowPrimaryKeyToJson(token)); + if (request == null) { + request = generateRequestItem(ots, conf, token, range.getEnd(), direction, columns); + } else { + RequestItem req = request; + + GetRangeResult result = RetryHelper.executeWithRetry( + new GetRangeCallable(ots, req.getCriteria(), req.getFuture()), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + if ((token = result.getNextStartPrimaryKey()) != null) { + request = generateRequestItem(ots, conf, token, range.getEnd(), direction, columns); + } + + rowsToSender(result.getRows(), sender, conf.getColumns()); + } + } while (token != null); + ots.shutdown(); + LOG.info("read end."); + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/OTSColumnAdaptor.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/OTSColumnAdaptor.java new file mode 100644 index 000000000..25f9b682c --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/OTSColumnAdaptor.java @@ -0,0 +1,117 @@ +package com.alibaba.datax.plugin.reader.otsreader.adaptor; + +import java.lang.reflect.Type; + +import org.apache.commons.codec.binary.Base64; + +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.aliyun.openservices.ots.model.ColumnType; +import com.google.gson.JsonDeserializationContext; +import com.google.gson.JsonDeserializer; +import com.google.gson.JsonElement; +import com.google.gson.JsonObject; +import com.google.gson.JsonParseException; +import com.google.gson.JsonPrimitive; +import com.google.gson.JsonSerializationContext; +import com.google.gson.JsonSerializer; + +public class OTSColumnAdaptor implements JsonDeserializer, JsonSerializer{ + private final static String NAME = "name"; + private final static String COLUMN_TYPE = "column_type"; + private final static String VALUE_TYPE = "value_type"; + private final static String VALUE = "value"; + + private void serializeConstColumn(JsonObject json, OTSColumn obj) { + switch (obj.getValueType()) { + case STRING : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.STRING.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asString())); + break; + case INTEGER : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.INTEGER.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asLong())); + break; + case DOUBLE : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.DOUBLE.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asDouble())); + break; + case BOOLEAN : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.BOOLEAN.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asBoolean())); + break; + case BINARY : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.BINARY.toString())); + json.add(VALUE, new JsonPrimitive(Base64.encodeBase64String(obj.getValue().asBytes()))); + break; + default: + throw new IllegalArgumentException("Unsupport serialize the type : " + obj.getValueType() + ""); + } + } + + private OTSColumn deserializeConstColumn(JsonObject obj) { + String strType = obj.getAsJsonPrimitive(VALUE_TYPE).getAsString(); + ColumnType type = ColumnType.valueOf(strType); + + JsonPrimitive jsonValue = obj.getAsJsonPrimitive(VALUE); + + switch (type) { + case STRING : + return OTSColumn.fromConstStringColumn(jsonValue.getAsString()); + case INTEGER : + return OTSColumn.fromConstIntegerColumn(jsonValue.getAsLong()); + case DOUBLE : + return OTSColumn.fromConstDoubleColumn(jsonValue.getAsDouble()); + case BOOLEAN : + return OTSColumn.fromConstBoolColumn(jsonValue.getAsBoolean()); + case BINARY : + return OTSColumn.fromConstBytesColumn(Base64.decodeBase64(jsonValue.getAsString())); + default: + throw new IllegalArgumentException("Unsupport deserialize the type : " + type + ""); + } + } + + private void serializeNormalColumn(JsonObject json, OTSColumn obj) { + json.add(NAME, new JsonPrimitive(obj.getName())); + } + + private OTSColumn deserializeNormarlColumn(JsonObject obj) { + return OTSColumn.fromNormalColumn(obj.getAsJsonPrimitive(NAME).getAsString()); + } + + @Override + public JsonElement serialize(OTSColumn obj, Type t, + JsonSerializationContext c) { + JsonObject json = new JsonObject(); + + switch (obj.getColumnType()) { + case CONST: + json.add(COLUMN_TYPE, new JsonPrimitive(OTSColumn.OTSColumnType.CONST.toString())); + serializeConstColumn(json, obj); + break; + case NORMAL: + json.add(COLUMN_TYPE, new JsonPrimitive(OTSColumn.OTSColumnType.NORMAL.toString())); + serializeNormalColumn(json, obj); + break; + default: + throw new IllegalArgumentException("Unsupport serialize the type : " + obj.getColumnType() + ""); + } + return json; + } + + @Override + public OTSColumn deserialize(JsonElement ele, Type t, + JsonDeserializationContext c) throws JsonParseException { + JsonObject obj = ele.getAsJsonObject(); + String strColumnType = obj.getAsJsonPrimitive(COLUMN_TYPE).getAsString(); + OTSColumn.OTSColumnType columnType = OTSColumn.OTSColumnType.valueOf(strColumnType); + + switch(columnType) { + case CONST: + return deserializeConstColumn(obj); + case NORMAL: + return deserializeNormarlColumn(obj); + default: + throw new IllegalArgumentException("Unsupport deserialize the type : " + columnType + ""); + } + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/PrimaryKeyValueAdaptor.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/PrimaryKeyValueAdaptor.java new file mode 100644 index 000000000..1a49ea476 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/PrimaryKeyValueAdaptor.java @@ -0,0 +1,91 @@ +package com.alibaba.datax.plugin.reader.otsreader.adaptor; + +import java.lang.reflect.Type; + +import com.aliyun.openservices.ots.model.ColumnType; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.google.gson.JsonDeserializationContext; +import com.google.gson.JsonDeserializer; +import com.google.gson.JsonElement; +import com.google.gson.JsonObject; +import com.google.gson.JsonParseException; +import com.google.gson.JsonPrimitive; +import com.google.gson.JsonSerializationContext; +import com.google.gson.JsonSerializer; + +/** + * {"type":"INF_MIN", "value":""} + * {"type":"INF_MAX", "value":""} + * {"type":"STRING", "value":"hello"} + * {"type":"INTEGER", "value":"1222"} + */ +public class PrimaryKeyValueAdaptor implements JsonDeserializer, JsonSerializer{ + private final static String TYPE = "type"; + private final static String VALUE = "value"; + private final static String INF_MIN = "INF_MIN"; + private final static String INF_MAX = "INF_MAX"; + + @Override + public JsonElement serialize(PrimaryKeyValue obj, Type t, + JsonSerializationContext c) { + JsonObject json = new JsonObject(); + + if (obj == PrimaryKeyValue.INF_MIN) { + json.add(TYPE, new JsonPrimitive(INF_MIN)); + json.add(VALUE, new JsonPrimitive("")); + return json; + } + + if (obj == PrimaryKeyValue.INF_MAX) { + json.add(TYPE, new JsonPrimitive(INF_MAX)); + json.add(VALUE, new JsonPrimitive("")); + return json; + } + + switch (obj.getType()) { + case STRING : + json.add(TYPE, new JsonPrimitive(ColumnType.STRING.toString())); + json.add(VALUE, new JsonPrimitive(obj.asString())); + break; + case INTEGER : + json.add(TYPE, new JsonPrimitive(ColumnType.INTEGER.toString())); + json.add(VALUE, new JsonPrimitive(obj.asLong())); + break; + default: + throw new IllegalArgumentException("Unsupport serialize the type : " + obj.getType() + ""); + } + return json; + } + + @Override + public PrimaryKeyValue deserialize(JsonElement ele, Type t, + JsonDeserializationContext c) throws JsonParseException { + + JsonObject obj = ele.getAsJsonObject(); + String strType = obj.getAsJsonPrimitive(TYPE).getAsString(); + JsonPrimitive jsonValue = obj.getAsJsonPrimitive(VALUE); + + if (strType.equals(INF_MIN)) { + return PrimaryKeyValue.INF_MIN; + } + + if (strType.equals(INF_MAX)) { + return PrimaryKeyValue.INF_MAX; + } + + PrimaryKeyValue value = null; + PrimaryKeyType type = PrimaryKeyType.valueOf(strType); + switch(type) { + case STRING : + value = PrimaryKeyValue.fromString(jsonValue.getAsString()); + break; + case INTEGER : + value = PrimaryKeyValue.fromLong(jsonValue.getAsLong()); + break; + default: + throw new IllegalArgumentException("Unsupport deserialize the type : " + type + ""); + } + return value; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetFirstRowPrimaryKeyCallable.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetFirstRowPrimaryKeyCallable.java new file mode 100644 index 000000000..f004c0ff6 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetFirstRowPrimaryKeyCallable.java @@ -0,0 +1,55 @@ +package com.alibaba.datax.plugin.reader.otsreader.callable; + +import java.util.List; +import java.util.Map; +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.ColumnType; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.GetRangeRequest; +import com.aliyun.openservices.ots.model.GetRangeResult; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; +import com.aliyun.openservices.ots.model.Row; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class GetFirstRowPrimaryKeyCallable implements Callable{ + + private OTSClient ots = null; + private TableMeta meta = null; + private RangeRowQueryCriteria criteria = null; + + public GetFirstRowPrimaryKeyCallable(OTSClient ots, TableMeta meta, RangeRowQueryCriteria criteria) { + this.ots = ots; + this.meta = meta; + this.criteria = criteria; + } + + @Override + public RowPrimaryKey call() throws Exception { + RowPrimaryKey ret = new RowPrimaryKey(); + GetRangeRequest request = new GetRangeRequest(); + request.setRangeRowQueryCriteria(criteria); + GetRangeResult result = ots.getRange(request); + List rows = result.getRows(); + if(rows.isEmpty()) { + return null;// no data + } + Row row = rows.get(0); + + Map pk = meta.getPrimaryKey(); + for (String key:pk.keySet()) { + ColumnValue v = row.getColumns().get(key); + if (v.getType() == ColumnType.INTEGER) { + ret.addPrimaryKeyColumn(key, PrimaryKeyValue.fromLong(v.asLong())); + } else { + ret.addPrimaryKeyColumn(key, PrimaryKeyValue.fromString(v.asString())); + } + } + return ret; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetRangeCallable.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetRangeCallable.java new file mode 100644 index 000000000..2cd1398a6 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetRangeCallable.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.plugin.reader.otsreader.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClientAsync; +import com.aliyun.openservices.ots.model.GetRangeRequest; +import com.aliyun.openservices.ots.model.GetRangeResult; +import com.aliyun.openservices.ots.model.OTSFuture; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; + +public class GetRangeCallable implements Callable { + + private OTSClientAsync ots; + private RangeRowQueryCriteria criteria; + private OTSFuture future; + + public GetRangeCallable(OTSClientAsync ots, RangeRowQueryCriteria criteria, OTSFuture future) { + this.ots = ots; + this.criteria = criteria; + this.future = future; + } + + @Override + public GetRangeResult call() throws Exception { + try { + return future.get(); + } catch (Exception e) { + GetRangeRequest request = new GetRangeRequest(); + request.setRangeRowQueryCriteria(criteria); + future = ots.getRange(request); + throw e; + } + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetTableMetaCallable.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetTableMetaCallable.java new file mode 100644 index 000000000..2884e12b1 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetTableMetaCallable.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.reader.otsreader.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.DescribeTableRequest; +import com.aliyun.openservices.ots.model.DescribeTableResult; +import com.aliyun.openservices.ots.model.TableMeta; + +public class GetTableMetaCallable implements Callable{ + + private OTSClient ots = null; + private String tableName = null; + + public GetTableMetaCallable(OTSClient ots, String tableName) { + this.ots = ots; + this.tableName = tableName; + } + + @Override + public TableMeta call() throws Exception { + DescribeTableRequest describeTableRequest = new DescribeTableRequest(); + describeTableRequest.setTableName(tableName); + DescribeTableResult result = ots.describeTable(describeTableRequest); + TableMeta tableMeta = result.getTableMeta(); + return tableMeta; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSColumn.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSColumn.java new file mode 100644 index 000000000..129ccd2fd --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSColumn.java @@ -0,0 +1,76 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.StringColumn; +import com.aliyun.openservices.ots.model.ColumnType; + +public class OTSColumn { + private String name; + private Column value; + private OTSColumnType columnType; + private ColumnType valueType; + + public static enum OTSColumnType { + NORMAL, // 普通列 + CONST // 常量列 + } + + private OTSColumn(String name) { + this.name = name; + this.columnType = OTSColumnType.NORMAL; + } + + private OTSColumn(Column value, ColumnType type) { + this.value = value; + this.columnType = OTSColumnType.CONST; + this.valueType = type; + } + + public static OTSColumn fromNormalColumn(String name) { + if (name.isEmpty()) { + throw new IllegalArgumentException("The column name is empty."); + } + + return new OTSColumn(name); + } + + public static OTSColumn fromConstStringColumn(String value) { + return new OTSColumn(new StringColumn(value), ColumnType.STRING); + } + + public static OTSColumn fromConstIntegerColumn(long value) { + return new OTSColumn(new LongColumn(value), ColumnType.INTEGER); + } + + public static OTSColumn fromConstDoubleColumn(double value) { + return new OTSColumn(new DoubleColumn(value), ColumnType.DOUBLE); + } + + public static OTSColumn fromConstBoolColumn(boolean value) { + return new OTSColumn(new BoolColumn(value), ColumnType.BOOLEAN); + } + + public static OTSColumn fromConstBytesColumn(byte[] value) { + return new OTSColumn(new BytesColumn(value), ColumnType.BINARY); + } + + public Column getValue() { + return value; + } + + public OTSColumnType getColumnType() { + return columnType; + } + + public ColumnType getValueType() { + return valueType; + } + + public String getName() { + return name; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConf.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConf.java new file mode 100644 index 000000000..8b109a39e --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConf.java @@ -0,0 +1,90 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import java.util.List; + +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + +public class OTSConf { + private String endpoint= null; + private String accessId = null; + private String accesskey = null; + private String instanceName = null; + private String tableName = null; + + private List rangeBegin = null; + private List rangeEnd = null; + private List rangeSplit = null; + + private List columns = null; + + private int retry; + private int sleepInMilliSecond; + + public String getEndpoint() { + return endpoint; + } + public void setEndpoint(String endpoint) { + this.endpoint = endpoint; + } + public String getAccessId() { + return accessId; + } + public void setAccessId(String accessId) { + this.accessId = accessId; + } + public String getAccesskey() { + return accesskey; + } + public void setAccesskey(String accesskey) { + this.accesskey = accesskey; + } + public String getInstanceName() { + return instanceName; + } + public void setInstanceName(String instanceName) { + this.instanceName = instanceName; + } + public String getTableName() { + return tableName; + } + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public List getColumns() { + return columns; + } + public void setColumns(List columns) { + this.columns = columns; + } + public int getRetry() { + return retry; + } + public void setRetry(int retry) { + this.retry = retry; + } + public int getSleepInMilliSecond() { + return sleepInMilliSecond; + } + public void setSleepInMilliSecond(int sleepInMilliSecond) { + this.sleepInMilliSecond = sleepInMilliSecond; + } + public List getRangeBegin() { + return rangeBegin; + } + public void setRangeBegin(List rangeBegin) { + this.rangeBegin = rangeBegin; + } + public List getRangeEnd() { + return rangeEnd; + } + public void setRangeEnd(List rangeEnd) { + this.rangeEnd = rangeEnd; + } + public List getRangeSplit() { + return rangeSplit; + } + public void setRangeSplit(List rangeSplit) { + this.rangeSplit = rangeSplit; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConst.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConst.java new file mode 100644 index 000000000..30177193b --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConst.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +public class OTSConst { + // Reader support type + public final static String TYPE_STRING = "STRING"; + public final static String TYPE_INTEGER = "INT"; + public final static String TYPE_DOUBLE = "DOUBLE"; + public final static String TYPE_BOOLEAN = "BOOL"; + public final static String TYPE_BINARY = "BINARY"; + public final static String TYPE_INF_MIN = "INF_MIN"; + public final static String TYPE_INF_MAX = "INF_MAX"; + + // Column + public final static String NAME = "name"; + public final static String TYPE = "type"; + public final static String VALUE = "value"; + + public final static String OTS_CONF = "OTS_CONF"; + public final static String OTS_RANGE = "OTS_RANGE"; + public final static String OTS_DIRECTION = "OTS_DIRECTION"; + + // options + public final static String RETRY = "maxRetryTime"; + public final static String SLEEP_IN_MILLI_SECOND = "retrySleepInMillionSecond"; +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSPrimaryKeyColumn.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSPrimaryKeyColumn.java new file mode 100644 index 000000000..eaec50ce5 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSPrimaryKeyColumn.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import com.aliyun.openservices.ots.model.PrimaryKeyType; + +public class OTSPrimaryKeyColumn { + private String name; + private PrimaryKeyType type; + + public String getName() { + return name; + } + public void setName(String name) { + this.name = name; + } + public PrimaryKeyType getType() { + return type; + } + public void setType(PrimaryKeyType type) { + this.type = type; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSRange.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSRange.java new file mode 100644 index 000000000..8ebfcf7ea --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSRange.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import com.aliyun.openservices.ots.model.RowPrimaryKey; + +public class OTSRange { + + private RowPrimaryKey begin = null; + private RowPrimaryKey end = null; + + public OTSRange() {} + + public OTSRange(RowPrimaryKey begin, RowPrimaryKey end) { + this.begin = begin; + this.end = end; + } + + public RowPrimaryKey getBegin() { + return begin; + } + public void setBegin(RowPrimaryKey begin) { + this.begin = begin; + } + public RowPrimaryKey getEnd() { + return end; + } + public void setEnd(RowPrimaryKey end) { + this.end = end; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/Common.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/Common.java new file mode 100644 index 000000000..7bb3f52ea --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/Common.java @@ -0,0 +1,161 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSPrimaryKeyColumn; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.Row; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class Common { + + public static int primaryKeyValueCmp(PrimaryKeyValue v1, PrimaryKeyValue v2) { + if (v1.getType() != null && v2.getType() != null) { + if (v1.getType() != v2.getType()) { + throw new IllegalArgumentException( + "Not same column type, column1:" + v1.getType() + ", column2:" + v2.getType()); + } + switch (v1.getType()) { + case INTEGER: + Long l1 = Long.valueOf(v1.asLong()); + Long l2 = Long.valueOf(v2.asLong()); + return l1.compareTo(l2); + case STRING: + return v1.asString().compareTo(v2.asString()); + default: + throw new IllegalArgumentException("Unsuporrt compare the type: " + v1.getType() + "."); + } + } else { + if (v1 == v2) { + return 0; + } else { + if (v1 == PrimaryKeyValue.INF_MIN) { + return -1; + } else if (v1 == PrimaryKeyValue.INF_MAX) { + return 1; + } + + if (v2 == PrimaryKeyValue.INF_MAX) { + return -1; + } else if (v2 == PrimaryKeyValue.INF_MIN) { + return 1; + } + } + } + return 0; + } + + public static OTSPrimaryKeyColumn getPartitionKey(TableMeta meta) { + List keys = new ArrayList(); + keys.addAll(meta.getPrimaryKey().keySet()); + + String key = keys.get(0); + + OTSPrimaryKeyColumn col = new OTSPrimaryKeyColumn(); + col.setName(key); + col.setType(meta.getPrimaryKey().get(key)); + return col; + } + + public static List getPrimaryKeyNameList(TableMeta meta) { + List names = new ArrayList(); + names.addAll(meta.getPrimaryKey().keySet()); + return names; + } + + public static int compareRangeBeginAndEnd(TableMeta meta, RowPrimaryKey begin, RowPrimaryKey end) { + if (begin.getPrimaryKey().size() != end.getPrimaryKey().size()) { + throw new IllegalArgumentException("Input size of begin not equal size of end, begin size : " + begin.getPrimaryKey().size() + + ", end size : " + end.getPrimaryKey().size() + "."); + } + for (String key : meta.getPrimaryKey().keySet()) { + PrimaryKeyValue v1 = begin.getPrimaryKey().get(key); + PrimaryKeyValue v2 = end.getPrimaryKey().get(key); + int cmp = primaryKeyValueCmp(v1, v2); + if (cmp != 0) { + return cmp; + } + } + return 0; + } + + public static List getNormalColumnNameList(List columns) { + List normalColumns = new ArrayList(); + for (OTSColumn col : columns) { + if (col.getColumnType() == OTSColumn.OTSColumnType.NORMAL) { + normalColumns.add(col.getName()); + } + } + return normalColumns; + } + + public static Record parseRowToLine(Row row, List columns, Record line) { + Map values = row.getColumns(); + for (OTSColumn col : columns) { + if (col.getColumnType() == OTSColumn.OTSColumnType.CONST) { + line.addColumn(col.getValue()); + } else { + ColumnValue v = values.get(col.getName()); + if (v == null) { + line.addColumn(new StringColumn(null)); + } else { + switch(v.getType()) { + case STRING: line.addColumn(new StringColumn(v.asString())); break; + case INTEGER: line.addColumn(new LongColumn(v.asLong())); break; + case DOUBLE: line.addColumn(new DoubleColumn(v.asDouble())); break; + case BOOLEAN: line.addColumn(new BoolColumn(v.asBoolean())); break; + case BINARY: line.addColumn(new BytesColumn(v.asBinary())); break; + default: + throw new IllegalArgumentException("Unsuporrt tranform the type: " + col.getValue().getType() + "."); + } + } + } + } + return line; + } + + public static String getDetailMessage(Exception exception) { + if (exception instanceof OTSException) { + OTSException e = (OTSException) exception; + return "OTSException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + ", RequestId:" + e.getRequestId() + "]"; + } else if (exception instanceof ClientException) { + ClientException e = (ClientException) exception; + return "ClientException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + "]"; + } else if (exception instanceof IllegalArgumentException) { + IllegalArgumentException e = (IllegalArgumentException) exception; + return "IllegalArgumentException[ErrorMessage:" + e.getMessage() + "]"; + } else { + return "Exception[ErrorMessage:" + exception.getMessage() + "]"; + } + } + + public static long getDelaySendMillinSeconds(int hadRetryTimes, int initSleepInMilliSecond) { + + if (hadRetryTimes <= 0) { + return 0; + } + + int sleepTime = initSleepInMilliSecond; + for (int i = 1; i < hadRetryTimes; i++) { + sleepTime += sleepTime; + if (sleepTime > 30000) { + sleepTime = 30000; + break; + } + } + return sleepTime; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/DefaultNoRetry.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/DefaultNoRetry.java new file mode 100644 index 000000000..359c19d02 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/DefaultNoRetry.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import com.aliyun.openservices.ots.internal.OTSRetryStrategy; + +public class DefaultNoRetry implements OTSRetryStrategy { + + @Override + public boolean shouldRetry(String action, Exception ex, int retries) { + return false; + } + + @Override + public long getPauseDelay(String s, Exception e, int i) { + return 0; + } + +} \ No newline at end of file diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/GsonParser.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/GsonParser.java new file mode 100644 index 000000000..a82f33500 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/GsonParser.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import com.alibaba.datax.plugin.reader.otsreader.adaptor.OTSColumnAdaptor; +import com.alibaba.datax.plugin.reader.otsreader.adaptor.PrimaryKeyValueAdaptor; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConf; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; +import com.google.gson.Gson; +import com.google.gson.GsonBuilder; + +public class GsonParser { + + private static Gson gsonBuilder() { + return new GsonBuilder() + .registerTypeAdapter(OTSColumn.class, new OTSColumnAdaptor()) + .registerTypeAdapter(PrimaryKeyValue.class, new PrimaryKeyValueAdaptor()) + .create(); + } + + public static String rangeToJson (OTSRange range) { + Gson g = gsonBuilder(); + return g.toJson(range); + } + + public static OTSRange jsonToRange (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, OTSRange.class); + } + + public static String confToJson (OTSConf conf) { + Gson g = gsonBuilder(); + return g.toJson(conf); + } + + public static OTSConf jsonToConf (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, OTSConf.class); + } + + public static String directionToJson (Direction direction) { + Gson g = gsonBuilder(); + return g.toJson(direction); + } + + public static Direction jsonToDirection (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, Direction.class); + } + + public static String metaToJson (TableMeta meta) { + Gson g = gsonBuilder(); + return g.toJson(meta); + } + + public static String rowPrimaryKeyToJson (RowPrimaryKey row) { + Gson g = gsonBuilder(); + return g.toJson(row); + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ParamChecker.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ParamChecker.java new file mode 100644 index 000000000..fbcdc9722 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ParamChecker.java @@ -0,0 +1,245 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSPrimaryKeyColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class ParamChecker { + + private static void throwNotExistException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is not exist."); + } + + private static void throwStringLengthZeroException(String key) { + throw new IllegalArgumentException("The param length of '" + key + "' is zero."); + } + + private static void throwEmptyException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is empty."); + } + + private static void throwNotListException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is not a json array."); + } + + private static void throwNotMapException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is not a json map."); + } + + public static String checkStringAndGet(Configuration param, String key) { + String value = param.getString(key); + if (null == value) { + throwNotExistException(key); + } else if (value.length() == 0) { + throwStringLengthZeroException(key); + } + return value; + } + + public static List checkListAndGet(Configuration param, String key, boolean isCheckEmpty) { + List value = null; + try { + value = param.getList(key); + } catch (ClassCastException e) { + throwNotListException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyException(key); + } + return value; + } + + public static List checkListAndGet(Map range, String key) { + Object obj = range.get(key); + if (null == obj) { + return null; + } + return checkListAndGet(range, key, false); + } + + public static List checkListAndGet(Map range, String key, boolean isCheckEmpty) { + Object obj = range.get(key); + if (null == obj) { + throwNotExistException(key); + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + if (isCheckEmpty && value.isEmpty()) { + throwEmptyException(key); + } + return value; + } else { + throw new IllegalArgumentException("Can not parse list of '" + key + "' from map."); + } + } + + public static List checkListAndGet(Map range, String key, List defaultList) { + Object obj = range.get(key); + if (null == obj) { + return defaultList; + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + return value; + } else { + throw new IllegalArgumentException("Can not parse list of '" + key + "' from map."); + } + } + + public static Map checkMapAndGet(Configuration param, String key, boolean isCheckEmpty) { + Map value = null; + try { + value = param.getMap(key); + } catch (ClassCastException e) { + throwNotMapException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyException(key); + } + return value; + } + + public static RowPrimaryKey checkInputPrimaryKeyAndGet(TableMeta meta, List range) { + if (meta.getPrimaryKey().size() != range.size()) { + throw new IllegalArgumentException(String.format( + "Input size of values not equal size of primary key. input size:%d, primary key size:%d .", + range.size(), meta.getPrimaryKey().size())); + } + RowPrimaryKey pk = new RowPrimaryKey(); + int i = 0; + for (Entry e: meta.getPrimaryKey().entrySet()) { + PrimaryKeyValue value = range.get(i); + if (e.getValue() != value.getType() && value != PrimaryKeyValue.INF_MIN && value != PrimaryKeyValue.INF_MAX) { + throw new IllegalArgumentException( + "Input range type not match primary key. Input type:" + value.getType() + ", Primary Key Type:"+ e.getValue() +", Index:" + i + ); + } else { + pk.addPrimaryKeyColumn(e.getKey(), value); + } + i++; + } + return pk; + } + + public static OTSRange checkRangeAndGet(TableMeta meta, List begin, List end) { + OTSRange range = new OTSRange(); + if (begin.size() == 0 && end.size() == 0) { + RowPrimaryKey beginRow = new RowPrimaryKey(); + RowPrimaryKey endRow = new RowPrimaryKey(); + for (String name : meta.getPrimaryKey().keySet()) { + beginRow.addPrimaryKeyColumn(name, PrimaryKeyValue.INF_MIN); + endRow.addPrimaryKeyColumn(name, PrimaryKeyValue.INF_MAX); + } + range.setBegin(beginRow); + range.setEnd(endRow); + } else { + RowPrimaryKey beginRow = checkInputPrimaryKeyAndGet(meta, begin); + RowPrimaryKey endRow = checkInputPrimaryKeyAndGet(meta, end); + range.setBegin(beginRow); + range.setEnd(endRow); + } + return range; + } + + public static Direction checkDirectionAndEnd(TableMeta meta, RowPrimaryKey begin, RowPrimaryKey end) { + Direction direction = null; + int cmp = Common.compareRangeBeginAndEnd(meta, begin, end) ; + + if (cmp > 0) { + direction = Direction.BACKWARD; + } else if (cmp < 0) { + direction = Direction.FORWARD; + } else { + throw new IllegalArgumentException("Value of 'range-begin' equal value of 'range-end'."); + } + return direction; + } + + /** + * 检查类型是否一致,是否重复,方向是否一致 + * @param direction + * @param before + * @param after + */ + private static void checkDirection(Direction direction, PrimaryKeyValue before, PrimaryKeyValue after) { + int cmp = Common.primaryKeyValueCmp(before, after); + if (cmp > 0) { // 反向 + if (direction == Direction.FORWARD) { + throw new IllegalArgumentException("Input direction of 'range-split' is FORWARD, but direction of 'range' is BACKWARD."); + } + } else if (cmp < 0) { // 正向 + if (direction == Direction.BACKWARD) { + throw new IllegalArgumentException("Input direction of 'range-split' is BACKWARD, but direction of 'range' is FORWARD."); + } + } else { // 重复列 + throw new IllegalArgumentException("Multi same column in 'range-split'."); + } + } + + /** + * 检查 points中的所有点是否是在Begin和end之间 + * @param begin + * @param end + * @param points + */ + private static void checkPointsRange(Direction direction, PrimaryKeyValue begin, PrimaryKeyValue end, List points) { + if (direction == Direction.FORWARD) { + if (!(Common.primaryKeyValueCmp(begin, points.get(0)) < 0 && Common.primaryKeyValueCmp(end, points.get(points.size() - 1)) > 0)) { + throw new IllegalArgumentException("The item of 'range-split' is not within scope of 'range-begin' and 'range-end'."); + } + } else { + if (!(Common.primaryKeyValueCmp(begin, points.get(0)) > 0 && Common.primaryKeyValueCmp(end, points.get(points.size() - 1)) < 0)) { + throw new IllegalArgumentException("The item of 'range-split' is not within scope of 'range-begin' and 'range-end'."); + } + } + } + + /** + * 1.检测用户的输入类型是否和PartitionKey一致 + * 2.顺序是否和Range一致 + * 3.是否有重复列 + * 4.检查points的范围是否在range内 + * @param meta + * @param points + */ + public static void checkInputSplitPoints(TableMeta meta, OTSRange range, Direction direction, List points) { + if (null == points || points.isEmpty()) { + return; + } + + OTSPrimaryKeyColumn part = Common.getPartitionKey(meta); + + // 处理第一个 + PrimaryKeyValue item = points.get(0); + if ( item.getType() != part.getType()) { + throw new IllegalArgumentException("Input type of 'range-split' not match partition key. " + + "Item of 'range-split' type:" + item.getType()+ ", Partition type:" + part.getType()); + } + + for (int i = 0 ; i < points.size() - 1; i++) { + PrimaryKeyValue before = points.get(i); + PrimaryKeyValue after = points.get(i + 1); + checkDirection(direction, before, after); + } + + PrimaryKeyValue begin = range.getBegin().getPrimaryKey().get(part.getName()); + PrimaryKeyValue end = range.getEnd().getPrimaryKey().get(part.getName()); + + checkPointsRange(direction, begin, end, points); + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RangeSplit.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RangeSplit.java new file mode 100644 index 000000000..74caac3f7 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RangeSplit.java @@ -0,0 +1,379 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.math.BigInteger; +import java.util.ArrayList; +import java.util.Collections; +import java.util.Comparator; +import java.util.List; + +import com.alibaba.datax.plugin.reader.otsreader.model.OTSPrimaryKeyColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +/** + * 主要提供对范围的解析 + */ +public class RangeSplit { + + private static String bigIntegerToString(BigInteger baseValue, + BigInteger bValue, BigInteger multi, int lenOfString) { + BigInteger tmp = bValue; + StringBuilder sb = new StringBuilder(); + for (int tmpLength = 0; tmpLength < lenOfString; tmpLength++) { + sb.insert(0, + (char) (baseValue.add(tmp.remainder(multi)).intValue())); + tmp = tmp.divide(multi); + } + return sb.toString(); + } + + /** + * 切分String的Unicode Unit + * + * 注意:该方法只支持begin小于end + * + * @param beginStr + * @param endStr + * @param count + * @return + */ + private static List splitCodePoint(int begin, int end, int count) { + if (begin >= end) { + throw new IllegalArgumentException("Only support begin < end."); + } + + List results = new ArrayList(); + BigInteger beginBig = BigInteger.valueOf(begin); + BigInteger endBig = BigInteger.valueOf(end); + BigInteger countBig = BigInteger.valueOf(count); + BigInteger multi = endBig.subtract(beginBig).add(BigInteger.ONE); + BigInteger range = endBig.subtract(beginBig); + BigInteger interval = BigInteger.ZERO; + int length = 1; + + BigInteger tmpBegin = BigInteger.ZERO; + BigInteger tmpEnd = endBig.subtract(beginBig); + + // 扩大之后的数值 + BigInteger realBegin = tmpBegin; + BigInteger realEnd = tmpEnd; + + while (range.compareTo(countBig) < 0) { // 不够切分 + realEnd = realEnd.multiply(multi).add(tmpEnd); + range = realEnd.subtract(realBegin); + length++; + } + + interval = range.divide(countBig); + + BigInteger cur = realBegin; + + for (int i = 0; i < (count - 1); i++) { + results.add(bigIntegerToString(beginBig, cur, multi, length)); + cur = cur.add(interval); + } + results.add(bigIntegerToString(beginBig, realEnd, multi, length)); + return results; + } + + /** + * 注意: 当begin和end相等时,函数将返回空的List + * + * @param begin + * @param end + * @param count + * @return + */ + public static List splitStringRange(String begin, String end, int count) { + + if (count <= 1) { + throw new IllegalArgumentException("Input count <= 1 ."); + } + + List results = new ArrayList(); + + int beginValue = 0; + if (!begin.isEmpty()) { + beginValue = begin.codePointAt(0); + } + int endValue = 0; + if (!end.isEmpty()) { + endValue = end.codePointAt(0); + } + + int cmp = beginValue - endValue; + + if (cmp == 0) { + return results; + } + + results.add(begin); + + Comparator comparator = new Comparator(){ + public int compare(String arg0, String arg1) { + return arg0.compareTo(arg1); + } + }; + + List tmp = null; + + if (cmp > 0) { // 如果是逆序,则 reverse Comparator + comparator = Collections.reverseOrder(comparator); + tmp = splitCodePoint(endValue, beginValue, count); + } else { // 正序 + tmp = splitCodePoint(beginValue, endValue, count); + } + + Collections.sort(tmp, comparator); + + for (String value : tmp) { + if (comparator.compare(value, begin) > 0 && comparator.compare(value, end) < 0) { + results.add(value); + } + } + + results.add(end); + + return results; + } + + /** + * begin 一定要小于 end + * @param begin + * @param end + * @param count + * @return + */ + private static List splitIntegerRange(BigInteger bigBegin, BigInteger bigEnd, BigInteger bigCount) { + List is = new ArrayList(); + + BigInteger interval = (bigEnd.subtract(bigBegin)).divide(bigCount); + BigInteger cur = bigBegin; + BigInteger i = BigInteger.ZERO; + while (cur.compareTo(bigEnd) < 0 && i.compareTo(bigCount) < 0) { + is.add(cur.longValue()); + cur = cur.add(interval); + i = i.add(BigInteger.ONE); + } + is.add(bigEnd.longValue()); + return is; + } + + /** + * 切分数值类型 注意: 当begin和end相等时,函数将返回空的List + * + * @param begin + * @param end + * @param count + * @return + */ + public static List splitIntegerRange(long begin, long end, int count) { + + if (count <= 1) { + throw new IllegalArgumentException("Input count <= 1 ."); + } + List is = new ArrayList(); + + BigInteger bigBegin = BigInteger.valueOf(begin); + BigInteger bigEnd = BigInteger.valueOf(end); + BigInteger bigCount = BigInteger.valueOf(count); + + BigInteger abs = (bigEnd.subtract(bigBegin)).abs(); + + if (abs.compareTo(BigInteger.ZERO) == 0) { // partition key 相等的情况 + return is; + } + + if (bigCount.compareTo(abs) > 0) { + bigCount = abs; + } + + if (bigEnd.subtract(bigBegin).compareTo(BigInteger.ZERO) > 0) { // 正向 + return splitIntegerRange(bigBegin, bigEnd, bigCount); + } else { // 逆向 + List tmp = splitIntegerRange(bigEnd, bigBegin, bigCount); + + Comparator comparator = new Comparator(){ + public int compare(Long arg0, Long arg1) { + return arg0.compareTo(arg1); + } + }; + + Collections.sort(tmp,Collections.reverseOrder(comparator)); + return tmp; + } + } + + public static List splitRangeByPrimaryKeyType( + PrimaryKeyType type, PrimaryKeyValue begin, PrimaryKeyValue end, + int count) { + List result = new ArrayList(); + if (type == PrimaryKeyType.STRING) { + List points = splitStringRange(begin.asString(), + end.asString(), count); + for (String s : points) { + result.add(PrimaryKeyValue.fromString(s)); + } + } else { + List points = splitIntegerRange(begin.asLong(), end.asLong(), + count); + for (Long l : points) { + result.add(PrimaryKeyValue.fromLong(l)); + } + } + return result; + } + + public static List rangeSplitByCount(TableMeta meta, + RowPrimaryKey begin, RowPrimaryKey end, int count) { + List results = new ArrayList(); + + OTSPrimaryKeyColumn partitionKey = Common.getPartitionKey(meta); + + PrimaryKeyValue beginPartitionKey = begin.getPrimaryKey().get( + partitionKey.getName()); + PrimaryKeyValue endPartitionKey = end.getPrimaryKey().get( + partitionKey.getName()); + + // 第一,先对PartitionKey列进行拆分 + + List ranges = RangeSplit.splitRangeByPrimaryKeyType( + partitionKey.getType(), beginPartitionKey, endPartitionKey, + count); + + if (ranges.isEmpty()) { + return results; + } + + int size = ranges.size(); + for (int i = 0; i < size - 1; i++) { + RowPrimaryKey bPk = new RowPrimaryKey(); + RowPrimaryKey ePk = new RowPrimaryKey(); + + bPk.addPrimaryKeyColumn(partitionKey.getName(), ranges.get(i)); + ePk.addPrimaryKeyColumn(partitionKey.getName(), ranges.get(i + 1)); + + results.add(new OTSRange(bPk, ePk)); + } + + // 第二,填充非PartitionKey的ParimaryKey列 + // 注意:在填充过程中,需要使用用户给定的Begin和End来替换切分出来的第一个Range + // 的Begin和最后一个Range的End + + List keys = new ArrayList(meta.getPrimaryKey().size()); + keys.addAll(meta.getPrimaryKey().keySet()); + + for (int i = 0; i < results.size(); i++) { + for (int j = 1; j < keys.size(); j++) { + OTSRange c = results.get(i); + RowPrimaryKey beginPK = c.getBegin(); + RowPrimaryKey endPK = c.getEnd(); + String key = keys.get(j); + if (i == 0) { // 第一行 + beginPK.addPrimaryKeyColumn(key, + begin.getPrimaryKey().get(key)); + endPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + } else if (i == results.size() - 1) {// 最后一行 + beginPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + endPK.addPrimaryKeyColumn(key, end.getPrimaryKey().get(key)); + } else { + beginPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + endPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + } + } + } + return results; + } + + private static List getCompletePK(int num, + PrimaryKeyValue value) { + List values = new ArrayList(); + for (int j = 0; j < num; j++) { + if (j == 0) { + values.add(value); + } else { + // 这里在填充PK时,系统需要选择特定的值填充于此 + // 系统默认填充INF_MIN + values.add(PrimaryKeyValue.INF_MIN); + } + } + return values; + } + + /** + * 根据输入的范围begin和end,从target中取得对应的point + * @param begin + * @param end + * @param target + * @return + */ + public static List getSplitPoint(PrimaryKeyValue begin, PrimaryKeyValue end, List target) { + List result = new ArrayList(); + + int cmp = Common.primaryKeyValueCmp(begin, end); + + if (cmp == 0) { + return result; + } + + result.add(begin); + + Comparator comparator = new Comparator(){ + public int compare(PrimaryKeyValue arg0, PrimaryKeyValue arg1) { + return Common.primaryKeyValueCmp(arg0, arg1); + } + }; + + if (cmp > 0) { // 如果是逆序,则 reverse Comparator + comparator = Collections.reverseOrder(comparator); + } + + Collections.sort(target, comparator); + + for (PrimaryKeyValue value:target) { + if (comparator.compare(value, begin) > 0 && comparator.compare(value, end) < 0) { + result.add(value); + } + } + result.add(end); + + return result; + } + + public static List rangeSplitByPoint(TableMeta meta, RowPrimaryKey beginPK, RowPrimaryKey endPK, + List splits) { + + List results = new ArrayList(); + + int pkCount = meta.getPrimaryKey().size(); + + String partName = Common.getPartitionKey(meta).getName(); + PrimaryKeyValue begin = beginPK.getPrimaryKey().get(partName); + PrimaryKeyValue end = endPK.getPrimaryKey().get(partName); + + List newSplits = getSplitPoint(begin, end, splits); + + if (newSplits.isEmpty()) { + return results; + } + + for (int i = 0; i < newSplits.size() - 1; i++) { + OTSRange item = new OTSRange( + ParamChecker.checkInputPrimaryKeyAndGet(meta, + getCompletePK(pkCount, newSplits.get(i))), + ParamChecker.checkInputPrimaryKeyAndGet(meta, + getCompletePK(pkCount, newSplits.get(i + 1)))); + results.add(item); + } + // replace first and last + OTSRange first = results.get(0); + OTSRange last = results.get(results.size() - 1); + + first.setBegin(beginPK); + last.setEnd(endPK); + return results; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ReaderModelParser.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ReaderModelParser.java new file mode 100644 index 000000000..8e1dfd415 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ReaderModelParser.java @@ -0,0 +1,175 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import org.apache.commons.codec.binary.Base64; + +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConst; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + +/** + * 主要对OTS PrimaryKey,OTSColumn的解析 + */ +public class ReaderModelParser { + + private static long getLongValue(String value) { + try { + return Long.parseLong(value); + } catch (NumberFormatException e) { + throw new IllegalArgumentException("Can not parse the value '"+ value +"' to Int."); + } + } + + private static double getDoubleValue(String value) { + try { + return Double.parseDouble(value); + } catch (NumberFormatException e) { + throw new IllegalArgumentException("Can not parse the value '"+ value +"' to Double."); + } + } + + private static boolean getBoolValue(String value) { + if (!(value.equalsIgnoreCase("true") || value.equalsIgnoreCase("false"))) { + throw new IllegalArgumentException("Can not parse the value '"+ value +"' to Bool."); + } + return Boolean.parseBoolean(value); + } + + public static OTSColumn parseConstColumn(String type, String value) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return OTSColumn.fromConstStringColumn(value); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return OTSColumn.fromConstIntegerColumn(getLongValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_DOUBLE)) { + return OTSColumn.fromConstDoubleColumn(getDoubleValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BOOLEAN)) { + return OTSColumn.fromConstBoolColumn(getBoolValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BINARY)) { + return OTSColumn.fromConstBytesColumn(Base64.decodeBase64(value)); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse map to 'OTSColumn', input type:" + type + ", value:" + value + "."); + } + } + + public static OTSColumn parseOTSColumn(Map item) { + if (item.containsKey(OTSConst.NAME) && item.size() == 1) { + Object name = item.get(OTSConst.NAME); + if (name instanceof String) { + String nameStr = (String) name; + return OTSColumn.fromNormalColumn(nameStr); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse map to 'OTSColumn', the value is not a string."); + } + } else if (item.containsKey(OTSConst.TYPE) && item.containsKey(OTSConst.VALUE) && item.size() == 2) { + Object type = item.get(OTSConst.TYPE); + Object value = item.get(OTSConst.VALUE); + if (type instanceof String && value instanceof String) { + String typeStr = (String) type; + String valueStr = (String) value; + return parseConstColumn(typeStr, valueStr); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse map to 'OTSColumn', the value is not a string."); + } + } else { + throw new IllegalArgumentException( + "Invalid 'column', Can not parse map to 'OTSColumn', valid format: '{\"name\":\"\"}' or '{\"type\":\"\", \"value\":\"\"}'."); + } + } + + private static void checkIsAllConstColumn(List columns) { + for (OTSColumn c : columns) { + if (c.getColumnType() == OTSColumn.OTSColumnType.NORMAL) { + return ; + } + } + throw new IllegalArgumentException("Invalid 'column', 'column' should include at least one or more Normal Column."); + } + + public static List parseOTSColumnList(List input) { + if (input.isEmpty()) { + throw new IllegalArgumentException("Input count of 'column' is zero."); + } + + List columns = new ArrayList(input.size()); + + for (Object item:input) { + if (item instanceof Map){ + @SuppressWarnings("unchecked") + Map column = (Map) item; + columns.add(parseOTSColumn(column)); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse Object to 'OTSColumn', item of list is not a map."); + } + } + checkIsAllConstColumn(columns); + return columns; + } + + public static PrimaryKeyValue parsePrimaryKeyValue(String type, String value) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return PrimaryKeyValue.fromString(value); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return PrimaryKeyValue.fromLong(getLongValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MIN)) { + throw new IllegalArgumentException("Format error, the " + OTSConst.TYPE_INF_MIN + " only support {\"type\":\"" + OTSConst.TYPE_INF_MIN + "\"}."); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MAX)) { + throw new IllegalArgumentException("Format error, the " + OTSConst.TYPE_INF_MAX + " only support {\"type\":\"" + OTSConst.TYPE_INF_MAX + "\"}."); + } else { + throw new IllegalArgumentException("Not supprot parsing type: "+ type +" for PrimaryKeyValue."); + } + } + + public static PrimaryKeyValue parsePrimaryKeyValue(String type) { + if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MIN)) { + return PrimaryKeyValue.INF_MIN; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MAX)) { + return PrimaryKeyValue.INF_MAX; + } else { + throw new IllegalArgumentException("Not supprot parsing type: "+ type +" for PrimaryKeyValue."); + } + } + + public static PrimaryKeyValue parsePrimaryKeyValue(Map item) { + if (item.containsKey(OTSConst.TYPE) && item.containsKey(OTSConst.VALUE) && item.size() == 2) { + Object type = item.get(OTSConst.TYPE); + Object value = item.get(OTSConst.VALUE); + if (type instanceof String && value instanceof String) { + String typeStr = (String) type; + String valueStr = (String) value; + return parsePrimaryKeyValue(typeStr, valueStr); + } else { + throw new IllegalArgumentException("The 'type' and 'value‘ only support string."); + } + } else if (item.containsKey(OTSConst.TYPE) && item.size() == 1) { + Object type = item.get(OTSConst.TYPE); + if (type instanceof String) { + String typeStr = (String) type; + return parsePrimaryKeyValue(typeStr); + } else { + throw new IllegalArgumentException("The 'type' only support string."); + } + } else { + throw new IllegalArgumentException("The map must consist of 'type' and 'value'."); + } + } + + public static List parsePrimaryKey(List input) { + if (null == input) { + return null; + } + List columns = new ArrayList(input.size()); + for (Object item:input) { + if (item instanceof Map) { + @SuppressWarnings("unchecked") + Map column = (Map) item; + columns.add(parsePrimaryKeyValue(column)); + } else { + throw new IllegalArgumentException("Can not parse Object to 'PrimaryKeyValue', item of list is not a map."); + } + } + return columns; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RetryHelper.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RetryHelper.java new file mode 100644 index 000000000..8ed412670 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RetryHelper.java @@ -0,0 +1,83 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.HashSet; +import java.util.Set; +import java.util.concurrent.Callable; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; + +public class RetryHelper { + + private static final Logger LOG = LoggerFactory.getLogger(RetryHelper.class); + private static final Set noRetryErrorCode = prepareNoRetryErrorCode(); + + public static V executeWithRetry(Callable callable, int maxRetryTimes, int sleepInMilliSecond) throws Exception { + int retryTimes = 0; + while (true){ + Thread.sleep(Common.getDelaySendMillinSeconds(retryTimes, sleepInMilliSecond)); + try { + return callable.call(); + } catch (Exception e) { + LOG.warn("Call callable fail, {}", e.getMessage()); + if (!canRetry(e)){ + LOG.error("Can not retry for Exception.", e); + throw e; + } else if (retryTimes >= maxRetryTimes) { + LOG.error("Retry times more than limition. maxRetryTimes : {}", maxRetryTimes); + throw e; + } + retryTimes++; + LOG.warn("Retry time : {}", retryTimes); + } + } + } + + private static Set prepareNoRetryErrorCode() { + Set pool = new HashSet(); + pool.add(OTSErrorCode.AUTHORIZATION_FAILURE); + pool.add(OTSErrorCode.INVALID_PARAMETER); + pool.add(OTSErrorCode.REQUEST_TOO_LARGE); + pool.add(OTSErrorCode.OBJECT_NOT_EXIST); + pool.add(OTSErrorCode.OBJECT_ALREADY_EXIST); + pool.add(OTSErrorCode.INVALID_PK); + pool.add(OTSErrorCode.OUT_OF_COLUMN_COUNT_LIMIT); + pool.add(OTSErrorCode.OUT_OF_ROW_SIZE_LIMIT); + pool.add(OTSErrorCode.CONDITION_CHECK_FAIL); + return pool; + } + + public static boolean canRetry(String otsErrorCode) { + if (noRetryErrorCode.contains(otsErrorCode)) { + return false; + } else { + return true; + } + } + + public static boolean canRetry(Exception exception) { + OTSException e = null; + if (exception instanceof OTSException) { + e = (OTSException) exception; + LOG.warn( + "OTSException:ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()} + ); + return canRetry(e.getErrorCode()); + + } else if (exception instanceof ClientException) { + ClientException ce = (ClientException) exception; + LOG.warn( + "ClientException:{}, ErrorMsg:{}", + new Object[]{ce.getErrorCode(), ce.getMessage()} + ); + return true; + } else { + return false; + } + } +} diff --git a/otsreader/src/main/resources/plugin.json b/otsreader/src/main/resources/plugin.json new file mode 100644 index 000000000..bfd956273 --- /dev/null +++ b/otsreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "otsreader", + "class": "com.alibaba.datax.plugin.reader.otsreader.OtsReader", + "description": "", + "developer": "alibaba" +} \ No newline at end of file diff --git a/otsreader/src/main/resources/plugin_job_template.json b/otsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..7d4d0dbc6 --- /dev/null +++ b/otsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "otsreader", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + "column":[], + "range":{ + "begin":[], + "end":[] + } + } +} \ No newline at end of file diff --git a/otswriter/doc/otswriter.md b/otswriter/doc/otswriter.md new file mode 100644 index 000000000..167487cb8 --- /dev/null +++ b/otswriter/doc/otswriter.md @@ -0,0 +1,234 @@ + +# OTSWriter 插件文档 + + +___ + + +## 1 快速介绍 + +OTSWriter插件实现了向OTS写入数据,目前支持两种写入方式: + +* PutRow,对应于OTS API PutRow,插入数据到指定的行,如果该行不存在,则新增一行;若该行存在,则覆盖原有行。 + +* UpdateRow,对应于OTS API UpdateRow,更新指定行的数据,如果该行不存在,则新增一行;若该行存在,则根据请求的内容在这一行中新增、修改或者删除指定列的值。 + +OTS是构建在阿里云飞天分布式系统之上的 NoSQL数据库服务,提供海量结构化数据的存储和实时访问。OTS 以实例和表的形式组织数据,通过数据分片和负载均衡技术,实现规模上的无缝扩展。 + +## 2 实现原理 + +简而言之,OTSWriter通过OTS官方Java SDK连接到OTS服务端,并通过SDK写入OTS服务端。OTSWriter本身对于写入过程做了很多优化,包括写入超时重试、异常写入重试、批量提交等Feature。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个写入OTS作业: + +``` +{ + "job": { + "setting": { + }, + "content": [ + { + "reader": {}, + "writer": { + "name": "otswriter", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + // 导出数据表的表名 + "table":"", + + // Writer支持不同类型之间进行相互转换 + // 如下类型转换不支持: + // ================================ + // int -> binary + // double -> bool, binary + // bool -> binary + // bytes -> int, double, bool + // ================================ + + // 需要导入的PK列名,区分大小写 + // 类型支持:STRING,INT + // 1. 支持类型转换,注意类型转换时的精度丢失 + // 2. 顺序不要求和表的Meta一致 + "primaryKey" : [ + {"name":"pk1", "type":"string"}, + {"name":"pk2", "type":"int"} + ], + + // 需要导入的列名,区分大小写 + // 类型支持STRING,INT,DOUBLE,BOOL和BINARY + "column" : [ + {"name":"col2", "type":"INT"}, + {"name":"col3", "type":"STRING"}, + {"name":"col4", "type":"STRING"}, + {"name":"col5", "type":"BINARY"}, + {"name":"col6", "type":"DOUBLE"} + ], + + // 写入OTS的方式 + // PutRow : 等同于OTS API中PutRow操作,检查条件是ignore + // UpdateRow : 等同于OTS API中UpdateRow操作,检查条件是ignore + "writeMode" : "PutRow" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OTS Server的EndPoint(服务地址),例如http://bazhen.cn−hangzhou.ots.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OTS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OTS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **instanceName** + + * 描述:OTS的实例名称,实例是用户使用和管理 OTS 服务的实体,用户在开通 OTS 服务之后,需要通过管理控制台来创建实例,然后在实例内进行表的创建和管理。实例是 OTS 资源管理的基础单元,OTS 对应用程序的访问控制和资源计量都在实例级别完成。
+ + * 必选:是
+ + * 默认值:无
+ + +* **table** + + * 描述:所选取的需要抽取的表名称,这里有且只能填写一张表。在OTS不存在多表同步的需求。
+ + * 必选:是
+ + * 默认值:无
+ +* **primaryKey** + + * 描述: OTS的主键信息,使用JSON的数组描述字段信息。OTS本身是NoSQL系统,在OTSWriter导入数据过程中,必须指定相应地字段名称。 + + OTS的PrimaryKey只能支持STRING,INT两种类型,因此OTSWriter本身也限定填写上述两种类型。 + + DataX本身支持类型转换的,因此对于源头数据非String/Int,OTSWriter会进行数据类型转换。 + + 配置实例: + + ```json + "primaryKey" : [ + {"name":"pk1", "type":"string"}, + {"name":"pk2", "type":"int"} + ], + ``` + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。使用格式为 + + ```json + {"name":"col2", "type":"INT"}, + ``` + + 其中的name指定写入的OTS列名,type指定写入的类型。OTS类型支持STRING,INT,DOUBLE,BOOL和BINARY几种类型 。 + + 写入过程不支持常量、函数或者自定义表达式。 + + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:写入模式,目前支持两种模式, + + * PutRow,对应于OTS API PutRow,插入数据到指定的行,如果该行不存在,则新增一行;若该行存在,则覆盖原有行。 + + * UpdateRow,对应于OTS API UpdateRow,更新指定行的数据,如果该行不存在,则新增一行;若该行存在,则根据请求的内容在这一行中新增、修改或者删除指定列的值。 + + * 必选:是
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前OTSWriter支持所有OTS类型,下面列出OTSWriter针对OTS类型转换列表: + + +| DataX 内部类型| OTS 数据类型 | +| -------- | ----- | +| Long |Integer | +| Double |Double| +| String |String| +| Boolean |Boolean| +| Bytes |Binary | + +* 注意,OTS本身不支持日期型类型。应用层一般使用Long报错时间的Unix TimeStamp。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +2列PK(10 + 8),15列String(10 Byte), 2两列Integer(8 Byte),算上Column Name每行大概327Byte,每次BatchWriteRow写入100行数据,所以当个请求的数据大小是32KB。 + +#### 4.1.2 机器参数 + +OTS端:3台前端机,5台后端机 + +DataX运行端: 24核CPU, 98GB内存 + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + +|并发数|DataX CPU|DATAX流量 |OTS 流量 | BatchWrite前端QPS| BatchWriteRow前端延时| +|--------|--------| --------|--------|--------|------| +|40| 1027% |Speed 22.13MB/s, 112640 records/s|65.8M/s |42|153ms | +|50| 1218% |Speed 24.11MB/s, 122700 records/s|73.5M/s |47|174ms| +|60| 1355% |Speed 25.31MB/s, 128854 records/s|78.1M/s |50|190ms| +|70| 1578% |Speed 26.35MB/s, 134121 records/s|80.8M/s |52|210ms| +|80| 1771% |Speed 26.55MB/s, 135161 records/s|82.7M/s |53|230ms| + + + + +## 5 约束限制 + +### 5.1 写入幂等性 + +OTS写入本身是支持幂等性的,也就是使用OTS SDK同一条数据写入OTS系统,一次和多次请求的结果可以理解为一致的。因此对于OTSWriter多次尝试写入同一条数据与写入一条数据结果是等同的。 + +### 5.2 单任务FailOver + +由于OTS写入本身是幂等性的,因此可以支持单任务FailOver。即一旦写入Fail,DataX会重新启动相关子任务进行重试。 + +## 6 FAQ diff --git a/otswriter/otswriter.iml b/otswriter/otswriter.iml new file mode 100644 index 000000000..7f8503ed7 --- /dev/null +++ b/otswriter/otswriter.iml @@ -0,0 +1,38 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/otswriter/pom.xml b/otswriter/pom.xml new file mode 100644 index 000000000..ed9e64486 --- /dev/null +++ b/otswriter/pom.xml @@ -0,0 +1,89 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + otswriter + otswriter + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.aliyun.openservices + ots-public + 2.1 + + + + com.google.code.gson + gson + 2.2.4 + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + org.apache.maven.plugins + maven-surefire-plugin + 2.5 + + + **/unittest/*.java + **/functiontest/*.java + + + + + + diff --git a/otswriter/src/main/assembly/package.xml b/otswriter/src/main/assembly/package.xml new file mode 100644 index 000000000..5ae7a0151 --- /dev/null +++ b/otswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/otswriter + + + target/ + + otswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/otswriter + + + + + + false + plugin/writer/otswriter/libs + runtime + + + diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/Key.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/Key.java new file mode 100644 index 000000000..0724b9cf6 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/Key.java @@ -0,0 +1,36 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.writer.otswriter; + +public final class Key { + + public final static String OTS_ENDPOINT = "endpoint"; + + public final static String OTS_ACCESSID = "accessId"; + + public final static String OTS_ACCESSKEY = "accessKey"; + + public final static String OTS_INSTANCE_NAME = "instanceName"; + + public final static String TABLE_NAME = "table"; + + public final static String PRIMARY_KEY = "primaryKey"; + + public final static String COLUMN = "column"; + + public final static String WRITE_MODE = "writeMode"; +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriter.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriter.java new file mode 100644 index 000000000..4d2ed17b3 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriter.java @@ -0,0 +1,92 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.utils.Common; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; + +public class OtsWriter { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + private OtsWriterMasterProxy proxy = new OtsWriterMasterProxy(); + + @Override + public void init() { + LOG.info("init() begin ..."); + try { + this.proxy.init(getPluginJobConf()); + } catch (OTSException e) { + LOG.error("OTSException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("init() end ..."); + } + + @Override + public void destroy() { + this.proxy.close(); + } + + @Override + public List split(int mandatoryNumber) { + try { + return this.proxy.split(mandatoryNumber); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.ERROR, Common.getDetailMessage(e), e); + } + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private OtsWriterSlaveProxy proxy = new OtsWriterSlaveProxy(); + + @Override + public void init() {} + + @Override + public void destroy() { + this.proxy.close(); + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("startWrite() begin ..."); + try { + this.proxy.init(this.getPluginJobConf()); + this.proxy.write(lineReceiver, this.getTaskPluginCollector()); + } catch (OTSException e) { + LOG.error("OTSException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("startWrite() end ..."); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterError.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterError.java new file mode 100644 index 000000000..67d1ee2b7 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterError.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class OtsWriterError implements ErrorCode { + + private String code; + + private String description; + + // TODO + // 这一块需要DATAX来统一定义分类, OTS基于这些分类在细化 + // 所以暂定两个基础的Error Code,其他错误统一使用OTS的错误码和错误消息 + + public final static OtsWriterError ERROR = new OtsWriterError( + "OtsWriterError", + "该错误表示插件的内部错误,表示系统没有处理到的异常"); + public final static OtsWriterError INVALID_PARAM = new OtsWriterError( + "OtsWriterInvalidParameter", + "该错误表示参数错误,表示用户输入了错误的参数格式等"); + + public OtsWriterError (String code) { + this.code = code; + this.description = code; + } + + public OtsWriterError (String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return this.code; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterMasterProxy.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterMasterProxy.java new file mode 100644 index 000000000..bf61aa681 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterMasterProxy.java @@ -0,0 +1,106 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import java.util.ArrayList; +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.callable.GetTableMetaCallable; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf.RestrictConf; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConst; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSOpType; +import com.alibaba.datax.plugin.writer.otswriter.utils.GsonParser; +import com.alibaba.datax.plugin.writer.otswriter.utils.ParamChecker; +import com.alibaba.datax.plugin.writer.otswriter.utils.RetryHelper; +import com.alibaba.datax.plugin.writer.otswriter.utils.WriterModelParser; +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.TableMeta; + +public class OtsWriterMasterProxy { + + private OTSConf conf = new OTSConf(); + + private OTSClient ots = null; + + private TableMeta meta = null; + + private static final Logger LOG = LoggerFactory.getLogger(OtsWriterMasterProxy.class); + + /** + * @param param + * @throws Exception + */ + public void init(Configuration param) throws Exception { + + // 默认参数 + conf.setRetry(param.getInt(OTSConst.RETRY, 18)); + conf.setSleepInMilliSecond(param.getInt(OTSConst.SLEEP_IN_MILLI_SECOND, 100)); + conf.setBatchWriteCount(param.getInt(OTSConst.BATCH_WRITE_COUNT, 100)); + conf.setConcurrencyWrite(param.getInt(OTSConst.CONCURRENCY_WRITE, 5)); + conf.setIoThreadCount(param.getInt(OTSConst.IO_THREAD_COUNT, 1)); + conf.setSocketTimeout(param.getInt(OTSConst.SOCKET_TIMEOUT, 60000)); + conf.setConnectTimeout(param.getInt(OTSConst.CONNECT_TIMEOUT, 60000)); + + RestrictConf restrictConf = conf.new RestrictConf(); + restrictConf.setRequestTotalSizeLimition(param.getInt(OTSConst.REQUEST_TOTAL_SIZE_LIMITION, 1024*1024)); + conf.setRestrictConf(restrictConf); + + // 必选参数 + conf.setEndpoint(ParamChecker.checkStringAndGet(param, Key.OTS_ENDPOINT)); + conf.setAccessId(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSID)); + conf.setAccessKey(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSKEY)); + conf.setInstanceName(ParamChecker.checkStringAndGet(param, Key.OTS_INSTANCE_NAME)); + conf.setTableName(ParamChecker.checkStringAndGet(param, Key.TABLE_NAME)); + + conf.setOperation(WriterModelParser.parseOTSOpType(ParamChecker.checkStringAndGet(param, Key.WRITE_MODE))); + + ots = new OTSClient( + this.conf.getEndpoint(), + this.conf.getAccessId(), + this.conf.getAccessKey(), + this.conf.getInstanceName()); + + meta = getTableMeta(ots, conf.getTableName()); + LOG.info("Table Meta : {}", GsonParser.metaToJson(meta)); + + conf.setPrimaryKeyColumn(WriterModelParser.parseOTSPKColumnList(ParamChecker.checkListAndGet(param, Key.PRIMARY_KEY, true))); + ParamChecker.checkPrimaryKey(meta, conf.getPrimaryKeyColumn()); + + conf.setAttributeColumn(WriterModelParser.parseOTSAttrColumnList(ParamChecker.checkListAndGet(param, Key.COLUMN, conf.getOperation() == OTSOpType.PUT_ROW ? false : true))); + ParamChecker.checkAttribute(conf.getAttributeColumn()); + } + + public List split(int mandatoryNumber){ + LOG.info("Begin split and MandatoryNumber : {}", mandatoryNumber); + List configurations = new ArrayList(); + for (int i = 0; i < mandatoryNumber; i++) { + Configuration configuration = Configuration.newDefault(); + configuration.set(OTSConst.OTS_CONF, GsonParser.confToJson(this.conf)); + configurations.add(configuration); + } + LOG.info("End split."); + assert(mandatoryNumber == configurations.size()); + return configurations; + } + + public void close() { + ots.shutdown(); + } + + public OTSConf getOTSConf() { + return conf; + } + + // private function + + private TableMeta getTableMeta(OTSClient ots, String tableName) throws Exception { + return RetryHelper.executeWithRetry( + new GetTableMetaCallable(ots, tableName), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterSlaveProxy.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterSlaveProxy.java new file mode 100644 index 000000000..3e9854d8e --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterSlaveProxy.java @@ -0,0 +1,86 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConst; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSLine; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSSendBuffer; +import com.alibaba.datax.plugin.writer.otswriter.utils.DefaultNoRetry; +import com.alibaba.datax.plugin.writer.otswriter.utils.GsonParser; +import com.aliyun.openservices.ots.ClientConfiguration; +import com.aliyun.openservices.ots.OTS; +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.OTSServiceConfiguration; + + +public class OtsWriterSlaveProxy { + + private static final Logger LOG = LoggerFactory.getLogger(OtsWriterSlaveProxy.class); + + private OTSConf conf = null; + private OTSSendBuffer buffer = null; + private OTS ots = null; + + public void init(Configuration configuration) { + conf = GsonParser.jsonToConf(configuration.getString(OTSConst.OTS_CONF)); + + ClientConfiguration clientConfigure = new ClientConfiguration(); + clientConfigure.setIoThreadCount(conf.getIoThreadCount()); + clientConfigure.setMaxConnections(conf.getConcurrencyWrite()); + clientConfigure.setSocketTimeoutInMillisecond(conf.getSocketTimeout()); + clientConfigure.setConnectionTimeoutInMillisecond(conf.getConnectTimeout()); + + OTSServiceConfiguration otsConfigure = new OTSServiceConfiguration(); + otsConfigure.setRetryStrategy(new DefaultNoRetry()); + + ots = new OTSClient( + conf.getEndpoint(), + conf.getAccessId(), + conf.getAccessKey(), + conf.getInstanceName(), + clientConfigure, + otsConfigure); + } + + public void close() { + ots.shutdown(); + } + + public void write(RecordReceiver recordReceiver, TaskPluginCollector collector) throws Exception { + LOG.info("write begin."); + int expectColumnCount = conf.getPrimaryKeyColumn().size() + conf.getAttributeColumn().size(); + Record record = null; + buffer = new OTSSendBuffer(ots, collector, conf); + while ((record = recordReceiver.getFromReader()) != null) { + + LOG.debug("Record Raw: {}", record.toString()); + + int columnCount = record.getColumnNumber(); + if (columnCount != expectColumnCount) { + // 如果Column的个数和预期的个数不一致时,认为是系统故障或者用户配置Column错误,异常退出 + throw new IllegalArgumentException(String.format(OTSErrorMessage.RECORD_AND_COLUMN_SIZE_ERROR, columnCount, expectColumnCount)); + } + + OTSLine line = null; + + // 类型转换 + try { + line = new OTSLine(conf.getTableName(), conf.getOperation(), record, conf.getPrimaryKeyColumn(), conf.getAttributeColumn()); + } catch (IllegalArgumentException e) { + LOG.warn("Convert fail。", e); + collector.collectDirtyRecord(record, e.getMessage()); + continue; + } + buffer.write(line); + } + buffer.close(); + LOG.info("write end."); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/BatchWriteRowCallable.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/BatchWriteRowCallable.java new file mode 100644 index 000000000..3ece0905b --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/BatchWriteRowCallable.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.plugin.writer.otswriter.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTS; +import com.aliyun.openservices.ots.model.BatchWriteRowRequest; +import com.aliyun.openservices.ots.model.BatchWriteRowResult; + +public class BatchWriteRowCallable implements Callable{ + + private OTS ots = null; + private BatchWriteRowRequest batchWriteRowRequest = null; + + public BatchWriteRowCallable(OTS ots, BatchWriteRowRequest batchWriteRowRequest) { + this.ots = ots; + this.batchWriteRowRequest = batchWriteRowRequest; + + } + + @Override + public BatchWriteRowResult call() throws Exception { + return ots.batchWriteRow(batchWriteRowRequest); + } + +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/GetTableMetaCallable.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/GetTableMetaCallable.java new file mode 100644 index 000000000..d4128e14c --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/GetTableMetaCallable.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.writer.otswriter.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.DescribeTableRequest; +import com.aliyun.openservices.ots.model.DescribeTableResult; +import com.aliyun.openservices.ots.model.TableMeta; + +public class GetTableMetaCallable implements Callable{ + + private OTSClient ots = null; + private String tableName = null; + + public GetTableMetaCallable(OTSClient ots, String tableName) { + this.ots = ots; + this.tableName = tableName; + } + + @Override + public TableMeta call() throws Exception { + DescribeTableRequest describeTableRequest = new DescribeTableRequest(); + describeTableRequest.setTableName(tableName); + DescribeTableResult result = ots.describeTable(describeTableRequest); + TableMeta tableMeta = result.getTableMeta(); + return tableMeta; + } + +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/PutRowChangeCallable.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/PutRowChangeCallable.java new file mode 100644 index 000000000..98be2944e --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/PutRowChangeCallable.java @@ -0,0 +1,24 @@ +package com.alibaba.datax.plugin.writer.otswriter.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTS; +import com.aliyun.openservices.ots.model.PutRowRequest; +import com.aliyun.openservices.ots.model.PutRowResult; + +public class PutRowChangeCallable implements Callable{ + + private OTS ots = null; + private PutRowRequest putRowRequest = null; + + public PutRowChangeCallable(OTS ots, PutRowRequest putRowRequest) { + this.ots = ots; + this.putRowRequest = putRowRequest; + } + + @Override + public PutRowResult call() throws Exception { + return ots.putRow(putRowRequest); + } + +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/UpdateRowChangeCallable.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/UpdateRowChangeCallable.java new file mode 100644 index 000000000..26246d906 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/UpdateRowChangeCallable.java @@ -0,0 +1,24 @@ +package com.alibaba.datax.plugin.writer.otswriter.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTS; +import com.aliyun.openservices.ots.model.UpdateRowRequest; +import com.aliyun.openservices.ots.model.UpdateRowResult; + +public class UpdateRowChangeCallable implements Callable{ + + private OTS ots = null; + private UpdateRowRequest updateRowRequest = null; + + public UpdateRowChangeCallable(OTS ots, UpdateRowRequest updateRowRequest ) { + this.ots = ots; + this.updateRowRequest = updateRowRequest; + } + + @Override + public UpdateRowResult call() throws Exception { + return ots.updateRow(updateRowRequest); + } + +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/LogExceptionManager.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/LogExceptionManager.java new file mode 100644 index 000000000..93175ddb1 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/LogExceptionManager.java @@ -0,0 +1,58 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; + +/** + * 添加这个类的主要目的是为了解决当用户遇到CU不够时,打印大量的日志 + * @author redchen + * + */ +public class LogExceptionManager { + + private long count = 0; + private long updateTimestamp = 0; + + private static final Logger LOG = LoggerFactory.getLogger(LogExceptionManager.class); + + private synchronized void countAndReset() { + count++; + long cur = System.currentTimeMillis(); + long interval = cur - updateTimestamp; + if (interval >= 10000) { + LOG.warn("Call callable fail, OTSNotEnoughCapacityUnit, total times:"+ count +", time range:"+ (interval/1000) +"s, times per second:" + ((float)count / (interval/1000))); + count = 0; + updateTimestamp = cur; + } + } + + public synchronized void addException(Exception exception) { + if (exception instanceof OTSException) { + OTSException e = (OTSException)exception; + if (e.getErrorCode().equals(OTSErrorCode.NOT_ENOUGH_CAPACITY_UNIT)) { + countAndReset(); + } else { + LOG.warn( + "Call callable fail, OTSException:ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()} + ); + } + } else { + LOG.warn("Call callable fail, {}", exception.getMessage()); + } + } + + public synchronized void addException(com.aliyun.openservices.ots.model.Error error, String requestId) { + if (error.getCode().equals(OTSErrorCode.NOT_ENOUGH_CAPACITY_UNIT)) { + countAndReset(); + } else { + LOG.warn( + "OTSException:ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{error.getCode(), error.getMessage(), requestId} + ); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSAttrColumn.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSAttrColumn.java new file mode 100644 index 000000000..d37960e00 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSAttrColumn.java @@ -0,0 +1,21 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.aliyun.openservices.ots.model.ColumnType; + +public class OTSAttrColumn { + private String name; + private ColumnType type; + + public OTSAttrColumn(String name, ColumnType type) { + this.name = name; + this.type = type; + } + + public String getName() { + return name; + } + + public ColumnType getType() { + return type; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBatchWriteRowTaskManager.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBatchWriteRowTaskManager.java new file mode 100644 index 000000000..5882ed1e5 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBatchWriteRowTaskManager.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.aliyun.openservices.ots.OTS; + +/** + * 控制Task的并发数目 + * + */ +public class OTSBatchWriteRowTaskManager { + + private OTS ots = null; + private TaskPluginCollector collector = null; + private OTSBlockingExecutor executorService = null; + private OTSConf conf = null; + + private static final Logger LOG = LoggerFactory.getLogger(OTSBatchWriteRowTaskManager.class); + + public OTSBatchWriteRowTaskManager( + OTS ots, + TaskPluginCollector collector, + OTSConf conf) { + this.ots = ots; + this.collector = collector; + this.conf = conf; + + executorService = new OTSBlockingExecutor(conf.getConcurrencyWrite()); + } + + public void execute(List lines) throws Exception { + LOG.debug("Begin execute."); + executorService.execute(new OTSBatchWriterRowTask(collector, ots, conf, lines)); + LOG.debug("End execute."); + } + + public void close() throws Exception { + LOG.debug("Begin close."); + executorService.shutdown(); + LOG.debug("End close."); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBatchWriterRowTask.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBatchWriterRowTask.java new file mode 100644 index 000000000..47ada2766 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBatchWriterRowTask.java @@ -0,0 +1,226 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.ArrayList; +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.writer.otswriter.callable.BatchWriteRowCallable; +import com.alibaba.datax.plugin.writer.otswriter.callable.PutRowChangeCallable; +import com.alibaba.datax.plugin.writer.otswriter.callable.UpdateRowChangeCallable; +import com.alibaba.datax.plugin.writer.otswriter.utils.Common; +import com.alibaba.datax.plugin.writer.otswriter.utils.RetryHelper; +import com.aliyun.openservices.ots.OTS; +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; +import com.aliyun.openservices.ots.model.BatchWriteRowResult; +import com.aliyun.openservices.ots.model.BatchWriteRowResult.RowStatus; +import com.aliyun.openservices.ots.model.BatchWriteRowRequest; +import com.aliyun.openservices.ots.model.Error; +import com.aliyun.openservices.ots.model.PutRowRequest; +import com.aliyun.openservices.ots.model.PutRowResult; +import com.aliyun.openservices.ots.model.RowPutChange; +import com.aliyun.openservices.ots.model.RowUpdateChange; +import com.aliyun.openservices.ots.model.UpdateRowRequest; +import com.aliyun.openservices.ots.model.UpdateRowResult; + +public class OTSBatchWriterRowTask implements Runnable { + private TaskPluginCollector collector = null; + private OTS ots = null; + private OTSConf conf = null; + private List otsLines = new ArrayList(); + + private boolean isDone = false; + private int retryTimes = 0; + + private static final Logger LOG = LoggerFactory.getLogger(OTSBatchWriterRowTask.class); + + public OTSBatchWriterRowTask( + TaskPluginCollector collector, + OTS ots, + OTSConf conf, + List lines + ) { + this.collector = collector; + this.ots = ots; + this.conf = conf; + + this.otsLines.addAll(lines); + } + + @Override + public void run() { + LOG.debug("Begin run"); + try { + sendAll(otsLines); + } catch (InterruptedException e) { + LOG.error(e.getMessage()); + } + LOG.debug("End run"); + } + + public boolean isDone() { + return this.isDone; + } + + private boolean isExceptionForSendOneByOne(Exception e) { + if (e instanceof OTSException) { + OTSException ee = (OTSException)e; + if (ee.getErrorCode().equals(OTSErrorCode.INVALID_PARAMETER)|| + ee.getErrorCode().equals(OTSErrorCode.REQUEST_TOO_LARGE) + ) { + return true; + } + } + return false; + } + + private BatchWriteRowRequest createRequset(List lines) { + BatchWriteRowRequest newRequst = new BatchWriteRowRequest(); + switch (conf.getOperation()) { + case PUT_ROW: + for (OTSLine l : lines) { + newRequst.addRowPutChange((RowPutChange) l.getAttr()); + } + break; + case UPDATE_ROW: + for (OTSLine l : lines) { + newRequst.addRowUpdateChange((RowUpdateChange) l.getAttr()); + } + break; + default: + throw new RuntimeException(String.format(OTSErrorMessage.OPERATION_PARSE_ERROR, conf.getOperation())); + } + return newRequst; + } + + private void sendLine(OTSLine line) throws InterruptedException { + try { + switch (conf.getOperation()) { + case PUT_ROW: + PutRowRequest putRowRequest = new PutRowRequest(); + putRowRequest.setRowChange((RowPutChange) line.getAttr()); + PutRowResult putResult = RetryHelper.executeWithRetry( + new PutRowChangeCallable(ots, putRowRequest), + conf.getRetry(), + conf.getSleepInMilliSecond()); + LOG.debug("Requst ID : {}", putResult.getRequestID()); + break; + case UPDATE_ROW: + UpdateRowRequest updateRowRequest = new UpdateRowRequest(); + updateRowRequest.setRowChange((RowUpdateChange) line.getAttr()); + UpdateRowResult updateResult = RetryHelper.executeWithRetry( + new UpdateRowChangeCallable(ots, updateRowRequest), + conf.getRetry(), + conf.getSleepInMilliSecond()); + LOG.debug("Requst ID : {}", updateResult.getRequestID()); + break; + } + } catch (Exception e) { + LOG.error("Can not send line to OTS. {}", e.getMessage()); + collector.collectDirtyRecord(line.getRecord(), e.getMessage()); + } + } + + private void sendAllOneByOne(List lines) throws InterruptedException { + for (OTSLine l : lines) { + sendLine(l); + } + } + + private void sendAll(List lines) throws InterruptedException { + Thread.sleep(Common.getDelaySendMillinSeconds(retryTimes, conf.getSleepInMilliSecond())); + try { + BatchWriteRowRequest batchWriteRowRequest = createRequset(lines); + BatchWriteRowResult result = RetryHelper.executeWithRetry( + new BatchWriteRowCallable(ots, batchWriteRowRequest), + conf.getRetry(), + conf.getSleepInMilliSecond()); + + LOG.debug("Requst ID : {}", result.getRequestID()); + List errors = getLineAndError(result, lines); + if (!errors.isEmpty()){ + if(retryTimes < conf.getRetry()) { + retryTimes++; + LOG.debug("Retry times : {}", retryTimes); + List newLines = new ArrayList(); + for (LineAndError re : errors) { + if (RetryHelper.canRetry(re.getError().getCode())) { + RetryHelper.logManager.addException(re.getError(), result.getRequestID()); + newLines.add(re.getLine()); + } else { + LOG.error("Can not retry, record row to collector. {}", re.getError().getMessage()); + collector.collectDirtyRecord(re.getLine().getRecord(), re.getError().getMessage()); + } + } + if (!newLines.isEmpty()) { + sendAll(newLines); + } + } else { + LOG.error("Retry times more than limition. RetryTime : {}", retryTimes); + Common.collectDirtyRecord(collector, errors); + } + } + } catch (Exception e) { + LOG.debug("Send data fail.", e); + if (isExceptionForSendOneByOne(e)) { + if (lines.size() == 1) { + LOG.error("Can not retry for Exception : {}", e.getMessage()); + Common.collectDirtyRecord(collector, lines, e.getMessage()); + } else { + // 进入单行发送的分支 + sendAllOneByOne(lines); + } + } else { + LOG.error("Can not send lines to OTS. {}", e.getMessage()); + Common.collectDirtyRecord(collector, lines, e.getMessage()); + } + } + } + + private List getLineAndError(BatchWriteRowResult result, List lines) { + List errors = new ArrayList(); + + switch(conf.getOperation()) { + case PUT_ROW: + List putStatus = result.getPutRowStatus(conf.getTableName()); + for (int i = 0; i < putStatus.size(); i++) { + if (!putStatus.get(i).isSucceed()) { + errors.add(new LineAndError(lines.get(i), putStatus.get(i).getError())); + } + } + break; + case UPDATE_ROW: + List updateStatus = result.getUpdateRowStatus(conf.getTableName()); + for (int i = 0; i < updateStatus.size(); i++) { + if (!updateStatus.get(i).isSucceed()) { + errors.add(new LineAndError(lines.get(i), updateStatus.get(i).getError())); + } + } + break; + default: + throw new RuntimeException(String.format(OTSErrorMessage.OPERATION_PARSE_ERROR, conf.getOperation())); + } + return errors; + } + + public class LineAndError { + private OTSLine line; + private Error error; + + public LineAndError(OTSLine record, Error error) { + this.line = record; + this.error = error; + } + + public OTSLine getLine() { + return line; + } + + public Error getError() { + return error; + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBlockingExecutor.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBlockingExecutor.java new file mode 100644 index 000000000..fab0f11fc --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSBlockingExecutor.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.LinkedBlockingQueue; +import java.util.concurrent.RejectedExecutionException; +import java.util.concurrent.Semaphore; +import java.util.concurrent.ThreadPoolExecutor; +import java.util.concurrent.TimeUnit; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class OTSBlockingExecutor { + private final ExecutorService exec; + private final Semaphore semaphore; + + private static final Logger LOG = LoggerFactory.getLogger(OTSBlockingExecutor.class); + + public OTSBlockingExecutor(int concurrency) { + this.exec = new ThreadPoolExecutor( + concurrency, concurrency, + 0L, TimeUnit.SECONDS, + new LinkedBlockingQueue()); + this.semaphore = new Semaphore(concurrency); + } + + public void execute(final Runnable task) + throws InterruptedException { + LOG.debug("Begin execute"); + try { + semaphore.acquire(); + exec.execute(new Runnable() { + public void run() { + try { + task.run(); + } finally { + semaphore.release(); + } + } + }); + } catch (RejectedExecutionException e) { + semaphore.release(); + throw new RuntimeException(OTSErrorMessage.INSERT_TASK_ERROR); + } + LOG.debug("End execute"); + } + + public void shutdown() throws InterruptedException { + this.exec.shutdown(); + while (!this.exec.awaitTermination(1, TimeUnit.SECONDS)){} + } +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConf.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConf.java new file mode 100644 index 000000000..ec86d6de2 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConf.java @@ -0,0 +1,136 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.List; + +public class OTSConf { + private String endpoint= null; + private String accessId = null; + private String accessKey = null; + private String instanceName = null; + private String tableName = null; + + private List primaryKeyColumn = null; + private List attributeColumn = null; + + private int retry = -1; + private int sleepInMilliSecond = -1; + private int batchWriteCount = -1; + private int concurrencyWrite = -1; + private int ioThreadCount = -1; + private int socketTimeout = -1; + private int connectTimeout = -1; + + private OTSOpType operation = null; + + private RestrictConf restrictConf = null; + + //限制项 + public class RestrictConf { + private int requestTotalSizeLimition = -1; + + public int getRequestTotalSizeLimition() { + return requestTotalSizeLimition; + } + public void setRequestTotalSizeLimition(int requestTotalSizeLimition) { + this.requestTotalSizeLimition = requestTotalSizeLimition; + } + } + + public RestrictConf getRestrictConf() { + return restrictConf; + } + public void setRestrictConf(RestrictConf restrictConf) { + this.restrictConf = restrictConf; + } + public OTSOpType getOperation() { + return operation; + } + public void setOperation(OTSOpType operation) { + this.operation = operation; + } + public List getPrimaryKeyColumn() { + return primaryKeyColumn; + } + public void setPrimaryKeyColumn(List primaryKeyColumn) { + this.primaryKeyColumn = primaryKeyColumn; + } + + public int getConcurrencyWrite() { + return concurrencyWrite; + } + public void setConcurrencyWrite(int concurrencyWrite) { + this.concurrencyWrite = concurrencyWrite; + } + public int getBatchWriteCount() { + return batchWriteCount; + } + public void setBatchWriteCount(int batchWriteCount) { + this.batchWriteCount = batchWriteCount; + } + public String getEndpoint() { + return endpoint; + } + public void setEndpoint(String endpoint) { + this.endpoint = endpoint; + } + public String getAccessId() { + return accessId; + } + public void setAccessId(String accessId) { + this.accessId = accessId; + } + public String getAccessKey() { + return accessKey; + } + public void setAccessKey(String accessKey) { + this.accessKey = accessKey; + } + public String getInstanceName() { + return instanceName; + } + public void setInstanceName(String instanceName) { + this.instanceName = instanceName; + } + public String getTableName() { + return tableName; + } + public void setTableName(String tableName) { + this.tableName = tableName; + } + public List getAttributeColumn() { + return attributeColumn; + } + public void setAttributeColumn(List attributeColumn) { + this.attributeColumn = attributeColumn; + } + public int getRetry() { + return retry; + } + public void setRetry(int retry) { + this.retry = retry; + } + public int getSleepInMilliSecond() { + return sleepInMilliSecond; + } + public void setSleepInMilliSecond(int sleepInMilliSecond) { + this.sleepInMilliSecond = sleepInMilliSecond; + } + public int getIoThreadCount() { + return ioThreadCount; + } + public void setIoThreadCount(int ioThreadCount) { + this.ioThreadCount = ioThreadCount; + } + public int getSocketTimeout() { + return socketTimeout; + } + public void setSocketTimeout(int socketTimeout) { + this.socketTimeout = socketTimeout; + } + public int getConnectTimeout() { + return connectTimeout; + } + public void setConnectTimeout(int connectTimeout) { + this.connectTimeout = connectTimeout; + } +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConst.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConst.java new file mode 100644 index 000000000..864d8ac2c --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConst.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public class OTSConst { + // Reader support type + public final static String TYPE_STRING = "STRING"; + public final static String TYPE_INTEGER = "INT"; + public final static String TYPE_DOUBLE = "DOUBLE"; + public final static String TYPE_BOOLEAN = "BOOL"; + public final static String TYPE_BINARY = "BINARY"; + + // Column + public final static String NAME = "name"; + public final static String TYPE = "type"; + + public final static String OTS_CONF = "OTS_CONF"; + + public final static String OTS_OP_TYPE_PUT = "PutRow"; + public final static String OTS_OP_TYPE_UPDATE = "UpdateRow"; + + // options + public final static String RETRY = "maxRetryTime"; + public final static String SLEEP_IN_MILLI_SECOND = "retrySleepInMillionSecond"; + public final static String BATCH_WRITE_COUNT = "batchWriteCount"; + public final static String CONCURRENCY_WRITE = "concurrencyWrite"; + public final static String IO_THREAD_COUNT = "ioThreadCount"; + public final static String MAX_CONNECT_COUNT = "maxConnectCount"; + public final static String SOCKET_TIMEOUT = "socketTimeoutInMillionSecond"; + public final static String CONNECT_TIMEOUT = "connectTimeoutInMillionSecond"; + + // 限制项 + public final static String REQUEST_TOTAL_SIZE_LIMITION = "requestTotalSizeLimition"; +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSErrorMessage.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSErrorMessage.java new file mode 100644 index 000000000..2406c0567 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSErrorMessage.java @@ -0,0 +1,66 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public class OTSErrorMessage { + + public static final String OPERATION_PARSE_ERROR = "The 'writeMode' only support 'PutRow' and 'UpdateRow' not '%s'."; + + public static final String UNSUPPORT_PARSE = "Unsupport parse '%s' to '%s'."; + + public static final String RECORD_AND_COLUMN_SIZE_ERROR = "Size of record not equal size of config column. record size : %d, config column size : %d."; + + public static final String PK_TYPE_ERROR = "Primary key type only support 'string' and 'int', not support '%s'."; + + public static final String ATTR_TYPE_ERROR = "Column type only support 'string','int','double','bool' and 'binary', not support '%s'."; + + public static final String PK_COLUMN_MISSING_ERROR = "Missing the column '%s' in 'primaryKey'."; + + public static final String INPUT_PK_COUNT_NOT_EQUAL_META_ERROR = "The count of 'primaryKey' not equal meta, input count : %d, primary key count : %d in meta."; + + public static final String INPUT_PK_TYPE_NOT_MATCH_META_ERROR = "The type of 'primaryKey' not match meta, column name : %s, input type: %s, primary key type : %s in meta."; + + public static final String ATTR_REPEAT_COLUMN_ERROR = "Repeat column '%s' in 'column'."; + + public static final String MISSING_PARAMTER_ERROR = "The param '%s' is not exist."; + + public static final String PARAMTER_STRING_IS_EMPTY_ERROR = "The param length of '%s' is zero."; + + public static final String PARAMETER_LIST_IS_EMPTY_ERROR = "The param '%s' is a empty json array."; + + public static final String PARAMETER_IS_NOT_ARRAY_ERROR = "The param '%s' is not a json array."; + + public static final String PARAMETER_IS_NOT_MAP_ERROR = "The param '%s' is not a json map."; + + public static final String PARSE_TO_LIST_ERROR = "Can not parse '%s' to list."; + + public static final String PK_MAP_NAME_TYPE_ERROR = "The 'name' and 'type only support string in json map of 'primaryKey'."; + + public static final String ATTR_MAP_NAME_TYPE_ERROR = "The 'name' and 'type only support string in json map of 'column'."; + + public static final String PK_MAP_INCLUDE_NAME_TYPE_ERROR = "The only support 'name' and 'type' fileds in json map of 'primaryKey'."; + + public static final String ATTR_MAP_INCLUDE_NAME_TYPE_ERROR = "The only support 'name' and 'type' fileds in json map of 'column'."; + + public static final String PK_ITEM_IS_NOT_MAP_ERROR = "The item is not map in 'primaryKey'."; + + public static final String ATTR_ITEM_IS_NOT_MAP_ERROR = "The item is not map in 'column'."; + + public static final String PK_COLUMN_NAME_IS_EMPTY_ERROR = "The name of item can not be a empty string in 'primaryKey'."; + + public static final String ATTR_COLUMN_NAME_IS_EMPTY_ERROR = "The name of item can not be a empty string in 'column'."; + + public static final String MULTI_ATTR_COLUMN_ERROR = "Multi item in 'column', column name : %s ."; + + public static final String COLUMN_CONVERSION_ERROR = "Column coversion error, src type : %s, src value: %s, expect type: %s ."; + + public static final String PK_COLUMN_VALUE_IS_NULL_ERROR = "The column of record is NULL, primary key name : %s ."; + + public static final String PK_STRONG_LENGTH_ERROR = "The length of pk string value is more than configuration, conf: %d, input: %d ."; + + public static final String ATTR_STRING_LENGTH_ERROR = "The length of attr string value is more than configuration, conf: %d, input: %d ."; + + public static final String BINARY_LENGTH_ERROR = "The length of binary value is more than configuration, conf: %d, input: %d ."; + + public static final String LINE_LENGTH_ERROR = "The length of row is more than length of request configuration, conf: %d, row: %d ."; + + public static final String INSERT_TASK_ERROR = "Can not execute the task, becase the ExecutorService is shutdown."; +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSLine.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSLine.java new file mode 100644 index 000000000..3829773bf --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSLine.java @@ -0,0 +1,42 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.List; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.plugin.writer.otswriter.utils.Common; +import com.alibaba.datax.plugin.writer.otswriter.utils.SizeCalculateHelper; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.RowChange; +import com.aliyun.openservices.ots.model.RowPrimaryKey; + +public class OTSLine { + private int dataSize = 0; + private Record record = null; + private RowPrimaryKey pk = null; + private RowChange attr = null; + + public OTSLine(String tableName, OTSOpType type, Record record, List pkColumns, List attrColumns) { + List> values = Common.getAttrFromRecord(pkColumns.size(), attrColumns, record); + + this.record = record; + this.pk = Common.getPKFromRecord(pkColumns, record); + this.attr = Common.columnValuesToRowChange(tableName, type, pk, values); + this.dataSize = SizeCalculateHelper.getRowPrimaryKeySize(this.pk) + SizeCalculateHelper.getAttributeColumnSize(values, type); + } + + public Record getRecord() { + return record; + } + + public RowPrimaryKey getPk() { + return pk; + } + + public int getDataSize() { + return dataSize; + } + + public RowChange getAttr() { + return attr; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSOpType.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSOpType.java new file mode 100644 index 000000000..9b927d95f --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSOpType.java @@ -0,0 +1,6 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public enum OTSOpType { + PUT_ROW, + UPDATE_ROW +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSPKColumn.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSPKColumn.java new file mode 100644 index 000000000..c873cb963 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSPKColumn.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.aliyun.openservices.ots.model.PrimaryKeyType; + +public class OTSPKColumn { + private String name; + private PrimaryKeyType type; + + public OTSPKColumn(String name, PrimaryKeyType type) { + this.name = name; + this.type = type; + } + + public PrimaryKeyType getType() { + return type; + } + + public String getName() { + return name; + } + +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSRowPrimaryKey.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSRowPrimaryKey.java new file mode 100644 index 000000000..d89d50177 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSRowPrimaryKey.java @@ -0,0 +1,61 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.Map; +import java.util.Map.Entry; + +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + +public class OTSRowPrimaryKey { + + private Map columns; + + public OTSRowPrimaryKey(Map columns) { + if (null == columns) { + throw new IllegalArgumentException("Input columns can not be null."); + } + this.columns = columns; + } + + public Map getColumns() { + return columns; + } + + @Override + public int hashCode() { + int result = 31; + for (Entry entry : columns.entrySet()) { + result = result ^ entry.getKey().hashCode() ^ entry.getValue().hashCode(); + } + return result; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) { + return true; + } + if (obj == null) { + return false; + } + if (!(obj instanceof OTSRowPrimaryKey)) { + return false; + } + OTSRowPrimaryKey other = (OTSRowPrimaryKey) obj; + + if (columns.size() != other.columns.size()) { + return false; + } + + for (Entry entry : columns.entrySet()) { + PrimaryKeyValue otherValue = other.columns.get(entry.getKey()); + + if (otherValue == null) { + return false; + } + if (!otherValue.equals(entry.getValue())) { + return false; + } + } + return true; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSSendBuffer.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSSendBuffer.java new file mode 100644 index 000000000..b6e14639c --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSSendBuffer.java @@ -0,0 +1,58 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Map; + +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.aliyun.openservices.ots.OTS; + +/** + * 请求满足特定条件就发送数据 + */ +public class OTSSendBuffer { + private int totalSize = 0; + + private OTSConf conf = null; + private OTSBatchWriteRowTaskManager manager = null; + + private Map lines = new HashMap(); + + public OTSSendBuffer( + OTS ots, + TaskPluginCollector collector, + OTSConf conf) { + this.conf = conf; + this.manager = new OTSBatchWriteRowTaskManager(ots, collector, conf); + } + + public void write(OTSLine line) throws Exception { + // 检查是否满足发送条件 + if (lines.size() >= conf.getBatchWriteCount() || + ((totalSize + line.getDataSize()) > conf.getRestrictConf().getRequestTotalSizeLimition() && totalSize > 0)) { + + manager.execute(new ArrayList(lines.values())); + lines.clear(); + totalSize = 0; + } + OTSRowPrimaryKey oPk = new OTSRowPrimaryKey(line.getPk().getPrimaryKey()); + OTSLine old = lines.get(oPk); + if (old != null) { + totalSize -= old.getDataSize(); // 移除相同PK的行的数据大小 + } + lines.put(oPk, line); + totalSize += line.getDataSize(); + } + + public void flush() throws Exception { + // 发送最后剩余的数据 + if (!lines.isEmpty()) { + manager.execute(new ArrayList(lines.values())); + } + } + + public void close() throws Exception { + flush(); + manager.close(); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/Pair.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/Pair.java new file mode 100644 index 000000000..dccb15c08 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/Pair.java @@ -0,0 +1,24 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public class Pair { + private KEY key; + private VALUE value; + + public Pair(KEY key, VALUE value) { + this.key = key; + this.value = value; + } + + public KEY getKey() { + return key; + } + public void setKey(KEY key) { + this.key = key; + } + public VALUE getValue() { + return value; + } + public void setValue(VALUE value) { + this.value = value; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ColumnConversion.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ColumnConversion.java new file mode 100644 index 000000000..51162b845 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ColumnConversion.java @@ -0,0 +1,61 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + + +/** + * 备注:datax提供的转换机制有如下限制,如下规则是不能转换的 + * 1. bool -> binary + * 2. binary -> long, double, bool + * 3. double -> bool, binary + * 4. long -> binary + */ +public class ColumnConversion { + public static PrimaryKeyValue columnToPrimaryKeyValue(Column c, OTSPKColumn col) { + try { + switch (col.getType()) { + case STRING: + return PrimaryKeyValue.fromString(c.asString()); + case INTEGER: + return PrimaryKeyValue.fromLong(c.asLong()); + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, col.getType(), "PrimaryKeyValue")); + } + } catch (DataXException e) { + throw new IllegalArgumentException(String.format( + OTSErrorMessage.COLUMN_CONVERSION_ERROR, + c.getType(), c.asString(), col.getType().toString() + )); + } + } + + public static ColumnValue columnToColumnValue(Column c, OTSAttrColumn col) { + try { + switch (col.getType()) { + case STRING: + return ColumnValue.fromString(c.asString()); + case INTEGER: + return ColumnValue.fromLong(c.asLong()); + case BOOLEAN: + return ColumnValue.fromBoolean(c.asBoolean()); + case DOUBLE: + return ColumnValue.fromDouble(c.asDouble()); + case BINARY: + return ColumnValue.fromBinary(c.asBytes()); + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, col.getType(), "ColumnValue")); + } + } catch (DataXException e) { + throw new IllegalArgumentException(String.format( + OTSErrorMessage.COLUMN_CONVERSION_ERROR, + c.getType(), c.asString(), col.getType().toString() + )); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/Common.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/Common.java new file mode 100644 index 000000000..0854c25e5 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/Common.java @@ -0,0 +1,134 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.ArrayList; +import java.util.List; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSLine; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSOpType; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.Pair; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSBatchWriterRowTask.LineAndError; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowChange; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.RowPutChange; +import com.aliyun.openservices.ots.model.RowUpdateChange; + +public class Common { + + public static String getDetailMessage(Exception exception) { + if (exception instanceof OTSException) { + OTSException e = (OTSException) exception; + return "OTSException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + ", RequestId:" + e.getRequestId() + "]"; + } else if (exception instanceof ClientException) { + ClientException e = (ClientException) exception; + return "ClientException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + "]"; + } else if (exception instanceof IllegalArgumentException) { + IllegalArgumentException e = (IllegalArgumentException) exception; + return "IllegalArgumentException[ErrorMessage:" + e.getMessage() + "]"; + } else { + return "Exception[ErrorMessage:" + exception.getMessage() + "]"; + } + } + + public static RowPrimaryKey getPKFromRecord(List pkColumns, Record r) { + RowPrimaryKey primaryKey = new RowPrimaryKey(); + int pkCount = pkColumns.size(); + for (int i = 0; i < pkCount; i++) { + Column col = r.getColumn(i); + OTSPKColumn expect = pkColumns.get(i); + + if (col.getRawData() == null) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PK_COLUMN_VALUE_IS_NULL_ERROR, expect.getName())); + } + + PrimaryKeyValue pk = ColumnConversion.columnToPrimaryKeyValue(col, expect); + primaryKey.addPrimaryKeyColumn(expect.getName(), pk); + } + return primaryKey; + } + + public static List> getAttrFromRecord(int pkCount, List attrColumns, Record r) { + List> attr = new ArrayList>(r.getColumnNumber()); + for (int i = 0; i < attrColumns.size(); i++) { + Column col = r.getColumn(i + pkCount); + OTSAttrColumn expect = attrColumns.get(i); + + if (col.getRawData() == null) { + attr.add(new Pair(expect.getName(), null)); + continue; + } + + ColumnValue cv = ColumnConversion.columnToColumnValue(col, expect); + attr.add(new Pair(expect.getName(), cv)); + } + return attr; + } + + public static RowChange columnValuesToRowChange(String tableName, OTSOpType type, RowPrimaryKey pk, List> values) { + switch (type) { + case PUT_ROW: + RowPutChange rowPutChange = new RowPutChange(tableName); + rowPutChange.setPrimaryKey(pk); + + for (Pair en : values) { + if (en.getValue() != null) { + rowPutChange.addAttributeColumn(en.getKey(), en.getValue()); + } + } + + return rowPutChange; + case UPDATE_ROW: + RowUpdateChange rowUpdateChange = new RowUpdateChange(tableName); + rowUpdateChange.setPrimaryKey(pk); + + for (Pair en : values) { + if (en.getValue() != null) { + rowUpdateChange.addAttributeColumn(en.getKey(), en.getValue()); + } else { + rowUpdateChange.deleteAttributeColumn(en.getKey()); + } + } + return rowUpdateChange; + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, type, "RowChange")); + } + } + + public static long getDelaySendMillinSeconds(int hadRetryTimes, int initSleepInMilliSecond) { + + if (hadRetryTimes <= 0) { + return 0; + } + + int sleepTime = initSleepInMilliSecond; + for (int i = 1; i < hadRetryTimes; i++) { + sleepTime += sleepTime; + if (sleepTime > 30000) { + sleepTime = 30000; + break; + } + } + return sleepTime; + } + + public static void collectDirtyRecord(TaskPluginCollector collector, List errors) { + for (LineAndError re : errors) { + collector.collectDirtyRecord(re.getLine().getRecord(), re.getError().getMessage()); + } + } + + public static void collectDirtyRecord(TaskPluginCollector collector, List lines, String errorMsg) { + for (OTSLine l : lines) { + collector.collectDirtyRecord(l.getRecord(), errorMsg); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/DefaultNoRetry.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/DefaultNoRetry.java new file mode 100644 index 000000000..ec2fb5f60 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/DefaultNoRetry.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import com.aliyun.openservices.ots.internal.OTSRetryStrategy; + +public class DefaultNoRetry implements OTSRetryStrategy { + + @Override + public boolean shouldRetry(String action, Exception ex, int retries) { + return false; + } + + @Override + public long getPauseDelay(String s, Exception e, int i) { + return 0; + } + +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/GsonParser.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/GsonParser.java new file mode 100644 index 000000000..0cae91f2b --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/GsonParser.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; +import com.google.gson.Gson; +import com.google.gson.GsonBuilder; + +public class GsonParser { + + private static Gson gsonBuilder() { + return new GsonBuilder() + .create(); + } + + public static String confToJson (OTSConf conf) { + Gson g = gsonBuilder(); + return g.toJson(conf); + } + + public static OTSConf jsonToConf (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, OTSConf.class); + } + + public static String directionToJson (Direction direction) { + Gson g = gsonBuilder(); + return g.toJson(direction); + } + + public static Direction jsonToDirection (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, Direction.class); + } + + public static String metaToJson (TableMeta meta) { + Gson g = gsonBuilder(); + return g.toJson(meta); + } + + public static String rowPrimaryKeyToJson (RowPrimaryKey row) { + Gson g = gsonBuilder(); + return g.toJson(row); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ParamChecker.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ParamChecker.java new file mode 100644 index 000000000..f9e17af5f --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ParamChecker.java @@ -0,0 +1,153 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.TableMeta; + +public class ParamChecker { + + private static void throwNotExistException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.MISSING_PARAMTER_ERROR, key)); + } + + private static void throwStringLengthZeroException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMTER_STRING_IS_EMPTY_ERROR, key)); + } + + private static void throwEmptyListException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMETER_LIST_IS_EMPTY_ERROR, key)); + } + + private static void throwNotListException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMETER_IS_NOT_ARRAY_ERROR, key)); + } + + private static void throwNotMapException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMETER_IS_NOT_MAP_ERROR, key)); + } + + public static String checkStringAndGet(Configuration param, String key) { + String value = param.getString(key); + if (null == value) { + throwNotExistException(key); + } else if (value.length() == 0) { + throwStringLengthZeroException(key); + } + return value; + } + + public static List checkListAndGet(Configuration param, String key, boolean isCheckEmpty) { + List value = null; + try { + value = param.getList(key); + } catch (ClassCastException e) { + throwNotListException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyListException(key); + } + return value; + } + + public static List checkListAndGet(Map range, String key) { + Object obj = range.get(key); + if (null == obj) { + return null; + } + return checkListAndGet(range, key, false); + } + + public static List checkListAndGet(Map range, String key, boolean isCheckEmpty) { + Object obj = range.get(key); + if (null == obj) { + throwNotExistException(key); + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + if (isCheckEmpty && value.isEmpty()) { + throwEmptyListException(key); + } + return value; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARSE_TO_LIST_ERROR, key)); + } + } + + public static List checkListAndGet(Map range, String key, List defaultList) { + Object obj = range.get(key); + if (null == obj) { + return defaultList; + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + return value; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARSE_TO_LIST_ERROR, key)); + } + } + + public static Map checkMapAndGet(Configuration param, String key, boolean isCheckEmpty) { + Map value = null; + try { + value = param.getMap(key); + } catch (ClassCastException e) { + throwNotMapException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyListException(key); + } + return value; + } + + public static void checkPrimaryKey(TableMeta meta, List pk) { + Map types = meta.getPrimaryKey(); + // 个数是否相等 + if (types.size() != pk.size()) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.INPUT_PK_COUNT_NOT_EQUAL_META_ERROR, pk.size(), types.size())); + } + + // 名字类型是否相等 + Map inputTypes = new HashMap(); + for (OTSPKColumn col : pk) { + inputTypes.put(col.getName(), col.getType()); + } + + for (Entry e : types.entrySet()) { + if (!inputTypes.containsKey(e.getKey())) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PK_COLUMN_MISSING_ERROR, e.getKey())); + } + PrimaryKeyType type = inputTypes.get(e.getKey()); + if (type != e.getValue()) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.INPUT_PK_TYPE_NOT_MATCH_META_ERROR, e.getKey(), type, e.getValue())); + } + } + } + + public static void checkAttribute(List attr) { + // 检查重复列 + Set names = new HashSet(); + for (OTSAttrColumn col : attr) { + if (names.contains(col.getName())) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.ATTR_REPEAT_COLUMN_ERROR, col.getName())); + } else { + names.add(col.getName()); + } + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/RetryHelper.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/RetryHelper.java new file mode 100644 index 000000000..24e92da7e --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/RetryHelper.java @@ -0,0 +1,76 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.HashSet; +import java.util.Set; +import java.util.concurrent.Callable; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.plugin.writer.otswriter.model.LogExceptionManager; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; + +public class RetryHelper { + + private static final Logger LOG = LoggerFactory.getLogger(RetryHelper.class); + private static final Set noRetryErrorCode = prepareNoRetryErrorCode(); + + public static LogExceptionManager logManager = new LogExceptionManager(); + + public static V executeWithRetry(Callable callable, int maxRetryTimes, int sleepInMilliSecond) throws Exception { + int retryTimes = 0; + while (true){ + Thread.sleep(Common.getDelaySendMillinSeconds(retryTimes, sleepInMilliSecond)); + try { + return callable.call(); + } catch (Exception e) { + logManager.addException(e); + if (!canRetry(e)){ + LOG.error("Can not retry for Exception.", e); + throw e; + } else if (retryTimes >= maxRetryTimes) { + LOG.error("Retry times more than limition. maxRetryTimes : {}", maxRetryTimes); + throw e; + } + retryTimes++; + LOG.warn("Retry time : {}", retryTimes); + } + } + } + + private static Set prepareNoRetryErrorCode() { + Set pool = new HashSet(); + pool.add(OTSErrorCode.AUTHORIZATION_FAILURE); + pool.add(OTSErrorCode.INVALID_PARAMETER); + pool.add(OTSErrorCode.REQUEST_TOO_LARGE); + pool.add(OTSErrorCode.OBJECT_NOT_EXIST); + pool.add(OTSErrorCode.OBJECT_ALREADY_EXIST); + pool.add(OTSErrorCode.INVALID_PK); + pool.add(OTSErrorCode.OUT_OF_COLUMN_COUNT_LIMIT); + pool.add(OTSErrorCode.OUT_OF_ROW_SIZE_LIMIT); + pool.add(OTSErrorCode.CONDITION_CHECK_FAIL); + return pool; + } + + public static boolean canRetry(String otsErrorCode) { + if (noRetryErrorCode.contains(otsErrorCode)) { + return false; + } else { + return true; + } + } + + public static boolean canRetry(Exception exception) { + OTSException e = null; + if (exception instanceof OTSException) { + e = (OTSException) exception; + return canRetry(e.getErrorCode()); + } else if (exception instanceof ClientException) { + return true; + } else { + return false; + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/SizeCalculateHelper.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/SizeCalculateHelper.java new file mode 100644 index 000000000..419cf6570 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/SizeCalculateHelper.java @@ -0,0 +1,67 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.List; +import java.util.Map.Entry; + +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSOpType; +import com.alibaba.datax.plugin.writer.otswriter.model.Pair; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; + +public class SizeCalculateHelper { + + public static int getPrimaryKeyValueSize(PrimaryKeyValue value) { + + switch (value.getType()) { + case INTEGER: + return 8; + case STRING: + return value.asString().length(); + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, value.getType(), "PrimaryKeyValue")); + } + } + + public static int getColumnValueSize(ColumnValue value) { + switch (value.getType()) { + case BINARY: + return value.asBinary().length; + case BOOLEAN: + return 1; + case DOUBLE: + return 8; + case INTEGER: + return 8; + case STRING: + return value.asString().length(); + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, value.getType(), "PrimaryKeyValue")); + } + } + + public static int getRowPrimaryKeySize(RowPrimaryKey pk) { + int size = 0; + for (Entry en : pk.getPrimaryKey().entrySet()) { + size += en.getKey().length(); + size += getPrimaryKeyValueSize(en.getValue()); + } + return size; + } + + public static int getAttributeColumnSize(List> attr, OTSOpType op) { + int size = 0; + for (Pair en : attr) { + if (en.getValue() == null) { + if (op == OTSOpType.UPDATE_ROW) { + size += en.getKey().length(); + } + } else { + size += en.getKey().length(); + size += getColumnValueSize(en.getValue()); + } + } + return size; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/WriterModelParser.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/WriterModelParser.java new file mode 100644 index 000000000..498dfdf72 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/WriterModelParser.java @@ -0,0 +1,137 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConst; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSOpType; +import com.aliyun.openservices.ots.model.ColumnType; +import com.aliyun.openservices.ots.model.PrimaryKeyType; + +/** + * 解析配置中参数 + * @author redchen + * + */ +public class WriterModelParser { + + public static PrimaryKeyType parsePrimaryKeyType(String type) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return PrimaryKeyType.STRING; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return PrimaryKeyType.INTEGER; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PK_TYPE_ERROR, type)); + } + } + + public static OTSPKColumn parseOTSPKColumn(Map column) { + if (column.containsKey(OTSConst.NAME) && column.containsKey(OTSConst.TYPE) && column.size() == 2) { + Object type = column.get(OTSConst.TYPE); + Object name = column.get(OTSConst.NAME); + if (type instanceof String && name instanceof String) { + String typeStr = (String) type; + String nameStr = (String) name; + if (nameStr.isEmpty()) { + throw new IllegalArgumentException(OTSErrorMessage.PK_COLUMN_NAME_IS_EMPTY_ERROR); + } + return new OTSPKColumn(nameStr, parsePrimaryKeyType(typeStr)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.PK_MAP_NAME_TYPE_ERROR); + } + } else { + throw new IllegalArgumentException(OTSErrorMessage.PK_MAP_INCLUDE_NAME_TYPE_ERROR); + } + } + + public static List parseOTSPKColumnList(List values) { + List pks = new ArrayList(); + for (Object obj : values) { + if (obj instanceof Map) { + @SuppressWarnings("unchecked") + Map column = (Map) obj; + pks.add(parseOTSPKColumn(column)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.PK_ITEM_IS_NOT_MAP_ERROR); + } + } + return pks; + } + + public static ColumnType parseColumnType(String type) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return ColumnType.STRING; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return ColumnType.INTEGER; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BOOLEAN)) { + return ColumnType.BOOLEAN; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_DOUBLE)) { + return ColumnType.DOUBLE; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BINARY)) { + return ColumnType.BINARY; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.ATTR_TYPE_ERROR, type)); + } + } + + public static OTSAttrColumn parseOTSAttrColumn(Map column) { + if (column.containsKey(OTSConst.NAME) && column.containsKey(OTSConst.TYPE) && column.size() == 2) { + Object type = column.get(OTSConst.TYPE); + Object name = column.get(OTSConst.NAME); + if (type instanceof String && name instanceof String) { + String typeStr = (String) type; + String nameStr = (String) name; + if (nameStr.isEmpty()) { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_COLUMN_NAME_IS_EMPTY_ERROR); + } + return new OTSAttrColumn(nameStr, parseColumnType(typeStr)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_MAP_NAME_TYPE_ERROR); + } + } else { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_MAP_INCLUDE_NAME_TYPE_ERROR); + } + } + + private static void checkMultiAttrColumn(List attrs) { + Set pool = new HashSet(); + for (OTSAttrColumn col : attrs) { + if (pool.contains(col.getName())) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.MULTI_ATTR_COLUMN_ERROR, col.getName())); + } else { + pool.add(col.getName()); + } + } + } + + public static List parseOTSAttrColumnList(List values) { + List attrs = new ArrayList(); + for (Object obj : values) { + if (obj instanceof Map) { + @SuppressWarnings("unchecked") + Map column = (Map) obj; + attrs.add(parseOTSAttrColumn(column)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_ITEM_IS_NOT_MAP_ERROR); + } + } + checkMultiAttrColumn(attrs); + return attrs; + } + + public static OTSOpType parseOTSOpType(String value) { + if (value.equalsIgnoreCase(OTSConst.OTS_OP_TYPE_PUT)) { + return OTSOpType.PUT_ROW; + } else if (value.equalsIgnoreCase(OTSConst.OTS_OP_TYPE_UPDATE)) { + return OTSOpType.UPDATE_ROW; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.OPERATION_PARSE_ERROR, value)); + } + } +} diff --git a/otswriter/src/main/resources/plugin.json b/otswriter/src/main/resources/plugin.json new file mode 100644 index 000000000..315e96cc3 --- /dev/null +++ b/otswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "otswriter", + "class": "com.alibaba.datax.plugin.writer.otswriter.OtsWriter", + "description": "", + "developer": "alibaba" +} \ No newline at end of file diff --git a/otswriter/src/main/resources/plugin_job_template.json b/otswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..572a9b254 --- /dev/null +++ b/otswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "otswriter", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + "table":"", + "primaryKey" : [], + "column" : [], + "writeMode" : "" + } +} \ No newline at end of file diff --git a/package.xml b/package.xml new file mode 100755 index 000000000..22caeb26c --- /dev/null +++ b/package.xml @@ -0,0 +1,298 @@ + + + + tar.gz + dir + + false + + + core/target/datax/ + + **/*.* + + datax + + + + + mysqlreader/target/datax/ + + **/*.* + + datax + + + oceanbasereader/target/datax/ + + **/*.* + + datax + + + drdsreader/target/datax/ + + **/*.* + + datax + + + oraclereader/target/datax/ + + **/*.* + + datax + + + sqlserverreader/target/datax/ + + **/*.* + + datax + + + db2reader/target/datax/ + + **/*.* + + datax + + + postgresqlreader/target/datax/ + + **/*.* + + datax + + + + odpsreader/target/datax/ + + **/*.* + + datax + + + otsreader/target/datax/ + + **/*.* + + datax + + + otsreader-internal/target/datax/ + + **/*.* + + datax + + + txtfilereader/target/datax/ + + **/*.* + + datax + + + hbasereader/target/datax/ + + **/*.* + + datax + + + ossreader/target/datax/ + + **/*.* + + datax + + + mongodbreader/target/datax/ + + **/*.* + + datax + + + streamreader/target/datax/ + + **/*.* + + datax + + + ftpreader/target/datax/ + + **/*.* + + datax + + + hdfsreader/target/datax/ + + **/*.* + + datax + + + + + mysqlwriter/target/datax/ + + **/*.* + + datax + + + oceanbasewriter/target/datax/ + + **/*.* + + datax + + + drdswriter/target/datax/ + + **/*.* + + datax + + + odpswriter/target/datax/ + + **/*.* + + datax + + + txtfilewriter/target/datax/ + + **/*.* + + datax + + + osswriter/target/datax/ + + **/*.* + + datax + + + adswriter/target/datax/ + + **/*.* + + datax + + + streamwriter/target/datax/ + + **/*.* + + datax + + + otswriter/target/datax/ + + **/*.* + + datax + + + otswriter-internal/target/datax/ + + **/*.* + + datax + + + mongodbwriter/target/datax/ + + **/*.* + + datax + + + oraclewriter/target/datax/ + + **/*.* + + datax + + + sqlserverwriter/target/datax/ + + **/*.* + + datax + + + postgresqlwriter/target/datax/ + + **/*.* + + datax + + + hbasebulkwriter/target/datax/ + + **/*.* + + datax + + + hbasebulkwriter2/target/datax/ + + **/*.* + + datax + + + tddlwriter/target/datax/ + + **/*.* + + datax + + + mysqlrulewriter/target/datax/ + + **/*.* + + datax + + + ocswriter/target/datax/ + + **/*.* + + datax + + + tairwriter/target/datax/ + + **/*.* + + datax + + + hdfswriter/target/datax/ + + **/*.* + + datax + + + metaqwriter/target/datax/ + + **/*.* + + datax + + + diff --git a/plugin-rdbms-util/plugin-rdbms-util.iml b/plugin-rdbms-util/plugin-rdbms-util.iml new file mode 100644 index 000000000..d5b9e84e2 --- /dev/null +++ b/plugin-rdbms-util/plugin-rdbms-util.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/plugin-rdbms-util/pom.xml b/plugin-rdbms-util/pom.xml new file mode 100755 index 000000000..81e4d7fd1 --- /dev/null +++ b/plugin-rdbms-util/pom.xml @@ -0,0 +1,66 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + plugin-rdbms-util + plugin-rdbms-util + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + commons-collections + commons-collections + 3.0 + + + mysql + mysql-connector-java + test + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba + druid + 1.0.13 + + + junit + junit + test + + + org.mockito + mockito-all + 1.9.5 + test + + + com.google.guava + guava + r05 + + + \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/CommonRdbmsReader.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/CommonRdbmsReader.java new file mode 100755 index 000000000..efb7d4065 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/CommonRdbmsReader.java @@ -0,0 +1,206 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.util.OriginalConfPretreatmentUtil; +import com.alibaba.datax.plugin.rdbms.reader.util.PreCheckTask; +import com.alibaba.datax.plugin.rdbms.reader.util.ReaderSplitUtil; +import com.alibaba.datax.plugin.rdbms.reader.util.SingleTableSplitUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.google.common.collect.Lists; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; + +public class CommonRdbmsReader { + + public static class Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + public Job(DataBaseType dataBaseType) { + OriginalConfPretreatmentUtil.DATABASE_TYPE = dataBaseType; + SingleTableSplitUtil.DATABASE_TYPE = dataBaseType; + } + + public void init(Configuration originalConfig) { + + OriginalConfPretreatmentUtil.doPretreatment(originalConfig); + + LOG.debug("After job init(), job config now is:[\n{}\n]", + originalConfig.toJSON()); + } + + public void preCheck(Configuration originalConfig,DataBaseType dataBaseType) { + /*检查每个表是否有读权限,以及querySql跟splik Key是否正确*/ + Configuration queryConf = ReaderSplitUtil.doPreCheckSplit(originalConfig); + String splitPK = queryConf.getString(Key.SPLIT_PK); + List connList = queryConf.getList(Constant.CONN_MARK, Object.class); + String username = queryConf.getString(Key.USERNAME); + String password = queryConf.getString(Key.PASSWORD); + ExecutorService exec; + if (connList.size() < 10){ + exec = Executors.newFixedThreadPool(connList.size()); + }else{ + exec = Executors.newFixedThreadPool(10); + } + Collection taskList = new ArrayList(); + for (int i = 0, len = connList.size(); i < len; i++){ + Configuration connConf = Configuration.from(connList.get(i).toString()); + PreCheckTask t = new PreCheckTask(username,password,connConf,dataBaseType,splitPK); + taskList.add(t); + } + List> results = Lists.newArrayList(); + try { + results = exec.invokeAll(taskList); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + + for (Future result : results){ + try { + result.get(); + } catch (ExecutionException e) { + DataXException de = (DataXException) e.getCause(); + throw de; + }catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + } + exec.shutdownNow(); + } + + + public List split(Configuration originalConfig, + int adviceNumber) { + return ReaderSplitUtil.doSplit(originalConfig, adviceNumber); + } + + public void post(Configuration originalConfig) { + // do nothing + } + + public void destroy(Configuration originalConfig) { + // do nothing + } + + } + + public static class Task { + private static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + private DataBaseType dataBaseType; + private int taskGroupId = -1; + private int taskId=-1; + + private String username; + private String password; + private String jdbcUrl; + private String mandatoryEncoding; + + // 作为日志显示信息时,需要附带的通用信息。比如信息所对应的数据库连接等信息,针对哪个表做的操作 + private String basicMsg; + + public Task(DataBaseType dataBaseType) { + this(dataBaseType, -1, -1); + } + + public Task(DataBaseType dataBaseType,int taskGropuId, int taskId) { + this.dataBaseType = dataBaseType; + this.taskGroupId = taskGropuId; + this.taskId = taskId; + } + + public void init(Configuration readerSliceConfig) { + + /* for database connection */ + + this.username = readerSliceConfig.getString(Key.USERNAME); + this.password = readerSliceConfig.getString(Key.PASSWORD); + this.jdbcUrl = readerSliceConfig.getString(Key.JDBC_URL); + this.mandatoryEncoding = readerSliceConfig.getString(Key.MANDATORY_ENCODING, ""); + + basicMsg = String.format("jdbcUrl:[%s]", this.jdbcUrl); + + } + + public void startRead(Configuration readerSliceConfig, + RecordSender recordSender, + TaskPluginCollector taskPluginCollector, int fetchSize) { + String querySql = readerSliceConfig.getString(Key.QUERY_SQL); + String table = readerSliceConfig.getString(Key.TABLE); + + PerfTrace.getInstance().addTaskDetails(taskId, table + "," + basicMsg); + + LOG.info("Begin to read record by Sql: [{}\n] {}.", + querySql, basicMsg); + PerfRecord queryPerfRecord = new PerfRecord(taskGroupId,taskId, PerfRecord.PHASE.SQL_QUERY); + queryPerfRecord.start(); + + Connection conn = DBUtil.getConnection(this.dataBaseType, jdbcUrl, + username, password); + + // session config .etc related + DBUtil.dealWithSessionConfig(conn, readerSliceConfig, + this.dataBaseType, basicMsg); + + int columnNumber = 0; + ResultSet rs = null; + try { + rs = DBUtil.query(conn, querySql, fetchSize); + queryPerfRecord.end(); + + ResultSetMetaData metaData = rs.getMetaData(); + columnNumber = metaData.getColumnCount(); + + //这个统计干净的result_Next时间 + PerfRecord allResultPerfRecord = new PerfRecord(taskGroupId, taskId, PerfRecord.PHASE.RESULT_NEXT_ALL); + allResultPerfRecord.start(); + + long rsNextUsedTime = 0; + long lastTime = System.nanoTime(); + while (rs.next()) { + rsNextUsedTime += (System.nanoTime() - lastTime); + ResultSetReadProxy.transportOneRecord(recordSender, rs, + metaData, columnNumber, mandatoryEncoding, taskPluginCollector); + lastTime = System.nanoTime(); + } + + allResultPerfRecord.end(rsNextUsedTime); + //目前大盘是依赖这个打印,而之前这个Finish read record是包含了sql查询和result next的全部时间 + LOG.info("Finished read record by Sql: [{}\n] {}.", + querySql, basicMsg); + + }catch (Exception e) { + throw RdbmsException.asQueryException(this.dataBaseType, e, querySql, table, username); + } finally { + DBUtil.closeDBResources(null, conn); + } + } + + public void post(Configuration originalConfig) { + // do nothing + } + + public void destroy(Configuration originalConfig) { + // do nothing + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Constant.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Constant.java new file mode 100755 index 000000000..1d938ab55 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Constant.java @@ -0,0 +1,24 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +public final class Constant { + public static final String PK_TYPE = "pkType"; + + public static final Object PK_TYPE_STRING = "pkTypeString"; + + public static final Object PK_TYPE_LONG = "pkTypeLong"; + + public static String CONN_MARK = "connection"; + + public static String TABLE_NUMBER_MARK = "tableNumber"; + + public static String IS_TABLE_MODE = "isTableMode"; + + public final static String FETCH_SIZE = "fetchSize"; + + public static String QUERY_SQL_TEMPLATE_WITHOUT_WHERE = "select %s from %s "; + + public static String QUERY_SQL_TEMPLATE = "select %s from %s where (%s)"; + + public static String TABLE_NAME_PLACEHOLDER = "@table"; + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Key.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Key.java new file mode 100755 index 000000000..2e8cb1891 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Key.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +/** + * 编码,时区等配置,暂未定. + */ +public final class Key { + public final static String JDBC_URL = "jdbcUrl"; + + public final static String USERNAME = "username"; + + public final static String PASSWORD = "password"; + + public final static String TABLE = "table"; + + public final static String MANDATORY_ENCODING = "mandatoryEncoding"; + + // 是数组配置 + public final static String COLUMN = "column"; + + public final static String WHERE = "where"; + + public final static String HINT = "hint"; + + public final static String SPLIT_PK = "splitPk"; + + public final static String QUERY_SQL = "querySql"; + + public final static String SPLIT_PK_SQL = "splitPkSql"; + + + public final static String PRE_SQL = "preSql"; + + public final static String POST_SQL = "postSql"; + + public final static String CHECK_SLAVE = "checkSlave"; + + public final static String SESSION = "session"; + + public final static String DBNAME = "dbName"; + + public final static String DRYRUN = "dryRun"; + + +} \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/ResultSetReadProxy.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/ResultSetReadProxy.java new file mode 100755 index 000000000..9fe765c6c --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/ResultSetReadProxy.java @@ -0,0 +1,139 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.sql.Types; + +public class ResultSetReadProxy { + private static final Logger LOG = LoggerFactory + .getLogger(ResultSetReadProxy.class); + + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + private static final byte[] EMPTY_CHAR_ARRAY = new byte[0]; + + //TODO + public static void transportOneRecord(RecordSender recordSender, ResultSet rs, + ResultSetMetaData metaData, int columnNumber, String mandatoryEncoding, + TaskPluginCollector taskPluginCollector) { + Record record = recordSender.createRecord(); + + try { + for (int i = 1; i <= columnNumber; i++) { + switch (metaData.getColumnType(i)) { + + case Types.CHAR: + case Types.NCHAR: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + String rawData; + if(StringUtils.isBlank(mandatoryEncoding)){ + rawData = rs.getString(i); + }else{ + rawData = new String((rs.getBytes(i) == null ? EMPTY_CHAR_ARRAY : + rs.getBytes(i)), mandatoryEncoding); + } + record.addColumn(new StringColumn(rawData)); + break; + + case Types.CLOB: + case Types.NCLOB: + record.addColumn(new StringColumn(rs.getString(i))); + break; + + case Types.SMALLINT: + case Types.TINYINT: + case Types.INTEGER: + case Types.BIGINT: + record.addColumn(new LongColumn(rs.getString(i))); + break; + + case Types.NUMERIC: + case Types.DECIMAL: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.TIME: + record.addColumn(new DateColumn(rs.getTime(i))); + break; + + // for mysql bug, see http://bugs.mysql.com/bug.php?id=35115 + case Types.DATE: + if (metaData.getColumnTypeName(i).equalsIgnoreCase("year")) { + record.addColumn(new LongColumn(rs.getInt(i))); + } else { + record.addColumn(new DateColumn(rs.getDate(i))); + } + break; + + case Types.TIMESTAMP: + record.addColumn(new DateColumn(rs.getTimestamp(i))); + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + record.addColumn(new BytesColumn(rs.getBytes(i))); + break; + + // warn: bit(1) -> Types.BIT 可使用BoolColumn + // warn: bit(>1) -> Types.VARBINARY 可使用BytesColumn + case Types.BOOLEAN: + case Types.BIT: + record.addColumn(new BoolColumn(rs.getBoolean(i))); + break; + + case Types.NULL: + String stringData = null; + if(rs.getObject(i) != null) { + stringData = rs.getObject(i).toString(); + } + record.addColumn(new StringColumn(stringData)); + break; + + // TODO 添加BASIC_MESSAGE + default: + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库读取这种字段类型. 字段名:[%s], 字段名称:[%s], 字段Java类型:[%s]. 请尝试使用数据库函数将其转换datax支持的类型 或者不同步该字段 .", + metaData.getColumnName(i), + metaData.getColumnType(i), + metaData.getColumnClassName(i))); + } + } + } catch (Exception e) { + if (IS_DEBUG) { + LOG.debug("read data " + record.toString() + + " occur exception:", e); + } + + //TODO 这里识别为脏数据靠谱吗? + taskPluginCollector.collectDirtyRecord(record, e); + if (e instanceof DataXException) { + throw (DataXException) e; + } + } + + recordSender.sendToWriter(record); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/HintUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/HintUtil.java new file mode 100644 index 000000000..4e6827cfc --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/HintUtil.java @@ -0,0 +1,67 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Created by liuyi on 15/9/18. + */ +public class HintUtil { + private static final Logger LOG = LoggerFactory.getLogger(ReaderSplitUtil.class); + + private static DataBaseType dataBaseType; + private static String username; + private static String password; + private static Pattern tablePattern; + private static String hintExpression; + + public static void initHintConf(DataBaseType type, Configuration configuration){ + dataBaseType = type; + username = configuration.getString(Key.USERNAME); + password = configuration.getString(Key.PASSWORD); + String hint = configuration.getString(Key.HINT); + if(StringUtils.isNotBlank(hint)){ + String[] tablePatternAndHint = hint.split("#"); + if(tablePatternAndHint.length==1){ + tablePattern = Pattern.compile(".*"); + hintExpression = tablePatternAndHint[0]; + }else{ + tablePattern = Pattern.compile(tablePatternAndHint[0]); + hintExpression = tablePatternAndHint[1]; + } + } + } + + public static String buildQueryColumn(String jdbcUrl, String table, String column){ + try{ + if(tablePattern != null && DataBaseType.Oracle.equals(dataBaseType)) { + Matcher m = tablePattern.matcher(table); + if(m.find()){ + String[] tableStr = table.split("\\."); + String tableWithoutSchema = tableStr[tableStr.length-1]; + String finalHint = hintExpression.replaceAll(Constant.TABLE_NAME_PLACEHOLDER, tableWithoutSchema); + //主库不并发读取 + if(finalHint.indexOf("parallel") > 0 && DBUtil.isOracleMaster(jdbcUrl, username, password)){ + LOG.info("master:{} will not use hint:{}", jdbcUrl, finalHint); + }else{ + LOG.info("table:{} use hint:{}.", table, finalHint); + return finalHint + column; + } + } + } + } catch (Exception e){ + LOG.warn("match hint exception, will not use hint", e); + } + return column; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/OriginalConfPretreatmentUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/OriginalConfPretreatmentUtil.java new file mode 100755 index 000000000..532710d93 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/OriginalConfPretreatmentUtil.java @@ -0,0 +1,271 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.TableExpandUtil; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +public final class OriginalConfPretreatmentUtil { + private static final Logger LOG = LoggerFactory + .getLogger(OriginalConfPretreatmentUtil.class); + + private static boolean IS_DEBUG = LOG.isDebugEnabled(); + + public static DataBaseType DATABASE_TYPE; + + public static void doPretreatment(Configuration originalConfig) { + // 检查 username/password 配置(必填) + originalConfig.getNecessaryValue(Key.USERNAME, + DBUtilErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.PASSWORD, + DBUtilErrorCode.REQUIRED_VALUE); + dealWhere(originalConfig); + + simplifyConf(originalConfig); + } + + public static void dealWhere(Configuration originalConfig) { + String where = originalConfig.getString(Key.WHERE, null); + if(StringUtils.isNotBlank(where)) { + String whereImprove = where.trim(); + if(whereImprove.endsWith(";") || whereImprove.endsWith(";")) { + whereImprove = whereImprove.substring(0,whereImprove.length()-1); + } + originalConfig.set(Key.WHERE, whereImprove); + } + } + + /** + * 对配置进行初步处理: + *
    + *
  1. 处理同一个数据库配置了多个jdbcUrl的情况
  2. + *
  3. 识别并标记是采用querySql 模式还是 table 模式
  4. + *
  5. 对 table 模式,确定分表个数,并处理 column 转 *事项
  6. + *
+ */ + private static void simplifyConf(Configuration originalConfig) { + boolean isTableMode = recognizeTableOrQuerySqlMode(originalConfig); + originalConfig.set(Constant.IS_TABLE_MODE, isTableMode); + + dealJdbcAndTable(originalConfig); + + dealColumnConf(originalConfig); + } + + private static void dealJdbcAndTable(Configuration originalConfig) { + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + boolean checkSlave = originalConfig.getBool(Key.CHECK_SLAVE, false); + boolean isTableMode = originalConfig.getBool(Constant.IS_TABLE_MODE); + boolean isPreCheck = originalConfig.getBool(Key.DRYRUN,false); + + List conns = originalConfig.getList(Constant.CONN_MARK, + Object.class); + List preSql = originalConfig.getList(Key.PRE_SQL, String.class); + + int tableNum = 0; + + for (int i = 0, len = conns.size(); i < len; i++) { + Configuration connConf = Configuration + .from(conns.get(i).toString()); + + connConf.getNecessaryValue(Key.JDBC_URL, + DBUtilErrorCode.REQUIRED_VALUE); + + List jdbcUrls = connConf + .getList(Key.JDBC_URL, String.class); + + String jdbcUrl; + if (isPreCheck) { + jdbcUrl = DBUtil.chooseJdbcUrlWithoutRetry(DATABASE_TYPE, jdbcUrls, + username, password, preSql, checkSlave); + } else { + jdbcUrl = DBUtil.chooseJdbcUrl(DATABASE_TYPE, jdbcUrls, + username, password, preSql, checkSlave); + } + + jdbcUrl = DATABASE_TYPE.appendJDBCSuffixForReader(jdbcUrl); + + // 回写到connection[i].jdbcUrl + originalConfig.set(String.format("%s[%d].%s", Constant.CONN_MARK, + i, Key.JDBC_URL), jdbcUrl); + + LOG.info("Available jdbcUrl:{}.",jdbcUrl); + + if (isTableMode) { + // table 方式 + // 对每一个connection 上配置的table 项进行解析(已对表名称进行了 ` 处理的) + List tables = connConf.getList(Key.TABLE, String.class); + + List expandedTables = TableExpandUtil.expandTableConf( + DATABASE_TYPE, tables); + + if (null == expandedTables || expandedTables.isEmpty()) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_VALUE, String.format("您所配置的读取数据库表:%s 不正确. 因为DataX根据您的配置找不到这张表. 请检查您的配置并作出修改." + + "请先了解 DataX 配置.", StringUtils.join(tables, ","))); + } + + tableNum += expandedTables.size(); + + originalConfig.set(String.format("%s[%d].%s", + Constant.CONN_MARK, i, Key.TABLE), expandedTables); + } else { + // 说明是配置的 querySql 方式,不做处理. + } + } + + originalConfig.set(Constant.TABLE_NUMBER_MARK, tableNum); + } + + private static void dealColumnConf(Configuration originalConfig) { + boolean isTableMode = originalConfig.getBool(Constant.IS_TABLE_MODE); + + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, + String.class); + + if (isTableMode) { + if (null == userConfiguredColumns + || userConfiguredColumns.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, "您未配置读取数据库表的列信息. " + + "正确的配置方式是给 column 配置上您需要读取的列名称,用英文逗号分隔. 例如: \"column\": [\"id\", \"name\"],请参考上述配置并作出修改."); + } else { + String splitPk = originalConfig.getString(Key.SPLIT_PK, null); + + if (1 == userConfiguredColumns.size() + && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改."); + // 回填其值,需要以 String 的方式转交后续处理 + originalConfig.set(Key.COLUMN, "*"); + } else { + String jdbcUrl = originalConfig.getString(String.format( + "%s[0].%s", Constant.CONN_MARK, Key.JDBC_URL)); + + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + + String tableName = originalConfig.getString(String.format( + "%s[0].%s[0]", Constant.CONN_MARK, Key.TABLE)); + + List allColumns = DBUtil.getTableColumns( + DATABASE_TYPE, jdbcUrl, username, password, + tableName); + LOG.info("table:[{}] has columns:[{}].", + tableName, StringUtils.join(allColumns, ",")); + // warn:注意mysql表名区分大小写 + allColumns = ListUtil.valueToLowerCase(allColumns); + List quotedColumns = new ArrayList(); + + for (String column : userConfiguredColumns) { + if ("*".equals(column)) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_VALUE, + "您的配置文件中的列配置信息有误. 因为根据您的配置,数据库表的列中存在多个*. 请检查您的配置并作出修改. "); + } + + if (null == column) { + quotedColumns.add(null); + } else { + if (allColumns.contains(column.toLowerCase())) { + quotedColumns.add(column); + } else { + // 可能是由于用户填写为函数,或者自己对字段进行了`处理或者常量 + quotedColumns.add(column); + } + } + } + + originalConfig.set(Key.COLUMN, + StringUtils.join(quotedColumns, ",")); + if (StringUtils.isNotBlank(splitPk)) { + if (!allColumns.contains(splitPk.toLowerCase())) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + String.format("您的配置文件中的列配置信息有误. 因为根据您的配置,您读取的数据库表:%s 中没有主键名为:%s. 请检查您的配置并作出修改.", tableName, splitPk)); + } + } + + } + } + } else { + // querySql模式,不希望配制 column,那样是混淆不清晰的 + if (null != userConfiguredColumns + && userConfiguredColumns.size() > 0) { + LOG.warn("您的配置有误. 由于您读取数据库表采用了querySql的方式, 所以您不需要再配置 column. 如果您不想看到这条提醒,请移除您源头表中配置中的 column."); + originalConfig.remove(Key.COLUMN); + } + + // querySql模式,不希望配制 where,那样是混淆不清晰的 + String where = originalConfig.getString(Key.WHERE, null); + if (StringUtils.isNotBlank(where)) { + LOG.warn("您的配置有误. 由于您读取数据库表采用了querySql的方式, 所以您不需要再配置 where. 如果您不想看到这条提醒,请移除您源头表中配置中的 where."); + originalConfig.remove(Key.WHERE); + } + + // querySql模式,不希望配制 splitPk,那样是混淆不清晰的 + String splitPk = originalConfig.getString(Key.SPLIT_PK, null); + if (StringUtils.isNotBlank(splitPk)) { + LOG.warn("您的配置有误. 由于您读取数据库表采用了querySql的方式, 所以您不需要再配置 splitPk. 如果您不想看到这条提醒,请移除您源头表中配置中的 splitPk."); + originalConfig.remove(Key.SPLIT_PK); + } + } + + } + + private static boolean recognizeTableOrQuerySqlMode( + Configuration originalConfig) { + List conns = originalConfig.getList(Constant.CONN_MARK, + Object.class); + + List tableModeFlags = new ArrayList(); + List querySqlModeFlags = new ArrayList(); + + String table = null; + String querySql = null; + + boolean isTableMode = false; + boolean isQuerySqlMode = false; + for (int i = 0, len = conns.size(); i < len; i++) { + Configuration connConf = Configuration + .from(conns.get(i).toString()); + table = connConf.getString(Key.TABLE, null); + querySql = connConf.getString(Key.QUERY_SQL, null); + + isTableMode = StringUtils.isNotBlank(table); + tableModeFlags.add(isTableMode); + + isQuerySqlMode = StringUtils.isNotBlank(querySql); + querySqlModeFlags.add(isQuerySqlMode); + + if (false == isTableMode && false == isQuerySqlMode) { + // table 和 querySql 二者均未配制 + throw DataXException.asDataXException( + DBUtilErrorCode.TABLE_QUERYSQL_MISSING, "您的配置有误. 因为table和querySql应该配置并且只能配置一个. 请检查您的配置并作出修改."); + } else if (true == isTableMode && true == isQuerySqlMode) { + // table 和 querySql 二者均配置 + throw DataXException.asDataXException(DBUtilErrorCode.TABLE_QUERYSQL_MIXED, + "您的配置凌乱了. 因为datax不能同时既配置table又配置querySql.请检查您的配置并作出修改."); + } + } + + // 混合配制 table 和 querySql + if (!ListUtil.checkIfValueSame(tableModeFlags) + || !ListUtil.checkIfValueSame(tableModeFlags)) { + throw DataXException.asDataXException(DBUtilErrorCode.TABLE_QUERYSQL_MIXED, + "您配置凌乱了. 不能同时既配置table又配置querySql. 请检查您的配置并作出修改."); + } + + return tableModeFlags.get(0); + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/PreCheckTask.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/PreCheckTask.java new file mode 100644 index 000000000..36e96732f --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/PreCheckTask.java @@ -0,0 +1,100 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.alibaba.druid.sql.parser.ParserException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.ResultSet; +import java.util.List; +import java.util.concurrent.Callable; + +/** + * Created by judy.lt on 2015/6/4. + */ +public class PreCheckTask implements Callable{ + private static final Logger LOG = LoggerFactory.getLogger(PreCheckTask.class); + private String userName; + private String password; + private String splitPkId; + private Configuration connection; + private DataBaseType dataBaseType; + + public PreCheckTask(String userName, + String password, + Configuration connection, + DataBaseType dataBaseType, + String splitPkId){ + this.connection = connection; + this.userName=userName; + this.password=password; + this.dataBaseType = dataBaseType; + this.splitPkId = splitPkId; + } + + @Override + public Boolean call() throws DataXException { + String jdbcUrl = this.connection.getString(Key.JDBC_URL); + List querySqls = this.connection.getList(Key.QUERY_SQL, Object.class); + List splitPkSqls = this.connection.getList(Key.SPLIT_PK_SQL, Object.class); + List tables = this.connection.getList(Key.TABLE,Object.class); + Connection conn = DBUtil.getConnectionWithoutRetry(this.dataBaseType, jdbcUrl, + this.userName, password); + int fetchSize = 1; + if(DataBaseType.MySql.equals(dataBaseType) || DataBaseType.DRDS.equals(dataBaseType)) { + fetchSize = Integer.MIN_VALUE; + } + try{ + for (int i=0;i doSplit( + Configuration originalSliceConfig, int adviceNumber) { + boolean isTableMode = originalSliceConfig.getBool(Constant.IS_TABLE_MODE).booleanValue(); + int eachTableShouldSplittedNumber = -1; + if (isTableMode) { + eachTableShouldSplittedNumber = calculateEachTableShouldSplittedNumber( + adviceNumber, originalSliceConfig.getInt(Constant.TABLE_NUMBER_MARK)); + } + + String column = originalSliceConfig.getString(Key.COLUMN); + String where = originalSliceConfig.getString(Key.WHERE, null); + + List conns = originalSliceConfig.getList(Constant.CONN_MARK, Object.class); + + List splittedConfigs = new ArrayList(); + + for (int i = 0, len = conns.size(); i < len; i++) { + Configuration sliceConfig = originalSliceConfig.clone(); + + Configuration connConf = Configuration.from(conns.get(i).toString()); + String jdbcUrl = connConf.getString(Key.JDBC_URL); + sliceConfig.set(Key.JDBC_URL, jdbcUrl); + + // 抽取 jdbcUrl 中的 ip/port 进行资源使用的打标,以提供给 core 做有意义的 shuffle 操作 + sliceConfig.set(CommonConstant.LOAD_BALANCE_RESOURCE_MARK, DataBaseType.parseIpFromJdbcUrl(jdbcUrl)); + + sliceConfig.remove(Constant.CONN_MARK); + + Configuration tempSlice; + + // 说明是配置的 table 方式 + if (isTableMode) { + // 已在之前进行了扩展和`处理,可以直接使用 + List tables = connConf.getList(Key.TABLE, String.class); + + Validate.isTrue(null != tables && !tables.isEmpty(), "您读取数据库表配置错误."); + + String splitPk = originalSliceConfig.getString(Key.SPLIT_PK, null); + + //最终切分份数不一定等于 eachTableShouldSplittedNumber + boolean needSplitTable = eachTableShouldSplittedNumber > 1 + && StringUtils.isNotBlank(splitPk); + if (needSplitTable) { + if (tables.size() == 1) { + //如果是单表的,主键切分num=num*2+1 + eachTableShouldSplittedNumber = eachTableShouldSplittedNumber * 2 + 1; + } + // 尝试对每个表,切分为eachTableShouldSplittedNumber 份 + for (String table : tables) { + tempSlice = sliceConfig.clone(); + tempSlice.set(Key.TABLE, table); + + List splittedSlices = SingleTableSplitUtil + .splitSingleTable(tempSlice, eachTableShouldSplittedNumber); + + splittedConfigs.addAll(splittedSlices); + } + } else { + for (String table : tables) { + tempSlice = sliceConfig.clone(); + tempSlice.set(Key.TABLE, table); + String queryColumn = HintUtil.buildQueryColumn(jdbcUrl, table, column); + tempSlice.set(Key.QUERY_SQL, SingleTableSplitUtil.buildQuerySql(queryColumn, table, where)); + splittedConfigs.add(tempSlice); + } + } + } else { + // 说明是配置的 querySql 方式 + List sqls = connConf.getList(Key.QUERY_SQL, String.class); + + // TODO 是否check 配置为多条语句?? + for (String querySql : sqls) { + tempSlice = sliceConfig.clone(); + tempSlice.set(Key.QUERY_SQL, querySql); + splittedConfigs.add(tempSlice); + } + } + + } + + return splittedConfigs; + } + + public static Configuration doPreCheckSplit(Configuration originalSliceConfig) { + Configuration queryConfig = originalSliceConfig.clone(); + boolean isTableMode = originalSliceConfig.getBool(Constant.IS_TABLE_MODE).booleanValue(); + + String splitPK = originalSliceConfig.getString(Key.SPLIT_PK); + String column = originalSliceConfig.getString(Key.COLUMN); + String where = originalSliceConfig.getString(Key.WHERE, null); + + List conns = queryConfig.getList(Constant.CONN_MARK, Object.class); + + for (int i = 0, len = conns.size(); i < len; i++){ + Configuration connConf = Configuration.from(conns.get(i).toString()); + List querys = new ArrayList(); + List splitPkQuerys = new ArrayList(); + String connPath = String.format("connection[%d]",i); + // 说明是配置的 table 方式 + if (isTableMode) { + // 已在之前进行了扩展和`处理,可以直接使用 + List tables = connConf.getList(Key.TABLE, String.class); + Validate.isTrue(null != tables && !tables.isEmpty(), "您读取数据库表配置错误."); + for (String table : tables) { + querys.add(SingleTableSplitUtil.buildQuerySql(column,table,where)); + if (splitPK != null && !splitPK.isEmpty()){ + splitPkQuerys.add(SingleTableSplitUtil.genPKSql(splitPK.trim(),table,where)); + } + } + if (!splitPkQuerys.isEmpty()){ + connConf.set(Key.SPLIT_PK_SQL,splitPkQuerys); + } + connConf.set(Key.QUERY_SQL,querys); + queryConfig.set(connPath,connConf); + } else { + // 说明是配置的 querySql 方式 + List sqls = connConf.getList(Key.QUERY_SQL, + String.class); + for (String querySql : sqls) { + querys.add(querySql); + } + connConf.set(Key.QUERY_SQL,querys); + queryConfig.set(connPath,connConf); + } + } + return queryConfig; + } + + private static int calculateEachTableShouldSplittedNumber(int adviceNumber, + int tableNumber) { + double tempNum = 1.0 * adviceNumber / tableNumber; + + return (int) Math.ceil(tempNum); + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/SingleTableSplitUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/SingleTableSplitUtil.java new file mode 100755 index 000000000..20798cc51 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/SingleTableSplitUtil.java @@ -0,0 +1,277 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.*; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.math.BigInteger; +import java.sql.Connection; +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.sql.Types; +import java.util.ArrayList; +import java.util.List; + +public class SingleTableSplitUtil { + private static final Logger LOG = LoggerFactory + .getLogger(SingleTableSplitUtil.class); + + public static DataBaseType DATABASE_TYPE; + + private SingleTableSplitUtil() { + } + + public static List splitSingleTable( + Configuration configuration, int adviceNum) { + List pluginParams = new ArrayList(); + + Pair minMaxPK = getPkRange(configuration); + + if (null == minMaxPK) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "根据切分主键切分表失败. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + + String splitPkName = configuration.getString(Key.SPLIT_PK); + String column = configuration.getString(Key.COLUMN); + String table = configuration.getString(Key.TABLE); + String where = configuration.getString(Key.WHERE, null); + + boolean hasWhere = StringUtils.isNotBlank(where); + + configuration.set(Key.QUERY_SQL, buildQuerySql(column, table, where)); + if (null == minMaxPK.getLeft() || null == minMaxPK.getRight()) { + // 切分后获取到的start/end 有 Null 的情况 + pluginParams.add(configuration); + return pluginParams; + } + + boolean isStringType = Constant.PK_TYPE_STRING.equals(configuration + .getString(Constant.PK_TYPE)); + boolean isLongType = Constant.PK_TYPE_LONG.equals(configuration + .getString(Constant.PK_TYPE)); + + List rangeList; + if (isStringType) { + rangeList = RdbmsRangeSplitWrap.splitAndWrap( + String.valueOf(minMaxPK.getLeft()), + String.valueOf(minMaxPK.getRight()), adviceNum, + splitPkName, "'", DATABASE_TYPE); + } else if (isLongType) { + rangeList = RdbmsRangeSplitWrap.splitAndWrap( + new BigInteger(minMaxPK.getLeft().toString()), + new BigInteger(minMaxPK.getRight().toString()), + adviceNum, splitPkName); + } else { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + + String tempQuerySql; + List allQuerySql = new ArrayList(); + + if (null != rangeList) { + for (String range : rangeList) { + Configuration tempConfig = configuration.clone(); + + tempQuerySql = buildQuerySql(column, table, where) + + (hasWhere ? " and " : " where ") + range; + + allQuerySql.add(tempQuerySql); + tempConfig.set(Key.QUERY_SQL, tempQuerySql); + pluginParams.add(tempConfig); + } + } else { + pluginParams.add(configuration); + } + + // deal pk is null + Configuration tempConfig = configuration.clone(); + tempQuerySql = buildQuerySql(column, table, where) + + (hasWhere ? " and " : " where ") + + String.format(" %s IS NULL", splitPkName); + + allQuerySql.add(tempQuerySql); + + LOG.info("After split(), allQuerySql=[\n{}\n].", + StringUtils.join(allQuerySql, "\n")); + + tempConfig.set(Key.QUERY_SQL, tempQuerySql); + pluginParams.add(tempConfig); + + return pluginParams; + } + + public static String buildQuerySql(String column, String table, + String where) { + String querySql; + + if (StringUtils.isBlank(where)) { + querySql = String.format(Constant.QUERY_SQL_TEMPLATE_WITHOUT_WHERE, + column, table); + } else { + querySql = String.format(Constant.QUERY_SQL_TEMPLATE, column, + table, where); + } + + return querySql; + } + + @SuppressWarnings("resource") + private static Pair getPkRange(Configuration configuration) { + String pkRangeSQL = genPKRangeSQL(configuration); + + int fetchSize = configuration.getInt(Constant.FETCH_SIZE); + String jdbcURL = configuration.getString(Key.JDBC_URL); + String username = configuration.getString(Key.USERNAME); + String password = configuration.getString(Key.PASSWORD); + String table = configuration.getString(Key.TABLE); + + Connection conn = DBUtil.getConnection(DATABASE_TYPE, jdbcURL, username, password); + Pair minMaxPK = checkSplitPk(conn, pkRangeSQL, fetchSize, table, username, configuration); + DBUtil.closeDBResources(null, null, conn); + return minMaxPK; + } + + public static void precheckSplitPk(Connection conn, String pkRangeSQL, int fetchSize, + String table, String username) { + Pair minMaxPK = checkSplitPk(conn, pkRangeSQL, fetchSize, table, username, null); + if (null == minMaxPK) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "根据切分主键切分表失败. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } + + /** + * 检测splitPk的配置是否正确。 + * configuration为null, 是precheck的逻辑,不需要回写PK_TYPE到configuration中 + * + */ + private static Pair checkSplitPk(Connection conn, String pkRangeSQL, int fetchSize, String table, + String username, Configuration configuration) { + ResultSet rs = null; + Pair minMaxPK = null; + try { + try { + rs = DBUtil.query(conn, pkRangeSQL, fetchSize); + }catch (Exception e) { + throw RdbmsException.asQueryException(DATABASE_TYPE, e, pkRangeSQL,table,username); + } + ResultSetMetaData rsMetaData = rs.getMetaData(); + if (isPKTypeValid(rsMetaData)) { + if (isStringType(rsMetaData.getColumnType(1))) { + if(configuration != null) { + configuration + .set(Constant.PK_TYPE, Constant.PK_TYPE_STRING); + } + while (DBUtil.asyncResultSetNext(rs)) { + minMaxPK = new ImmutablePair( + rs.getString(1), rs.getString(2)); + } + } else if (isLongType(rsMetaData.getColumnType(1))) { + if(configuration != null) { + configuration.set(Constant.PK_TYPE, Constant.PK_TYPE_LONG); + } + + while (DBUtil.asyncResultSetNext(rs)) { + minMaxPK = new ImmutablePair( + rs.getString(1), rs.getString(2)); + + // check: string shouldn't contain '.', for oracle + String minMax = rs.getString(1) + rs.getString(2); + if (StringUtils.contains(minMax, '.')) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } + } else { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } else { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } catch(DataXException e) { + throw e; + } catch (Exception e) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, "DataX尝试切分表发生错误. 请检查您的配置并作出修改.", e); + } finally { + DBUtil.closeDBResources(rs, null, null); + } + + return minMaxPK; + } + + private static boolean isPKTypeValid(ResultSetMetaData rsMetaData) { + boolean ret = false; + try { + int minType = rsMetaData.getColumnType(1); + int maxType = rsMetaData.getColumnType(2); + + boolean isNumberType = isLongType(minType); + + boolean isStringType = isStringType(minType); + + if (minType == maxType && (isNumberType || isStringType)) { + ret = true; + } + } catch (Exception e) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "DataX获取切分主键(splitPk)字段类型失败. 该错误通常是系统底层异常导致. 请联系旺旺:askdatax或者DBA处理."); + } + return ret; + } + + // warn: Types.NUMERIC is used for oracle! because oracle use NUMBER to + // store INT, SMALLINT, INTEGER etc, and only oracle need to concern + // Types.NUMERIC + private static boolean isLongType(int type) { + boolean isValidLongType = type == Types.BIGINT || type == Types.INTEGER + || type == Types.SMALLINT || type == Types.TINYINT; + + switch (SingleTableSplitUtil.DATABASE_TYPE) { + case Oracle: + isValidLongType |= type == Types.NUMERIC; + break; + default: + break; + } + return isValidLongType; + } + + private static boolean isStringType(int type) { + return type == Types.CHAR || type == Types.NCHAR + || type == Types.VARCHAR || type == Types.LONGVARCHAR + || type == Types.NVARCHAR; + } + + private static String genPKRangeSQL(Configuration configuration) { + + String splitPK = configuration.getString(Key.SPLIT_PK).trim(); + String table = configuration.getString(Key.TABLE).trim(); + String where = configuration.getString(Key.WHERE, null); + return genPKSql(splitPK,table,where); + } + + public static String genPKSql(String splitPK, String table, String where){ + + String minMaxTemplate = "SELECT MIN(%s),MAX(%s) FROM %s"; + String pkRangeSQL = String.format(minMaxTemplate, splitPK, splitPK, + table); + if (StringUtils.isNotBlank(where)) { + pkRangeSQL = String.format("%s WHERE (%s AND %s IS NOT NULL)", + pkRangeSQL, where, splitPK); + } + return pkRangeSQL; + } + + +} \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/ConnectionFactory.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/ConnectionFactory.java new file mode 100644 index 000000000..3aef46b35 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/ConnectionFactory.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.sql.Connection; + +/** + * Date: 15/3/16 下午2:17 + */ +public interface ConnectionFactory { + + public Connection getConnecttion(); + + public Connection getConnecttionWithoutRetry(); + + public String getConnectionInfo(); + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/Constant.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/Constant.java new file mode 100755 index 000000000..54840aa60 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/Constant.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.rdbms.util; + +public final class Constant { + static final int TIMEOUT_SECONDS = 3; + static final int MAX_TRY_TIMES = 4; + static final int SOCKET_TIMEOUT_INSECOND = 172800; + + public static final String MYSQL_DATABASE = "Unknown database"; + public static final String MYSQL_CONNEXP = "Communications link failure"; + public static final String MYSQL_ACCDENIED = "Access denied"; + public static final String MYSQL_TABLE_NAME_ERR1 = "Table"; + public static final String MYSQL_TABLE_NAME_ERR2 = "doesn't exist"; + public static final String MYSQL_SELECT_PRI = "SELECT command denied to user"; + public static final String MYSQL_COLUMN1 = "Unknown column"; + public static final String MYSQL_COLUMN2 = "field list"; + public static final String MYSQL_WHERE = "where clause"; + + public static final String ORACLE_DATABASE = "ORA-12505"; + public static final String ORACLE_CONNEXP = "The Network Adapter could not establish the connection"; + public static final String ORACLE_ACCDENIED = "ORA-01017"; + public static final String ORACLE_TABLE_NAME = "table or view does not exist"; + public static final String ORACLE_SELECT_PRI = "insufficient privileges"; + public static final String ORACLE_SQL = "invalid identifier"; + + + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtil.java new file mode 100755 index 000000000..70e16ffc5 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtil.java @@ -0,0 +1,764 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.druid.sql.parser.SQLParserUtils; +import com.alibaba.druid.sql.parser.SQLStatementParser; +import com.google.common.util.concurrent.ThreadFactoryBuilder; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.ImmutableTriple; +import org.apache.commons.lang3.tuple.Triple; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.*; +import java.util.*; +import java.util.concurrent.*; + +public final class DBUtil { + private static final Logger LOG = LoggerFactory.getLogger(DBUtil.class); + + private static final ThreadLocal rsExecutors = new ThreadLocal() { + @Override + protected ExecutorService initialValue() { + return Executors.newFixedThreadPool(1, new ThreadFactoryBuilder() + .setNameFormat("rsExecutors-%d") + .setDaemon(true) + .build()); + } + }; + + private DBUtil() { + } + + public static String chooseJdbcUrl(final DataBaseType dataBaseType, + final List jdbcUrls, final String username, + final String password, final List preSql, + final boolean checkSlave) { + + if (null == jdbcUrls || jdbcUrls.isEmpty()) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format("您的jdbcUrl的配置信息有错, 因为jdbcUrl[%s]不能为空. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ","))); + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + + @Override + public String call() throws Exception { + boolean connOK = false; + for (String url : jdbcUrls) { + if (StringUtils.isNotBlank(url)) { + url = url.trim(); + if (null != preSql && !preSql.isEmpty()) { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, preSql); + } else { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, checkSlave); + } + if (connOK) { + return url; + } + } + } + throw new Exception("DataX无法连接对应的数据库,可能原因是:1) 配置的ip/port/database/jdbc错误,无法连接。2) 配置的username/password错误,鉴权失败。请和DBA确认该数据库的连接信息是否正确。"); +// throw new Exception(DBUtilErrorCode.JDBC_NULL.toString()); + } + }, 3, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息,无法从:%s 中找到可连接的jdbcUrl. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ",")), e); + } + } + + public static String chooseJdbcUrlWithoutRetry(final DataBaseType dataBaseType, + final List jdbcUrls, final String username, + final String password, final List preSql, + final boolean checkSlave) throws DataXException { + + if (null == jdbcUrls || jdbcUrls.isEmpty()) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format("您的jdbcUrl的配置信息有错, 因为jdbcUrl[%s]不能为空. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ","))); + } + + boolean connOK = false; + for (String url : jdbcUrls) { + if (StringUtils.isNotBlank(url)) { + url = url.trim(); + if (null != preSql && !preSql.isEmpty()) { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, preSql); + } else { + try { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, checkSlave); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息,无法从:%s 中找到可连接的jdbcUrl. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ",")), e); + } + } + if (connOK) { + return url; + } + } + } + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息,无法从:%s 中找到可连接的jdbcUrl. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ","))); + } + + /** + * 检查slave的库中的数据是否已到凌晨00:00 + * 如果slave同步的数据还未到00:00返回false + * 否则范围true + * + * @author ZiChi + * @version 1.0 2014-12-01 + */ + private static boolean isSlaveBehind(Connection conn) { + try { + ResultSet rs = query(conn, "SHOW VARIABLES LIKE 'read_only'"); + if (DBUtil.asyncResultSetNext(rs)) { + String readOnly = rs.getString("Value"); + if ("ON".equalsIgnoreCase(readOnly)) { //备库 + ResultSet rs1 = query(conn, "SHOW SLAVE STATUS"); + if (DBUtil.asyncResultSetNext(rs1)) { + String ioRunning = rs1.getString("Slave_IO_Running"); + String sqlRunning = rs1.getString("Slave_SQL_Running"); + long secondsBehindMaster = rs1.getLong("Seconds_Behind_Master"); + if ("Yes".equalsIgnoreCase(ioRunning) && "Yes".equalsIgnoreCase(sqlRunning)) { + ResultSet rs2 = query(conn, "SELECT TIMESTAMPDIFF(SECOND, CURDATE(), NOW())"); + DBUtil.asyncResultSetNext(rs2); + long secondsOfDay = rs2.getLong(1); + return secondsBehindMaster > secondsOfDay; + } else { + return true; + } + } else { + LOG.warn("SHOW SLAVE STATUS has no result"); + } + } + } else { + LOG.warn("SHOW VARIABLES like 'read_only' has no result"); + } + } catch (Exception e) { + LOG.warn("checkSlave failed, errorMessage:[{}].", e.getMessage()); + } + return false; + } + + /** + * 检查表是否具有insert 权限 + * insert on *.* 或者 insert on database.* 时验证通过 + * 当insert on database.tableName时,确保tableList中的所有table有insert 权限,验证通过 + * 其它验证都不通过 + * + * @author ZiChi + * @version 1.0 2015-01-28 + */ + public static boolean hasInsertPrivilege(DataBaseType dataBaseType, String jdbcURL, String userName, String password, List tableList) { + /*准备参数*/ + + String[] urls = jdbcURL.split("/"); + String dbName; + if (urls != null && urls.length != 0) { + dbName = urls[3]; + }else{ + return false; + } + + String dbPattern = "`" + dbName + "`.*"; + Collection tableNames = new HashSet(tableList.size()); + tableNames.addAll(tableList); + + Connection connection = connect(dataBaseType, jdbcURL, userName, password); + try { + ResultSet rs = query(connection, "SHOW GRANTS FOR " + userName); + while (DBUtil.asyncResultSetNext(rs)) { + String grantRecord = rs.getString("Grants for " + userName + "@%"); + String[] params = grantRecord.split("\\`"); + if (params != null && params.length >= 3) { + String tableName = params[3]; + if (params[0].contains("INSERT") && !tableName.equals("*") && tableNames.contains(tableName)) + tableNames.remove(tableName); + } else { + if (grantRecord.contains("INSERT") ||grantRecord.contains("ALL PRIVILEGES")) { + if (grantRecord.contains("*.*")) + return true; + else if (grantRecord.contains(dbPattern)) { + return true; + } + } + } + } + } catch (Exception e) { + LOG.warn("Check the database has the Insert Privilege failed, errorMessage:[{}]", e.getMessage()); + } + if (tableNames.isEmpty()) + return true; + return false; + } + + + public static boolean checkInsertPrivilege(DataBaseType dataBaseType, String jdbcURL, String userName, String password, List tableList) { + Connection connection = connect(dataBaseType, jdbcURL, userName, password); + String insertTemplate = "insert into %s(select * from %s where 1 = 2)"; + + boolean hasInsertPrivilege = true; + Statement insertStmt = null; + for(String tableName : tableList) { + String checkInsertPrivilegeSql = String.format(insertTemplate, tableName, tableName); + try { + insertStmt = connection.createStatement(); + executeSqlWithoutResultSet(insertStmt, checkInsertPrivilegeSql); + } catch (Exception e) { + if(DataBaseType.Oracle.equals(dataBaseType)) { + if(e.getMessage() != null && e.getMessage().contains("insufficient privileges")) { + hasInsertPrivilege = false; + LOG.warn("User [" + userName +"] has no 'insert' privilege on table[" + tableName + "], errorMessage:[{}]", e.getMessage()); + } + } else { + hasInsertPrivilege = false; + LOG.warn("User [" + userName + "] has no 'insert' privilege on table[" + tableName + "], errorMessage:[{}]", e.getMessage()); + } + } + } + try { + connection.close(); + } catch (SQLException e) { + LOG.warn("connection close failed, " + e.getMessage()); + } + return hasInsertPrivilege; + } + + public static boolean checkDeletePrivilege(DataBaseType dataBaseType,String jdbcURL, String userName, String password, List tableList) { + Connection connection = connect(dataBaseType, jdbcURL, userName, password); + String deleteTemplate = "delete from %s WHERE 1 = 2"; + + boolean hasInsertPrivilege = true; + Statement deleteStmt = null; + for(String tableName : tableList) { + String checkDeletePrivilegeSQL = String.format(deleteTemplate, tableName); + try { + deleteStmt = connection.createStatement(); + executeSqlWithoutResultSet(deleteStmt, checkDeletePrivilegeSQL); + } catch (Exception e) { + hasInsertPrivilege = false; + LOG.warn("User [" + userName +"] has no 'delete' privilege on table[" + tableName + "], errorMessage:[{}]", e.getMessage()); + } + } + try { + connection.close(); + } catch (SQLException e) { + LOG.warn("connection close failed, " + e.getMessage()); + } + return hasInsertPrivilege; + } + + public static boolean needCheckDeletePrivilege(Configuration originalConfig) { + List allSqls =new ArrayList(); + List preSQLs = originalConfig.getList(Key.PRE_SQL, String.class); + List postSQLs = originalConfig.getList(Key.POST_SQL, String.class); + if (preSQLs != null && !preSQLs.isEmpty()){ + allSqls.addAll(preSQLs); + } + if (postSQLs != null && !postSQLs.isEmpty()){ + allSqls.addAll(postSQLs); + } + for(String sql : allSqls) { + if(StringUtils.isNotBlank(sql)) { + if (sql.trim().toUpperCase().startsWith("DELETE")) { + return true; + } + } + } + return false; + } + + /** + * Get direct JDBC connection + *

+ * if connecting failed, try to connect for MAX_TRY_TIMES times + *

+ * NOTE: In DataX, we don't need connection pool in fact + */ + public static Connection getConnection(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password) { + + return getConnection(dataBaseType, jdbcUrl, username, password, String.valueOf(Constant.SOCKET_TIMEOUT_INSECOND * 1000)); + } + + /** + * + * @param dataBaseType + * @param jdbcUrl + * @param username + * @param password + * @param socketTimeout 设置socketTimeout,单位ms,String类型 + * @return + */ + public static Connection getConnection(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password, final String socketTimeout) { + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Connection call() throws Exception { + return DBUtil.connect(dataBaseType, jdbcUrl, username, + password, socketTimeout); + } + }, 9, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息:%s获取数据库连接失败. 请检查您的配置并作出修改.", jdbcUrl), e); + } + } + + /** + * Get direct JDBC connection + *

+ * if connecting failed, try to connect for MAX_TRY_TIMES times + *

+ * NOTE: In DataX, we don't need connection pool in fact + */ + public static Connection getConnectionWithoutRetry(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password) { + return getConnectionWithoutRetry(dataBaseType, jdbcUrl, username, + password, String.valueOf(Constant.SOCKET_TIMEOUT_INSECOND * 1000)); + } + + public static Connection getConnectionWithoutRetry(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password, String socketTimeout) { + return DBUtil.connect(dataBaseType, jdbcUrl, username, + password, socketTimeout); + } + + private static synchronized Connection connect(DataBaseType dataBaseType, + String url, String user, String pass) { + return connect(dataBaseType, url, user, pass, String.valueOf(Constant.SOCKET_TIMEOUT_INSECOND * 1000)); + } + + private static synchronized Connection connect(DataBaseType dataBaseType, + String url, String user, String pass, String socketTimeout) { + Properties prop = new Properties(); + prop.put("user", user); + prop.put("password", pass); + + if (dataBaseType == DataBaseType.Oracle) { + //oracle.net.READ_TIMEOUT for jdbc versions < 10.1.0.5 oracle.jdbc.ReadTimeout for jdbc versions >=10.1.0.5 + // unit ms + prop.put("oracle.jdbc.ReadTimeout", socketTimeout); + } + + return connect(dataBaseType, url, prop); + } + + private static synchronized Connection connect(DataBaseType dataBaseType, + String url, Properties prop) { + try { + Class.forName(dataBaseType.getDriverClassName()); + DriverManager.setLoginTimeout(Constant.TIMEOUT_SECONDS); + return DriverManager.getConnection(url, prop); + } catch (Exception e) { + throw RdbmsException.asConnException(dataBaseType, e, prop.getProperty("user"), null); + } + } + + /** + * a wrapped method to execute select-like sql statement . + * + * @param conn Database connection . + * @param sql sql statement to be executed + * @return a {@link ResultSet} + * @throws SQLException if occurs SQLException. + */ + public static ResultSet query(Connection conn, String sql, int fetchSize) + throws SQLException { + // 默认3600 s 的query Timeout + return query(conn, sql, fetchSize, Constant.SOCKET_TIMEOUT_INSECOND); + } + + /** + * a wrapped method to execute select-like sql statement . + * + * @param conn Database connection . + * @param sql sql statement to be executed + * @param fetchSize + * @param queryTimeout unit:second + * @return + * @throws SQLException + */ + public static ResultSet query(Connection conn, String sql, int fetchSize, int queryTimeout) + throws SQLException { + // make sure autocommit is off + conn.setAutoCommit(false); + Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, + ResultSet.CONCUR_READ_ONLY); + stmt.setFetchSize(fetchSize); + stmt.setQueryTimeout(queryTimeout); + return query(stmt, sql); + } + + /** + * a wrapped method to execute select-like sql statement . + * + * @param stmt {@link Statement} + * @param sql sql statement to be executed + * @return a {@link ResultSet} + * @throws SQLException if occurs SQLException. + */ + public static ResultSet query(Statement stmt, String sql) + throws SQLException { + return stmt.executeQuery(sql); + } + + public static void executeSqlWithoutResultSet(Statement stmt, String sql) + throws SQLException { + stmt.execute(sql); + } + + /** + * Close {@link ResultSet}, {@link Statement} referenced by this + * {@link ResultSet} + * + * @param rs {@link ResultSet} to be closed + * @throws IllegalArgumentException + */ + public static void closeResultSet(ResultSet rs) { + try { + if (null != rs) { + Statement stmt = rs.getStatement(); + if (null != stmt) { + stmt.close(); + stmt = null; + } + rs.close(); + } + rs = null; + } catch (SQLException e) { + throw new IllegalStateException(e); + } + } + + public static void closeDBResources(ResultSet rs, Statement stmt, + Connection conn) { + if (null != rs) { + try { + rs.close(); + } catch (SQLException unused) { + } + } + + if (null != stmt) { + try { + stmt.close(); + } catch (SQLException unused) { + } + } + + if (null != conn) { + try { + conn.close(); + } catch (SQLException unused) { + } + } + } + + public static void closeDBResources(Statement stmt, Connection conn) { + closeDBResources(null, stmt, conn); + } + + public static List getTableColumns(DataBaseType dataBaseType, + String jdbcUrl, String user, String pass, String tableName) { + Connection conn = getConnection(dataBaseType, jdbcUrl, user, pass); + return getTableColumnsByConn(dataBaseType, conn, tableName, "jdbcUrl:"+jdbcUrl); + } + + public static List getTableColumnsByConn(DataBaseType dataBaseType, Connection conn, String tableName, String basicMsg) { + List columns = new ArrayList(); + Statement statement = null; + ResultSet rs = null; + String queryColumnSql = null; + try { + statement = conn.createStatement(); + queryColumnSql = String.format("select * from %s where 1=2", + tableName); + rs = statement.executeQuery(queryColumnSql); + ResultSetMetaData rsMetaData = rs.getMetaData(); + for (int i = 0, len = rsMetaData.getColumnCount(); i < len; i++) { + columns.add(rsMetaData.getColumnName(i + 1)); + } + + } catch (SQLException e) { + throw RdbmsException.asQueryException(dataBaseType,e,queryColumnSql,tableName,null); + } finally { + DBUtil.closeDBResources(rs, statement, conn); + } + + return columns; + } + + /** + * @return Left:ColumnName Middle:ColumnType Right:ColumnTypeName + */ + public static Triple, List, List> getColumnMetaData( + DataBaseType dataBaseType, String jdbcUrl, String user, + String pass, String tableName, String column) { + Connection conn = null; + try { + conn = getConnection(dataBaseType, jdbcUrl, user, pass); + return getColumnMetaData(conn, tableName, column); + } finally { + DBUtil.closeDBResources(null, null, conn); + } + } + + /** + * @return Left:ColumnName Middle:ColumnType Right:ColumnTypeName + */ + public static Triple, List, List> getColumnMetaData( + Connection conn, String tableName, String column) { + Statement statement = null; + ResultSet rs = null; + + Triple, List, List> columnMetaData = new ImmutableTriple, List, List>( + new ArrayList(), new ArrayList(), + new ArrayList()); + try { + statement = conn.createStatement(); + String queryColumnSql = "select " + column + " from " + tableName + + " where 1=2"; + + rs = statement.executeQuery(queryColumnSql); + ResultSetMetaData rsMetaData = rs.getMetaData(); + for (int i = 0, len = rsMetaData.getColumnCount(); i < len; i++) { + + columnMetaData.getLeft().add(rsMetaData.getColumnName(i + 1)); + columnMetaData.getMiddle().add(rsMetaData.getColumnType(i + 1)); + columnMetaData.getRight().add( + rsMetaData.getColumnTypeName(i + 1)); + } + return columnMetaData; + + } catch (SQLException e) { + throw DataXException + .asDataXException(DBUtilErrorCode.GET_COLUMN_INFO_FAILED, + String.format("获取表:%s 的字段的元信息时失败. 请联系 DBA 核查该库、表信息.", tableName), e); + } finally { + DBUtil.closeDBResources(rs, statement, null); + } + } + + public static boolean testConnWithoutRetry(DataBaseType dataBaseType, + String url, String user, String pass, boolean checkSlave){ + Connection connection = null; + + try { + connection = connect(dataBaseType, url, user, pass); + if (connection != null) { + if (dataBaseType.equals(dataBaseType.MySql) && checkSlave) { + //dataBaseType.MySql + boolean connOk = !isSlaveBehind(connection); + return connOk; + } else { + return true; + } + } + } catch (Exception e) { + LOG.warn("test connection of [{}] failed, for {}.", url, + e.getMessage()); + } finally { + DBUtil.closeDBResources(null, connection); + } + return false; + } + + public static boolean testConnWithoutRetry(DataBaseType dataBaseType, + String url, String user, String pass, List preSql) { + Connection connection = null; + try { + connection = connect(dataBaseType, url, user, pass); + if (null != connection) { + for (String pre : preSql) { + if (doPreCheck(connection, pre) == false) { + LOG.warn("doPreCheck failed."); + return false; + } + } + return true; + } + } catch (Exception e) { + LOG.warn("test connection of [{}] failed, for {}.", url, + e.getMessage()); + } finally { + DBUtil.closeDBResources(null, connection); + } + + return false; + } + + public static boolean isOracleMaster(final String url, final String user, final String pass) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Boolean call() throws Exception { + Connection conn = null; + try { + conn = connect(DataBaseType.Oracle, url, user, pass); + ResultSet rs = query(conn, "select DATABASE_ROLE from V$DATABASE"); + if (DBUtil.asyncResultSetNext(rs, 5)) { + String role = rs.getString("DATABASE_ROLE"); + return "PRIMARY".equalsIgnoreCase(role); + } + throw DataXException.asDataXException(DBUtilErrorCode.RS_ASYNC_ERROR, + String.format("select DATABASE_ROLE from V$DATABASE failed,请检查您的jdbcUrl:%s.", url)); + } finally { + DBUtil.closeDBResources(null, conn); + } + } + }, 3, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(DBUtilErrorCode.CONN_DB_ERROR, + String.format("select DATABASE_ROLE from V$DATABASE failed, url: %s", url), e); + } + } + + public static ResultSet query(Connection conn, String sql) + throws SQLException { + Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, + ResultSet.CONCUR_READ_ONLY); + //默认3600 seconds + stmt.setQueryTimeout(Constant.SOCKET_TIMEOUT_INSECOND); + return query(stmt, sql); + } + + private static boolean doPreCheck(Connection conn, String pre) { + ResultSet rs = null; + try { + rs = query(conn, pre); + + int checkResult = -1; + if (DBUtil.asyncResultSetNext(rs)) { + checkResult = rs.getInt(1); + if (DBUtil.asyncResultSetNext(rs)) { + LOG.warn( + "pre check failed. It should return one result:0, pre:[{}].", + pre); + return false; + } + + } + + if (0 == checkResult) { + return true; + } + + LOG.warn( + "pre check failed. It should return one result:0, pre:[{}].", + pre); + } catch (Exception e) { + LOG.warn("pre check failed. pre:[{}], errorMessage:[{}].", pre, + e.getMessage()); + } finally { + DBUtil.closeResultSet(rs); + } + return false; + } + + // warn:until now, only oracle need to handle session config. + public static void dealWithSessionConfig(Connection conn, + Configuration config, DataBaseType databaseType, String message) { + List sessionConfig = null; + switch (databaseType) { + case Oracle: + sessionConfig = config.getList(Key.SESSION, + new ArrayList(), String.class); + DBUtil.doDealWithSessionConfig(conn, sessionConfig, message); + break; + case DRDS: + // 用于关闭 drds 的分布式事务开关 + sessionConfig = new ArrayList(); + sessionConfig.add("set transaction policy 4"); + DBUtil.doDealWithSessionConfig(conn, sessionConfig, message); + break; + case MySql: + sessionConfig = config.getList(Key.SESSION, + new ArrayList(), String.class); + DBUtil.doDealWithSessionConfig(conn, sessionConfig, message); + break; + default: + break; + } + } + + private static void doDealWithSessionConfig(Connection conn, + List sessions, String message) { + if (null == sessions || sessions.isEmpty()) { + return; + } + + Statement stmt; + try { + stmt = conn.createStatement(); + } catch (SQLException e) { + throw DataXException + .asDataXException(DBUtilErrorCode.SET_SESSION_ERROR, String + .format("session配置有误. 因为根据您的配置执行 session 设置失败. 上下文信息是:[%s]. 请检查您的配置并作出修改.", message), + e); + } + + for (String sessionSql : sessions) { + LOG.info("execute sql:[{}]", sessionSql); + try { + DBUtil.executeSqlWithoutResultSet(stmt, sessionSql); + } catch (SQLException e) { + throw DataXException.asDataXException( + DBUtilErrorCode.SET_SESSION_ERROR, String.format( + "session配置有误. 因为根据您的配置执行 session 设置失败. 上下文信息是:[%s]. 请检查您的配置并作出修改.", message), e); + } + } + DBUtil.closeDBResources(stmt, null); + } + + public static void sqlValid(String sql, DataBaseType dataBaseType){ + SQLStatementParser statementParser = SQLParserUtils.createSQLStatementParser(sql,dataBaseType.getTypeName()); + statementParser.parseStatementList(); + } + + /** + * 异步获取resultSet的next(),注意,千万不能应用在数据的读取中。只能用在meta的获取 + * @param resultSet + * @return + */ + public static boolean asyncResultSetNext(final ResultSet resultSet) { + return asyncResultSetNext(resultSet, 3600); + } + + public static boolean asyncResultSetNext(final ResultSet resultSet, int timeout) { + Future future = rsExecutors.get().submit(new Callable() { + @Override + public Boolean call() throws Exception { + return resultSet.next(); + } + }); + try { + return future.get(timeout, TimeUnit.SECONDS); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.RS_ASYNC_ERROR, "异步获取ResultSet失败", e); + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtilErrorCode.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtilErrorCode.java new file mode 100755 index 000000000..08b7386ca --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtilErrorCode.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.spi.ErrorCode; + +//TODO +public enum DBUtilErrorCode implements ErrorCode { + //连接错误 + MYSQL_CONN_USERPWD_ERROR("MYSQLErrCode-01","数据库用户名或者密码错误,请检查填写的账号密码或者联系DBA确认账号和密码是否正确"), + MYSQL_CONN_IPPORT_ERROR("MYSQLErrCode-02","数据库服务的IP地址或者Port错误,请检查填写的IP地址和Port或者联系DBA确认IP地址和Port是否正确。如果是同步中心用户请联系DBA确认idb上录入的IP和PORT信息和数据库的当前实际信息是一致的"), + MYSQL_CONN_DB_ERROR("MYSQLErrCode-03","数据库名称错误,请检查数据库实例名称或者联系DBA确认该实例是否存在并且在正常服务"), + + ORACLE_CONN_USERPWD_ERROR("ORACLEErrCode-01","数据库用户名或者密码错误,请检查填写的账号密码或者联系DBA确认账号和密码是否正确"), + ORACLE_CONN_IPPORT_ERROR("ORACLEErrCode-02","数据库服务的IP地址或者Port错误,请检查填写的IP地址和Port或者联系DBA确认IP地址和Port是否正确。如果是同步中心用户请联系DBA确认idb上录入的IP和PORT信息和数据库的当前实际信息是一致的"), + ORACLE_CONN_DB_ERROR("ORACLEErrCode-03","数据库名称错误,请检查数据库实例名称或者联系DBA确认该实例是否存在并且在正常服务"), + + //execute query错误 + MYSQL_QUERY_TABLE_NAME_ERROR("MYSQLErrCode-04","表不存在,请检查表名或者联系DBA确认该表是否存在"), + MYSQL_QUERY_SQL_ERROR("MYSQLErrCode-05","SQL语句执行出错,请检查Where条件是否存在拼写或语法错误"), + MYSQL_QUERY_COLUMN_ERROR("MYSQLErrCode-06","Column信息错误,请检查该列是否存在,如果是常量或者变量,请使用英文单引号’包起来"), + MYSQL_QUERY_SELECT_PRI_ERROR("MYSQLErrCode-07","读表数据出错,因为账号没有读表的权限,请联系DBA确认该账号的权限并授权"), + + ORACLE_QUERY_TABLE_NAME_ERROR("ORACLEErrCode-04","表不存在,请检查表名或者联系DBA确认该表是否存在"), + ORACLE_QUERY_SQL_ERROR("ORACLEErrCode-05","SQL语句执行出错,原因可能是你填写的列不存在或者where条件不符合要求,1,请检查该列是否存在,如果是常量或者变量,请使用英文单引号’包起来; 2,请检查Where条件是否存在拼写或语法错误"), + ORACLE_QUERY_SELECT_PRI_ERROR("ORACLEErrCode-06","读表数据出错,因为账号没有读表的权限,请联系DBA确认该账号的权限并授权"), + ORACLE_QUERY_SQL_PARSER_ERROR("ORACLEErrCode-07","SQL语法出错,请检查Where条件是否存在拼写或语法错误"), + + //PreSql,Post Sql错误 + MYSQL_PRE_SQL_ERROR("MYSQLErrCode-08","PreSQL语法错误,请检查"), + MYSQL_POST_SQL_ERROR("MYSQLErrCode-09","PostSql语法错误,请检查"), + MYSQL_QUERY_SQL_PARSER_ERROR("MYSQLErrCode-10","SQL语法出错,请检查Where条件是否存在拼写或语法错误"), + + ORACLE_PRE_SQL_ERROR("ORACLEErrCode-08", "PreSQL语法错误,请检查"), + ORACLE_POST_SQL_ERROR("ORACLEErrCode-09", "PostSql语法错误,请检查"), + + //SplitPK 错误 + MYSQL_SPLIT_PK_ERROR("MYSQLErrCode-11","SplitPK错误,请检查"), + ORACLE_SPLIT_PK_ERROR("ORACLEErrCode-10","SplitPK错误,请检查"), + + //Insert,Delete 权限错误 + MYSQL_INSERT_ERROR("MYSQLErrCode-12","数据库没有写权限,请联系DBA"), + MYSQL_DELETE_ERROR("MYSQLErrCode-13","数据库没有Delete权限,请联系DBA"), + ORACLE_INSERT_ERROR("ORACLEErrCode-11","数据库没有写权限,请联系DBA"), + ORACLE_DELETE_ERROR("ORACLEErrCode-12","数据库没有Delete权限,请联系DBA"), + + JDBC_NULL("DBUtilErrorCode-20","JDBC URL为空,请检查配置"), + CONF_ERROR("DBUtilErrorCode-00", "您的配置错误."), + CONN_DB_ERROR("DBUtilErrorCode-10", "连接数据库失败. 请检查您的 账号、密码、数据库名称、IP、Port或者向 DBA 寻求帮助(注意网络环境)."), + GET_COLUMN_INFO_FAILED("DBUtilErrorCode-01", "获取表字段相关信息失败."), + UNSUPPORTED_TYPE("DBUtilErrorCode-12", "不支持的数据库类型. 请注意查看 DataX 已经支持的数据库类型以及数据库版本."), + COLUMN_SPLIT_ERROR("DBUtilErrorCode-13", "根据主键进行切分失败."), + SET_SESSION_ERROR("DBUtilErrorCode-14", "设置 session 失败."), + RS_ASYNC_ERROR("DBUtilErrorCode-15", "异步获取ResultSet next失败."), + + REQUIRED_VALUE("DBUtilErrorCode-03", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("DBUtilErrorCode-02", "您填写的参数值不合法."), + ILLEGAL_SPLIT_PK("DBUtilErrorCode-04", "您填写的主键列不合法, DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型."), + SPLIT_FAILED_ILLEGAL_SQL("DBUtilErrorCode-15", "DataX尝试切分表时, 执行数据库 Sql 失败. 请检查您的配置 table/splitPk/where 并作出修改."), + SQL_EXECUTE_FAIL("DBUtilErrorCode-06", "执行数据库 Sql 失败, 请检查您的配置的 column/table/where/querySql或者向 DBA 寻求帮助."), + + // only for reader + READ_RECORD_FAIL("DBUtilErrorCode-07", "读取数据库数据失败. 请检查您的配置的 column/table/where/querySql或者向 DBA 寻求帮助."), + TABLE_QUERYSQL_MIXED("DBUtilErrorCode-08", "您配置凌乱了. 不能同时既配置table又配置querySql"), + TABLE_QUERYSQL_MISSING("DBUtilErrorCode-09", "您配置错误. table和querySql 应该并且只能配置一个."), + + // only for writer + WRITE_DATA_ERROR("DBUtilErrorCode-05", "往您配置的写入表中写入数据时失败."), + NO_INSERT_PRIVILEGE("DBUtilErrorCode-11", "数据库没有写权限,请联系DBA"), + NO_DELETE_PRIVILEGE("DBUtilErrorCode-16", "数据库没有DELETE权限,请联系DBA"), + ; + + private final String code; + + private final String description; + + private DBUtilErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java new file mode 100755 index 000000000..d7f11edf1 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java @@ -0,0 +1,195 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * refer:http://blog.csdn.net/ring0hx/article/details/6152528 + *

+ */ +public enum DataBaseType { + MySql("mysql", "com.mysql.jdbc.Driver"), + Tddl("mysql", "com.mysql.jdbc.Driver"), + DRDS("drds", "com.mysql.jdbc.Driver"), + Oracle("oracle", "oracle.jdbc.OracleDriver"), + SQLServer("sqlserver", "com.microsoft.sqlserver.jdbc.SQLServerDriver"), + PostgreSQL("postgresql", "org.postgresql.Driver"), + Sybase("sybase", "com.sybase.jdbc2.jdbc.SybDriver (com.sybase.jdbc3.jdbc.SybDriver)"), + DB2("db2", "com.ibm.db2.jcc.DB2Driver"), + ADS("ads","com.mysql.jdbc.Driver"); + + + + private String typeName; + private String driverClassName; + + DataBaseType(String typeName, String driverClassName) { + this.typeName = typeName; + this.driverClassName = driverClassName; + } + + public String getDriverClassName() { + return this.driverClassName; + } + + public String appendJDBCSuffixForReader(String jdbc) { + String result = jdbc; + String suffix = null; + switch (this) { + case MySql: + case DRDS: + suffix = "yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true"; + if (jdbc.contains("?")) { + result = jdbc + "&" + suffix; + } else { + result = jdbc + "?" + suffix; + } + break; + case Oracle: + break; + case SQLServer: + break; + case DB2: + break; + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type."); + } + + return result; + } + + public String appendJDBCSuffixForWriter(String jdbc) { + String result = jdbc; + String suffix = null; + switch (this) { + case MySql: + suffix = "yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true&tinyInt1isBit=false"; + if (jdbc.contains("?")) { + result = jdbc + "&" + suffix; + } else { + result = jdbc + "?" + suffix; + } + break; + case DRDS: + suffix = "yearIsDateType=false&zeroDateTimeBehavior=convertToNull"; + if (jdbc.contains("?")) { + result = jdbc + "&" + suffix; + } else { + result = jdbc + "?" + suffix; + } + break; + case Oracle: + break; + case SQLServer: + break; + case DB2: + break; + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type."); + } + + return result; + } + + public String formatPk(String splitPk) { + String result = splitPk; + + switch (this) { + case MySql: + case Oracle: + if (splitPk.length() >= 2 && splitPk.startsWith("`") && splitPk.endsWith("`")) { + result = splitPk.substring(1, splitPk.length() - 1).toLowerCase(); + } + break; + case SQLServer: + if (splitPk.length() >= 2 && splitPk.startsWith("[") && splitPk.endsWith("]")) { + result = splitPk.substring(1, splitPk.length() - 1).toLowerCase(); + } + break; + case DB2: + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type."); + } + + return result; + } + + + public String quoteColumnName(String columnName) { + String result = columnName; + + switch (this) { + case MySql: + result = "`" + columnName.replace("`", "``") + "`"; + break; + case Oracle: + break; + case SQLServer: + result = "[" + columnName + "]"; + break; + case DB2: + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type"); + } + + return result; + } + + public String quoteTableName(String tableName) { + String result = tableName; + + switch (this) { + case MySql: + result = "`" + tableName.replace("`", "``") + "`"; + break; + case Oracle: + break; + case SQLServer: + break; + case DB2: + break; + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type"); + } + + return result; + } + + private static Pattern mysqlPattern = Pattern.compile("jdbc:mysql://(.+):\\d+/.+"); + private static Pattern oraclePattern = Pattern.compile("jdbc:oracle:thin:@(.+):\\d+:.+"); + + /** + * 注意:目前只实现了从 mysql/oracle 中识别出ip 信息.未识别到则返回 null. + */ + public static String parseIpFromJdbcUrl(String jdbcUrl) { + Matcher mysql = mysqlPattern.matcher(jdbcUrl); + if (mysql.matches()) { + return mysql.group(1); + } + Matcher oracle = oraclePattern.matcher(jdbcUrl); + if (oracle.matches()) { + return oracle.group(1); + } + return null; + } + public String getTypeName() { + return typeName; + } + + public void setTypeName(String typeName) { + this.typeName = typeName; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/JdbcConnectionFactory.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/JdbcConnectionFactory.java new file mode 100644 index 000000000..2fe3108ec --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/JdbcConnectionFactory.java @@ -0,0 +1,39 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.sql.Connection; + +/** + * Date: 15/3/16 下午3:12 + */ +public class JdbcConnectionFactory implements ConnectionFactory { + + private DataBaseType dataBaseType; + + private String jdbcUrl; + + private String userName; + + private String password; + + public JdbcConnectionFactory(DataBaseType dataBaseType, String jdbcUrl, String userName, String password) { + this.dataBaseType = dataBaseType; + this.jdbcUrl = jdbcUrl; + this.userName = userName; + this.password = password; + } + + @Override + public Connection getConnecttion() { + return DBUtil.getConnection(dataBaseType, jdbcUrl, userName, password); + } + + @Override + public Connection getConnecttionWithoutRetry() { + return DBUtil.getConnectionWithoutRetry(dataBaseType, jdbcUrl, userName, password); + } + + @Override + public String getConnectionInfo() { + return "jdbcUrl:" + jdbcUrl; + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsException.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsException.java new file mode 100644 index 000000000..4b6601adb --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsException.java @@ -0,0 +1,190 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by judy.lt on 2015/6/5. + */ +public class RdbmsException extends DataXException{ + public RdbmsException(ErrorCode errorCode, String message){ + super(errorCode,message); + } + + public static DataXException asConnException(DataBaseType dataBaseType,Exception e,String userName,String dbName){ + if (dataBaseType.equals(DataBaseType.MySql)){ + DBUtilErrorCode dbUtilErrorCode = mySqlConnectionErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_CONN_DB_ERROR && dbName !=null ){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库名称为:"+dbName+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_CONN_USERPWD_ERROR ){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库用户名为:"+userName+" 具体错误信息为:"+e); + } + return DataXException.asDataXException(dbUtilErrorCode," 具体错误信息为:"+e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + DBUtilErrorCode dbUtilErrorCode = oracleConnectionErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_CONN_DB_ERROR && dbName != null){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库名称为:"+dbName+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_CONN_USERPWD_ERROR ){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库用户名为:"+userName+" 具体错误信息为:"+e); + } + return DataXException.asDataXException(dbUtilErrorCode," 具体错误信息为:"+e); + } + return DataXException.asDataXException(DBUtilErrorCode.CONN_DB_ERROR," 具体错误信息为:"+e); + } + + public static DBUtilErrorCode mySqlConnectionErrorAna(String e){ + if (e.contains(Constant.MYSQL_DATABASE)){ + return DBUtilErrorCode.MYSQL_CONN_DB_ERROR; + } + + if (e.contains(Constant.MYSQL_CONNEXP)){ + return DBUtilErrorCode.MYSQL_CONN_IPPORT_ERROR; + } + + if (e.contains(Constant.MYSQL_ACCDENIED)){ + return DBUtilErrorCode.MYSQL_CONN_USERPWD_ERROR; + } + + return DBUtilErrorCode.CONN_DB_ERROR; + } + + public static DBUtilErrorCode oracleConnectionErrorAna(String e){ + if (e.contains(Constant.ORACLE_DATABASE)){ + return DBUtilErrorCode.ORACLE_CONN_DB_ERROR; + } + + if (e.contains(Constant.ORACLE_CONNEXP)){ + return DBUtilErrorCode.ORACLE_CONN_IPPORT_ERROR; + } + + if (e.contains(Constant.ORACLE_ACCDENIED)){ + return DBUtilErrorCode.ORACLE_CONN_USERPWD_ERROR; + } + + return DBUtilErrorCode.CONN_DB_ERROR; + } + + public static DataXException asQueryException(DataBaseType dataBaseType, Exception e,String querySql,String table,String userName){ + if (dataBaseType.equals(DataBaseType.MySql)){ + DBUtilErrorCode dbUtilErrorCode = mySqlQueryErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_QUERY_TABLE_NAME_ERROR && table != null){ + return DataXException.asDataXException(dbUtilErrorCode,"表名为:"+table+" 执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_QUERY_SELECT_PRI_ERROR && userName != null){ + return DataXException.asDataXException(dbUtilErrorCode,"用户名为:"+userName+" 具体错误信息为:"+e); + } + + return DataXException.asDataXException(dbUtilErrorCode,"执行的SQL为: "+querySql+" 具体错误信息为:"+e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + DBUtilErrorCode dbUtilErrorCode = oracleQueryErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_QUERY_TABLE_NAME_ERROR && table != null){ + return DataXException.asDataXException(dbUtilErrorCode,"表名为:"+table+" 执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_QUERY_SELECT_PRI_ERROR){ + return DataXException.asDataXException(dbUtilErrorCode,"用户名为:"+userName+" 具体错误信息为:"+e); + } + + return DataXException.asDataXException(dbUtilErrorCode,"执行的SQL为: "+querySql+" 具体错误信息为:"+e); + + } + + return DataXException.asDataXException(DBUtilErrorCode.SQL_EXECUTE_FAIL, "执行的SQL为: "+querySql+" 具体错误信息为:"+e); + } + + public static DBUtilErrorCode mySqlQueryErrorAna(String e){ + if (e.contains(Constant.MYSQL_TABLE_NAME_ERR1) && e.contains(Constant.MYSQL_TABLE_NAME_ERR2)){ + return DBUtilErrorCode.MYSQL_QUERY_TABLE_NAME_ERROR; + }else if (e.contains(Constant.MYSQL_SELECT_PRI)){ + return DBUtilErrorCode.MYSQL_QUERY_SELECT_PRI_ERROR; + }else if (e.contains(Constant.MYSQL_COLUMN1) && e.contains(Constant.MYSQL_COLUMN2)){ + return DBUtilErrorCode.MYSQL_QUERY_COLUMN_ERROR; + }else if (e.contains(Constant.MYSQL_WHERE)){ + return DBUtilErrorCode.MYSQL_QUERY_SQL_ERROR; + } + return DBUtilErrorCode.READ_RECORD_FAIL; + } + + public static DBUtilErrorCode oracleQueryErrorAna(String e){ + if (e.contains(Constant.ORACLE_TABLE_NAME)){ + return DBUtilErrorCode.ORACLE_QUERY_TABLE_NAME_ERROR; + }else if (e.contains(Constant.ORACLE_SQL)){ + return DBUtilErrorCode.ORACLE_QUERY_SQL_ERROR; + }else if (e.contains(Constant.ORACLE_SELECT_PRI)){ + return DBUtilErrorCode.ORACLE_QUERY_SELECT_PRI_ERROR; + } + return DBUtilErrorCode.READ_RECORD_FAIL; + } + + public static DataXException asSqlParserException(DataBaseType dataBaseType, Exception e,String querySql){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_QUERY_SQL_PARSER_ERROR, "执行的SQL为:"+querySql+" 具体错误信息为:" + e); + } + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_QUERY_SQL_PARSER_ERROR,"执行的SQL为:"+querySql+" 具体错误信息为:" +e); + } + throw DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,"执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + + public static DataXException asPreSQLParserException(DataBaseType dataBaseType, Exception e,String querySql){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_PRE_SQL_ERROR, "执行的SQL为:"+querySql+" 具体错误信息为:" + e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_PRE_SQL_ERROR,"执行的SQL为:"+querySql+" 具体错误信息为:" +e); + } + throw DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,"执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + + public static DataXException asPostSQLParserException(DataBaseType dataBaseType, Exception e,String querySql){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_POST_SQL_ERROR, "执行的SQL为:"+querySql+" 具体错误信息为:" + e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_POST_SQL_ERROR,"执行的SQL为:"+querySql+" 具体错误信息为:" +e); + } + throw DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,"执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + + public static DataXException asInsertPriException(DataBaseType dataBaseType, String userName,String jdbcUrl){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_INSERT_ERROR, "用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_INSERT_ERROR,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + throw DataXException.asDataXException(DBUtilErrorCode.NO_INSERT_PRIVILEGE,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + public static DataXException asDeletePriException(DataBaseType dataBaseType, String userName,String jdbcUrl){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_DELETE_ERROR, "用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_DELETE_ERROR,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + throw DataXException.asDataXException(DBUtilErrorCode.NO_DELETE_PRIVILEGE,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + public static DataXException asSplitPKException(DataBaseType dataBaseType, Exception e,String splitSql,String splitPkID){ + if (dataBaseType.equals(DataBaseType.MySql)){ + + return DataXException.asDataXException(DBUtilErrorCode.MYSQL_SPLIT_PK_ERROR,"配置的SplitPK为: "+splitPkID+", 执行的SQL为: "+splitSql+" 具体错误信息为:"+e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + return DataXException.asDataXException(DBUtilErrorCode.ORACLE_SPLIT_PK_ERROR,"配置的SplitPK为: "+splitPkID+", 执行的SQL为: "+splitSql+" 具体错误信息为:"+e); + } + + return DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,splitSql+e); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsRangeSplitWrap.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsRangeSplitWrap.java new file mode 100755 index 000000000..9d9c0aaf1 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsRangeSplitWrap.java @@ -0,0 +1,86 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.util.RangeSplitUtil; +import org.apache.commons.lang3.StringUtils; + +import java.math.BigInteger; +import java.util.ArrayList; +import java.util.List; + +public final class RdbmsRangeSplitWrap { + + public static List splitAndWrap(String left, String right, int expectSliceNumber, + String columnName, String quote, DataBaseType dataBaseType) { + String[] tempResult = RangeSplitUtil.doAsciiStringSplit(left, right, expectSliceNumber); + return RdbmsRangeSplitWrap.wrapRange(tempResult, columnName, quote, dataBaseType); + } + + // warn: do not use this method long->BigInteger + public static List splitAndWrap(long left, long right, int expectSliceNumber, String columnName) { + long[] tempResult = RangeSplitUtil.doLongSplit(left, right, expectSliceNumber); + return RdbmsRangeSplitWrap.wrapRange(tempResult, columnName); + } + + public static List splitAndWrap(BigInteger left, BigInteger right, int expectSliceNumber, String columnName) { + BigInteger[] tempResult = RangeSplitUtil.doBigIntegerSplit(left, right, expectSliceNumber); + return RdbmsRangeSplitWrap.wrapRange(tempResult, columnName); + } + + public static List wrapRange(long[] rangeResult, String columnName) { + String[] rangeStr = new String[rangeResult.length]; + for (int i = 0, len = rangeResult.length; i < len; i++) { + rangeStr[i] = String.valueOf(rangeResult[i]); + } + return wrapRange(rangeStr, columnName, "", null); + } + + public static List wrapRange(BigInteger[] rangeResult, String columnName) { + String[] rangeStr = new String[rangeResult.length]; + for (int i = 0, len = rangeResult.length; i < len; i++) { + rangeStr[i] = rangeResult[i].toString(); + } + return wrapRange(rangeStr, columnName, "", null); + } + + public static List wrapRange(String[] rangeResult, String columnName, + String quote, DataBaseType dataBaseType) { + if (null == rangeResult || rangeResult.length < 2) { + throw new IllegalArgumentException(String.format( + "Parameter rangeResult can not be null and its length can not <2. detail:rangeResult=[%s].", + StringUtils.join(rangeResult, ","))); + } + + List result = new ArrayList(); + + //TODO change to stringbuilder.append(..) + if (2 == rangeResult.length) { + result.add(String.format(" %s%s%s <= %s AND %s <= %s%s%s ", quote, quoteConstantValue(rangeResult[0], dataBaseType), + quote, columnName, columnName, quote, quoteConstantValue(rangeResult[1], dataBaseType), quote)); + return result; + } else { + for (int i = 0, len = rangeResult.length - 2; i < len; i++) { + result.add(String.format(" %s%s%s <= %s AND %s < %s%s%s ", quote, quoteConstantValue(rangeResult[i], dataBaseType), + quote, columnName, columnName, quote, quoteConstantValue(rangeResult[i + 1], dataBaseType), quote)); + } + + result.add(String.format(" %s%s%s <= %s AND %s <= %s%s%s ", quote, quoteConstantValue(rangeResult[rangeResult.length - 2], dataBaseType), + quote, columnName, columnName, quote, quoteConstantValue(rangeResult[rangeResult.length - 1], dataBaseType), quote)); + return result; + } + } + + private static String quoteConstantValue(String aString, DataBaseType dataBaseType) { + if (null == dataBaseType) { + return aString; + } + + if (dataBaseType.equals(DataBaseType.MySql)) { + return aString.replace("'", "''").replace("\\", "\\\\"); + } else if (dataBaseType.equals(DataBaseType.Oracle) || dataBaseType.equals(DataBaseType.SQLServer)) { + return aString.replace("'", "''"); + } else { + //TODO other type supported + return aString; + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/TableExpandUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/TableExpandUtil.java new file mode 100755 index 000000000..8d28ed4f0 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/TableExpandUtil.java @@ -0,0 +1,83 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.util.ArrayList; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public final class TableExpandUtil { + + // schema.table[0-2]more + // 1 2 3 4 5 + public static Pattern pattern = Pattern + .compile("(\\w+\\.)?(\\w+)\\[(\\d+)-(\\d+)\\](.*)"); + + private TableExpandUtil() { + } + + /** + * Split the table string(Usually contains names of some tables) to a List + * that is formated. example: table[0-32] will be splitted into `table0`, + * `table1`, `table2`, ... ,`table32` in {@link List} + * + * @param tables + * a string contains table name(one or many). + * @return a split result of table name. + *

+ * TODO 删除参数 DataBaseType + */ + public static List splitTables(DataBaseType dataBaseType, + String tables) { + List splittedTables = new ArrayList(); + + String[] tableArrays = tables.split(","); + + String tableName = null; + for (String tableArray : tableArrays) { + Matcher matcher = pattern.matcher(tableArray.trim()); + if (!matcher.matches()) { + tableName = tableArray.trim(); + splittedTables.add(tableName); + } else { + String start = matcher.group(3).trim(); + String end = matcher.group(4).trim(); + String tmp = ""; + if (Integer.valueOf(start) > Integer.valueOf(end)) { + tmp = start; + start = end; + end = tmp; + } + int len = start.length(); + String schema = null; + for (int k = Integer.valueOf(start); k <= Integer.valueOf(end); k++) { + schema = (null == matcher.group(1)) ? "" : matcher.group(1) + .trim(); + if (start.startsWith("0")) { + tableName = schema + matcher.group(2).trim() + + String.format("%0" + len + "d", k) + + matcher.group(5).trim(); + splittedTables.add(tableName); + } else { + tableName = schema + matcher.group(2).trim() + + String.format("%d", k) + + matcher.group(5).trim(); + splittedTables.add(tableName); + } + } + } + } + return splittedTables; + } + + public static List expandTableConf(DataBaseType dataBaseType, + List tables) { + List parsedTables = new ArrayList(); + for (String table : tables) { + List splittedTables = splitTables(dataBaseType, table); + parsedTables.addAll(splittedTables); + } + + return parsedTables; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/CommonRdbmsWriter.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/CommonRdbmsWriter.java new file mode 100755 index 000000000..df298bf6b --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/CommonRdbmsWriter.java @@ -0,0 +1,547 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.alibaba.datax.plugin.rdbms.writer.util.OriginalConfPretreatmentUtil; +import com.alibaba.datax.plugin.rdbms.writer.util.WriterUtil; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.Triple; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.PreparedStatement; +import java.sql.SQLException; +import java.sql.Types; +import java.util.ArrayList; +import java.util.List; + +public class CommonRdbmsWriter { + + public static class Job { + private DataBaseType dataBaseType; + + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + public Job(DataBaseType dataBaseType) { + this.dataBaseType = dataBaseType; + OriginalConfPretreatmentUtil.DATABASE_TYPE = this.dataBaseType; + } + + public void init(Configuration originalConfig) { + OriginalConfPretreatmentUtil.doPretreatment(originalConfig); + + LOG.debug("After job init(), originalConfig now is:[\n{}\n]", + originalConfig.toJSON()); + } + + /*目前只支持MySQL Writer跟Oracle Writer;检查PreSQL跟PostSQL语法以及insert,delete权限*/ + public void writerPreCheck(Configuration originalConfig,DataBaseType dataBaseType){ + /*检查PreSql跟PostSql语句*/ + prePostSqlValid(originalConfig,dataBaseType); + /*检查insert 跟delete权限*/ + privilegeValid(originalConfig,dataBaseType); + } + + public void prePostSqlValid(Configuration originalConfig,DataBaseType dataBaseType){ + /*检查PreSql跟PostSql语句*/ + WriterUtil.preCheckPrePareSQL(originalConfig, dataBaseType); + WriterUtil.preCheckPostSQL(originalConfig, dataBaseType); + } + + public void privilegeValid(Configuration originalConfig,DataBaseType dataBaseType){ + /*检查insert 跟delete权限*/ + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + List connections = originalConfig.getList(Constant.CONN_MARK, + Object.class); + + for (int i = 0, len = connections.size(); i < len; i++) { + Configuration connConf = Configuration.from(connections.get(i).toString()); + String jdbcUrl = connConf.getString(Key.JDBC_URL); + List expandedTables = connConf.getList(Key.TABLE, String.class); + boolean hasInsertPri = DBUtil.checkInsertPrivilege(dataBaseType,jdbcUrl,username,password,expandedTables); + + if(!hasInsertPri) { + throw RdbmsException.asInsertPriException(dataBaseType, originalConfig.getString(Key.USERNAME), jdbcUrl); + } + + if(DBUtil.needCheckDeletePrivilege(originalConfig)) { + boolean hasDeletePri = DBUtil.checkDeletePrivilege(dataBaseType,jdbcUrl, username, password, expandedTables); + if(!hasDeletePri) { + throw RdbmsException.asDeletePriException(dataBaseType, originalConfig.getString(Key.USERNAME), jdbcUrl); + } + } + } + } + + // 一般来说,是需要推迟到 task 中进行pre 的执行(单表情况例外) + public void prepare(Configuration originalConfig) { + int tableNumber = originalConfig.getInt(Constant.TABLE_NUMBER_MARK); + if (tableNumber == 1) { + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + + List conns = originalConfig.getList(Constant.CONN_MARK, + Object.class); + Configuration connConf = Configuration.from(conns.get(0) + .toString()); + + // 这里的 jdbcUrl 已经 append 了合适后缀参数 + String jdbcUrl = connConf.getString(Key.JDBC_URL); + originalConfig.set(Key.JDBC_URL, jdbcUrl); + + String table = connConf.getList(Key.TABLE, String.class).get(0); + originalConfig.set(Key.TABLE, table); + + List preSqls = originalConfig.getList(Key.PRE_SQL, + String.class); + List renderedPreSqls = WriterUtil.renderPreOrPostSqls( + preSqls, table); + + originalConfig.remove(Constant.CONN_MARK); + if (null != renderedPreSqls && !renderedPreSqls.isEmpty()) { + // 说明有 preSql 配置,则此处删除掉 + originalConfig.remove(Key.PRE_SQL); + + Connection conn = DBUtil.getConnection(dataBaseType, + jdbcUrl, username, password); + LOG.info("Begin to execute preSqls:[{}]. context info:{}.", + StringUtils.join(renderedPreSqls, ";"), jdbcUrl); + + WriterUtil.executeSqls(conn, renderedPreSqls, jdbcUrl,dataBaseType); + DBUtil.closeDBResources(null, null, conn); + } + } + + LOG.debug("After job prepare(), originalConfig now is:[\n{}\n]", + originalConfig.toJSON()); + } + + public List split(Configuration originalConfig, + int mandatoryNumber) { + return WriterUtil.doSplit(originalConfig, mandatoryNumber); + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + public void post(Configuration originalConfig) { + int tableNumber = originalConfig.getInt(Constant.TABLE_NUMBER_MARK); + if (tableNumber == 1) { + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + + // 已经由 prepare 进行了appendJDBCSuffix处理 + String jdbcUrl = originalConfig.getString(Key.JDBC_URL); + + String table = originalConfig.getString(Key.TABLE); + + List postSqls = originalConfig.getList(Key.POST_SQL, + String.class); + List renderedPostSqls = WriterUtil.renderPreOrPostSqls( + postSqls, table); + + if (null != renderedPostSqls && !renderedPostSqls.isEmpty()) { + // 说明有 postSql 配置,则此处删除掉 + originalConfig.remove(Key.POST_SQL); + + Connection conn = DBUtil.getConnection(this.dataBaseType, + jdbcUrl, username, password); + + LOG.info( + "Begin to execute postSqls:[{}]. context info:{}.", + StringUtils.join(renderedPostSqls, ";"), jdbcUrl); + WriterUtil.executeSqls(conn, renderedPostSqls, jdbcUrl,dataBaseType); + DBUtil.closeDBResources(null, null, conn); + } + } + } + + public void destroy(Configuration originalConfig) { + } + + } + + public static class Task { + protected static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + protected DataBaseType dataBaseType; + private static final String VALUE_HOLDER = "?"; + + protected String username; + protected String password; + protected String jdbcUrl; + protected String table; + protected List columns; + protected List preSqls; + protected List postSqls; + protected int batchSize; + protected int batchByteSize; + protected int columnNumber = 0; + protected TaskPluginCollector taskPluginCollector; + + // 作为日志显示信息时,需要附带的通用信息。比如信息所对应的数据库连接等信息,针对哪个表做的操作 + protected static String BASIC_MESSAGE; + + protected static String INSERT_OR_REPLACE_TEMPLATE; + + protected String writeRecordSql; + protected String writeMode; + protected boolean emptyAsNull; + protected Triple, List, List> resultSetMetaData; + + public Task(DataBaseType dataBaseType) { + this.dataBaseType = dataBaseType; + } + + public void init(Configuration writerSliceConfig) { + this.username = writerSliceConfig.getString(Key.USERNAME); + this.password = writerSliceConfig.getString(Key.PASSWORD); + this.jdbcUrl = writerSliceConfig.getString(Key.JDBC_URL); + this.table = writerSliceConfig.getString(Key.TABLE); + + this.columns = writerSliceConfig.getList(Key.COLUMN, String.class); + this.columnNumber = this.columns.size(); + + this.preSqls = writerSliceConfig.getList(Key.PRE_SQL, String.class); + this.postSqls = writerSliceConfig.getList(Key.POST_SQL, String.class); + this.batchSize = writerSliceConfig.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_SIZE); + this.batchByteSize = writerSliceConfig.getInt(Key.BATCH_BYTE_SIZE, Constant.DEFAULT_BATCH_BYTE_SIZE); + + writeMode = writerSliceConfig.getString(Key.WRITE_MODE, "INSERT"); + emptyAsNull = writerSliceConfig.getBool(Key.EMPTY_AS_NULL, true); + INSERT_OR_REPLACE_TEMPLATE = writerSliceConfig.getString(Constant.INSERT_OR_REPLACE_TEMPLATE_MARK); + this.writeRecordSql = String.format(INSERT_OR_REPLACE_TEMPLATE, this.table); + + BASIC_MESSAGE = String.format("jdbcUrl:[%s], table:[%s]", + this.jdbcUrl, this.table); + } + + public void prepare(Configuration writerSliceConfig) { + Connection connection = DBUtil.getConnection(this.dataBaseType, + this.jdbcUrl, username, password); + + DBUtil.dealWithSessionConfig(connection, writerSliceConfig, + this.dataBaseType, BASIC_MESSAGE); + + int tableNumber = writerSliceConfig.getInt( + Constant.TABLE_NUMBER_MARK); + if (tableNumber != 1) { + LOG.info("Begin to execute preSqls:[{}]. context info:{}.", + StringUtils.join(this.preSqls, ";"), BASIC_MESSAGE); + WriterUtil.executeSqls(connection, this.preSqls, BASIC_MESSAGE,dataBaseType); + } + + DBUtil.closeDBResources(null, null, connection); + } + + public void startWriteWithConnection(RecordReceiver recordReceiver, TaskPluginCollector taskPluginCollector, Connection connection) { + this.taskPluginCollector = taskPluginCollector; + + // 用于写入数据的时候的类型根据目的表字段类型转换 + this.resultSetMetaData = DBUtil.getColumnMetaData(connection, + this.table, StringUtils.join(this.columns, ",")); + // 写数据库的SQL语句 + calcWriteRecordSql(); + + List writeBuffer = new ArrayList(this.batchSize); + int bufferBytes = 0; + try { + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + if (record.getColumnNumber() != this.columnNumber) { + // 源头读取字段列数与目的表字段写入列数不相等,直接报错 + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "列配置信息有错误. 因为您配置的任务中,源头读取字段数:%s 与 目的表要写入的字段数:%s 不相等. 请检查您的配置并作出修改.", + record.getColumnNumber(), + this.columnNumber)); + } + + writeBuffer.add(record); + bufferBytes += record.getMemorySize(); + + if (writeBuffer.size() >= batchSize || bufferBytes >= batchByteSize) { + doBatchInsert(connection, writeBuffer); + writeBuffer.clear(); + bufferBytes = 0; + } + } + if (!writeBuffer.isEmpty()) { + doBatchInsert(connection, writeBuffer); + writeBuffer.clear(); + bufferBytes = 0; + } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + writeBuffer.clear(); + bufferBytes = 0; + DBUtil.closeDBResources(null, null, connection); + } + } + + // TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver, + Configuration writerSliceConfig, + TaskPluginCollector taskPluginCollector) { + Connection connection = DBUtil.getConnection(this.dataBaseType, + this.jdbcUrl, username, password); + DBUtil.dealWithSessionConfig(connection, writerSliceConfig, + this.dataBaseType, BASIC_MESSAGE); + startWriteWithConnection(recordReceiver, taskPluginCollector, connection); + } + + + public void post(Configuration writerSliceConfig) { + int tableNumber = writerSliceConfig.getInt( + Constant.TABLE_NUMBER_MARK); + + boolean hasPostSql = (this.postSqls != null && this.postSqls.size() > 0); + if (tableNumber == 1 || !hasPostSql) { + return; + } + + Connection connection = DBUtil.getConnection(this.dataBaseType, + this.jdbcUrl, username, password); + + LOG.info("Begin to execute postSqls:[{}]. context info:{}.", + StringUtils.join(this.postSqls, ";"), BASIC_MESSAGE); + WriterUtil.executeSqls(connection, this.postSqls, BASIC_MESSAGE,dataBaseType); + DBUtil.closeDBResources(null, null, connection); + } + + public void destroy(Configuration writerSliceConfig) { + } + + protected void doBatchInsert(Connection connection, List buffer) + throws SQLException { + PreparedStatement preparedStatement = null; + try { + connection.setAutoCommit(false); + preparedStatement = connection + .prepareStatement(this.writeRecordSql); + + for (Record record : buffer) { + preparedStatement = fillPreparedStatement( + preparedStatement, record); + preparedStatement.addBatch(); + } + preparedStatement.executeBatch(); + connection.commit(); + } catch (SQLException e) { + LOG.warn("回滚此次写入, 采用每次写入一行方式提交. 因为:" + e.getMessage()); + connection.rollback(); + doOneInsert(connection, buffer); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(preparedStatement, null); + } + } + + protected void doOneInsert(Connection connection, List buffer) { + PreparedStatement preparedStatement = null; + try { + connection.setAutoCommit(true); + preparedStatement = connection + .prepareStatement(this.writeRecordSql); + + for (Record record : buffer) { + try { + preparedStatement = fillPreparedStatement( + preparedStatement, record); + preparedStatement.execute(); + } catch (SQLException e) { + LOG.debug(e.toString()); + + this.taskPluginCollector.collectDirtyRecord(record, e); + } finally { + // 最后不要忘了关闭 preparedStatement + preparedStatement.clearParameters(); + } + } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(preparedStatement, null); + } + } + + // 直接使用了两个类变量:columnNumber,resultSetMetaData + protected PreparedStatement fillPreparedStatement(PreparedStatement preparedStatement, Record record) + throws SQLException { + for (int i = 0; i < this.columnNumber; i++) { + int columnSqltype = this.resultSetMetaData.getMiddle().get(i); + preparedStatement = fillPreparedStatementColumnType(preparedStatement, i, columnSqltype, record.getColumn(i)); + } + + return preparedStatement; + } + + protected PreparedStatement fillPreparedStatementColumnType(PreparedStatement preparedStatement, int columnIndex, int columnSqltype, Column column) throws SQLException { + java.util.Date utilDate; + switch (columnSqltype) { + case Types.CHAR: + case Types.NCHAR: + case Types.CLOB: + case Types.NCLOB: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + preparedStatement.setString(columnIndex + 1, column + .asString()); + break; + + case Types.SMALLINT: + case Types.INTEGER: + case Types.BIGINT: + case Types.NUMERIC: + case Types.DECIMAL: + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + String strValue = column.asString(); + if(emptyAsNull && "".equals(strValue)){ + preparedStatement.setString(columnIndex + 1, null); + }else{ + preparedStatement.setString(columnIndex + 1, strValue); + } + break; + + //tinyint is a little special in some database like mysql {boolean->tinyint(1)} + case Types.TINYINT: + Long longValue = column.asLong(); + if (null == longValue) { + preparedStatement.setString(columnIndex + 1, null); + } else { + preparedStatement.setString(columnIndex + 1, longValue.toString()); + } + break; + + // for mysql bug, see http://bugs.mysql.com/bug.php?id=35115 + case Types.DATE: + if (this.resultSetMetaData.getRight().get(columnIndex) + .equalsIgnoreCase("year")) { + if (column.asBigInteger() == null) { + preparedStatement.setString(columnIndex + 1, null); + } else { + preparedStatement.setInt(columnIndex + 1, column.asBigInteger().intValue()); + } + } else { + java.sql.Date sqlDate = null; + try { + utilDate = column.asDate(); + } catch (DataXException e) { + throw new SQLException(String.format( + "Date 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlDate = new java.sql.Date(utilDate.getTime()); + } + preparedStatement.setDate(columnIndex + 1, sqlDate); + } + break; + + case Types.TIME: + java.sql.Time sqlTime = null; + try { + utilDate = column.asDate(); + } catch (DataXException e) { + throw new SQLException(String.format( + "TIME 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTime = new java.sql.Time(utilDate.getTime()); + } + preparedStatement.setTime(columnIndex + 1, sqlTime); + break; + + case Types.TIMESTAMP: + java.sql.Timestamp sqlTimestamp = null; + try { + utilDate = column.asDate(); + } catch (DataXException e) { + throw new SQLException(String.format( + "TIMESTAMP 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTimestamp = new java.sql.Timestamp( + utilDate.getTime()); + } + preparedStatement.setTimestamp(columnIndex + 1, sqlTimestamp); + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + preparedStatement.setBytes(columnIndex + 1, column + .asBytes()); + break; + + case Types.BOOLEAN: + preparedStatement.setString(columnIndex + 1, column.asString()); + break; + + // warn: bit(1) -> Types.BIT 可使用setBoolean + // warn: bit(>1) -> Types.VARBINARY 可使用setBytes + case Types.BIT: + if (this.dataBaseType == DataBaseType.MySql) { + preparedStatement.setBoolean(columnIndex + 1, column.asBoolean()); + } else { + preparedStatement.setString(columnIndex + 1, column.asString()); + } + break; + default: + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d], 字段Java类型:[%s]. 请修改表中该字段的类型或者不同步该字段.", + this.resultSetMetaData.getLeft() + .get(columnIndex), + this.resultSetMetaData.getMiddle() + .get(columnIndex), + this.resultSetMetaData.getRight() + .get(columnIndex))); + } + return preparedStatement; + } + + private void calcWriteRecordSql() { + if (!VALUE_HOLDER.equals(calcValueHolder(""))) { + List valueHolders = new ArrayList(columnNumber); + for (int i = 0; i < columns.size(); i++) { + String type = resultSetMetaData.getRight().get(i); + valueHolders.add(calcValueHolder(type)); + } + INSERT_OR_REPLACE_TEMPLATE = WriterUtil.getWriteTemplate(columns, valueHolders, writeMode); + writeRecordSql = String.format(INSERT_OR_REPLACE_TEMPLATE, this.table); + } + } + + protected String calcValueHolder(String columnType) { + return VALUE_HOLDER; + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Constant.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Constant.java new file mode 100755 index 000000000..fcb3eb4a7 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Constant.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +/** + * 用于插件解析用户配置时,需要进行标识(MARK)的常量的声明. + */ +public final class Constant { + public static final int DEFAULT_BATCH_SIZE = 2048; + + public static final int DEFAULT_BATCH_BYTE_SIZE = 32 * 1024 * 1024; + + public static String TABLE_NAME_PLACEHOLDER = "@table"; + + public static String CONN_MARK = "connection"; + + public static String TABLE_NUMBER_MARK = "tableNumber"; + + public static String INSERT_OR_REPLACE_TEMPLATE_MARK = "insertOrReplaceTemplate"; + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Key.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Key.java new file mode 100755 index 000000000..25a2ab52f --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Key.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +public final class Key { + public final static String JDBC_URL = "jdbcUrl"; + + public final static String USERNAME = "username"; + + public final static String PASSWORD = "password"; + + public final static String TABLE = "table"; + + public final static String COLUMN = "column"; + + //可选值为:insert,replace,默认为 insert (mysql 支持,oracle 没用 replace 机制,只能 insert,oracle 可以不暴露这个参数) + public final static String WRITE_MODE = "writeMode"; + + public final static String PRE_SQL = "preSql"; + + public final static String POST_SQL = "postSql"; + + public final static String TDDL_APP_NAME = "appName"; + + //默认值:256 + public final static String BATCH_SIZE = "batchSize"; + + //默认值:32m + public final static String BATCH_BYTE_SIZE = "batchByteSize"; + + public final static String EMPTY_AS_NULL = "emptyAsNull"; + + public final static String DB_NAME_PATTERN = "dbNamePattern"; + + public final static String DB_RULE = "dbRule"; + + public final static String TABLE_NAME_PATTERN = "tableNamePattern"; + + public final static String TABLE_RULE = "tableRule"; + + public final static String DRYRUN = "dryRun"; +} \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/MysqlWriterErrorCode.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/MysqlWriterErrorCode.java new file mode 100755 index 000000000..523292ad0 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/MysqlWriterErrorCode.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +import com.alibaba.datax.common.spi.ErrorCode; + +//TODO 后续考虑与 util 包种的 DBUTilErrorCode 做合并.(区分读和写的错误码) +public enum MysqlWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private MysqlWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java new file mode 100755 index 000000000..292d92e58 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java @@ -0,0 +1,160 @@ +package com.alibaba.datax.plugin.rdbms.writer.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.rdbms.util.*; +import com.alibaba.datax.plugin.rdbms.writer.Constant; +import com.alibaba.datax.plugin.rdbms.writer.Key; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +public final class OriginalConfPretreatmentUtil { + private static final Logger LOG = LoggerFactory + .getLogger(OriginalConfPretreatmentUtil.class); + + public static DataBaseType DATABASE_TYPE; + + public static void doPretreatment(Configuration originalConfig) { + // 检查 username/password 配置(必填) + originalConfig.getNecessaryValue(Key.USERNAME, DBUtilErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.PASSWORD, DBUtilErrorCode.REQUIRED_VALUE); + + doCheckBatchSize(originalConfig); + + simplifyConf(originalConfig); + + dealColumnConf(originalConfig); + dealWriteMode(originalConfig); + } + + public static void doCheckBatchSize(Configuration originalConfig) { + // 检查batchSize 配置(选填,如果未填写,则设置为默认值) + int batchSize = originalConfig.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_SIZE); + if (batchSize < 1) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, String.format( + "您的batchSize配置有误. 您所配置的写入数据库表的 batchSize:%s 不能小于1. 推荐配置范围为:[100-1000], 该值越大, 内存溢出可能性越大. 请检查您的配置并作出修改.", + batchSize)); + } + + originalConfig.set(Key.BATCH_SIZE, batchSize); + } + + public static void simplifyConf(Configuration originalConfig) { + List connections = originalConfig.getList(Constant.CONN_MARK, + Object.class); + + int tableNum = 0; + + for (int i = 0, len = connections.size(); i < len; i++) { + Configuration connConf = Configuration.from(connections.get(i).toString()); + + String jdbcUrl = connConf.getString(Key.JDBC_URL); + if (StringUtils.isBlank(jdbcUrl)) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, "您未配置的写入数据库表的 jdbcUrl."); + } + + jdbcUrl = DATABASE_TYPE.appendJDBCSuffixForReader(jdbcUrl); + originalConfig.set(String.format("%s[%d].%s", Constant.CONN_MARK, i, Key.JDBC_URL), + jdbcUrl); + + List tables = connConf.getList(Key.TABLE, String.class); + + if (null == tables || tables.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + "您未配置写入数据库表的表名称. 根据配置DataX找不到您配置的表. 请检查您的配置并作出修改."); + } + + // 对每一个connection 上配置的table 项进行解析 + List expandedTables = TableExpandUtil + .expandTableConf(DATABASE_TYPE, tables); + + if (null == expandedTables || expandedTables.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + "您配置的写入数据库表名称错误. DataX找不到您配置的表,请检查您的配置并作出修改."); + } + + tableNum += expandedTables.size(); + + originalConfig.set(String.format("%s[%d].%s", Constant.CONN_MARK, + i, Key.TABLE), expandedTables); + } + + originalConfig.set(Constant.TABLE_NUMBER_MARK, tableNum); + } + + public static void dealColumnConf(Configuration originalConfig, ConnectionFactory connectionFactory, String oneTable) { + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, String.class); + if (null == userConfiguredColumns || userConfiguredColumns.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的配置文件中的列配置信息有误. 因为您未配置写入数据库表的列名称,DataX获取不到列信息. 请检查您的配置并作出修改."); + } else { + boolean isPreCheck = originalConfig.getBool(Key.DRYRUN, false); + List allColumns; + if (isPreCheck){ + allColumns = DBUtil.getTableColumnsByConn(DATABASE_TYPE,connectionFactory.getConnecttionWithoutRetry(), oneTable, connectionFactory.getConnectionInfo()); + }else{ + allColumns = DBUtil.getTableColumnsByConn(DATABASE_TYPE,connectionFactory.getConnecttion(), oneTable, connectionFactory.getConnectionInfo()); + } + + LOG.info("table:[{}] all columns:[\n{}\n].", oneTable, + StringUtils.join(allColumns, ",")); + + if (1 == userConfiguredColumns.size() && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改."); + + // 回填其值,需要以 String 的方式转交后续处理 + originalConfig.set(Key.COLUMN, allColumns); + } else if (userConfiguredColumns.size() > allColumns.size()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + String.format("您的配置文件中的列配置信息有误. 因为您所配置的写入数据库表的字段个数:%s 大于目的表的总字段总个数:%s. 请检查您的配置并作出修改.", + userConfiguredColumns.size(), allColumns.size())); + } else { + // 确保用户配置的 column 不重复 + ListUtil.makeSureNoValueDuplicate(userConfiguredColumns, false); + + // 检查列是否都为数据库表中正确的列(通过执行一次 select column from table 进行判断) + DBUtil.getColumnMetaData(connectionFactory.getConnecttion(), oneTable,StringUtils.join(userConfiguredColumns, ",")); + } + } + } + + public static void dealColumnConf(Configuration originalConfig) { + String jdbcUrl = originalConfig.getString(String.format("%s[0].%s", + Constant.CONN_MARK, Key.JDBC_URL)); + + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + String oneTable = originalConfig.getString(String.format( + "%s[0].%s[0]", Constant.CONN_MARK, Key.TABLE)); + + JdbcConnectionFactory jdbcConnectionFactory = new JdbcConnectionFactory(DATABASE_TYPE, jdbcUrl, username, password); + dealColumnConf(originalConfig, jdbcConnectionFactory, oneTable); + } + + public static void dealWriteMode(Configuration originalConfig) { + List columns = originalConfig.getList(Key.COLUMN, String.class); + + String jdbcUrl = originalConfig.getString(String.format("%s[0].%s", + Constant.CONN_MARK, Key.JDBC_URL, String.class)); + + // 默认为:insert 方式 + String writeMode = originalConfig.getString(Key.WRITE_MODE, "INSERT"); + + List valueHolders = new ArrayList(columns.size()); + for(int i=0; i doSplit(Configuration simplifiedConf, + int adviceNumber) { + + List splitResultConfigs = new ArrayList(); + + int tableNumber = simplifiedConf.getInt(Constant.TABLE_NUMBER_MARK); + + //处理单表的情况 + if (tableNumber == 1) { + //由于在之前的 master prepare 中已经把 table,jdbcUrl 提取出来,所以这里处理十分简单 + for (int j = 0; j < adviceNumber; j++) { + splitResultConfigs.add(simplifiedConf.clone()); + } + + return splitResultConfigs; + } + + if (tableNumber != adviceNumber) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + String.format("您的配置文件中的列配置信息有误. 您要写入的目的端的表个数是:%s , 但是根据系统建议需要切分的份数是:%s. 请检查您的配置并作出修改.", + tableNumber, adviceNumber)); + } + + String jdbcUrl; + List preSqls = simplifiedConf.getList(Key.PRE_SQL, String.class); + List postSqls = simplifiedConf.getList(Key.POST_SQL, String.class); + + List conns = simplifiedConf.getList(Constant.CONN_MARK, + Object.class); + + for (Object conn : conns) { + Configuration sliceConfig = simplifiedConf.clone(); + + Configuration connConf = Configuration.from(conn.toString()); + jdbcUrl = connConf.getString(Key.JDBC_URL); + sliceConfig.set(Key.JDBC_URL, jdbcUrl); + + sliceConfig.remove(Constant.CONN_MARK); + + List tables = connConf.getList(Key.TABLE, String.class); + + for (String table : tables) { + Configuration tempSlice = sliceConfig.clone(); + tempSlice.set(Key.TABLE, table); + tempSlice.set(Key.PRE_SQL, renderPreOrPostSqls(preSqls, table)); + tempSlice.set(Key.POST_SQL, renderPreOrPostSqls(postSqls, table)); + + splitResultConfigs.add(tempSlice); + } + + } + + return splitResultConfigs; + } + + public static List renderPreOrPostSqls(List preOrPostSqls, String tableName) { + if (null == preOrPostSqls) { + return Collections.emptyList(); + } + + List renderedSqls = new ArrayList(); + for (String sql : preOrPostSqls) { + //preSql为空时,不加入执行队列 + if (StringUtils.isNotBlank(sql)) { + renderedSqls.add(sql.replace(Constant.TABLE_NAME_PLACEHOLDER, tableName)); + } + } + + return renderedSqls; + } + + public static void executeSqls(Connection conn, List sqls, String basicMessage,DataBaseType dataBaseType) { + Statement stmt = null; + String currentSql = null; + try { + stmt = conn.createStatement(); + for (String sql : sqls) { + currentSql = sql; + DBUtil.executeSqlWithoutResultSet(stmt, sql); + } + } catch (Exception e) { + throw RdbmsException.asQueryException(dataBaseType,e,currentSql,null,null); + } finally { + DBUtil.closeDBResources(null, stmt, null); + } + } + + public static String getWriteTemplate(List columnHolders, List valueHolders, String writeMode){ + boolean isWriteModeLegal = writeMode.trim().toLowerCase().startsWith("insert") + || writeMode.trim().toLowerCase().startsWith("replace"); + + if (!isWriteModeLegal) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + String.format("您所配置的 writeMode:%s 错误. 因为DataX 目前仅支持replace 或 insert 方式. 请检查您的配置并作出修改.", writeMode)); + } + + String writeDataSqlTemplate = new StringBuilder().append(writeMode) + .append(" INTO %s (").append(StringUtils.join(columnHolders, ",")) + .append(") VALUES(").append(StringUtils.join(valueHolders, ",")) + .append(")").toString(); + + return writeDataSqlTemplate; + } + + public static void preCheckPrePareSQL(Configuration originalConfig, DataBaseType type) { + List conns = originalConfig.getList(Constant.CONN_MARK, Object.class); + Configuration connConf = Configuration.from(conns.get(0).toString()); + String table = connConf.getList(Key.TABLE, String.class).get(0); + + List preSqls = originalConfig.getList(Key.PRE_SQL, + String.class); + List renderedPreSqls = WriterUtil.renderPreOrPostSqls( + preSqls, table); + + if (null != renderedPreSqls && !renderedPreSqls.isEmpty()) { + LOG.info("Begin to preCheck preSqls:[{}].", + StringUtils.join(renderedPreSqls, ";")); + for(String sql : renderedPreSqls) { + try{ + DBUtil.sqlValid(sql, type); + }catch(ParserException e) { + throw RdbmsException.asPreSQLParserException(type,e,sql); + } + } + } + } + + public static void preCheckPostSQL(Configuration originalConfig, DataBaseType type) { + List conns = originalConfig.getList(Constant.CONN_MARK, Object.class); + Configuration connConf = Configuration.from(conns.get(0).toString()); + String table = connConf.getList(Key.TABLE, String.class).get(0); + + List postSqls = originalConfig.getList(Key.POST_SQL, + String.class); + List renderedPostSqls = WriterUtil.renderPreOrPostSqls( + postSqls, table); + if (null != renderedPostSqls && !renderedPostSqls.isEmpty()) { + + LOG.info("Begin to preCheck postSqls:[{}].", + StringUtils.join(renderedPostSqls, ";")); + for(String sql : renderedPostSqls) { + try{ + DBUtil.sqlValid(sql, type); + }catch(ParserException e){ + throw RdbmsException.asPostSQLParserException(type,e,sql); + } + + } + } + } + + +} diff --git a/plugin-unstructured-storage-util/plugin-unstructured-storage-util.iml b/plugin-unstructured-storage-util/plugin-unstructured-storage-util.iml new file mode 100644 index 000000000..a3a0c5f21 --- /dev/null +++ b/plugin-unstructured-storage-util/plugin-unstructured-storage-util.iml @@ -0,0 +1,29 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/plugin-unstructured-storage-util/pom.xml b/plugin-unstructured-storage-util/pom.xml new file mode 100755 index 000000000..3954a8ea9 --- /dev/null +++ b/plugin-unstructured-storage-util/pom.xml @@ -0,0 +1,59 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + plugin-unstructured-storage-util + plugin-unstructured-storage-util + plugin-unstructured-storage-util通用的文件类型的读取写入方法,供TxtFileReader/Writer, OSSReader/Writer 使用。 + jar + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + net.sourceforge.javacsv + javacsv + 2.0 + + + org.apache.commons + commons-compress + 1.9 + + + org.anarres.lzo + lzo-core + 1.0.1 + + + junit + junit + test + + + \ No newline at end of file diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Constant.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Constant.java new file mode 100755 index 000000000..a34378b3d --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Constant.java @@ -0,0 +1,11 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +public class Constant { + public static final String DEFAULT_ENCODING = "UTF-8"; + + public static final char DEFAULT_FIELD_DELIMITER = ','; + + public static final boolean DEFAULT_SKIP_HEADER = false; + + public static final String DEFAULT_NULL_FORMAT = "\\N"; +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Key.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Key.java new file mode 100755 index 000000000..96f150636 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Key.java @@ -0,0 +1,28 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +/** + * Created by haiwei.luo on 14-12-5. + */ +public class Key { + public static final String COLUMN = "column"; + + public static final String ENCODING = "encoding"; + + public static final String FIELD_DELIMITER = "fieldDelimiter"; + + public static final String SKIP_HEADER = "skipHeader"; + + public static final String TYPE = "type"; + + public static final String FORMAT = "format"; + + public static final String INDEX = "index"; + + public static final String VALUE = "value"; + + public static final String COMPRESS = "compress"; + + public static final String NULL_FORMAT = "nullFormat"; + + +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderErrorCode.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderErrorCode.java new file mode 100755 index 000000000..911bf3457 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public enum UnstructuredStorageReaderErrorCode implements ErrorCode { + CONFIG_INVALID_EXCEPTION("UnstructuredStorageReader-00", "您的参数配置错误."), + NOT_SUPPORT_TYPE("UnstructuredStorageReader-01","您配置的列类型暂不支持."), + REQUIRED_VALUE("UnstructuredStorageReader-02", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("UnstructuredStorageReader-03", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("UnstructuredStorageReader-04", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("UnstructuredStorageReader-05","您明确的配置列信息,但未填写相应的index,value."), + FILE_NOT_EXISTS("UnstructuredStorageReader-06", "您配置的源路径不存在."), + OPEN_FILE_WITH_CHARSET_ERROR("UnstructuredStorageReader-07", "您配置的编码和实际存储编码不符合."), + OPEN_FILE_ERROR("UnstructuredStorageReader-08", "您配置的源在打开时异常,建议您检查源源是否有隐藏实体,管道文件等特殊文件."), + READ_FILE_IO_ERROR("UnstructuredStorageReader-09", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("UnstructuredStorageReader-10", "您缺少权限执行相应的文件读取操作."), + RUNTIME_EXCEPTION("UnstructuredStorageReader-11", "出现运行时异常, 请联系我们"); + + private final String code; + private final String description; + + private UnstructuredStorageReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderUtil.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderUtil.java new file mode 100755 index 000000000..6a8666069 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderUtil.java @@ -0,0 +1,620 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import java.io.BufferedReader; +import java.io.FileNotFoundException; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.io.StringReader; +import java.io.UnsupportedEncodingException; +import java.nio.charset.UnsupportedCharsetException; +import java.text.SimpleDateFormat; +import java.util.Date; +import java.util.List; +import java.util.Scanner; + +/*import org.anarres.lzo.LzoDecompressor1z_safe; +import org.anarres.lzo.LzoInputStream; +import org.anarres.lzo.LzopInputStream; +import org.apache.commons.compress.archivers.ArchiveException; +import org.apache.commons.compress.archivers.ar.ArArchiveInputStream; +import org.apache.commons.compress.archivers.arj.ArjArchiveInputStream; +import org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream; +import org.apache.commons.compress.archivers.dump.DumpArchiveInputStream; +import org.apache.commons.compress.archivers.jar.JarArchiveInputStream; +import org.apache.commons.compress.archivers.tar.TarArchiveInputStream; +import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream;*/ +import org.apache.commons.compress.compressors.CompressorInputStream; +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream; +import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream; +/*//import org.apache.commons.compress.compressors.lzma.LZMACompressorInputStream; +import org.apache.commons.compress.compressors.pack200.Pack200CompressorInputStream; +//import org.apache.commons.compress.compressors.snappy.SnappyCompressorInputStream; +import org.apache.commons.compress.compressors.xz.XZCompressorInputStream;*/ +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.csvreader.CsvReader; + +public class UnstructuredStorageReaderUtil { + private static final Logger LOG = LoggerFactory + .getLogger(UnstructuredStorageReaderUtil.class); + + private UnstructuredStorageReaderUtil() { + + } + + /** + * @param inputLine + * 输入待分隔字符串 + * @param delimiter + * 字符串分割符 + * @return 分隔符分隔后的字符串数组,出现异常时返回为null 支持转义,即数据中可包含分隔符 + * */ + public static String[] splitOneLine(String inputLine, char delimiter) { + String[] splitedResult = null; + if (null != inputLine) { + try { + CsvReader csvReader = new CsvReader(new StringReader(inputLine)); + csvReader.setDelimiter(delimiter); + if (csvReader.readRecord()) { + splitedResult = csvReader.getValues(); + } + } catch (IOException e) { + // nothing to do + } + } + return splitedResult; + } + + /** + * 不支持转义 + * + * @return 分隔符分隔后的字符串数, + * */ + public static String[] splitOneLine(String inputLine, String delimiter) { + String[] splitedResult = StringUtils.split(inputLine, delimiter); + return splitedResult; + } + + public static void readFromStream(InputStream inputStream, String context, + Configuration readerSliceConfig, RecordSender recordSender, + TaskPluginCollector taskPluginCollector) { + String compress = readerSliceConfig.getString(Key.COMPRESS, null); + if (StringUtils.isBlank(compress)) { + compress = null; + } + String encoding = readerSliceConfig.getString(Key.ENCODING, + Constant.DEFAULT_ENCODING); + // handle blank encoding + if (StringUtils.isBlank(encoding)) { + encoding = Constant.DEFAULT_ENCODING; + LOG.warn(String.format("您配置的encoding为[%s], 使用默认值[%s]", encoding, + Constant.DEFAULT_ENCODING)); + } + + List column = readerSliceConfig + .getListConfiguration(Key.COLUMN); + // handle ["*"] -> [], null + if (null != column && 1 == column.size() + && "\"*\"".equals(column.get(0).toString())) { + readerSliceConfig.set(Key.COLUMN, null); + column = null; + } + + BufferedReader reader = null; + // compress logic + try { + if (null == compress) { + reader = new BufferedReader(new InputStreamReader(inputStream, + encoding)); + } else { + // TODO compress + /*if ("lzo".equalsIgnoreCase(compress)) { + LzoInputStream lzoInputStream = new LzoInputStream( + inputStream, new LzoDecompressor1z_safe()); + reader = new BufferedReader(new InputStreamReader( + lzoInputStream, encoding)); + } else if ("lzop".equalsIgnoreCase(compress)) { + LzoInputStream lzopInputStream = new LzopInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + lzopInputStream, encoding)); + } else */if ("gzip".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new GzipCompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } else if ("bzip2".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new BZip2CompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } /*else if ("lzma".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new LZMACompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } *//*else if ("pack200".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new Pack200CompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } *//*else if ("snappy".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new SnappyCompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } *//*else if ("xz".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new XZCompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } else if ("ar".equalsIgnoreCase(compress)) { + ArArchiveInputStream arArchiveInputStream = new ArArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + arArchiveInputStream, encoding)); + } else if ("arj".equalsIgnoreCase(compress)) { + ArjArchiveInputStream arjArchiveInputStream = new ArjArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + arjArchiveInputStream, encoding)); + } else if ("cpio".equalsIgnoreCase(compress)) { + CpioArchiveInputStream cpioArchiveInputStream = new CpioArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + cpioArchiveInputStream, encoding)); + } else if ("dump".equalsIgnoreCase(compress)) { + DumpArchiveInputStream dumpArchiveInputStream = new DumpArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + dumpArchiveInputStream, encoding)); + } else if ("jar".equalsIgnoreCase(compress)) { + JarArchiveInputStream jarArchiveInputStream = new JarArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + jarArchiveInputStream, encoding)); + } else if ("tar".equalsIgnoreCase(compress)) { + TarArchiveInputStream tarArchiveInputStream = new TarArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + tarArchiveInputStream, encoding)); + } else if ("zip".equalsIgnoreCase(compress)) { + ZipArchiveInputStream zipArchiveInputStream = new ZipArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + zipArchiveInputStream, encoding)); + }*/ else { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 gzip, bzip2 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + } + UnstructuredStorageReaderUtil.doReadFromStream(reader, context, + readerSliceConfig, recordSender, taskPluginCollector); + } catch (UnsupportedEncodingException uee) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.OPEN_FILE_WITH_CHARSET_ERROR, + String.format("不支持的编码格式 : [%]", encoding), uee); + } catch (NullPointerException e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.RUNTIME_EXCEPTION, + "运行时错误, 请联系我们", e); + }/* catch (ArchiveException e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.READ_FILE_IO_ERROR, + String.format("压缩文件流读取错误 : [%]", context), e); + } */catch (IOException e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.READ_FILE_IO_ERROR, + String.format("流读取错误 : [%]", context), e); + } finally { + IOUtils.closeQuietly(reader); + } + + } + + public static void doReadFromStream(BufferedReader reader, String context, + Configuration readerSliceConfig, RecordSender recordSender, + TaskPluginCollector taskPluginCollector) { + List column = readerSliceConfig + .getListConfiguration(Key.COLUMN); + String encoding = readerSliceConfig.getString(Key.ENCODING, + Constant.DEFAULT_ENCODING); + Character fieldDelimiter = null; + String delimiterInStr = readerSliceConfig + .getString(Key.FIELD_DELIMITER); + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + } + + // warn: default value ',', fieldDelimiter could be \n(lineDelimiter) + // for no fieldDelimiter + fieldDelimiter = readerSliceConfig.getChar(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + Boolean skipHeader = readerSliceConfig.getBool(Key.SKIP_HEADER, + Constant.DEFAULT_SKIP_HEADER); + // warn: no default value '\N' + String nullFormat = readerSliceConfig.getString(Key.NULL_FORMAT); + // every line logic + try { + String fetchLine = null; + // TODO lineDelimiter + if (skipHeader) { + fetchLine = reader.readLine(); + LOG.info("Header line has been skiped."); + } + while ((fetchLine = reader.readLine()) != null) { + String[] splitedStrs = null; + if (null == fieldDelimiter) { + splitedStrs = new String[] { fetchLine }; + } else { + splitedStrs = UnstructuredStorageReaderUtil.splitOneLine( + fetchLine, fieldDelimiter); + } + UnstructuredStorageReaderUtil.transportOneRecord(recordSender, + column, splitedStrs, nullFormat, taskPluginCollector); + } + } catch (UnsupportedEncodingException uee) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.OPEN_FILE_WITH_CHARSET_ERROR, + String.format("不支持的编码格式 : [%]", encoding), uee); + } catch (FileNotFoundException fnfe) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.FILE_NOT_EXISTS, + String.format("无法找到文件 : [%s]", context), fnfe); + } catch (IOException ioe) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.READ_FILE_IO_ERROR, + String.format("读取文件错误 : [%s]", context), ioe); + } catch (Exception e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.RUNTIME_EXCEPTION, + String.format("运行时异常 : %s", e.getMessage()), e); + } finally { + IOUtils.closeQuietly(reader); + } + } + + private static Record transportOneRecord(RecordSender recordSender, + List columnConfigs, String[] sourceLine, + String nullFormat, TaskPluginCollector taskPluginCollector) { + Record record = recordSender.createRecord(); + Column columnGenerated = null; + + // 创建都为String类型column的record + if (null == columnConfigs || columnConfigs.size() == 0) { + for (String columnValue : sourceLine) { + // not equalsIgnoreCase, it's all ok if nullFormat is null + if (columnValue.equals(nullFormat)) { + columnGenerated = new StringColumn(null); + } else { + columnGenerated = new StringColumn(columnValue); + } + record.addColumn(columnGenerated); + } + recordSender.sendToWriter(record); + } else { + try { + for (Configuration columnConfig : columnConfigs) { + String columnType = columnConfig + .getNecessaryValue( + Key.TYPE, + UnstructuredStorageReaderErrorCode.CONFIG_INVALID_EXCEPTION); + Integer columnIndex = columnConfig.getInt(Key.INDEX); + String columnConst = columnConfig.getString(Key.VALUE); + + String columnValue = null; + + if (null == columnIndex && null == columnConst) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnConst) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + + if (null != columnIndex) { + if (columnIndex >= sourceLine.length) { + String message = String + .format("您尝试读取的列越界,源文件该行有 [%s] 列,您尝试读取第 [%s] 列, 数据详情[%s]", + sourceLine.length, columnIndex + 1, + sourceLine); + LOG.warn(message); + throw new IndexOutOfBoundsException(message); + } + + columnValue = sourceLine[columnIndex]; + } else { + columnValue = columnConst; + } + Type type = Type.valueOf(columnType.toUpperCase()); + // it's all ok if nullFormat is null + if (columnValue.equals(nullFormat)) { + columnValue = null; + } + switch (type) { + case STRING: + columnGenerated = new StringColumn(columnValue); + break; + case LONG: + try { + columnGenerated = new LongColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "LONG")); + } + break; + case DOUBLE: + try { + columnGenerated = new DoubleColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DOUBLE")); + } + break; + case BOOLEAN: + try { + columnGenerated = new BoolColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "BOOLEAN")); + } + + break; + case DATE: + try { + if (columnValue == null) { + Date date = null; + columnGenerated = new DateColumn(date); + } else { + String formatString = columnConfig + .getString(Key.FORMAT); + //if (null != formatString) { + if (StringUtils.isNotBlank(formatString)) { + // 用户自己配置的格式转换 + SimpleDateFormat format = new SimpleDateFormat( + formatString); + columnGenerated = new DateColumn( + format.parse(columnValue)); + } else { + // 框架尝试转换 + columnGenerated = new DateColumn( + new StringColumn(columnValue) + .asDate()); + } + } + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DATE")); + } + break; + default: + String errorMessage = String.format( + "您配置的列类型暂不支持 : [%s]", columnType); + LOG.error(errorMessage); + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.NOT_SUPPORT_TYPE, + errorMessage); + } + + record.addColumn(columnGenerated); + + } + recordSender.sendToWriter(record); + } catch (IllegalArgumentException iae) { + taskPluginCollector + .collectDirtyRecord(record, iae.getMessage()); + } catch (IndexOutOfBoundsException ioe) { + taskPluginCollector + .collectDirtyRecord(record, ioe.getMessage()); + } catch (Exception e) { + if (e instanceof DataXException) { + throw (DataXException) e; + } + // 每一种转换失败都是脏数据处理,包括数字格式 & 日期格式 + taskPluginCollector.collectDirtyRecord(record, e.getMessage()); + } + } + + return record; + } + + private enum Type { + STRING, LONG, BOOLEAN, DOUBLE, DATE, ; + } + + /** + * check parameter:encoding, compress, filedDelimiter + * */ + public static void validateParameter(Configuration readerConfiguration) { + + // encoding check + String encoding = readerConfiguration.getUnnecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING,null); + try { + encoding = encoding.trim(); + readerConfiguration.set(Key.ENCODING, encoding); + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("编码配置异常, 请联系我们: %s", e.getMessage()), e); + } + + //only support compress types + String compress =readerConfiguration + .getUnnecessaryValue(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS,null,null); + if(compress != null){ + compress = compress.toLowerCase().trim(); + boolean compressTag = "gzip".equals(compress) || "bzip2".equals(compress); + if (!compressTag) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅支持 gzip, bzip2 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", compress)); + } + } + readerConfiguration.set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, compress); + + //fieldDelimiter check + String delimiterInStr = readerConfiguration.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER,null); + if(null == delimiterInStr){ + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.REQUIRED_VALUE, + String.format("您提供配置文件有误,[%s]是必填参数.", + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER)); + }else if(1 != delimiterInStr.length()){ + // warn: if have, length must be one + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + + // column: 1. index type 2.value type 3.when type is Date, may have + // format + List columns = readerConfiguration + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.REQUIRED_VALUE, "您需要指定 columns"); + } + // handle ["*"] + if (null != columns && 1 == columns.size()) { + String columnsInStr = columns.get(0).toString(); + if ("\"*\"".equals(columnsInStr) || "'*'".equals(columnsInStr)) { + readerConfiguration.set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, null); + columns = null; + } + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf.getNecessaryValue(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, + UnstructuredStorageReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf + .getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + if (null != columnIndex && columnIndex < 0) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("index需要大于等于0, 您配置的index为[%s]", columnIndex)); + } + } + } + + } + + /** + * + * @Title: getRegexPathParent + * @Description: 获取正则表达式目录的父目录 + * @param @param regexPath + * @param @return + * @return String + * @throws + */ + public static String getRegexPathParent(String regexPath){ + int endMark; + for (endMark = 0; endMark < regexPath.length(); endMark++) { + if ('*' != regexPath.charAt(endMark) && '?' != regexPath.charAt(endMark)) { + continue; + } else { + break; + } + } + int lastDirSeparator = regexPath.substring(0, endMark).lastIndexOf(IOUtils.DIR_SEPARATOR); + String parentPath = regexPath.substring(0,lastDirSeparator + 1); + + return parentPath; + } + /** + * + * @Title: getRegexPathParentPath + * @Description: 获取含有通配符路径的父目录,目前只支持在最后一级目录使用通配符*或者?. + * (API jcraft.jsch.ChannelSftp.ls(String path)函数限制) http://epaul.github.io/jsch-documentation/javadoc/ + * @param @param regexPath + * @param @return + * @return String + * @throws + */ + public static String getRegexPathParentPath(String regexPath){ + int lastDirSeparator = regexPath.lastIndexOf(IOUtils.DIR_SEPARATOR); + String parentPath = ""; + parentPath = regexPath.substring(0,lastDirSeparator + 1); + if(parentPath.contains("*") || parentPath.contains("?")){ + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("配置项目path中:[%s]不合法,目前只支持在最后一级目录使用通配符*或者?", regexPath)); + } + return parentPath; + } + + + + public static void main(String args[]) { + while (true) { + @SuppressWarnings("resource") + Scanner sc = new Scanner(System.in); + String inputString = sc.nextLine(); + String delemiter = sc.nextLine(); + if (delemiter.length() == 0) { + break; + } + if (!inputString.equals("exit")) { + String[] result = UnstructuredStorageReaderUtil.splitOneLine( + inputString, delemiter.charAt(0)); + for (String str : result) { + System.out.print(str + " "); + } + System.out.println(); + } else { + break; + } + } + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Constant.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Constant.java new file mode 100755 index 000000000..cb0aa1701 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Constant.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +public class Constant { + + public static final String DEFAULT_ENCODING = "UTF-8"; + + public static final char DEFAULT_FIELD_DELIMITER = ','; + + public static final String DEFAULT_NULL_FORMAT = "\\N"; + + public static final String FILE_FORMAT_CSV = "csv"; + + public static final String FILE_FORMAT_TEXT = "text"; + + //每个分块10MB,最大10000个分块 + public static final Long MAX_FILE_SIZE = 1024 * 1024 * 10 * 10000L; +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Key.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Key.java new file mode 100755 index 000000000..bcfd0ffcd --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Key.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +public class Key { + // must have + public static final String FILE_NAME = "fileName"; + + // must have + public static final String WRITE_MODE = "writeMode"; + + // not must , not default , + public static final String FIELD_DELIMITER = "fieldDelimiter"; + + // not must, default UTF-8 + public static final String ENCODING = "encoding"; + + // not must, default no compress + public static final String COMPRESS = "compress"; + + // not must, not default \N + public static final String NULL_FORMAT = "nullFormat"; + + // not must, date format old style, do not use this + public static final String FORMAT = "format"; + // for writers ' data format + public static final String DATE_FORMAT = "dateFormat"; + + // csv or plain text + public static final String FILE_FORMAT = "fileFormat"; + + // writer headers + public static final String HEADER = "header"; + + // writer maxFileSize + public static final String MAX_FILE_SIZE = "maxFileSize"; +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterErrorCode.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterErrorCode.java new file mode 100755 index 000000000..0f780ebdd --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterErrorCode.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +import com.alibaba.datax.common.spi.ErrorCode; + + +public enum UnstructuredStorageWriterErrorCode implements ErrorCode { + ILLEGAL_VALUE("UnstructuredStorageWriter-00", "您填写的参数值不合法."), + Write_FILE_WITH_CHARSET_ERROR("UnstructuredStorageWriter-01", "您配置的编码未能正常写入."), + Write_FILE_IO_ERROR("UnstructuredStorageWriter-02", "您配置的文件在写入时出现IO异常."), + RUNTIME_EXCEPTION("UnstructuredStorageWriter-03", "出现运行时异常, 请联系我们"), + REQUIRED_VALUE("UnstructuredStorageWriter-04", "您缺失了必须填写的参数值."),; + + private final String code; + private final String description; + + private UnstructuredStorageWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterUtil.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterUtil.java new file mode 100755 index 000000000..69afdd619 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterUtil.java @@ -0,0 +1,381 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.io.OutputStream; +import java.io.OutputStreamWriter; +import java.io.StringWriter; +import java.io.UnsupportedEncodingException; +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.List; +import java.util.Set; + +import org.anarres.lzo.LzoCompressor1x_1; +import org.anarres.lzo.LzoOutputStream; +import org.anarres.lzo.LzopOutputStream; +import org.apache.commons.compress.archivers.ar.ArArchiveOutputStream; +import org.apache.commons.compress.archivers.cpio.CpioArchiveOutputStream; +import org.apache.commons.compress.archivers.jar.JarArchiveOutputStream; +import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream; +import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream; +import org.apache.commons.compress.compressors.CompressorOutputStream; +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorOutputStream; +import org.apache.commons.compress.compressors.gzip.GzipCompressorOutputStream; +import org.apache.commons.compress.compressors.pack200.Pack200CompressorOutputStream; +import org.apache.commons.compress.compressors.xz.XZCompressorOutputStream; +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.csvreader.CsvWriter; +import com.google.common.collect.Sets; + +public class UnstructuredStorageWriterUtil { + private UnstructuredStorageWriterUtil() { + + } + + private static final Logger LOG = LoggerFactory + .getLogger(UnstructuredStorageWriterUtil.class); + + /** + * check parameter: writeMode, encoding, compress, filedDelimiter + * */ + public static void validateParameter(Configuration writerConfiguration) { + // writeMode check + String writeMode = writerConfiguration.getNecessaryValue( + Key.WRITE_MODE, + UnstructuredStorageWriterErrorCode.REQUIRED_VALUE); + writeMode = writeMode.trim(); + Set supportedWriteModes = Sets.newHashSet("truncate", "append", + "nonConflict"); + if (!supportedWriteModes.contains(writeMode)) { + throw DataXException + .asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 truncate, append, nonConflict 三种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + writerConfiguration.set(Key.WRITE_MODE, writeMode); + + // encoding check + String encoding = writerConfiguration.getString(Key.ENCODING); + if (StringUtils.isBlank(encoding)) { + // like " ", null + LOG.warn(String.format("您的encoding配置为空, 将使用默认值[%s]", + Constant.DEFAULT_ENCODING)); + writerConfiguration.set(Key.ENCODING, Constant.DEFAULT_ENCODING); + } else { + try { + encoding = encoding.trim(); + writerConfiguration.set(Key.ENCODING, encoding); + Charsets.toCharset(encoding); + } catch (Exception e) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式:[%s]", encoding), e); + } + } + + // only support compress types + String compress = writerConfiguration.getString(Key.COMPRESS); + if (StringUtils.isBlank(compress)) { + writerConfiguration.set(Key.COMPRESS, null); + } else { + Set supportedCompress = Sets.newHashSet("gzip", "bzip2"); + if (!supportedCompress.contains(compress.toLowerCase().trim())) { + String message = String.format( + "仅支持 [%s] 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + StringUtils.join(supportedCompress, ",")); + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format(message, compress)); + } + } + + // fieldDelimiter check + String delimiterInStr = writerConfiguration + .getString(Key.FIELD_DELIMITER); + // warn: if have, length must be one + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + writerConfiguration.set(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + } + + // fileFormat check + String fileFormat = writerConfiguration.getString(Key.FILE_FORMAT, + Constant.FILE_FORMAT_TEXT); + if (!Constant.FILE_FORMAT_CSV.equals(fileFormat) + && !Constant.FILE_FORMAT_TEXT.equals(fileFormat)) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的fileFormat [%s]错误, 支持csv, plainText两种.", + fileFormat)); + } + } + + public static void writeToStream(RecordReceiver lineReceiver, + OutputStream outputStream, Configuration config, String context, + TaskPluginCollector taskPluginCollector) { + String encoding = config.getString(Key.ENCODING, + Constant.DEFAULT_ENCODING); + // handle blank encoding + if (StringUtils.isBlank(encoding)) { + LOG.warn(String.format("您配置的encoding为[%s], 使用默认值[%s]", encoding, + Constant.DEFAULT_ENCODING)); + encoding = Constant.DEFAULT_ENCODING; + } + String compress = config.getString(Key.COMPRESS); + + BufferedWriter writer = null; + // compress logic + try { + if (null == compress) { + writer = new BufferedWriter(new OutputStreamWriter( + outputStream, encoding)); + } else { + // TODO compress + if ("lzo".equalsIgnoreCase(compress)) { + + LzoOutputStream lzoOutputStream = new LzoOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + lzoOutputStream, encoding)); + } else if ("lzop".equalsIgnoreCase(compress)) { + LzoOutputStream lzopOutputStream = new LzopOutputStream( + outputStream, new LzoCompressor1x_1()); + writer = new BufferedWriter(new OutputStreamWriter( + lzopOutputStream, encoding)); + } else if ("gzip".equalsIgnoreCase(compress)) { + CompressorOutputStream compressorOutputStream = new GzipCompressorOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + compressorOutputStream, encoding)); + } else if ("bzip2".equalsIgnoreCase(compress)) { + CompressorOutputStream compressorOutputStream = new BZip2CompressorOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + compressorOutputStream, encoding)); + } else if ("pack200".equalsIgnoreCase(compress)) { + CompressorOutputStream compressorOutputStream = new Pack200CompressorOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + compressorOutputStream, encoding)); + } else if ("xz".equalsIgnoreCase(compress)) { + CompressorOutputStream compressorOutputStream = new XZCompressorOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + compressorOutputStream, encoding)); + } else if ("ar".equalsIgnoreCase(compress)) { + ArArchiveOutputStream arArchiveOutputStream = new ArArchiveOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + arArchiveOutputStream, encoding)); + } else if ("cpio".equalsIgnoreCase(compress)) { + CpioArchiveOutputStream cpioArchiveOutputStream = new CpioArchiveOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + cpioArchiveOutputStream, encoding)); + } else if ("jar".equalsIgnoreCase(compress)) { + JarArchiveOutputStream jarArchiveOutputStream = new JarArchiveOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + jarArchiveOutputStream, encoding)); + } else if ("tar".equalsIgnoreCase(compress)) { + TarArchiveOutputStream tarArchiveOutputStream = new TarArchiveOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + tarArchiveOutputStream, encoding)); + } else if ("zip".equalsIgnoreCase(compress)) { + ZipArchiveOutputStream zipArchiveOutputStream = new ZipArchiveOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + zipArchiveOutputStream, encoding)); + } else { + throw DataXException + .asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 lzo, lzop, gzip, bzip2, pack200, xz, ar, cpio, jar, tar, zip 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + } + UnstructuredStorageWriterUtil.doWriteToStream(lineReceiver, writer, + context, config, taskPluginCollector); + } catch (UnsupportedEncodingException uee) { + throw DataXException + .asDataXException( + UnstructuredStorageWriterErrorCode.Write_FILE_WITH_CHARSET_ERROR, + String.format("不支持的编码格式 : [%]", encoding), uee); + } catch (NullPointerException e) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.RUNTIME_EXCEPTION, + "运行时错误, 请联系我们", e); + } catch (IOException e) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.Write_FILE_IO_ERROR, + String.format("流写入错误 : [%]", context), e); + } finally { + IOUtils.closeQuietly(writer); + } + } + + private static void doWriteToStream(RecordReceiver lineReceiver, + BufferedWriter writer, String contex, Configuration config, + TaskPluginCollector taskPluginCollector) throws IOException { + + String nullFormat = config.getString(Key.NULL_FORMAT); + + // 兼容format & dataFormat + String dateFormat = config.getString(Key.DATE_FORMAT); + + // warn: default false + String fileFormat = config.getString(Key.FILE_FORMAT, + Constant.FILE_FORMAT_TEXT); + + String delimiterInStr = config.getString(Key.FIELD_DELIMITER); + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + } + + // warn: fieldDelimiter could not be '' for no fieldDelimiter + char fieldDelimiter = config.getChar(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + + List headers = config.getList(Key.HEADER, String.class); + if (null != headers && !headers.isEmpty()) { + writer.write(UnstructuredStorageWriterUtil.doTransportOneRecord( + headers, fieldDelimiter, fileFormat)); + } + + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + MutablePair transportResult = UnstructuredStorageWriterUtil + .transportOneRecord(record, nullFormat, dateFormat, + fieldDelimiter, fileFormat, taskPluginCollector); + if (!transportResult.getRight()) { + writer.write(transportResult.getLeft()); + } + } + } + + /** + * @return MutablePair left: formated data line; right: is + * dirty data or not, true means meeting dirty data + * */ + public static MutablePair transportOneRecord( + Record record, String nullFormat, String dateFormat, + char fieldDelimiter, String fileFormat, + TaskPluginCollector taskPluginCollector) { + // warn: default is null + if (null == nullFormat) { + nullFormat = "null"; + } + MutablePair transportResult = new MutablePair(); + transportResult.setRight(false); + List splitedRows = new ArrayList(); + int recordLength = record.getColumnNumber(); + if (0 != recordLength) { + Column column; + for (int i = 0; i < recordLength; i++) { + column = record.getColumn(i); + if (null != column.getRawData()) { + boolean isDateColumn = column instanceof DateColumn; + if (!isDateColumn) { + splitedRows.add(column.asString()); + } else { + // if (null != dateFormat) { + if (StringUtils.isNotBlank(dateFormat)) { + try { + SimpleDateFormat dateParse = new SimpleDateFormat( + dateFormat); + splitedRows.add(dateParse.format(column + .asDate())); + } catch (Exception e) { + // warn: 此处认为似乎脏数据 + String message = String.format( + "使用您配置的格式 [%s] 转换 [%s] 错误.", + dateFormat, column.asString()); + taskPluginCollector.collectDirtyRecord(record, + message); + transportResult.setRight(true); + break; + } + } else { + splitedRows.add(column.asString()); + } + } + } else { + // warn: it's all ok if nullFormat is null + splitedRows.add(nullFormat); + } + } + } + + transportResult.setLeft(UnstructuredStorageWriterUtil + .doTransportOneRecord(splitedRows, fieldDelimiter, fileFormat)); + return transportResult; + } + + public static String doTransportOneRecord(List splitedRows, + char fieldDelimiter, String fileFormat) { + if (splitedRows.isEmpty()) { + LOG.info("Found one record line which is empty."); + } + // warn: false means plain text(old way), true means strict csv format + if (Constant.FILE_FORMAT_TEXT.equals(fileFormat)) { + return StringUtils.join(splitedRows, fieldDelimiter) + + IOUtils.LINE_SEPARATOR; + } else { + StringWriter sw = new StringWriter(); + CsvWriter csvWriter = new CsvWriter(sw, fieldDelimiter); + csvWriter.setTextQualifier('"'); + csvWriter.setUseTextQualifier(true); + // warn: in linux is \n , in windows is \r\n + csvWriter.setRecordDelimiter(IOUtils.LINE_SEPARATOR.charAt(0)); + UnstructuredStorageWriterUtil.csvWriteSlience(csvWriter, + splitedRows); + return sw.toString(); + // sw.close(); //no need do this + } + } + + private static void csvWriteSlience(CsvWriter csvWriter, + List splitedRows) { + try { + csvWriter + .writeRecord((String[]) splitedRows.toArray(new String[0])); + } catch (IOException e) { + // shall not happen + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.RUNTIME_EXCEPTION, + String.format("转换CSV格式失败[%s]", + StringUtils.join(splitedRows, " "))); + } + } +} diff --git a/pom.xml b/pom.xml new file mode 100755 index 000000000..fc961eee9 --- /dev/null +++ b/pom.xml @@ -0,0 +1,200 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + org.hamcrest + hamcrest-core + 1.3 + + + + datax-all + pom + + + 0.0.1-SNAPSHOT + 3.3.2 + 1.10 + 1.2 + 1.1.43 + 16.0.1 + 3.7.2.1-SNAPSHOT + + + 1.7.10 + 1.0.13 + 2.4 + 5.1.34 + 4.11 + 5.1.17 + + + UTF-8 + UTF-8 + UTF-8 + UTF-8 + + + + common + core + + + mysqlreader + drdsreader + sqlserverreader + postgresqlreader + oraclereader + odpsreader + otsreader + + txtfilereader + streamreader + + hbasereader + ossreader + ftpreader + + + mysqlwriter + drdswriter + odpswriter + txtfilewriter + streamwriter + otswriter + + oraclewriter + sqlserverwriter + postgresqlwriter + osswriter + + + plugin-rdbms-util + plugin-unstructured-storage-util + mongodbreader + mongodbwriter + + adswriter + ocswriter + + + hdfsreader + hdfswriter + + + + + + + org.apache.commons + commons-lang3 + ${commons-lang3-version} + + + com.alibaba + fastjson + ${fastjson-version} + + + + commons-io + commons-io + ${commons-io-version} + + + mysql + mysql-connector-java + ${mysql-connector-java-version} + + + org.slf4j + slf4j-api + ${slf4j-api-version} + + + ch.qos.logback + logback-classic + ${logback-classic-version} + + + + com.taobao.tddl + tddl-client + ${tddl.version} + + + + com.taobao.tddl + tddl-client + ${tddl.version} + + + com.google.guava + guava + + + com.taobao.diamond + diamond-client + + + + + + com.taobao.diamond + diamond-client + ${diamond.version} + + + + junit + junit + ${junit-version} + + + + org.mockito + mockito-all + 1.9.5 + test + + + + + + + + maven-assembly-plugin + + datax + + package.xml + + + + + make-assembly + package + + + + + org.apache.maven.plugins + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + \ No newline at end of file diff --git a/postgresqlreader/doc/postgresqlreader.md b/postgresqlreader/doc/postgresqlreader.md new file mode 100644 index 000000000..7431f73c0 --- /dev/null +++ b/postgresqlreader/doc/postgresqlreader.md @@ -0,0 +1,297 @@ + +# PostgresqlReader 插件文档 + + +___ + + +## 1 快速介绍 + +PostgresqlReader插件实现了从PostgreSQL读取数据。在底层实现上,PostgresqlReader通过JDBC连接远程PostgreSQL数据库,并执行相应的sql语句将数据从PostgreSQL库中SELECT出来。 + +## 2 实现原理 + +简而言之,PostgresqlReader通过JDBC连接器连接到远程的PostgreSQL数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程PostgreSQL数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,PostgresqlReader将其拼接为SQL语句发送到PostgreSQL数据库;对于用户配置querySql信息,PostgresqlReader直接将其发送到PostgreSQL数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从PostgreSQL数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度,单位为byte/s,DataX运行会尽可能达到该速度但是不超过它. + "byte": 1048576 + }, + //出错限制 + "errorLimit": { + //出错的record条数上限,当大于该值即报错。 + "record": 0, + //出错的record百分比上限 1.0表示100%,0.02表示2% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "postgresqlreader", + "parameter": { + // 数据库连接用户名 + "username": "xx", + // 数据库连接密码 + "password": "xx", + "column": [ + "id","name" + ], + //切分主键 + "splitPk": "id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:postgresql://host:port/database" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + //是否打印内容 + "parameter": { + "print":true, + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": 1048576 + }, + "content": [ + { + "reader": { + "name": "postgresqlreader", + "parameter": { + "username": "xx", + "password": "xx", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:postgresql://host:port/database", "jdbc:postgresql://host:port/database" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,PostgresqlReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,PostgresqlReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照PostgreSQL官方规范,并可以填写连接附件控制信息。具体请参看[PostgreSQL官方文档](http://jdbc.postgresql.org/documentation/93/connect.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,PostgresqlReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用*代表默认使用所有列配置,例如['*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照PostgreSQL语法格式: + ["id", "'hello'::varchar", "true", "2.5::real", "power(2,3)"] + id为普通列名,'hello'::varchar为字符串常量,true为布尔值,2.5为浮点数, power(2,3)为函数。 + + **column必须用户显示指定同步的列集合,不允许为空!** + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:PostgresqlReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形、字符串型数据切分,`不支持浮点、日期等其他类型`。如果用户指定其他非支持类型,PostgresqlReader将报错! + + splitPk设置为空,底层将视作用户不允许对单表进行切分,因此使用单通道进行抽取。 + + * 必选:否
+ + * 默认值:空
+ +* **where** + + * 描述:筛选条件,Pgeader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。例如在做测试时,可以将where条件指定为limit 10;在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。 where条件不配置或者为空,视作全表同步数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,PostgresqlReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +目前PostgresqlReader支持大部分PostgreSQL类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出PostgresqlReader针对PostgreSQL类型转换列表: + + +| DataX 内部类型| PostgreSQL 数据类型 | +| -------- | ----- | +| Long |bigint, bigserial, integer, smallint, serial | +| Double |double precision, money, numeric, real | +| String |varchar, char, text, bit, inet| +| Date |date, time, timestamp | +| Boolean |bool| +| Bytes |bytea| + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持; money,inet,bit需用户使用a_inet::varchar类似的语法转换`。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + +create table pref_test( + id serial, + a_bigint bigint, + a_bit bit(10), + a_boolean boolean, + a_char character(5), + a_date date, + a_double double precision, + a_integer integer, + a_money money, + a_num numeric(10,2), + a_real real, + a_smallint smallint, + a_text text, + a_time time, + a_timestamp timestamp +) + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 16核 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz + 2. mem: MemTotal: 24676836kB MemFree: 6365080kB + 3. net: 百兆双网卡 + +* PostgreSQL数据库机器参数为: + D12 24逻辑核 192G内存 12*480G SSD 阵列 + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数 | 是否按照主键切分 | DataX速度(Rec/s) | DataX流量(MB/s) | DataX机器运行负载 | +|--------|--------| --------|--------|--------| +|1| 否 | 10211 | 0.63 | 0.2 | +|1| 是 | 10211 | 0.63 | 0.2 | +|4| 否 | 10211 | 0.63 | 0.2 | +|4| 是 | 40000 | 2.48 | 0.5 | +|8| 否 | 10211 | 0.63 | 0.2 | +|8| 是 | 78048 | 4.84 | 0.8 | + + +说明: + +1. 这里的单表,主键类型为 serial,数据分布均匀。 +2. 对单表如果没有按照主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 \ No newline at end of file diff --git a/postgresqlreader/pom.xml b/postgresqlreader/pom.xml new file mode 100755 index 000000000..a9915dd85 --- /dev/null +++ b/postgresqlreader/pom.xml @@ -0,0 +1,86 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + postgresqlreader + postgresqlreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + org.postgresql + postgresql + 9.3-1102-jdbc4 + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/postgresqlreader/postgresqlreader.iml b/postgresqlreader/postgresqlreader.iml new file mode 100644 index 000000000..c37210356 --- /dev/null +++ b/postgresqlreader/postgresqlreader.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/postgresqlreader/src/main/assembly/package.xml b/postgresqlreader/src/main/assembly/package.xml new file mode 100755 index 000000000..5860c0576 --- /dev/null +++ b/postgresqlreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/postgresqlreader + + + target/ + + postgresqlreader-0.0.1-SNAPSHOT.jar + + plugin/reader/postgresqlreader + + + + + + false + plugin/reader/postgresqlreader/libs + runtime + + + diff --git a/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/Constant.java b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/Constant.java new file mode 100755 index 000000000..9b9b46789 --- /dev/null +++ b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.postgresqlreader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1000; + +} diff --git a/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/PostgresqlReader.java b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/PostgresqlReader.java new file mode 100755 index 000000000..59d2825fe --- /dev/null +++ b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/PostgresqlReader.java @@ -0,0 +1,86 @@ +package com.alibaba.datax.plugin.reader.postgresqlreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; + +import java.util.List; + +public class PostgresqlReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.PostgreSQL; + + public static class Job extends Reader.Job { + + private Configuration originalConfig; + private CommonRdbmsReader.Job commonRdbmsReaderMaster; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + String.format("您配置的fetchSize有误,根据DataX的设计,fetchSize : [%d] 设置值不能小于 1.", fetchSize)); + } + this.originalConfig.set(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, fetchSize); + + this.commonRdbmsReaderMaster = new CommonRdbmsReader.Job(DATABASE_TYPE); + this.commonRdbmsReaderMaster.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderMaster.split(this.originalConfig, adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderMaster.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderMaster.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderSlave; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderSlave = new CommonRdbmsReader.Task(DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderSlave.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderSlave.startRead(this.readerSliceConfig, recordSender, + super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderSlave.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderSlave.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/postgresqlreader/src/main/resources/plugin.json b/postgresqlreader/src/main/resources/plugin.json new file mode 100755 index 000000000..152f8b7b0 --- /dev/null +++ b/postgresqlreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "postgresqlreader", + "class": "com.alibaba.datax.plugin.reader.postgresqlreader.PostgresqlReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/postgresqlreader/src/main/resources/plugin_job_template.json b/postgresqlreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..21970520f --- /dev/null +++ b/postgresqlreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "postgresqlreader", + "parameter": { + "username": "", + "password": "", + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } +} \ No newline at end of file diff --git a/postgresqlwriter/doc/postgresqlwriter.md b/postgresqlwriter/doc/postgresqlwriter.md new file mode 100644 index 000000000..662da2e4f --- /dev/null +++ b/postgresqlwriter/doc/postgresqlwriter.md @@ -0,0 +1,267 @@ +# DataX PostgresqlWriter + + +--- + + +## 1 快速介绍 + +PostgresqlWriter插件实现了写入数据到 PostgreSQL主库目的表的功能。在底层实现上,PostgresqlWriter通过JDBC连接远程 PostgreSQL 数据库,并执行相应的 insert into ... sql 语句将数据写入 PostgreSQL,内部会分批次提交入库。 + +PostgresqlWriter面向ETL开发工程师,他们使用PostgresqlWriter从数仓导入数据到PostgreSQL。同时 PostgresqlWriter亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +PostgresqlWriter通过 DataX 框架获取 Reader 生成的协议数据,根据你配置生成相应的SQL插入语句 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +
+ + 注意: + 1. 目的表所在数据库必须是主库才能写入数据;整个任务至少需具备 insert into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + 2. PostgresqlWriter和MysqlWriter不同,不支持配置writeMode参数。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 PostgresqlWriter导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "postgresqlwriter", + "parameter": { + "username": "xx", + "password": "xx", + "column": [ + "id", + "name" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:postgresql://127.0.0.1:3002/datax", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息 ,jdbcUrl必须包含在connection配置单元中。 + + 注意:1、在一个数据库上只能配置一个值。 + 2、jdbcUrl按照PostgreSQL官方规范,并可以填写连接附加参数信息。具体请参看PostgreSQL官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。比如你的任务是要写入到目的端的100个同构分表(表名称为:datax_00,datax01, ... datax_98,datax_99),并且你希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["delete from @table"]`,效果是:在执行到每个表写入数据前,会先执行对应的 delete from 对应表名称
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与PostgreSql的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ +### 3.3 类型转换 + +目前 PostgresqlWriter支持大部分 PostgreSQL类型,但也存在部分没有支持的情况,请注意检查你的类型。 + +下面列出 PostgresqlWriter针对 PostgreSQL类型转换列表: + +| DataX 内部类型| PostgreSQL 数据类型 | +| -------- | ----- | +| Long |bigint, bigserial, integer, smallint, serial | +| Double |double precision, money, numeric, real | +| String |varchar, char, text, bit| +| Date |date, time, timestamp | +| Boolean |bool| +| Bytes |bytea| + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + create table pref_test( + id serial, + a_bigint bigint, + a_bit bit(10), + a_boolean boolean, + a_char character(5), + a_date date, + a_double double precision, + a_integer integer, + a_money money, + a_num numeric(10,2), + a_real real, + a_smallint smallint, + a_text text, + a_time time, + a_timestamp timestamp +) + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 16核 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz + 2. mem: MemTotal: 24676836kB MemFree: 6365080kB + 3. net: 百兆双网卡 + +* PostgreSQL数据库机器参数为: + D12 24逻辑核 192G内存 12*480G SSD 阵列 + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + +| 通道数| 批量提交batchSize | DataX速度(Rec/s)| DataX流量(M/s) | DataX机器运行负载 +|--------|--------| --------|--------|--------|--------| +|1| 128 | 9259 | 0.55 | 0.3 +|1| 512 | 10869 | 0.653 | 0.3 +|1| 2048 | 9803 | 0.589 | 0.8 +|4| 128 | 30303 | 1.82 | 1 +|4| 512 | 36363 | 2.18 | 1 +|4| 2048 | 36363 | 2.18 | 1 +|8| 128 | 57142 | 3.43 | 2 +|8| 512 | 66666 | 4.01 | 1.5 +|8| 2048 | 66666 | 4.01 | 1.1 +|16| 128 | 88888 | 5.34 | 1.8 +|16| 2048 | 94117 | 5.65 | 2.5 +|32| 512 | 76190 | 4.58 | 3 + +#### 4.2.2 性能测试小结 +1. `channel数对性能影响很大` +2. `通常不建议写入数据库时,通道个数 > 32` + + +## FAQ + +*** + +**Q: PostgresqlWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。 +第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** diff --git a/postgresqlwriter/pom.xml b/postgresqlwriter/pom.xml new file mode 100755 index 000000000..6f46e2cbb --- /dev/null +++ b/postgresqlwriter/pom.xml @@ -0,0 +1,82 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + postgresqlwriter + postgresqlwriter + jar + writer data into postgresql database + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + org.postgresql + postgresql + 9.3-1102-jdbc4 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/postgresqlwriter/postgresqlwriter.iml b/postgresqlwriter/postgresqlwriter.iml new file mode 100644 index 000000000..c37210356 --- /dev/null +++ b/postgresqlwriter/postgresqlwriter.iml @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/postgresqlwriter/src/main/assembly/package.xml b/postgresqlwriter/src/main/assembly/package.xml new file mode 100755 index 000000000..20bfe6226 --- /dev/null +++ b/postgresqlwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/postgresqlwriter + + + target/ + + postgresqlwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/postgresqlwriter + + + + + + false + plugin/writer/postgresqlwriter/libs + runtime + + + diff --git a/postgresqlwriter/src/main/java/com/alibaba/datax/plugin/writer/postgresqlwriter/PostgresqlWriter.java b/postgresqlwriter/src/main/java/com/alibaba/datax/plugin/writer/postgresqlwriter/PostgresqlWriter.java new file mode 100755 index 000000000..22dc0c1e6 --- /dev/null +++ b/postgresqlwriter/src/main/java/com/alibaba/datax/plugin/writer/postgresqlwriter/PostgresqlWriter.java @@ -0,0 +1,100 @@ +package com.alibaba.datax.plugin.writer.postgresqlwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class PostgresqlWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.PostgreSQL; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterMaster; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, PostgreSQL only support insert mode, don't use + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + String.format("写入模式(writeMode)配置有误. 因为PostgreSQL不支持配置参数项 writeMode: %s, PostgreSQL仅使用insert sql 插入数据. 请检查您的配置并作出修改.", writeMode)); + } + + this.commonRdbmsWriterMaster = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterMaster.init(this.originalConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterMaster.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterMaster.split(this.originalConfig, mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterMaster.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterMaster.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterSlave; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterSlave = new CommonRdbmsWriter.Task(DATABASE_TYPE){ + @Override + public String calcValueHolder(String columnType){ + if("serial".equalsIgnoreCase(columnType)){ + return "?::int"; + }else if("bit".equalsIgnoreCase(columnType)){ + return "?::bit varying"; + } + return "?::" + columnType; + } + }; + this.commonRdbmsWriterSlave.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterSlave.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterSlave.startWrite(recordReceiver, this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterSlave.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterSlave.destroy(this.writerSliceConfig); + } + + } + +} diff --git a/postgresqlwriter/src/main/resources/plugin.json b/postgresqlwriter/src/main/resources/plugin.json new file mode 100755 index 000000000..b61b28886 --- /dev/null +++ b/postgresqlwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "postgresqlwriter", + "class": "com.alibaba.datax.plugin.writer.postgresqlwriter.PostgresqlWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/postgresqlwriter/src/main/resources/plugin_job_template.json b/postgresqlwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..c1e781f16 --- /dev/null +++ b/postgresqlwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "postgresqlwriter", + "parameter": { + "username": "", + "password": "", + "column": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ], + "preSql": [], + "postSql": [] + } +} \ No newline at end of file diff --git a/sqlserverreader/doc/sqlserverreader.md b/sqlserverreader/doc/sqlserverreader.md new file mode 100644 index 000000000..f2e9f4eb4 --- /dev/null +++ b/sqlserverreader/doc/sqlserverreader.md @@ -0,0 +1,279 @@ + +# SqlServerReader 插件文档 + +___ + + +## 1 快速介绍 + +SqlServerReader插件实现了从SqlServer读取数据。在底层实现上,SqlServerReader通过JDBC连接远程SqlServer数据库,并执行相应的sql语句将数据从SqlServer库中SELECT出来。 + +## 2 实现原理 + +简而言之,SqlServerReader通过JDBC连接器连接到远程的SqlServer数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程SqlServer数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,SqlServerReader将其拼接为SQL语句发送到SqlServer数据库;对于用户配置querySql信息,SqlServer直接将其发送到SqlServer数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从SqlServer数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "byte": 1048576 + } + }, + "content": [ + { + "reader": { + "name": "sqlserverreader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "column": [ + "id" + ], + "splitPk": "db_id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:sqlserver://localhost:3433;DatabaseName=dbname" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": 1048576 + }, + "content": [ + { + "reader": { + "name": "sqlserverreader", + "parameter": { + "username": "root", + "password": "root", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:sqlserver://bad_ip:3433;DatabaseName=dbname", + "jdbc:sqlserver://127.0.0.1:bad_port;DatabaseName=dbname", + "jdbc:sqlserver://127.0.0.1:3306;DatabaseName=dbname" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "visible": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,SqlServerReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,SqlServerReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照SqlServer官方规范,并可以填写连接附件控制信息。具体请参看[SqlServer官方文档](http://technet.microsoft.com/zh-cn/library/ms378749(v=SQL.110).aspx)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,SqlServerReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用*代表默认使用所有列配置,例如["*"]。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照JSON格式: + ["id", "[table]", "1", "'bazhen.csy'", "null", "COUNT(*)", "2.3" , "true"] + id为普通列名,[table]为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + column必须用户显示指定同步的列集合,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:SqlServerReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形、字符串型数据切分,`不支持浮点、日期等其他类型`。如果用户指定其他非支持类型,SqlServerReader将报错! + + splitPk设置为空,底层将视作用户不允许对单表进行切分,因此使用单通道进行抽取。 + + * 必选:否
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,SqlServerReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。例如在做测试时,可以将where条件指定为limit 10;在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。如果该值为空,代表同步全表所有的信息。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,SqlServerReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +目前SqlServerReader支持大部分SqlServer类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出SqlServerReader针对SqlServer类型转换列表: + + +| DataX 内部类型| SqlServer 数据类型 | +| -------- | ----- | +| Long |bigint, int, smallint, tinyint| +| Double |float, decimal, real, numeric| +|String |char,nchar,ntext,nvarchar,text,varchar,nvarchar(MAX),varchar(MAX)| +| Date |date, datetime, time | +| Boolean |bit| +| Bytes |binary,varbinary,varbinary(MAX),timestamp| + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `timestamp类型作为二进制类型`。 + +## 4 性能报告 + +暂无 + +## 5 约束限制 + +### 5.1 主备同步数据恢复问题 + +主备同步问题指SqlServer使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 一致性约束 + +SqlServer在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,SqlServerReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在SqlServerReader单线程模型下数据同步一致性的特性,由于SqlServerReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当SqlServerReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 数据库编码问题 + +SqlServerReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此SqlServerReader不需用户指定编码,可以自动识别编码并转码。 + +### 5.4 增量数据同步 + +SqlServerReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,SqlServerReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,SqlServerReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,SqlServerReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.5 Sql安全性 + +SqlServerReader提供querySql语句交给用户自己实现SELECT抽取语句,SqlServerReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + + diff --git a/sqlserverreader/pom.xml b/sqlserverreader/pom.xml new file mode 100755 index 000000000..d31b54445 --- /dev/null +++ b/sqlserverreader/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + sqlserverreader + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.microsoft.sqlserver + sqljdbc4 + 4.0 + system + ${basedir}/src/main/lib/sqljdbc4-4.0.jar + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + \ No newline at end of file diff --git a/sqlserverreader/sqlserverreader.iml b/sqlserverreader/sqlserverreader.iml new file mode 100644 index 000000000..07accd9fc --- /dev/null +++ b/sqlserverreader/sqlserverreader.iml @@ -0,0 +1,54 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/sqlserverreader/src/main/assembly/package.xml b/sqlserverreader/src/main/assembly/package.xml new file mode 100755 index 000000000..6180fbc0a --- /dev/null +++ b/sqlserverreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/sqlserverreader + + + target/ + + sqlserverreader-0.0.1-SNAPSHOT.jar + + plugin/reader/sqlserverreader + + + + + + false + plugin/reader/sqlserverreader/libs + runtime + + + diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Constant.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Constant.java new file mode 100755 index 000000000..1b6a14d28 --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1024; + +} diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Key.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Key.java new file mode 100755 index 000000000..c1a083107 --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Key.java @@ -0,0 +1,6 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +public class Key { + + public static final String FETCH_SIZE = "fetchSize"; +} diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReader.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReader.java new file mode 100755 index 000000000..fbb7bfa7f --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReader.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; + +import java.util.List; + +public class SqlServerReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.SQLServer; + + public static class Job extends Reader.Job { + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + String.format("您配置的fetchSize有误,根据DataX的设计,fetchSize : [%d] 设置值不能小于 1.", + fetchSize)); + } + this.originalConfig.set( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + fetchSize); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderJob.split(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task( + DATABASE_TYPE ,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig + .getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReaderErrorCode.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReaderErrorCode.java new file mode 100755 index 000000000..6f24a1799 --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReaderErrorCode.java @@ -0,0 +1,26 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum SqlServerReaderErrorCode implements ErrorCode { + ; + + private String code; + private String description; + + private SqlServerReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + +} diff --git a/sqlserverreader/src/main/lib/sqljdbc4-4.0.jar b/sqlserverreader/src/main/lib/sqljdbc4-4.0.jar new file mode 100644 index 000000000..d6b7f6daf Binary files /dev/null and b/sqlserverreader/src/main/lib/sqljdbc4-4.0.jar differ diff --git a/sqlserverreader/src/main/resources/plugin.json b/sqlserverreader/src/main/resources/plugin.json new file mode 100755 index 000000000..5b9d49709 --- /dev/null +++ b/sqlserverreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "sqlserverreader", + "class": "com.alibaba.datax.plugin.reader.sqlserverreader.SqlServerReader", + "description": "useScene: test. mechanism: use datax framework to transport data from SQL Server. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/sqlserverreader/src/main/resources/plugin_job_template.json b/sqlserverreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..002ebca57 --- /dev/null +++ b/sqlserverreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "sqlserverreader", + "parameter": { + "username": "", + "password": "", + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } +} \ No newline at end of file diff --git a/sqlserverwriter/doc/sqlserverwriter.md b/sqlserverwriter/doc/sqlserverwriter.md new file mode 100644 index 000000000..255834c65 --- /dev/null +++ b/sqlserverwriter/doc/sqlserverwriter.md @@ -0,0 +1,248 @@ +# DataX SqlServerWriter + + +--- + + +## 1 快速介绍 + +SqlServerWriter 插件实现了写入数据到 SqlServer 库的目的表的功能。在底层实现上, SqlServerWriter 通过 JDBC 连接远程 SqlServer 数据库,并执行相应的 insert into ... sql 语句将数据写入 SqlServer,内部会分批次提交入库。 + +SqlServerWriter 面向ETL开发工程师,他们使用 SqlServerWriter 从数仓导入数据到 SqlServer。同时 SqlServerWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +SqlServerWriter 通过 DataX 框架获取 Reader 生成的协议数据,根据你配置生成相应的SQL语句 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +
+ + 注意: + 1. 目的表所在数据库必须是主库才能写入数据;整个任务至少需具备 insert into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + 2.SqlServerWriter和MysqlWriter不同,不支持配置writeMode参数。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 SqlServer 导入的数据。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 5 + } + }, + "content": [ + { + "reader": {}, + "writer": { + "name": "sqlserverwriter", + "parameter": { + "username": "root", + "password": "root", + "column": [ + "db_id", + "db_type", + "db_ip", + "db_port", + "db_role", + "db_name", + "db_username", + "db_password", + "db_modify_time", + "db_modify_user", + "db_description", + "db_tddl_info" + ], + "connection": [ + { + "table": [ + "db_info_for_writer" + ], + "jdbcUrl": "jdbc:sqlserver://[HOST_NAME]:PORT;DatabaseName=[DATABASE_NAME]" + } + ], + "preSql": [ + "delete from @table where db_id = -1;" + ], + "postSql": [ + "update @table set db_modify_time = now() where db_id = 1;" + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息 ,jdbcUrl必须包含在connection配置单元中。 + + 注意:1、在一个数据库上只能配置一个值。这与 SqlServerReader 支持多个备库探测不同,因为此处不支持同一个数据库存在多个主库的情况(双主导入数据情况) + 2、jdbcUrl按照SqlServer官方规范,并可以填写连接附加参数信息。具体请参看 SqlServer官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + **column配置项必须指定,不能留空!** + + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与SqlServer的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ + + +### 3.3 类型转换 + +类似 SqlServerReader ,目前 SqlServerWriter 支持大部分 SqlServer 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 SqlServerWriter 针对 SqlServer 类型转换列表: + + +| DataX 内部类型| SqlServer 数据类型 | +| -------- | ----- | +| Long || +| Double || +| String || +| Date || +| Boolean || +| Bytes || + + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: +``` + +``` +单行记录类似于: +``` +``` +#### 4.1.2 机器参数 + +* 执行 DataX 的机器参数为: + 1. cpu: 24 Core Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + 2. mem: 94GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* SqlServer 数据库机器参数为: + 1. cpu: 4 Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz + 2. mem: 7GB + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +#### 4.1.4 性能测试作业配置 + +``` + +``` + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + + +## 5 约束限制 + + + + +## FAQ + +*** + +**Q: SqlServerWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/sqlserverwriter/pom.xml b/sqlserverwriter/pom.xml new file mode 100644 index 000000000..109567b82 --- /dev/null +++ b/sqlserverwriter/pom.xml @@ -0,0 +1,79 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + sqlserverwriter + sqlserverwriter + jar + writer data into sqlserver database + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.microsoft.sqlserver + sqljdbc4 + 4.0 + system + ${basedir}/src/main/lib/sqljdbc4-4.0.jar + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/sqlserverwriter/sqlserverwriter.iml b/sqlserverwriter/sqlserverwriter.iml new file mode 100644 index 000000000..07accd9fc --- /dev/null +++ b/sqlserverwriter/sqlserverwriter.iml @@ -0,0 +1,54 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/sqlserverwriter/src/main/assembly/package.xml b/sqlserverwriter/src/main/assembly/package.xml new file mode 100755 index 000000000..f8f262987 --- /dev/null +++ b/sqlserverwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/sqlserverwriter + + + target/ + + sqlserverwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/sqlserverwriter + + + + + + false + plugin/writer/sqlserverwriter/libs + runtime + + + diff --git a/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriter.java b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriter.java new file mode 100644 index 000000000..6c8197191 --- /dev/null +++ b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriter.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.writer.sqlserverwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class SqlServerWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.SQLServer; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, sqlserver only support insert mode + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "写入模式(writeMode)配置错误. 因为sqlserver不支持配置项 writeMode: %s, sqlserver只能使用insert sql 插入数据. 请检查您的配置并作出修改", + writeMode)); + } + + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, + mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task( + DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, + this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + } + +} diff --git a/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriterErrorCode.java b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriterErrorCode.java new file mode 100644 index 000000000..26f526a0f --- /dev/null +++ b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriterErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.writer.sqlserverwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum SqlServerWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private SqlServerWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/sqlserverwriter/src/main/lib/sqljdbc4-4.0.jar b/sqlserverwriter/src/main/lib/sqljdbc4-4.0.jar new file mode 100644 index 000000000..d6b7f6daf Binary files /dev/null and b/sqlserverwriter/src/main/lib/sqljdbc4-4.0.jar differ diff --git a/sqlserverwriter/src/main/resources/plugin.json b/sqlserverwriter/src/main/resources/plugin.json new file mode 100755 index 000000000..a92d0c69b --- /dev/null +++ b/sqlserverwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "sqlserverwriter", + "class": "com.alibaba.datax.plugin.writer.sqlserverwriter.SqlServerWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/sqlserverwriter/src/main/resources/plugin_job_template.json b/sqlserverwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..b22c7dff4 --- /dev/null +++ b/sqlserverwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "sqlserverwriter", + "parameter": { + "username": "", + "password": "", + "column": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ], + "preSql": [], + "postSql": [] + } +} \ No newline at end of file diff --git a/streamreader/pom.xml b/streamreader/pom.xml new file mode 100755 index 000000000..f7b12d501 --- /dev/null +++ b/streamreader/pom.xml @@ -0,0 +1,74 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + streamreader + streamreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/streamreader/src/main/assembly/package.xml b/streamreader/src/main/assembly/package.xml new file mode 100755 index 000000000..5db1e0b75 --- /dev/null +++ b/streamreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/streamreader + + + target/ + + streamreader-0.0.1-SNAPSHOT.jar + + plugin/reader/streamreader + + + + + + false + plugin/reader/streamreader/libs + runtime + + + diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Constant.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Constant.java new file mode 100755 index 000000000..42c580b1b --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Constant.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +public class Constant { + + public static final String TYPE = "type"; + + public static final String VALUE = "value"; + + public static final String DATE_FORMAT_MARK = "dateFormat"; + + public static final String DEFAULT_DATE_FORMAT = "yyyy-MM-dd HH:mm:ss"; +} diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Key.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Key.java new file mode 100755 index 000000000..6542f4b70 --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Key.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +public class Key { + + /** + * should look like:[{"value":"123","type":"int"},{"value":"hello","type":"string"}] + */ + public static final String COLUMN = "column"; + + public static final String SLICE_RECORD_COUNT = "sliceRecordCount"; + +} diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReader.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReader.java new file mode 100755 index 000000000..a7b84b68f --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReader.java @@ -0,0 +1,215 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSONObject; +import org.apache.commons.lang3.StringUtils; + +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.List; + +public class StreamReader extends Reader { + + public static class Job extends Reader.Job { + + private Configuration originalConfig; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + dealColumn(this.originalConfig); + + Long sliceRecordCount = this.originalConfig + .getLong(Key.SLICE_RECORD_COUNT); + if (null == sliceRecordCount) { + throw DataXException.asDataXException(StreamReaderErrorCode.REQUIRED_VALUE, + "没有设置参数[sliceRecordCount]."); + } else if (sliceRecordCount < 1) { + throw DataXException.asDataXException(StreamReaderErrorCode.ILLEGAL_VALUE, + "参数[sliceRecordCount]不能小于1."); + } + + } + + private void dealColumn(Configuration originalConfig) { + List columns = originalConfig.getList(Key.COLUMN, + JSONObject.class); + if (null == columns || columns.isEmpty()) { + throw DataXException.asDataXException(StreamReaderErrorCode.REQUIRED_VALUE, + "没有设置参数[column]."); + } + + List dealedColumns = new ArrayList(); + for (JSONObject eachColumn : columns) { + Configuration eachColumnConfig = Configuration.from(eachColumn); + eachColumnConfig.getNecessaryValue(Constant.VALUE, + StreamReaderErrorCode.REQUIRED_VALUE); + String typeName = eachColumnConfig.getString(Constant.TYPE); + if (StringUtils.isBlank(typeName)) { + // empty typeName will be set to default type: string + eachColumnConfig.set(Constant.TYPE, Type.STRING); + } else { + if (Type.DATE.name().equalsIgnoreCase(typeName)) { + boolean notAssignDateFormat = StringUtils + .isBlank(eachColumnConfig + .getString(Constant.DATE_FORMAT_MARK)); + if (notAssignDateFormat) { + eachColumnConfig.set(Constant.DATE_FORMAT_MARK, + Constant.DEFAULT_DATE_FORMAT); + } + } + if (!Type.isTypeIllegal(typeName)) { + throw DataXException.asDataXException( + StreamReaderErrorCode.NOT_SUPPORT_TYPE, + String.format("不支持类型[%s]", typeName)); + } + } + dealedColumns.add(eachColumnConfig.toJSON()); + } + + originalConfig.set(Key.COLUMN, dealedColumns); + } + + @Override + public void prepare() { + } + + @Override + public List split(int adviceNumber) { + List configurations = new ArrayList(); + + for (int i = 0; i < adviceNumber; i++) { + configurations.add(this.originalConfig.clone()); + } + return configurations; + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + + private List columns; + + private long sliceRecordCount; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.columns = this.readerSliceConfig.getList(Key.COLUMN, + String.class); + + this.sliceRecordCount = this.readerSliceConfig + .getLong(Key.SLICE_RECORD_COUNT); + } + + @Override + public void prepare() { + } + + @Override + public void startRead(RecordSender recordSender) { + Record oneRecord = buildOneRecord(recordSender, this.columns); + + while (this.sliceRecordCount > 0) { + recordSender.sendToWriter(oneRecord); + this.sliceRecordCount--; + } + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + private Record buildOneRecord(RecordSender recordSender, + List columns) { + if (null == recordSender) { + throw new IllegalArgumentException( + "参数[recordSender]不能为空."); + } + + if (null == columns || columns.isEmpty()) { + throw new IllegalArgumentException( + "参数[column]不能为空."); + } + + Record record = recordSender.createRecord(); + try { + for (String eachColumn : columns) { + Configuration eachColumnConfig = Configuration + .from(eachColumn); + String columnValue = eachColumnConfig + .getString(Constant.VALUE); + Type columnType = Type.valueOf(eachColumnConfig.getString( + Constant.TYPE).toUpperCase()); + switch (columnType) { + case STRING: + record.addColumn(new StringColumn(columnValue)); + break; + case LONG: + record.addColumn(new LongColumn(columnValue)); + break; + case DOUBLE: + record.addColumn(new DoubleColumn(columnValue)); + break; + case DATE: + SimpleDateFormat format = new SimpleDateFormat( + eachColumnConfig + .getString(Constant.DATE_FORMAT_MARK)); + record.addColumn(new DateColumn(format + .parse(columnValue))); + break; + case BOOL: + record.addColumn(new BoolColumn("true" + .equalsIgnoreCase(columnValue) ? true : false)); + break; + case BYTES: + record.addColumn(new BytesColumn(columnValue.getBytes())); + break; + default: + // in fact,never to be here + throw new Exception(String.format("不支持类型[%s]", + columnType.name())); + } + } + } catch (Exception e) { + throw DataXException.asDataXException(StreamReaderErrorCode.ILLEGAL_VALUE, + "构造一个record失败.", e); + } + + return record; + } + } + + private enum Type { + STRING, LONG, BOOL, DOUBLE, DATE, BYTES, ; + + private static boolean isTypeIllegal(String typeString) { + try { + Type.valueOf(typeString.toUpperCase()); + } catch (Exception e) { + return false; + } + + return true; + } + } + +} diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReaderErrorCode.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReaderErrorCode.java new file mode 100755 index 000000000..ae3f2b880 --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReaderErrorCode.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum StreamReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("StreamReader-00", "缺失必要的值"), + ILLEGAL_VALUE("StreamReader-01", "值非法"), + NOT_SUPPORT_TYPE("StreamReader-02", "不支持的column类型"),; + + + private final String code; + private final String description; + + private StreamReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/streamreader/src/main/resources/plugin.json b/streamreader/src/main/resources/plugin.json new file mode 100755 index 000000000..4c0b3edf9 --- /dev/null +++ b/streamreader/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "streamreader", + "class": "com.alibaba.datax.plugin.reader.streamreader.StreamReader", + "description": { + "useScene": "only for developer test.", + "mechanism": "use datax framework to transport data from stream.", + "warn": "Never use it in your real job." + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/streamreader/src/main/resources/plugin_job_template.json b/streamreader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..4dced6362 --- /dev/null +++ b/streamreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,7 @@ +{ + "name": "streamreader", + "parameter": { + "sliceRecordCount": "", + "column": [] + } +} \ No newline at end of file diff --git a/streamreader/streamreader.iml b/streamreader/streamreader.iml new file mode 100644 index 000000000..0be046885 --- /dev/null +++ b/streamreader/streamreader.iml @@ -0,0 +1,24 @@ + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/streamwriter/pom.xml b/streamwriter/pom.xml new file mode 100755 index 000000000..58b294712 --- /dev/null +++ b/streamwriter/pom.xml @@ -0,0 +1,68 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + streamwriter + streamwriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/streamwriter/src/main/assembly/package.xml b/streamwriter/src/main/assembly/package.xml new file mode 100755 index 000000000..6564e05ba --- /dev/null +++ b/streamwriter/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/streamwriter + + + target/ + + streamwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/streamwriter + + + + + + false + plugin/writer/streamwriter/libs + runtime + + + diff --git a/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/Key.java b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/Key.java new file mode 100755 index 000000000..b716ea21c --- /dev/null +++ b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/Key.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.writer.streamwriter; + +public class Key { + public static final String FIELD_DELIMITER = "fieldDelimiter"; + + public static final String PRINT = "print"; + + public static final String PATH = "path"; + + public static final String FILE_NAME = "fileName"; + + public static final String RECORD_NUM_BEFORE_SLEEP = "recordNumBeforeSleep"; + + public static final String SLEEP_TIME = "sleepTime"; + +} diff --git a/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriter.java b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriter.java new file mode 100755 index 000000000..888c6ad77 --- /dev/null +++ b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriter.java @@ -0,0 +1,255 @@ + +package com.alibaba.datax.plugin.writer.streamwriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.io.FileUtils; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.*; +import java.util.ArrayList; +import java.util.List; + +public class StreamWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private Configuration originalConfig; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + String path = this.originalConfig.getString(Key.PATH, null); + String fileName = this.originalConfig.getString(Key.FILE_NAME, null); + + if(StringUtils.isNoneBlank(path) && StringUtils.isNoneBlank(fileName)) { + validateParameter(path, fileName); + } + } + + private void validateParameter(String path, String fileName) { + try { + // warn: 这里用户需要配一个目录 + File dir = new File(path); + if (dir.isFile()) { + throw DataXException + .asDataXException( + StreamWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + if (!dir.exists()) { + boolean createdOk = dir.mkdirs(); + if (!createdOk) { + throw DataXException + .asDataXException( + StreamWriterErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("您指定的文件路径 : [%s] 创建失败.", + path)); + } + } + + String fileFullPath = buildFilePath(path, fileName); + File newFile = new File(fileFullPath); + if(newFile.exists()) { + try { + FileUtils.forceDelete(newFile); + } catch (IOException e) { + throw DataXException.asDataXException( + StreamWriterErrorCode.RUNTIME_EXCEPTION, + String.format("删除文件失败 : [%s] ", fileFullPath), e); + } + } + } catch (SecurityException se) { + throw DataXException.asDataXException( + StreamWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限创建文件路径 : [%s] ", path), se); + } + } + + @Override + public void prepare() { + } + + @Override + public List split(int mandatoryNumber) { + List writerSplitConfigs = new ArrayList(); + for (int i = 0; i < mandatoryNumber; i++) { + writerSplitConfigs.add(this.originalConfig); + } + + return writerSplitConfigs; + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + private static final String NEWLINE_FLAG = System.getProperty("line.separator", "\n"); + + private Configuration writerSliceConfig; + + private String fieldDelimiter; + private boolean print; + + private String path; + private String fileName; + + private long recordNumBeforSleep; + private long sleepTime; + + + + @Override + public void init() { + this.writerSliceConfig = getPluginJobConf(); + + this.fieldDelimiter = this.writerSliceConfig.getString( + Key.FIELD_DELIMITER, "\t"); + this.print = this.writerSliceConfig.getBool(Key.PRINT, true); + + this.path = this.writerSliceConfig.getString(Key.PATH, null); + this.fileName = this.writerSliceConfig.getString(Key.FILE_NAME, null); + this.recordNumBeforSleep = this.writerSliceConfig.getLong(Key.RECORD_NUM_BEFORE_SLEEP, 0); + this.sleepTime = this.writerSliceConfig.getLong(Key.SLEEP_TIME, 0); + if(recordNumBeforSleep < 0) { + throw DataXException.asDataXException(StreamWriterErrorCode.CONFIG_INVALID_EXCEPTION, "recordNumber 不能为负值"); + } + if(sleepTime <0) { + throw DataXException.asDataXException(StreamWriterErrorCode.CONFIG_INVALID_EXCEPTION, "sleep 不能为负值"); + } + + } + + @Override + public void prepare() { + } + + @Override + public void startWrite(RecordReceiver recordReceiver) { + + + if(StringUtils.isNoneBlank(path) && StringUtils.isNoneBlank(fileName)) { + writeToFile(recordReceiver,path, fileName, recordNumBeforSleep, sleepTime); + } else { + try { + BufferedWriter writer = new BufferedWriter( + new OutputStreamWriter(System.out, "UTF-8")); + + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + if (this.print) { + writer.write(recordToString(record)); + } else { + /* do nothing */ + } + } + writer.flush(); + + } catch (Exception e) { + throw DataXException.asDataXException(StreamWriterErrorCode.RUNTIME_EXCEPTION, e); + } + } + } + + private void writeToFile(RecordReceiver recordReceiver, String path, String fileName, + long recordNumBeforSleep, long sleepTime) { + + LOG.info("begin do write..."); + String fileFullPath = buildFilePath(path, fileName); + LOG.info(String.format("write to file : [%s]", fileFullPath)); + BufferedWriter writer = null; + try { + File newFile = new File(fileFullPath); + newFile.createNewFile(); + + writer = new BufferedWriter( + new OutputStreamWriter(new FileOutputStream(newFile, true), "UTF-8")); + + Record record; + int count =0; + while ((record = recordReceiver.getFromReader()) != null) { + if(recordNumBeforSleep > 0 && sleepTime >0 &&count == recordNumBeforSleep) { + LOG.info("StreamWriter start to sleep ... recordNumBeforSleep={},sleepTime={}",recordNumBeforSleep,sleepTime); + try { + Thread.sleep(sleepTime * 1000l); + } catch (InterruptedException e) { + } + } + writer.write(recordToString(record)); + count++; + } + writer.flush(); + } catch (Exception e) { + throw DataXException.asDataXException(StreamWriterErrorCode.RUNTIME_EXCEPTION, e); + } finally { + IOUtils.closeQuietly(writer); + } + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + private String recordToString(Record record) { + int recordLength = record.getColumnNumber(); + if (0 == recordLength) { + return NEWLINE_FLAG; + } + + Column column; + StringBuilder sb = new StringBuilder(); + for (int i = 0; i < recordLength; i++) { + column = record.getColumn(i); + sb.append(column.asString()).append(fieldDelimiter); + } + sb.setLength(sb.length() - 1); + sb.append(NEWLINE_FLAG); + + return sb.toString(); + } + } + + private static String buildFilePath(String path, String fileName) { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + path = path + IOUtils.DIR_SEPARATOR; + } + return String.format("%s%s", path, fileName); + } +} diff --git a/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriterErrorCode.java b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriterErrorCode.java new file mode 100755 index 000000000..1762482a2 --- /dev/null +++ b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriterErrorCode.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.writer.streamwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum StreamWriterErrorCode implements ErrorCode { + RUNTIME_EXCEPTION("StreamWriter-00", "运行时异常"), + ILLEGAL_VALUE("StreamWriter-01", "您填写的参数值不合法."), + CONFIG_INVALID_EXCEPTION("StreamWriter-02", "您的参数配置错误."), + SECURITY_NOT_ENOUGH("TxtFileWriter-03", "您缺少权限执行相应的文件写入操作."); + + + + private final String code; + private final String description; + + private StreamWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/streamwriter/src/main/resources/plugin.json b/streamwriter/src/main/resources/plugin.json new file mode 100755 index 000000000..6eed86e3c --- /dev/null +++ b/streamwriter/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "streamwriter", + "class": "com.alibaba.datax.plugin.writer.streamwriter.StreamWriter", + "description": { + "useScene": "only for developer test.", + "mechanism": "use datax framework to transport data to stream.", + "warn": "Never use it in your real job." + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/streamwriter/src/main/resources/plugin_job_template.json b/streamwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..66e1f5e39 --- /dev/null +++ b/streamwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,7 @@ +{ + "name": "streamwriter", + "parameter": { + "encoding": "", + "print": true + } +} \ No newline at end of file diff --git a/streamwriter/streamwriter.iml b/streamwriter/streamwriter.iml new file mode 100644 index 000000000..d48a6da86 --- /dev/null +++ b/streamwriter/streamwriter.iml @@ -0,0 +1,24 @@ + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/txtfilereader/doc/txtfielreader.md b/txtfilereader/doc/txtfielreader.md new file mode 100644 index 000000000..da99945e7 --- /dev/null +++ b/txtfilereader/doc/txtfielreader.md @@ -0,0 +1,223 @@ +# DataX TxtFileReader 说明 + + +------------ + +## 1 快速介绍 + +TxtFileReader提供了读取本地文件系统数据存储的能力。在底层实现上,TxtFileReader获取本地文件数据,并转换为DataX传输协议传递给Writer。 + +**本地文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +TxtFileReader实现了从本地文件读取数据并转为DataX协议的功能,本地文件本身是无结构化数据存储,对于DataX而言,TxtFileReader实现上类比OSSReader,有诸多相似之处。目前TxtFileReader支持功能如下: + +1. 支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +4. 支持递归读取、支持文件名过滤。 + +5. 支持文本压缩,现有压缩格式为zip、lzo、lzop、tgz、bzip2。 + +6. 多个File可以支持并发读取。 + +我们暂时不能做到: + +1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。 + +2. 单个File在压缩情况下,从技术上无法支持多线程并发读取。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": ["/home/haiwei.luo/case00/data"], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/home/haiwei.luo/case00/result", + "fileName": "luohw", + "writeMode": "truncate", + "format": "yyyy-MM-dd" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **path** + + * 描述:本地文件系统的路径信息,注意这里可以支持填写多个路径。
+ + 当指定单个本地文件,TxtFileReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个File可以进行多线程并发读取。 + + 当指定多个本地文件,TxtFileReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,TxtFileReader尝试遍历出多个文件信息。例如: 指定/*代表读取/目录下所有的文件,指定/bazhen/\*代表读取bazhen目录下游所有的文件。**TxtFileReader目前只支持\*作为文件通配符。** + + **特别需要注意的是,DataX会将一个作业下同步的所有Text File视作同一张数据表。用户必须自己保证所有的File能够适配同一套schema信息。读取文件用户必须保证为类CSV格式,并且提供给DataX权限可读。** + + **特别需要注意的是,如果Path指定的路径下没有符合匹配的文件抽取,DataX将报错。** + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json + { + "type": "long", + "index": 0 //从本地文件文本第一列获取int字段 + }, + { + "type": "string", + "value": "alibaba" //从TxtFileReader内部生成alibaba的字符串字段作为当前字段 + } + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:是
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为gzip、bzip2。
+ + * 必选:否
+ + * 默认值:没有压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ +* **skipHeader** + + * 描述:类CSV格式文件可能存在表头为标题情况,需要跳过。默认不跳过。
+ + * 必选:否
+ + * 默认值:false
+ +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat:"\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ + +### 3.3 类型转换 + +本地文件本身不提供数据类型,该类型是DataX TxtFileReader定义: + +| DataX 内部类型| 本地文件 数据类型 | +| -------- | ----- | +| +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* 本地文件 Long是指本地文件文本中使用整形的字符串表示形式,例如"19901219"。 +* 本地文件 Double是指本地文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* 本地文件 Boolean是指本地文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* 本地文件 Date是指本地文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + + diff --git a/txtfilereader/pom.xml b/txtfilereader/pom.xml new file mode 100755 index 000000000..4d9cf6bf6 --- /dev/null +++ b/txtfilereader/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + txtfilereader + txtfilereader + TxtFileReader提供了本地读取TEXT功能,并可以根据用户配置的类型进行类型转换,建议开发、测试环境使用。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/txtfilereader/src/main/assembly/package.xml b/txtfilereader/src/main/assembly/package.xml new file mode 100755 index 000000000..b0fb12906 --- /dev/null +++ b/txtfilereader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/txtfilereader + + + target/ + + txtfilereader-0.0.1-SNAPSHOT.jar + + plugin/reader/txtfilereader + + + + + + false + plugin/reader/txtfilereader/libs + runtime + + + diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Constant.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Constant.java new file mode 100755 index 000000000..7b7a46fa2 --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Constant.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public class Constant { + public static final String SOURCE_FILES = "sourceFiles"; + +} diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Key.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Key.java new file mode 100755 index 000000000..4c6e84a98 --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Key.java @@ -0,0 +1,8 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public class Key { + public static final String PATH = "path"; +} diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReader.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReader.java new file mode 100755 index 000000000..f05c37291 --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReader.java @@ -0,0 +1,420 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderErrorCode; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.google.common.collect.Sets; + +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.BooleanUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.InputStream; +import java.nio.charset.UnsupportedCharsetException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.regex.Pattern; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public class TxtFileReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration originConfig = null; + + private List path = null; + + private List sourceFiles; + + private Map pattern; + + private Map isRegexPath; + + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + this.pattern = new HashMap(); + this.isRegexPath = new HashMap(); + this.validateParameter(); + } + + private void validateParameter() { + // Compatible with the old version, path is a string before + String pathInString = this.originConfig.getNecessaryValue(Key.PATH, + TxtFileReaderErrorCode.REQUIRED_VALUE); + if (StringUtils.isBlank(pathInString)) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.REQUIRED_VALUE, + "您需要指定待读取的源目录或文件"); + } + if (!pathInString.startsWith("[") && !pathInString.endsWith("]")) { + path = new ArrayList(); + path.add(pathInString); + } else { + path = this.originConfig.getList(Key.PATH, String.class); + if (null == path || path.size() == 0) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.REQUIRED_VALUE, + "您需要指定待读取的源目录或文件"); + } + } + + String encoding = this.originConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING); + if (StringUtils.isBlank(encoding)) { + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + null); + } else { + try { + encoding = encoding.trim(); + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + encoding); + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("编码配置异常, 请联系我们: %s", e.getMessage()), + e); + } + } + + // column: 1. index type 2.value type 3.when type is Date, may have + // format + List columns = this.originConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + // handle ["*"] + if (null != columns && 1 == columns.size()) { + String columnsInStr = columns.get(0).toString(); + if ("\"*\"".equals(columnsInStr) || "'*'".equals(columnsInStr)) { + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, + null); + columns = null; + } + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, + TxtFileReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf + .getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + if (null != columnIndex && columnIndex < 0) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.ILLEGAL_VALUE, String + .format("index需要大于等于0, 您配置的index为[%s]", + columnIndex)); + } + } + } + + // only support compress types + String compress = this.originConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS); + if (StringUtils.isBlank(compress)) { + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + null); + } else { + Set supportedCompress = Sets + .newHashSet("gzip", "bzip2"); + compress = compress.toLowerCase().trim(); + if (!supportedCompress.contains(compress)) { + throw DataXException + .asDataXException( + TxtFileReaderErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 gzip, bzip2 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + compress); + } + + String delimiterInStr = this.originConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER); + // warn: if have, length must be one + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", + delimiterInStr)); + } + + } + + @Override + public void prepare() { + LOG.debug("prepare() begin..."); + // warn:make sure this regex string + // warn:no need trim + for (String eachPath : this.path) { + String regexString = eachPath.replace("*", ".*").replace("?", + ".?"); + Pattern patt = Pattern.compile(regexString); + this.pattern.put(eachPath, patt); + this.sourceFiles = this.buildSourceTargets(); + } + + LOG.info(String.format("您即将读取的文件数为: [%s]", this.sourceFiles.size())); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + // warn: 如果源目录为空会报错,拖空目录意图=>空文件显示指定此意图 + @Override + public List split(int adviceNumber) { + LOG.debug("split() begin..."); + List readerSplitConfigs = new ArrayList(); + + // warn:每个slice拖且仅拖一个文件, + // int splitNumber = adviceNumber; + int splitNumber = this.sourceFiles.size(); + if (0 == splitNumber) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.EMPTY_DIR_EXCEPTION, String + .format("未能找到待读取的文件,请确认您的配置项path: %s", + this.originConfig.getString(Key.PATH))); + } + + List> splitedSourceFiles = this.splitSourceFiles( + this.sourceFiles, splitNumber); + for (List files : splitedSourceFiles) { + Configuration splitedConfig = this.originConfig.clone(); + splitedConfig.set(Constant.SOURCE_FILES, files); + readerSplitConfigs.add(splitedConfig); + } + LOG.debug("split() ok and end..."); + return readerSplitConfigs; + } + + // validate the path, path must be a absolute path + private List buildSourceTargets() { + // for eath path + Set toBeReadFiles = new HashSet(); + for (String eachPath : this.path) { + int endMark; + for (endMark = 0; endMark < eachPath.length(); endMark++) { + if ('*' != eachPath.charAt(endMark) + && '?' != eachPath.charAt(endMark)) { + continue; + } else { + this.isRegexPath.put(eachPath, true); + break; + } + } + + String parentDirectory; + if (BooleanUtils.isTrue(this.isRegexPath.get(eachPath))) { + int lastDirSeparator = eachPath.substring(0, endMark) + .lastIndexOf(IOUtils.DIR_SEPARATOR); + parentDirectory = eachPath.substring(0, + lastDirSeparator + 1); + } else { + this.isRegexPath.put(eachPath, false); + parentDirectory = eachPath; + } + this.buildSourceTargetsEathPath(eachPath, parentDirectory, + toBeReadFiles); + } + return Arrays.asList(toBeReadFiles.toArray(new String[0])); + } + + private void buildSourceTargetsEathPath(String regexPath, + String parentDirectory, Set toBeReadFiles) { + // 检测目录是否存在,错误情况更明确 + try { + File dir = new File(parentDirectory); + boolean isExists = dir.exists(); + if (!isExists) { + String message = String.format("您设定的目录不存在 : [%s]", + parentDirectory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.FILE_NOT_EXISTS, message); + } + } catch (SecurityException se) { + String message = String.format("您没有权限查看目录 : [%s]", + parentDirectory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.SECURITY_NOT_ENOUGH, message); + } + + directoryRover(regexPath, parentDirectory, toBeReadFiles); + } + + private void directoryRover(String regexPath, String parentDirectory, + Set toBeReadFiles) { + File directory = new File(parentDirectory); + // is a normal file + if (!directory.isDirectory()) { + if (this.isTargetFile(regexPath, directory.getAbsolutePath())) { + toBeReadFiles.add(parentDirectory); + LOG.info(String.format( + "add file [%s] as a candidate to be read.", + parentDirectory)); + + } + } else { + // 是目录 + try { + // warn:对于没有权限的目录,listFiles 返回null,而不是抛出SecurityException + File[] files = directory.listFiles(); + if (null != files) { + for (File subFileNames : files) { + directoryRover(regexPath, + subFileNames.getAbsolutePath(), + toBeReadFiles); + } + } else { + // warn: 对于没有权限的文件,是直接throw DataXException + String message = String.format("您没有权限查看目录 : [%s]", + directory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.SECURITY_NOT_ENOUGH, + message); + } + + } catch (SecurityException e) { + String message = String.format("您没有权限查看目录 : [%s]", + directory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.SECURITY_NOT_ENOUGH, + message, e); + } + } + } + + // 正则过滤 + private boolean isTargetFile(String regexPath, String absoluteFilePath) { + if (this.isRegexPath.get(regexPath)) { + return this.pattern.get(regexPath).matcher(absoluteFilePath) + .matches(); + } else { + return true; + } + + } + + private List> splitSourceFiles(final List sourceList, + int adviceNumber) { + List> splitedList = new ArrayList>(); + int averageLength = sourceList.size() / adviceNumber; + averageLength = averageLength == 0 ? 1 : averageLength; + + for (int begin = 0, end = 0; begin < sourceList.size(); begin = end) { + end = begin + averageLength; + if (end > sourceList.size()) { + end = sourceList.size(); + } + splitedList.add(sourceList.subList(begin, end)); + } + return splitedList; + } + + } + + public static class Task extends Reader.Task { + private static Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration readerSliceConfig; + private List sourceFiles; + + @Override + public void init() { + this.readerSliceConfig = this.getPluginJobConf(); + this.sourceFiles = this.readerSliceConfig.getList( + Constant.SOURCE_FILES, String.class); + } + + @Override + public void prepare() { + + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + @Override + public void startRead(RecordSender recordSender) { + LOG.debug("start read source files..."); + for (String fileName : this.sourceFiles) { + LOG.info(String.format("reading file : [%s]", fileName)); + InputStream inputStream; + try { + inputStream = new FileInputStream(fileName); + UnstructuredStorageReaderUtil.readFromStream(inputStream, + fileName, this.readerSliceConfig, recordSender, + this.getTaskPluginCollector()); + recordSender.flush(); + } catch (FileNotFoundException e) { + // warn: sock 文件无法read,能影响所有文件的传输,需要用户自己保证 + String message = String + .format("找不到待读取的文件 : [%s]", fileName); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.OPEN_FILE_ERROR, message); + } + } + LOG.debug("end read source files..."); + } + + } +} diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReaderErrorCode.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReaderErrorCode.java new file mode 100755 index 000000000..800009622 --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public enum TxtFileReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("TxtFileReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("TxtFileReader-01", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("TxtFileReader-02", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("TxtFileReader-03","您明确的配置列信息,但未填写相应的index,value."), + FILE_NOT_EXISTS("TxtFileReader-04", "您配置的目录文件路径不存在."), + OPEN_FILE_WITH_CHARSET_ERROR("TxtFileReader-05", "您配置的文件编码和实际文件编码不符合."), + OPEN_FILE_ERROR("TxtFileReader-06", "您配置的文件在打开时异常,建议您检查源目录是否有隐藏文件,管道文件等特殊文件."), + READ_FILE_IO_ERROR("TxtFileReader-07", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("TxtFileReader-08", "您缺少权限执行相应的文件操作."), + CONFIG_INVALID_EXCEPTION("TxtFileReader-09", "您的参数配置错误."), + RUNTIME_EXCEPTION("TxtFileReader-10", "出现运行时异常, 请联系我们"), + EMPTY_DIR_EXCEPTION("TxtFileReader-11", "您尝试读取的文件目录为空."),; + + private final String code; + private final String description; + + private TxtFileReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/txtfilereader/src/main/resources/plugin.json b/txtfilereader/src/main/resources/plugin.json new file mode 100755 index 000000000..89e20cfc1 --- /dev/null +++ b/txtfilereader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "txtfilereader", + "class": "com.alibaba.datax.plugin.reader.txtfilereader.TxtFileReader", + "description": "useScene: test. mechanism: use datax framework to transport data from txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/txtfilereader/src/main/resources/plugin_job_template.json b/txtfilereader/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..744d6f54b --- /dev/null +++ b/txtfilereader/src/main/resources/plugin_job_template.json @@ -0,0 +1,9 @@ +{ + "name": "txtfilereader", + "parameter": { + "path": [], + "encoding": "", + "column": [], + "fieldDelimiter": "" + } +} \ No newline at end of file diff --git a/txtfilereader/txtfilereader.iml b/txtfilereader/txtfilereader.iml new file mode 100644 index 000000000..feb2a397a --- /dev/null +++ b/txtfilereader/txtfilereader.iml @@ -0,0 +1,30 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/txtfilewriter/doc/txtfilewriter.md b/txtfilewriter/doc/txtfilewriter.md new file mode 100644 index 000000000..e8daab739 --- /dev/null +++ b/txtfilewriter/doc/txtfilewriter.md @@ -0,0 +1,216 @@ +# DataX TxtFileWriter 说明 + + +------------ + +## 1 快速介绍 + +TxtFileWriter提供了向本地文件写入类CSV格式的一个或者多个表文件。TxtFileWriter服务的用户主要在于DataX开发、测试同学。 + +**写入本地文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +TxtFileWriter实现了从DataX协议转为本地TXT文件功能,本地文件本身是无结构化数据存储,TxtFileWriter如下几个方面约定: + +1. 支持且仅支持写入 TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持文本压缩,现有压缩格式为gzip、bzip2。 + +6. 支持多线程写入,每个线程写入不同子文件。 + +7. 文件支持滚动,当文件大于某个size值或者行数值,文件需要切换。 [暂不支持] + +我们不能做到: + +1. 单个文件不能支持并发写入。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": ["/home/haiwei.luo/case00/data"], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/home/haiwei.luo/case00/result", + "fileName": "luohw", + "writeMode": "truncate", + "dateFormat": "yyyy-MM-dd" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **path** + + * 描述:本地文件系统的路径信息,TxtFileWriter会写入Path目录下属多个文件。
+ + * 必选:是
+ + * 默认值:无
+ +* **fileName** + + * 描述:TxtFileWriter写入的文件名,该文件名会添加随机的后缀作为每个线程写入实际文件名。
+ + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:TxtFileWriter写入前数据清理处理模式:
+ + * truncate,写入前清理目录下一fileName前缀的所有文件。 + * append,写入前不做任何处理,DataX TxtFileWriter直接使用filename写入,并保证文件名不冲突。 + * nonConflict,如果目录下有fileName前缀的文件,直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:否
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、lzo、lzop、tgz、bzip2。
+ + * 必选:否
+ + * 默认值:无压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **dateFormat** + + * 描述:日期类型的数据序列化到文件中时的格式,例如 "dateFormat": "yyyy-MM-dd"。
+ + * 必选:否
+ + * 默认值:无
+ +* **fileFormat** + + * 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。
+ + * 必选:否
+ + * 默认值:text
+ +* **header** + + * 描述:txt写出时的表头,示例['id', 'name', 'age']。
+ + * 必选:否
+ + * 默认值:无
+ +### 3.3 类型转换 + + +本地文件本身不提供数据类型,该类型是DataX TxtFileWriter定义: + +| DataX 内部类型| 本地文件 数据类型 | +| -------- | ----- | +| +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* 本地文件 Long是指本地文件文本中使用整形的字符串表示形式,例如"19901219"。 +* 本地文件 Double是指本地文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* 本地文件 Boolean是指本地文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* 本地文件 Date是指本地文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + + diff --git a/txtfilewriter/pom.xml b/txtfilewriter/pom.xml new file mode 100755 index 000000000..cb47329b1 --- /dev/null +++ b/txtfilewriter/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + txtfilewriter + txtfilewriter + TxtFileWriter提供了本地写入TEXT功能,建议开发、测试环境使用。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/txtfilewriter/src/main/assembly/package.xml b/txtfilewriter/src/main/assembly/package.xml new file mode 100755 index 000000000..63354d469 --- /dev/null +++ b/txtfilewriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/txtfilewriter + + + target/ + + txtfilewriter-0.0.1-SNAPSHOT.jar + + plugin/writer/txtfilewriter + + + + + + false + plugin/writer/txtfilewriter/libs + runtime + + + diff --git a/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/Key.java b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/Key.java new file mode 100755 index 000000000..f57f9f961 --- /dev/null +++ b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/Key.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.writer.txtfilewriter; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public class Key { + // must have + public static final String PATH = "path"; +} diff --git a/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriter.java b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriter.java new file mode 100755 index 000000000..eefd3096e --- /dev/null +++ b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriter.java @@ -0,0 +1,342 @@ +package com.alibaba.datax.plugin.writer.txtfilewriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.writer.UnstructuredStorageWriterUtil; + +import org.apache.commons.io.FileUtils; +import org.apache.commons.io.IOUtils; +import org.apache.commons.io.filefilter.PrefixFileFilter; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.FilenameFilter; +import java.io.IOException; +import java.io.OutputStream; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.UUID; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public class TxtFileWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + String dateFormatOld = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FORMAT); + String dateFormatNew = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.DATE_FORMAT); + if (null == dateFormatNew) { + this.writerSliceConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.DATE_FORMAT, + dateFormatOld); + } + if (null != dateFormatOld) { + LOG.warn("您使用format配置日期格式化, 这是不推荐的行为, 请优先使用dateFormat配置项, 两项同时存在则使用dateFormat."); + } + UnstructuredStorageWriterUtil + .validateParameter(this.writerSliceConfig); + } + + private void validateParameter() { + this.writerSliceConfig + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + TxtFileWriterErrorCode.REQUIRED_VALUE); + + String path = this.writerSliceConfig.getNecessaryValue(Key.PATH, + TxtFileWriterErrorCode.REQUIRED_VALUE); + + try { + // warn: 这里用户需要配一个目录 + File dir = new File(path); + if (dir.isFile()) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + if (!dir.exists()) { + boolean createdOk = dir.mkdirs(); + if (!createdOk) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("您指定的文件路径 : [%s] 创建失败.", + path)); + } + } + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限创建文件路径 : [%s] ", path), se); + } + } + + @Override + public void prepare() { + String path = this.writerSliceConfig.getString(Key.PATH); + String fileName = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + String writeMode = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.WRITE_MODE); + // truncate option handler + if ("truncate".equals(writeMode)) { + LOG.info(String.format( + "由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的内容", + path, fileName)); + File dir = new File(path); + // warn:需要判断文件是否存在,不存在时,不能删除 + try { + if (dir.exists()) { + // warn:不要使用FileUtils.deleteQuietly(dir); + FilenameFilter filter = new PrefixFileFilter(fileName); + File[] filesWithFileNamePrefix = dir.listFiles(filter); + for (File eachFile : filesWithFileNamePrefix) { + LOG.info(String.format("delete file [%s].", + eachFile.getName())); + FileUtils.forceDelete(eachFile); + } + // FileUtils.cleanDirectory(dir); + } + } catch (NullPointerException npe) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.Write_FILE_ERROR, + String.format("您配置的目录清空时出现空指针异常 : [%s]", + path), npe); + } catch (IllegalArgumentException iae) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您配置的目录参数异常 : [%s]", path)); + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限查看目录 : [%s]", path)); + } catch (IOException e) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.Write_FILE_ERROR, + String.format("无法清空目录 : [%s]", path), e); + } + } else if ("append".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode append, 写入前不做清理工作, [%s] 目录下写入相应文件名前缀 [%s] 的文件", + path, fileName)); + } else if ("nonConflict".equals(writeMode)) { + LOG.info(String.format( + "由于您配置了writeMode nonConflict, 开始检查 [%s] 下面的内容", path)); + // warn: check two times about exists, mkdirs + File dir = new File(path); + try { + if (dir.exists()) { + if (dir.isFile()) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + // fileName is not null + FilenameFilter filter = new PrefixFileFilter(fileName); + File[] filesWithFileNamePrefix = dir.listFiles(filter); + if (filesWithFileNamePrefix.length > 0) { + List allFiles = new ArrayList(); + for (File eachFile : filesWithFileNamePrefix) { + allFiles.add(eachFile.getName()); + } + LOG.error(String.format("冲突文件列表为: [%s]", + StringUtils.join(allFiles, ","))); + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 目录不为空, 下面存在其他文件或文件夹.", + path)); + } + } else { + boolean createdOk = dir.mkdirs(); + if (!createdOk) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.CONFIG_INVALID_EXCEPTION, + String.format( + "您指定的文件路径 : [%s] 创建失败.", + path)); + } + } + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限查看目录 : [%s]", path)); + } + } else { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 truncate, append, nonConflict 三种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + @Override + public List split(int mandatoryNumber) { + LOG.info("begin do split..."); + List writerSplitConfigs = new ArrayList(); + String filePrefix = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + + Set allFiles = new HashSet(); + String path = null; + try { + path = this.writerSliceConfig.getString(Key.PATH); + File dir = new File(path); + allFiles.addAll(Arrays.asList(dir.list())); + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限查看目录 : [%s]", path)); + } + + String fileSuffix; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same file name + + Configuration splitedTaskConfig = this.writerSliceConfig + .clone(); + + String fullFileName = null; + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s__%s", filePrefix, fileSuffix); + while (allFiles.contains(fullFileName)) { + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s__%s", filePrefix, + fileSuffix); + } + allFiles.add(fullFileName); + + splitedTaskConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + fullFileName); + + LOG.info(String.format("splited write file name:[%s]", + fullFileName)); + + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration writerSliceConfig; + + private String path; + + private String fileName; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.path = this.writerSliceConfig.getString(Key.PATH); + this.fileName = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + } + + @Override + public void prepare() { + + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("begin do write..."); + String fileFullPath = this.buildFilePath(); + LOG.info(String.format("write to file : [%s]", fileFullPath)); + + OutputStream outputStream = null; + try { + File newFile = new File(fileFullPath); + newFile.createNewFile(); + outputStream = new FileOutputStream(newFile); + UnstructuredStorageWriterUtil.writeToStream(lineReceiver, + outputStream, this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限创建文件 : [%s]", this.fileName)); + } catch (IOException ioe) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.Write_FILE_IO_ERROR, + String.format("无法创建待写文件 : [%s]", this.fileName), ioe); + } finally { + IOUtils.closeQuietly(outputStream); + } + LOG.info("end do write"); + } + + private String buildFilePath() { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + this.path = this.path + IOUtils.DIR_SEPARATOR; + } + return String.format("%s%s", this.path, this.fileName); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + } +} diff --git a/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriterErrorCode.java b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriterErrorCode.java new file mode 100755 index 000000000..55579575b --- /dev/null +++ b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriterErrorCode.java @@ -0,0 +1,41 @@ +package com.alibaba.datax.plugin.writer.txtfilewriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public enum TxtFileWriterErrorCode implements ErrorCode { + + CONFIG_INVALID_EXCEPTION("TxtFileWriter-00", "您的参数配置错误."), + REQUIRED_VALUE("TxtFileWriter-01", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("TxtFileWriter-02", "您填写的参数值不合法."), + Write_FILE_ERROR("TxtFileWriter-03", "您配置的目标文件在写入时异常."), + Write_FILE_IO_ERROR("TxtFileWriter-04", "您配置的文件在写入时出现IO异常."), + SECURITY_NOT_ENOUGH("TxtFileWriter-05", "您缺少权限执行相应的文件写入操作."); + + private final String code; + private final String description; + + private TxtFileWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/txtfilewriter/src/main/resources/plugin.json b/txtfilewriter/src/main/resources/plugin.json new file mode 100755 index 000000000..8844814c9 --- /dev/null +++ b/txtfilewriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "txtfilewriter", + "class": "com.alibaba.datax.plugin.writer.txtfilewriter.TxtFileWriter", + "description": "useScene: test. mechanism: use datax framework to transport data to txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/txtfilewriter/src/main/resources/plugin_job_template.json b/txtfilewriter/src/main/resources/plugin_job_template.json new file mode 100644 index 000000000..62d075bbd --- /dev/null +++ b/txtfilewriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,10 @@ +{ + "name": "txtfilewriter", + "parameter": { + "path": "", + "fileName": "", + "writeMode": "", + "fieldDelimiter":"", + "dateFormat": "" + } +} \ No newline at end of file diff --git a/txtfilewriter/txtfilewriter.iml b/txtfilewriter/txtfilewriter.iml new file mode 100644 index 000000000..feb2a397a --- /dev/null +++ b/txtfilewriter/txtfilewriter.iml @@ -0,0 +1,30 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/userGuid.md b/userGuid.md new file mode 100644 index 000000000..c6d36ef6f --- /dev/null +++ b/userGuid.md @@ -0,0 +1,254 @@ +# DataX + +DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。 + +# Features + +DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。 + +# System Requirements + +- Linux +- [JDK(1.6以上,推荐1.6) ](http://www.oracle.com/technetwork/cn/java/javase/downloads/index.html) +- [Python(推荐Python2.6.X) ](https://www.python.org/downloads/) +- [Apache Maven 3.x](https://maven.apache.org/download.cgi) (Compile DataX) + +# Quick Start + +* 工具部署 + + * 方法一、直接下载DataX工具包:[DataX](https://github.com/alibaba/DataX) + + 下载后解压至本地某个目录,进入bin目录,即可运行同步作业: + + ``` shell + $ cd {YOUR_DATAX_HOME}/bin + $ python datax.py {YOUR_JOB.json} + ``` + + * 方法二、下载DataX源码,自己编译:[DataX源码](https://github.com/alibaba/DataX) + + (1)、下载DataX源码: + + ``` shell + $ git clone git@github.com:alibaba/DataX.git + ``` + + (2)、通过maven打包: + + ``` shell + $ cd {DataX_source_code_home} + $ mvn -U clean package assembly:assembly -Dmaven.test.skip=true + ``` + + 打包成功,日志显示如下: + + ``` + [INFO] BUILD SUCCESS + [INFO] ----------------------------------------------------------------- + [INFO] Total time: 08:12 min + [INFO] Finished at: 2015-12-13T16:26:48+08:00 + [INFO] Final Memory: 133M/960M + [INFO] ----------------------------------------------------------------- + ``` + + 打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ,结构如下: + + ``` shell + $ cd {DataX_source_code_home} + $ ls ./target/datax/datax/ + bin conf job lib log log_perf plugin script tmp + ``` + + +* 配置示例:从stream读取数据并打印到控制台 + + * 第一步、创建创业的配置文件(json格式) + + 可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER} + + ``` shell + $ cd {YOUR_DATAX_HOME}/bin + $ python datax.py -r streamreader -w streamwriter + DataX (UNKNOWN_DATAX_VERSION), From Alibaba ! + Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved. + Please refer to the streamreader document: + https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md + + Please refer to the streamwriter document: + https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md + + Please save the following configuration as a json file and use + python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json + to run the job. + + { + "job": { + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [], + "sliceRecordCount": "" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "encoding": "", + "print": true + } + } + } + ], + "setting": { + "speed": { + "channel": "" + } + } + } + } + ``` + + 根据模板配置json如下: + + ``` json + #stream2stream.json + { + "job": { + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "sliceRecordCount": 10, + "column": [ + { + "type": "long", + "value": "10" + }, + { + "type": "string", + "value": "hello,你好,世界-DataX" + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "encoding": "UTF-8", + "print": true + } + } + } + ], + "setting": { + "speed": { + "channel": 5 + } + } + } + } + ``` + + * 第二步:启动DataX + + ``` shell + $ cd {YOUR_DATAX_DIR_BIN} + $ python datax.py ./stream2stream.json + ``` + + 同步结束,显示日志如下: + + ``` shell + ... + 2015-12-17 11:20:25.263 [job-0] INFO JobContainer - + 任务启动时刻 : 2015-12-17 11:20:15 + 任务结束时刻 : 2015-12-17 11:20:25 + 任务总计耗时 : 10s + 任务平均流量 : 205B/s + 记录写入速度 : 5rec/s + 读出记录总数 : 50 + 读写失败总数 : 0 + ``` + +# Support Data Channels + +目前DataX支持的数据源有: + +### Reader + +> **Reader实现了从数据存储系统批量抽取数据,并转换为DataX标准数据交换协议,DataX任意Reader能与DataX任意Writer实现无缝对接,达到任意异构数据互通之目的。** + +**RDBMS 关系型数据库** + +* [MysqlReader](https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md): 使用JDBC批量抽取Mysql数据集。 +* [OracleReader](https://github.com/alibaba/DataX/blob/master/oraclereader/doc/oraclereader.md): 使用JDBC批量抽取Oracle数据集。 +* [SqlServerReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-sqlserverreader): 使用JDBC批量抽取SqlServer数据集 +* [PostgresqlReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-pgreader): 使用JDBC批量抽取PostgreSQL数据集 +* [DrdsReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-drdsreader): 针对公有云上DRDS的批量数据抽取工具。 + +**数仓数据存储** + +* [ODPSReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-odpsreader): 使用ODPS Tunnel SDK批量抽取ODPS数据。 + +**NoSQL数据存储** + +* [OTSReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-otsreader): 针对公有云上OTS的批量数据抽取工具。 +* [HBaseReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-hbasereader): 针对 HBase 0.94版本的在线数据抽取工具 + +**无结构化数据存储** + +* [TxtFileReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-txtfilereader): 读取(递归/过滤)本地文件。 +* [FtpReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-ftpreader): 读取(递归/过滤)远程ftp文件。 +* [HdfsReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-hdfsreader): 针对Hdfs文件系统中textfile和orcfile文件批量数据抽取工具。 +* [OssReader](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-ossreader): 针对公有云OSS产品的批量数据抽取工具。 +* StreamReader + +### Writer + +---- + +> **Writer实现了从DataX标准数据交换协议,翻译为具体的数据存储类型并写入目的数据存储。DataX任意Writer能与DataX任意Reader实现无缝对接,达到任意异构数据互通之目的。** + +---- + +**RDBMS 关系型数据库** + +* [MysqlWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-mysqlwriter): 使用JDBC(Insert,Replace方式)写入Mysql数据库 +* [OracleWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-oraclewriter): 使用JDBC(Insert方式)写入Oracle数据库 +* [PostgresqlWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-pgwriter): 使用JDBC(Insert方式)写入PostgreSQL数据库 +* [SqlServerWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-sqlserverwriter): 使用JDBC(Insert方式)写入sqlserver数据库 +* [DrdsWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-drdswriter): 使用JDBC(Replace方式)写入Drds数据库 + +**数仓数据存储** + +* [ODPSWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-odpswriter): 使用ODPS Tunnel SDK向ODPS写入数据。 +* [ADSWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-adswriter): 使用ODPS中转将数据导入ADS。 + +**NoSQL数据存储** + +* [OTSWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-otswriter): 使用OTS SDK向OTS Public模型的表中导入数据。 +* [OCSWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-ocswriter) +* [MongoDBReader](mongo_db_reader):MongoDBReader +* [MongoDBWriter](mongo_db_writer):MongoDBWriter + +**无结构化数据存储** + +* [TxtFileWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-txtfilewriter): 提供写入本地文件功能。 +* [OssWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-osswriter): 使用OSS SDK写入OSS数据。 +* [HdfsWriter](http://gitlab.alibaba-inc.com/datax/datax/wikis/datax-plugin-hdfswriter): 提供向Hdfs文件系统中写入textfile文件和orcfile文件功能。 +* StreamWriter + + + +# Contact us + +Google Groups: [DataX-user](https://github.com/alibaba/DataX) + +QQ群:?? + + +