解决方案

开源数据同步工具DataX

seo靠我 2023-09-25 12:45:30

1. DataX

1.1. 产品特性

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HSEO靠我Base、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

官方提供的datax框架图:

最终把不同数据源和目标源组成的网状结构,变成了SEO靠我星型结构:

1.2. 支持场景

DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南

类型数据源RSEO靠我eader(读)Writer(写)文档RDBMS 关系型数据库MySQL√√读 、写Oracle√√读 、写SQLServer√√读 、写PostgreSQL√√读 、写DRDS√√读 、写通用RDBSEO靠我MS(支持所有关系型数据库)√√读 、写阿里云数仓数据存储ODPS√√读 、写ADS√写OSS√√读 、写OCS√√读 、写NoSQL数据存储OTS√√读 、写Hbase0.94√√读 、写HbaseSEO靠我1.1√√读 、写Phoenix4.x√√读 、写Phoenix5.x√√读 、写MongoDB√√读 、写Hive√√读 、写Cassandra√√读 、写无结构化数据存储TxtFile√√读 、写SEO靠我FTP√√读 、写HDFS√√读 、写Elasticsearch√写时间序列数据库OpenTSDB√读TSDB√√读 、写

1.3. 迁移场景解决方案

1.3.1. 迁移测试场景

当前测试均选用阿里云RDSSEO靠我 Mysql5.6作为测试源端以及目标端资源,DataX数据源可支持范围很大,具体支持场景可以参考本文档3段落部分

1.3.1.1. 待迁移源端数据库 数据库类型数据源数据库版本资源大小部署方式关系型数据SEO靠我库阿里云RDSMysql 5.62C-4G-50G单主库 1.3.1.2. 待同步目标端数据库 数据库类型数据源数据库版本资源大小部署方式关系型数据库阿里云RDSMysSEO靠我ql 5.62C-4G-50G单主库 1.3.1.3. 迁移程序 环境要求:linux(windows也可以)、JDK1.8级以上、 python 2.x本次测试环境:SEO靠我CentOS7.5、JDK1.8、Python2.7.5

1.3.2. 安装部署

部署两种方式

使用官方编译好的 工具 包(datax.tar.gz),解压即用下载Datax源码,使用Maven进行编译。编SEO靠我译时间会有点长。

本次测试使用官方编译好的工具包(datax.tar.gz)下载并解压后使用

1.3.2.1. 安装JDK环境 yum install -y java-1.8.0 # 使用SEO靠我默认的CentOS7.5 Yum源即可 1.3.2.2. 安装apache-maven环境 yum install -y maven # 使用默认的CentOS7.5SEO靠我 Yum源即可 1.3.2.3. 安装Python环境

CentOS7.5 操作系统默认自带python 2.7.5,无需进行安装,可以直接进行使用

1.3.2.4. 下载DataX安装SEO靠我

DataX安装包链接:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

curl -O http://datax-opSEO靠我ensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz 1.3.2.5. 解压缩datax程序包 cd /root/ #SEO靠我 登陆到datax.tar.gz的下载目录 tar -zxvf datax.tar.gz -C /root/

1.3.3. 迁移使用

1.3.3.1. 生成迁移数据源样例模版

按照迁移需求使SEO靠我用DataX指令生成迁移数据源配置样例模版,然后可以根据数据源样例模版进行修改保存使用

执行以下指令生成数据源json文件,此自动生成的配置样例只是模版,还需要将输出json文件内容按照提示保存后进行修SEO靠我改(其他异构数据源模版,按照需求自主生成)python /root/datax/bin/datax.py -r mysqlreader -w mysqlwriter DataX (SEO靠我DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights ResSEO靠我erved.Please refer to the mysqlreader document:https://github.com/alibaba/DataX/blob/master/mysqlreaSEO靠我der/doc/mysqlreader.mdPlease refer to the mysqlwriter document:https://github.com/alibaba/DataX/blobSEO靠我/master/mysqlwriter/doc/mysqlwriter.mdPlease save the following configuration as a json file and useSEO靠我python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job.# 复制以下json内容并保存成xxxx.jSEO靠我son文件{"job": {"content": [{"reader": {"name": "mysqlreader","parameter": {"column": [],"connection":SEO靠我 [{"jdbcUrl": [],"table": []}],"password": "","username": "","where": ""}},"writer": {"name": "mysqlSEO靠我writer","parameter": {"column": [],"connection": [{"jdbcUrl": "","table": []}],"password": "","preSqSEO靠我l": [],"session": [],"username": "","writeMode": ""}}}],"setting": {"speed": {"channel": ""}}} SEO靠我 } 1.3.3.2. 修改数据源样例模版

此处按照我们的测试样例,源为mysql,目标也为mysql,mysql2mysql.json文件样例内容如下:

{"job": {"conSEO靠我tent": [{"reader": {"name": "mysqlreader","parameter": {"column": ["*"],"connection": [{"jdbcUrl": [SEO靠我"jdbc:mysql://39.103.21.61:3306/coredb"],"table": ["seepcore_table"]}],"password": "Abc999@1","usernSEO靠我ame": "xuxingzhuang",}},"writer": {"name": "mysqlwriter","parameter": {"column": ["*"],"connection":SEO靠我 [{"jdbcUrl": "jdbc:mysql://106.15.31.131:3306/coredb","table": ["seepcore_table"]}],"password": "AbSEO靠我c999@1","preSql": [],"session": [],"username": "xuxingzhuang","writeMode": "insert"}}}],"setting": {SEO靠我"speed": {"channel": "2"}}} }

注意⚠️:以上内容,要确保启动datax实例可以有权限访问远程数据库,源和目标库相关配置信息已经创建完成,具体参数详细配置信息SEO靠我,参考以下内容

配置样例mysqlreader参数详细说明,查看链接:DataX MysqlReader

配置样例mysqlwriter参数详细说明,查看链接:DataX MysqlWriter1.3.3SEO靠我.3. 启动DataX程序 python /root/datax/bin/datax.py /root/datax/job/mysql2mysql.json 程序执行SEO靠我输出LOG DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, AlibabaSEO靠我 Group. All Rights Reserved.2020-07-16 06:25:32.889 [main] INFO VMInfo - VMInfo# operatingSystem claSEO靠我ss => sun.management.OperatingSystemImpl 2020-07-16 06:25:32.902 [main] INFO Engine - the maSEO靠我chine info =>osInfo: Oracle Corporation 1.8 25.252-b09jvmInfo: Linux amd64 3.10.0-862.3.2.el7.x86_64SEO靠我cpu num: 4totalPhysicalMemory: -0.00GfreePhysicalMemory: -0.00GmaxFileDescriptorCount: -1currentOpenSEO靠我FileDescriptorCount: -1GC Names [PS MarkSweep, PS Scavenge]MEMORY_NAME | allocation_size | init_sizeSEO靠我PS Eden Space | 256.00MB | 256.00MBCode Cache | 240.00MB | 2.44MBCompressed Class Space | 1,024.00MBSEO靠我 | 0.00MBPS Survivor Space | 42.50MB | 42.50MBPS Old Gen | 683.00MB | 683.00MBMetaspace | -0.00MB | SEO靠我0.00MB2020-07-16 06:25:32.927 [main] INFO Engine - {"content":[{"reader":{"name":"mysqlreadeSEO靠我r","parameter":{"column":["*"],"connection":[{"jdbcUrl":["jdbc:mysql://39.103.21.61:3306/coredb"],"tSEO靠我able":["seepcore_table"]}],"password":"********","username":"xuxingzhuang"}},"writer":{"name":"mysqlSEO靠我writer","parameter":{"column":["*"],"connection":[{"jdbcUrl":"jdbc:mysql://106.15.31.131:3306/coredbSEO靠我","table":["seepcore_table"]}],"password":"********","preSql":[],"session":[],"username":"xuxingzhuaSEO靠我ng","writeMode":"insert"}}}],"setting":{"speed":{"channel":"2"}} }2020-07-16 06:25:32.956 [mSEO靠我ain] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2020-SEO靠我07-16 06:25:32.959 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 SEO靠我 2020-07-16 06:25:32.959 [main] INFO JobContainer - DataX jobContainer starts job. 2020-0SEO靠我7-16 06:25:32.962 [main] INFO JobContainer - Set jobId = 0 2020-07-16 06:25:33.419 [job-0] ISEO靠我NFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://39.103.21.61:3306/coredb?yearIsDatSEO靠我eType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true. SEO靠我 2020-07-16 06:25:33.421 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置SEO靠我读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改. 2020-07-16 06:25:33.929 [job-0] INFOSEO靠我 OriginalConfPretreatmentUtil - table:[seepcore_table] all columns:[ id,date_time,line_3,linSEO靠我e_4,line_10,line_11,line_12,line_13,line_14,line_15,line_16,line_17,line_18,line_19,line_20 SEO靠我]. 2020-07-16 06:25:33.930 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为SEO靠我您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改. 2020-07-16 06:25:33.933 [job-0SEO靠我] INFO OriginalConfPretreatmentUtil - Write data [ insert INTO %s (id,date_time,line_3,line_SEO靠我4,line_10,line_11,line_12,line_13,line_14,line_15,line_16,line_17,line_18,line_19,line_20) VALUES(?,SEO靠我?,?,?,?,?,?,?,?,?,?,?,?,?,?) ], which jdbcUrl like:[jdbc:mysql://106.15.31.131:3306/coredb?ySEO靠我earIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=SEO靠我true] 2020-07-16 06:25:33.935 [job-0] INFO JobContainer - jobContainer starts to do prepare SEO靠我... 2020-07-16 06:25:33.936 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prSEO靠我epare work . 2020-07-16 06:25:33.936 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwritSEO靠我er] do prepare work . 2020-07-16 06:25:33.938 [job-0] INFO JobContainer - jobContainer startSEO靠我s to do split ... 2020-07-16 06:25:33.938 [job-0] INFO JobContainer - Job set Channel-NumberSEO靠我 to 2 channels. 2020-07-16 06:25:33.946 [job-0] INFO JobContainer - DataX Reader.Job [mysqlrSEO靠我eader] splits to [1] tasks. 2020-07-16 06:25:33.947 [job-0] INFO JobContainer - DataX WriterSEO靠我.Job [mysqlwriter] splits to [1] tasks. 2020-07-16 06:25:33.978 [job-0] INFO JobContainer - SEO靠我jobContainer starts to do schedule ... 2020-07-16 06:25:33.983 [job-0] INFO JobContainer - SSEO靠我cheduler starts [1] taskGroups. 2020-07-16 06:25:33.986 [job-0] INFO JobContainer - Running SEO靠我by standalone Mode. 2020-07-16 06:25:34.002 [taskGroup-0] INFO TaskGroupContainer - taskGrouSEO靠我pId=[0] start [1] channels for [1] tasks. 2020-07-16 06:25:34.010 [taskGroup-0] INFO ChannelSEO靠我 - Channel set byte_speed_limit to -1, No bps activated. 2020-07-16 06:25:34.011 [taskGroup-SEO靠我0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2020-07-16 06:25:3SEO靠我4.023 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started SEO靠我 2020-07-16 06:25:34.032 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by SqlSEO靠我: [select * from seepcore_table ] jdbcUrl:[jdbc:mysql://39.103.21.61:3306/coredb?yearIsDateTSEO靠我ype=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]. SEO靠我 2020-07-16 06:25:44.018 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bySEO靠我tes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All TaskSEO靠我 WaitReaderTime 0.000s | Percentage 0.00% 2020-07-16 06:25:54.023 [job-0] INFO StandAloneJobSEO靠我ContainerCommunicator - Total 109056 records, 23305935 bytes | Speed 2.22MB/s, 10905 records/s | ErrSEO靠我or 0 records, 0 bytes | All Task WaitWriterTime 8.666s | All Task WaitReaderTime 0.801s | PercentageSEO靠我 0.00% 2020-07-16 06:26:04.026 [job-0] INFO StandAloneJobContainerCommunicator - Total 23193SEO靠我6 records, 49667141 bytes | Speed 2.51MB/s, 12288 records/s | Error 0 records, 0 bytes | All Task WaSEO靠我itWriterTime 17.829s | All Task WaitReaderTime 1.541s | Percentage 0.00% 2020-07-16 06:26:14SEO靠我.029 [job-0] INFO StandAloneJobContainerCommunicator - Total 354816 records, 75963461 bytes | Speed SEO靠我2.51MB/s, 12288 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 26.983s | All Task WaSEO靠我itReaderTime 2.221s | Percentage 0.00% 2020-07-16 06:26:24.032 [job-0] INFO StandAloneJobConSEO靠我tainerCommunicator - Total 467456 records, 100068421 bytes | Speed 2.30MB/s, 11264 records/s | ErrorSEO靠我 0 records, 0 bytes | All Task WaitWriterTime 36.103s | All Task WaitReaderTime 2.919s | Percentage SEO靠我0.00% 2020-07-16 06:26:34.034 [job-0] INFO StandAloneJobContainerCommunicator - Total 584192SEO靠我 records, 125049925 bytes | Speed 2.38MB/s, 11673 records/s | Error 0 records, 0 bytes | All Task WaSEO靠我itWriterTime 45.179s | All Task WaitReaderTime 3.551s | Percentage 0.00% ..... 2020-SEO靠我07-16 06:30:34.113 [job-0] INFO VMInfo -[delta cpu info] =>curDeltaCpu | averageCpu | maxDeltaCpu | SEO靠我minDeltaCpu-1.00% | -1.00% | -1.00% | -1.00%[delta memory info] =>NAME | used_size | used_percent | SEO靠我max_used_size | max_percentPS Eden Space | 309.99MB | 95.09% | 309.99MB | 95.09%Code Cache | 6.02MB SEO靠我| 79.63% | 6.02MB | 79.63%Compressed Class Space | 1.81MB | 90.66% | 1.81MB | 90.66%PS Survivor SpacSEO靠我e | 5.53MB | 73.75% | 5.53MB | 73.75%PS Old Gen | 6.35MB | 0.93% | 6.35MB | 0.93%Metaspace | 18.57MBSEO靠我 | 97.72% | 18.57MB | 97.72%[delta gc info] =>NAME | curDeltaGCCount | totalGCCount | maxDeltaGCCounSEO靠我t | minDeltaGCCount | curDeltaGCTime | totalGCTime | maxDeltaGCTime | minDeltaGCTimePS MarkSweep | 0SEO靠我 | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s | 0.000sPS Scavenge | 100 | 100 | 100 | 100 | 0.854s | 0.854SEO靠我s | 0.854s | 0.854s2020-07-16 06:30:44.115 [job-0] INFO StandAloneJobContainerCommunicator - Total 3SEO靠我350592 records, 719410097 bytes | Speed 1.88MB/s, 9171 records/s | Error 0 records, 0 bytes | All TaSEO靠我sk WaitWriterTime 274.631s | All Task WaitReaderTime 19.540s | Percentage 0.00% .....2020-07SEO靠我-16 06:34:21.281 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select *SEO靠我 from seepcore_table ] jdbcUrl:[jdbc:mysql://39.103.21.61:3306/coredb?yearIsDateType=false&zSEO靠我eroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]. 2020-0SEO靠我7-16 06:34:21.633 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[SEO靠我527612]ms 2020-07-16 06:34:21.634 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] complSEO靠我eted its tasks. 2020-07-16 06:34:24.169 [job-0] INFO StandAloneJobContainerCommunicator - ToSEO靠我tal 5585206 records, 1197238078 bytes | Speed 3.49MB/s, 17183 records/s | Error 0 records, 0 bytes |SEO靠我 All Task WaitWriterTime 484.287s | All Task WaitReaderTime 32.298s | Percentage 100.00% 202SEO靠我0-07-16 06:34:24.169 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2020SEO靠我-07-16 06:34:24.170 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work. SEO靠我 2020-07-16 06:34:24.171 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do post work. SEO靠我 2020-07-16 06:34:24.172 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. SEO靠我 2020-07-16 06:34:24.177 [job-0] INFO HookInvoker - No hook invoked, because base dir not existSEO靠我s or is a file: /root/datax/hook 2020-07-16 06:34:24.178 [job-0] INFO JobContainer -[total cSEO靠我pu info] =>averageCpu | maxDeltaCpu | minDeltaCpu-1.00% | -1.00% | -1.00%[total gc info] =>NAME | toSEO靠我talGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTimePS MarSEO靠我kSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000sPS Scavenge | 166 | 100 | 66 | 1.335s | 0.854s | 0.481sSEO靠我2020-07-16 06:34:24.178 [job-0] INFO JobContainer - PerfTrace not enable! 2020-07-16 06:34:2SEO靠我4.178 [job-0] INFO StandAloneJobContainerCommunicator - Total 5585206 records, 1197238078 bytes | SpSEO靠我eed 2.15MB/s, 10538 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 484.287s | All TaSEO靠我sk WaitReaderTime 32.298s | Percentage 100.00% 2020-07-16 06:34:24.181 [job-0] INFO JobContaSEO靠我iner - 任务启动时刻 : 2020-07-16 06:25:32 任务结束时刻 : 2020-07-16 06:34:24 任务总计耗时 : 53SEO靠我1s 任务平均流量 : 2.15MB/s 记录写入速度 : 10538rec/s 读出记录总数 : 5585206 读写失败总数 : 0SEO靠我

以上内容程序执行输出可以看到迁移进度以及迁移资源花费,并且最后还会输出汇总信息,到此为止mysql迁移mysql的已经完成

注意⚠️:当前测试为一张表进行迁移,如果有多张表,请按照需求进行配置多个jsoSEO靠我n文件进行拷贝,datax实例可以cpu可以配置高一些,将json文件的speed.channel调整大一些,并发效果会更好一些

1.4. 迁移测试总结

1.4.1. DataX优势

DataX较适合跨数据SEO靠我库表级的数据一次性迁移。可跨异构数据库,支持多数据源

1.4.2. DataX缺点

无法做增量数据同步,每一次同步都需要清空目标端表格资料。无法支持实时同步。
“SEO靠我”的新闻页面文章、图片、音频、视频等稿件均为自媒体人、第三方机构发布或转载。如稿件涉及版权等问题,请与 我们联系删除或处理,客服邮箱:html5sh@163.com,稿件内容仅为传递更多信息之目的,不代表本网观点,亦不代表本网站赞同 其观点或证实其内容的真实性。

网站备案号:浙ICP备17034767号-2