关于怎么高效地迁移数据到 hadoop - V2EX

Home Sign Up Sign In

› Apache Hadoop

› Hortonworks Sandbox

› Intel Hadoop Distribution

› Treasure Data

This topic created in 3155 days ago, the information mentioned may be changed or developed.

现在我想把几十 G 的信息从 elasticsearch 迁移到 hadoop. 用的方式是 spark + elasticsearch-hadoop

Demo 代码如下:

 $ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
>>> conf = {"es.resource" : "index/type"}   # assume Elasticsearch is running on localhost defaults
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",\
    "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
>>> rdd.first()         # the result is a MapWritable that is converted to a Python dict
(u'Elasticsearch ID',
 {u'field1': True,
  u'field2': u'Some Text',
  u'field3': 12345})

按照这种方式能够正常地从 es 迁移数据,但是现在有一个问题,就是迁移的速度太慢了.

我想出了一个解决方案如下:

首先我通过日期创建 n 个迁移的任务,然后让他们同时在 spark 集群上执行,然后集群上的 core 数只够 k 个任务,那么就让(n-k)个任务自己在排队等待资源

有没有更好的解决方案呢?

8 replies • 2018-09-04 10:56:49 +08:00

1

gouchaoer

Nov 2, 2017

1

憨肚噗如果是用的 thrift 接口 insert 的数据然后跑 mapreduce 任务的话会很慢，如果你先把数据从 elastic 中导出来成为 txt 文件，然后再用 spark/hive 直接导入的话会很快，我也不知道为啥

2

hwsdien

Nov 2, 2017

1

还不如 dump 出来直接 cp 到 hadoop 上？

3

ufo22940268

OP

Nov 2, 2017

@gouchaoer Save my day!

4

focusheart

Nov 2, 2017

1

可以直接 dump 成文件，hdfs dfs cp 命令复制上去也很快哦。https://github.com/taskrabbit/elasticsearch-dump

5

ufo22940268

OP

Nov 2, 2017

这里个个都是人才,说话又好听,超喜欢在这里

6

mingweili0x

Nov 3, 2017

用 hadoop 自带的 distcp 可以啊。hadoop 会开一个专门的 mapreduce job 来拷你的数据，前提是你的数据放在了所有机器都能访问到的地方（比如 nfs 上）

7

yanyanlong

Nov 3, 2017

@gouchaoer 可能是大文件和小文件的区别，hadoop 更适合对大文件处理

8

pythonee

Sep 4, 2018

考虑增量数据吗

About · Help · Advertise · Blog · API · FAQ · Solana · 5587 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 70ms · UTC 01:41 · PVG 09:41 · LAX 18:41 · JFK 21:41
♥ Do have faith in what you're doing.