Hive 可以知道数据有多大吗？

是这样，接手了别人的一个项目。 Hive 表结构大概有 17 亿条数据。

我知道 hive 的存储是放到 HDFS 上的 /usr/hive/warehouse 目录下

但是因为它之前的数据是做了分区的，还有 hdfs 本来就是冗余存储所以就会是这样

/usr/hive/warehouse/dbname/tablename/hour=01/part00000001copy /usr/hive/warehouse/dbname/tablename/hour=01/part00000001 /usr/hive/warehouse/dbname/tablename/hour=02/part00000001copy

大概类似上面的效果

而且 HDFS 上的目录的文件是不显示大小的。

因为要做项目的数据评估效率分析之类的，如何才能知道这 17 亿条数据的数据大小呢？

cxzl25

2016-04-27 00:19:25 +08:00

可用
hdfs dfs -help du

-du [-s] [-h] <path> ...: Show the amount of space, in bytes, used by the files that
match the specified file pattern. The following flags are optional:
-s Rather than showing the size of each individual file that
matches the pattern, shows the total (summary) size.
-h Formats the sizes of files in a human-readable fashion
rather than a number of bytes.

Note that, even without the -s option, this only shows size summaries
one level deep into a directory.
The output is in the form
size name(full path)

另一般 hive 计算完毕，日志会显示
stats: [numFiles=1, numRows=62xx9xx, totalSize=83xx236xx, rawDataSize=81x49xxx]

且有如下语法，对部分计算过产生的分区，可以统计大小

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

参考： https://cwiki.apache.org/confluence/display/Hive/StatsDev

firstway

2016-04-27 06:45:05 +08:00

楼上对的。
使用 du ，你可以查看整个数据文件大小。但是这个不能告诉你 record 的条数。一般文件是压缩过的(不压缩太费空间).
不知道你说大小是指什么？目的是什么？是想预测处理速度？这个不光看数据规模，还要看数据可分的细度，还有集群的处理能力。。。。

firstway

2016-04-28 00:50:53 +08:00

hive 最终是以 MapReduce 的 job 来运行的，所以你可以在 copy 几个小时，几天，几周的数据跑一下你们的 job 。如果你数据分布比较均匀，处理速度接近线性。
如果接近线性，就好估计整体处理时间了。

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/274640

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.