Hadoop基础（五十二）：sqoop相关面试题

时间：2020-10-06 21:17:18 阅读：41 评论：0 收藏：0 [点我收藏+]

标签：解决 result 日志 apache style you 工程师 led aging

1、Sqoop参数

/opt/module/sqoop/bin/sqoop import \

　　--connect \

　　--username \

　　--password \

　　--target-dir \

　　--delete-target-dir \

　　--num-mappers \

　　--fields-terminated-by \

　　--query "$2" ‘ and $CONDITIONS;‘

2、开发中遇到的问题

1 ）Sqoop导入导出Null存储一致性问题

Hive中的Null在底层是以“\N”来存储，而MySQL中的Null在底层就是Null，为了保证数据两端的一致性。在导出数据时采用--input-null-string和--input-null-non-string两个参数。导入数据时采用--null-string和--null-non-string。

2） Sqoop数据导出一致性问题

（1）场景1：如Sqoop在导出到Mysql时，使用4个Map任务，过程中有2个任务失败，那此时MySQL中存储了另外两个Map任务导入的数据，此时老板正好看到了这个报表数据。而开发工程师发现任务失败后，会调试问题并最终将全部数据正确的导入MySQL，那后面老板再次看报表数据，发现本次看到的数据与之前的不一致，这在生产环境是不允许的。

官网：http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Since Sqoop breaks down export process into multiple transactions, 
it is possible that a failed export job may result in partial data being committed to the database.
 This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others.
You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. 
The staged data is finally moved to the destination table in a single transaction.

–staging-table方式

sqoop export --connect jdbc:mysql://192.168.137.10:3306/user_behavior --username root --password 123456 --table app_cource_study_report --columns watch_video_cnt,complete_video_cnt,dt --fields-terminated-by "\t" --export-dir "/user/hive/warehouse/tmp.db/app_cource_study_analysis_${day}" --staging-table app_cource_study_report_tmp --clear-staging-table --input-null-string ‘\N‘

（2）场景2：设置map数量为1个（不推荐，面试官想要的答案不只这个）

多个Map任务时，采用–staging-table方式，仍然可以解决数据一致性问题。

3 Sqoop底层运行的任务是什么

只有Map阶段，没有Reduce阶段的任务。

4 Sqoop数据导出的时候一次执行多长时间

Sqoop任务一般情况40 -50分钟的都有。取决于数据量（11：11,6:18）。

5 Sqoop一天导多少数据

100万日活、业务数据
普通的日常消费品电商：10万
每人每天多少条 10条

1 条日志1k
10万 * 10条 = 1g数据量
1g

业务数据每天1G，就导1G。

每天导入订单多少数据？订单详情多少数据？登录注册多少？物流多少

java后台一共导过来多少业务表？30张

1g/30张=34m 2-3倍

用户行为：100g 10张表 =》 1 张表10g 2-3

实时项目每天处理多少数据？业务数据读了哪些表（）用户行为（）

6、sqoop数据导出的时候一次执行多长时间

　　 0:30开始执行 40-50分钟

　　6.18/11.11 1个小时以上跟服务器性能有关

7、Sqoop在导入数据的时候数据倾斜

https://blog.csdn.net/lizhiguo18/article/details/103969906

sqoop 抽数的并行化主要涉及到两个参数：num-mappers：启动N个map来并行导入数据，默认4个；split-by：按照某一列来切分表的工作单元。

通过ROWNUM() 生成一个严格均匀分布的字段，然后指定为分割字段

Hadoop基础（五十二）：sqoop相关面试题

标签：解决 result 日志 apache style you 工程师 led aging

原文地址：https://www.cnblogs.com/qiu-hua/p/13773915.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行