Accessing data in Hadoop using dplyr and SQL

时间：2018-04-09 13:07:08 阅读：231 评论：0 收藏：0 [点我收藏+]

标签：clu use int scale methods his primary base pop

If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.

There are two methods for accessing data in Hadoop using dplyr and SQL.

ODBC

You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI, dplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:

library(odbc)

con <- dbConnect(odbc::odbc(),
                 driver = <driver>,
                 host = <host>,
                 dbname = <dbname>,
                 user = <user>,
                 password = <password>,
                 port = 10000)

tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

dbDisconnect(con)

Spark

If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:

library(sparklyr)

con <- spark_connect(master = "yarn-client")

tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

spark_disconnect(con)

转自：https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL

Accessing data in Hadoop using dplyr and SQL

标签：clu use int scale methods his primary base pop

原文地址：https://www.cnblogs.com/payton/p/8758893.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行