Polybase Configuration With Cloudera 5
In this article i’ll try to describe the required configuration steps to setup Polybase in SQLServer 2016 for Cloudera CDH5.
Intro
Before starting the the configuration steps, let’s just try to understand why this is being done.
Sqoop
Sqoop is one of the most used tools to transfer data from the Relational World to BigData infrastructures. It relies on JDBC connectors to transfer data between SQLServer and HDFS.
Performance
The transfer process to SQLServer via sqoop is taking quite a lot, so the objective of this PoC is to verify if PolyBase alternative for dumping data in/out the cluster and understand if there is a improvement on existing processes.
Setup Process
Obtain the cluster configuration files
In order to configure Polybase for you Cloudera cluster one should first gather from CM the client configurations for HDFS and YARN.
After downloading the configs you need to update the following files:
- yarn-site.xml
- mapred-site.xml
- hdfs-site.xml
Copy this files to your SQLServer instance where Polybase will be installed.
The usual path is
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf |
Note: Please note, that when PolyBase authenticates to a Kerberos secured cluster, we require the hadoop.rpc.protection
setting to be set to authentication. This will leave the data communication between Hadoop nodes unencrypted.
Activate the required Feature
On the SQLServer you are going to activate Polybase make sure you have the required pre-requisites
- 64-bit SQL Server Evaluation edition
- Microsoft .NET Framework 4.5.
- Oracle Java SE RunTime Environment (JRE) version 7.51 or higher (64-bit)
- Minimum memory: 4GB
- Minimum hard disk space: 2GB
- TCP/IP must be enabled for Polybase to function correctly.
Follow Microsoft guide to activate the feature PolyBase Install Guide
One could test if Polybase is correctly installed by running the following command
SELECT SERVERPROPERTY ('IsPolybaseInstalled') AS IsPolybaseInstalled; |
Configure External data source
Execute the following T-SQL to create Hadoop connectivity to CDH5
-- Values map to various external data sources. |
You could change the option in case you use a different Hadoop Cluster check the Option Mapping
Note: After running RECONFIGURE, you must stop and restart the SQL Server service.
Create the T-SQL objects
Follow the example configuration described in Getting Started with Polybase
-- 1: Create a database scoped credential. |
Example Queries
- Import external Data
-- PolyBase Scenario 2: Import external data into SQL Server. |
- Export External Data
-- PolyBase Scenario 3: Export data from SQL Server to Hadoop. |
Tests
Initial tests are quite good actually, even with the identified issues. Polybase seems quite limited but for the objective in hands migth present like a very viable solution.
Some more tests would be required.
Issues
- It seems one cannot truncate external tables so an extra process would be required if you plan to use this as part of an ETL process that should support re-runs
- It seems that
hadoop_user_name
is being ignored and polybase still usespwc_user
account in cluster. - Take care on the compression levels you choose as they consume quite a lot CPU on your SQLServer
- The metadata of the tables is allways stored on SQLServer. And when you choose parquet files has source format it stores in parquet meta the colunms as
col-0,col-1,col-3,...
if you map thoose files to a Hive table would require a view with the respective column name mapping. - Not sure if this can be change but the dumped files to HDFS are splitted in 8, for the initial tests, only bad for small tables.
Conclusion
- Work in progress