19 Sep, 2017 - About 3 minutes
Create Avro Tables for Google BigQuery
Intro
This article describes one way to create tables in BigQuery From HDFS parquet Data.
Requirements
For this article assumes the following requisites are meet:
- You have a Google Cloud Platform account
- You have created a Google Cloud Storage bucket
- You have HDFS configured
- You have Parquet data that you could materialize in Hive
- Must have bq util install
Definitions
- Hive Parquet Table: parquet_table
- HDFS Parquet Location:
/user/hive/warehouse/test.db/parquet_table
- Hive Avro Table Name: avro_table
- HDFS Avro Location:
/user/hive/warehouse/test.db/avro_table
Process
- Create a avro table from parquet data
- Copy avro files to GCS
- Create Bigquery Table from avro in gs bucket
Setup
Create the Avro Table
Let’s start by creating the new table based on the existing parquet data
SET hive.exec.compress.output=true; |
One could specify in the SELECT
statement the columns we would like that could be obtain using parquet-tools
command ex:
parquet-tools meta <parquet_file> |
You would still need to get the parquet file to obtain that.
Why Avro File and that format
You could update data to BigQuery by streaming or from Google Cloud Storage as a batch process. A bulk import from HDFS seems logical to use a batch process so why avro ?
According to the latest info on the google BigQuery Site it’s possible to:
- Load from Google Cloud Storage, including CSV, JSON (newline-delimited), and Avro files, as well as Google Cloud Datastore backups.
- Load directly from a readable data source.
- Insert individual records using streaming inserts.
Compressed Avro files are not supported, but compressed data blocks are. BigQuery supports the DEFLATE and Snappy codecs.
Also there is the following Avro mapping that could be useful
Avro data type | BigQuery data type |
---|---|
null | - ignored - |
boolean | BOOLEAN |
int | INTEGER |
long | INTEGER |
float | FLOAT |
double | FLOAT |
bytes | BYTES |
string | STRING |
record | RECORD |
enum | STRING |
array | - repeated fields - |
map | RECORD |
union | RECORD |
fixed | BYTES |
Check the full spec on GCP Page
The other advantage of using avro is that BigQuery infers the schema so you don’t have to describe the columns of you table.
Copy Avro file from HDFS to GCS
The best approach for this is to add the GCS connector to your HDFS config
Follow the instructions in the following link or download the jar for Hadoop 2.x here
- Add that jar on a valid location for you cluster
HADOOP_CLASSPATH
- Generate a
service account
in the GCP console and get JSON key (follow this instructions) - Copy that JSON file to a location in your cluster
- Add the following properties to your cluster
core-site.xml
<property> |
Extended options are available in gcs-core-default example
- Create a new bucket on GCP and make sure you can access to with via
hdfs
command.
hdfs dfs -ls gs://my-bucket-name/ |
If that works, you can now execute a distcp to sync the avro files directly to GCS.
hdfs mkdir gs://my-bucket-name/my_table |
Load GCS avro data to a BigQuery table
Execute
bq load ds.table gs://my-bucket-name/my_table/* --autodetect |
And that’s it you now have a table with data in BigQuery.
It is recommended to have this process managed by some type of orchestrator. There are several solutions for this. The next article i’ll be writing about one of those Airflow
Cheers,
RR
References
- https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_hive.html
- https://github.com/apache/parquet-mr/tree/master/parquet-tools
- https://cloud.google.com/bigquery/loading-data
- https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md
- https://cloud.google.com/bigquery/docs/loading-data-cloud-storage