Skip to main content

Create Avro Tables For Google BigQuery

·626 words·3 mins·
Hive BigQuery Avro HDFS Parquet GCS
Table of Contents

This article describes one way to create tables in BigQuery From HDFS parquet Data.

Intro
#

This article describes one way to create tables in BigQuery From HDFS parquet Data.

Requirements
#

For this article assumes the following requisites are meet:

  • You have a Google Cloud Platform account
  • You have created a Google Cloud Storage bucket
  • You have HDFS configured
  • You have Parquet data that you could materialize in Hive
  • Must have bq util install

Definitions
#

  • Hive Parquet Table: parquet_table
  • HDFS Parquet Location: /user/hive/warehouse/test.db/parquet_table
  • Hive Avro Table Name: avro_table
  • HDFS Avro Location: /user/hive/warehouse/test.db/avro_table

Process
#

  1. Create a avro table from parquet data
  2. Copy avro files to GCS
  3. Create Bigquery Table from avro in gs bucket

Setup
#

Create the Avro Table
#

Let’s start by creating the new table based on the existing parquet data

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

CREATE TABLE avro_table STORED AS AVRO
  AS (SELECT * FROM parquet_table);

One could specify in the SELECT statement the columns we would like that could be obtain using parquet-tools command ex:

parquet-tools meta <parquet_file>

You would still need to get the parquet file to obtain that.

Why Avro File and that format
#

You could update data to BigQuery by streaming or from Google Cloud Storage as a batch process. A bulk import from HDFS seems logical to use a batch process so why avro ? According to the latest info on the google BigQuery Site it’s possible to:

  • Load from Google Cloud Storage, including CSV, JSON (newline-delimited), and Avro files, as well as Google Cloud Datastore backups.
  • Load directly from a readable data source.
  • Insert individual records using streaming inserts.

Compressed Avro files are not supported, but compressed data blocks are. BigQuery supports the DEFLATE and Snappy codecs.

Also there is the following Avro mapping that could be useful

Avro data typeBigQuery data type
null- ignored -
booleanBOOLEAN
intINTEGER
longINTEGER
floatFLOAT
doubleFLOAT
bytesBYTES
stringSTRING
recordRECORD
enumSTRING
array- repeated fields -
mapRECORD
unionRECORD
fixedBYTES

Check the full spec on GCP Page

The other advantage of using avro is that BigQuery infers the schema so you don’t have to describe the columns of you table.

Copy Avro file from HDFS to GCS
#

The best approach for this is to add the GCS connector to your HDFS config

Follow the instructions in the following link or download the jar for Hadoop 2.x here

  1. Add that jar on a valid location for you cluster HADOOP_CLASSPATH
  2. Generate a service account in the GCP console and get JSON key (follow this instructions)
  3. Copy that JSON file to a location in your cluster
  4. Add the following properties to your cluster core-site.xml
  <property>
    <name>fs.gs.project.id</name>
    <value>your-project-name</value>
    <description>
      Required. Google Cloud Project ID with access to configured GCS buckets.
    </description>
  </property>
  <property>
  <name>google.cloud.auth.service.account.json.keyfile</name>
  <value>/path/to/your/JSON-keyfile</value>
  <description>
    The JSON key file of the service account used for GCS
    access when google.cloud.auth.service.account.enable is true.
  </description>
  </property>
    <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>

Extended options are available in gcs-core-default example

  1. Create a new bucket on GCP and make sure you can access to with via hdfs command.
hdfs dfs -ls gs://my-bucket-name/

If that works, you can now execute a distcp to sync the avro files directly to GCS.

hdfs mkdir gs://my-bucket-name/my_table
hdfs distcp /user/hive/warehouse/test.db/avro_table/* gs://my-bucket-name/my_table/

Load GCS avro data to a BigQuery table
#

Execute

bq load ds.table gs://my-bucket-name/my_table/* --autodetect

And that’s it you now have a table with data in BigQuery.

It is recommended to have this process managed by some type of orchestrator. There are several solutions for this. The next article i’ll be writing about one of those Airflow

Cheers, RR

References
#