5 reasons why BigQuery users should use dbt

Yu Ishikawa
5 min readAug 4, 2020

--

How do you implement and test data pipelines with BigQuery to create intermediate tables and manage metadata and data discovery? I used to use Apache Airflow’s operators with BigQuery. However, I basically need to implement code in python and manage the dependencies between BigQuery tables manually. As well as, actually, apache airflow enables us to test BigQuery tables with the CheckOperator . But, we need to implement BigQuery queries even to test not-null or unique column. It can be useful, but not productive to me. And then, apache airfow doesn’t support metadata management and data discovery.

dbt(data build tool), which is not only a cloud service but also an open source project, enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views. The open sourced dbt is a CLI to run jobs to create tables and test the created tables. It supports not only BigQuery, but also postgres, redshift and snowfrake at the time of writing the article.

I would like to highlight the useful functions of dbt rather than explaining the detailed features and how to use it step by step. If you would like to get started with dbt , “Get started with BigQuery and dbt, the easy way” should guide you.

Reason 1: Managed dependencies of tables and views only using SQL

When we build multiple steps data pipelines with BigQuery, it is bothersome to manage the dependencies between jobs. For instance, if we implement a DAG with apache airflow, we have to manually implement steps. Moreover, if an intermediate table is used by multiple DAGs, it is difficult to manage the triggers.

Whereas, dbt offers ref() and source() macros that enables data analysts to refer to dependent tables. Even if dependent tables like customer_orders depend on other tables, we don’t have to manually manage the order to run jobs. dbt automatically analyze the dependencies and run jobs in the right order. We don’t need to manually manage dependencies. So, data analysts can build even complicated multiple steps jobs using only BigQuery queries.

with customers as (
select * from {{ ref('stg_customers') }}
),
customer_orders as (
select * from {{ ref('customer_orders') }}
),
customer_payments as (
select * from {{ ref('customer_payments') }}
),
final as (
select
customers.customer_id,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders using (customer_id)
left join customer_payments using (customer_id)
)
select * from final

As the image below show, dbt execute models/jobs in the right order of the dependencies, when running dbt run with jaffle_shop example locally.

Reason 2: Simple data validation method

There are some approaches to validate BigQuery tables and views, such as not null column and uniqueness of a column. For instance, apache airflow provides the check operator for BigQuery. As long as we express tests with BigQuery queries, the operator enables data analysts to check the data quality. However, we have to implement the DAG in python.

Meanwhile, dbt enables data analysts to validate not only intermediate tables and views, but also even existing tables and views using only YAML files. For instance, the column customer_id should be unique and not null, all we have to do is put unique and not_null in schema.yml for the table.

models:
- name: dim_customers
description: This table has basic information about a customer, as well as some derived facts based on a customer's orders
columns:
- name: customer_id
description: This is a unique identifier for a customer
tests:
- unique
- not_null

As the image below show, dbt execute models/jobs in the right order of the dependencies, when running dbt testwith jaffle_shop example locally.

Reason 3: Data lineage

As I mentioned in “Reason 1”, dbt automatically analyze dependencies between tables (including existing tables) and view. The web UI of dbt enables us to see data lineage from the dependencies. We don’t need to manage metadata of data lineage.

Consider if a table is used to build some intermediate tables and the table is something wrong. We probably want to understand affected tables by using the origin table. As we can see the image below, we can analyze the impacted tables on the web UI.

Lineage graph

Reason 4: Metadata management and Data discovery

dbt enables us to annotate and tag models (intermediate tables/views) and sources (existing tables/views) using only YAML files. That would be helpful to select models to be partially run and search data on the web UI. We can put documents and information about foreign keys to related tables with YAML files too.

sources:
- name: product
database: "project-a"
tags:
- daily

tables:
- name: users
columns:
- name: id
description: "user ID"
tests:
- unique
- not_null
- name: email
description: "masked email"
meta:
contains_pii: true
data discovery

Reason 5: Extensibility

dbt is based on Jinja which is a modern and designer-friendly templating language for Python. So, we can custom macros. As well as, dbt packages enables us to easily reuse external libraries in our dbt projects.

  • dbt-utils: macros that can be (re)used across dbt projects
  • dbt-ml: BigQuery ML models in dbt
  • airflow-dbt: Airflow operators to provide easy integration with dbt.

Conclusion

dbt is a powerful tool to build data pipelines with BigQuery using only BigQuery queries and YAML files. We no longer consider dependencies between tables and views manually. As well as we can annotate and tag tables and views with dbt. Moreover the web UI enables us to see and anlyze data lineage and to do data discovery. We can locate metadata information with jobs to build data pipelines.

--

--

Yu Ishikawa
Yu Ishikawa

Written by Yu Ishikawa

Data Engineering / Machine Learning / MLOps / Data Governance / Privacy Engineering

No responses yet