Data Status Time Machine on Persisted dbt Artifacts

Yu Ishikawa
6 min readJul 6, 2021

The article is brought by Yu as one of the blog post series from Ubie, inc. Ubie automatically generates medical records using an AI-powered patient questionnaire that helps save time and provide better patient care. As you can imagine, data engineering, data management and data governance are very significant to build the high-quality AI-powered AI system.

As you know, the trends on data engineering and management has been shifting to the next step in 2021. The objectives of the modern data stack are beyond just storing and transforming data. We are looking for better ways to manage pipelines, data quality, metadata, metrics and so on. Both open-sourced and enterprise products for the modern data stack are dramatically evolving these days. I am sure dbt is one of the key products which inspire us and change our data engineering workload.

dbt (data build tool) enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. If you are not familiar with dbt, please read the article I wrote before. Transforming data is of course one of the significant features of dbt. In addition, there are many useful other features as schema test, source freshness check and so on. In this article, I would like to show an idea to take advantage of dbt artifacts by persisting them on BigQuery so that we inspect data status in the past.

TL;DR

  • dbt artifacts are JSON files that can be used to power data documentation, data state and data visualization.
  • dbt artifacts is just a snapshot of dbt results. Those don’t contain all historical information about dbt results. So, we have to deal with the changes for data downtime, data quality status changes and so on.
  • We implement a Cloud Run application to insert dbt artifacts JSON to BigQuery by receiving notifications from GCS.
  • We can understand not only what tables and views are wrong at that time, but also when tables and views got wrong and how long those were wrong on top of the persisted dbt artifacts on BigQuery with BI tools.

What are dbt artifacts?

dbt generates and saves one or more artifacts with every invocation as catalog.json , manifest.json , run_results.json and sources.json .

Catalog

catalog.json contains information from your data warehouse of the tables and views produced and defined by the resources in your project.

Manifest

manifest.json contains a full representation of your dbt project’s resources (models, tests, macros, etc), including all node configurations and resource properties. Even if you’re only running some models or tests, all resources will appear in the manifest (unless they are disabled) with most of their properties. (A few node properties, such as compiled_sql, only appear for executed nodes.)

Run Results

run_results.json contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.

Sources

sources.json contains information about sources with freshness checks. Today, dbt Cloud uses this file to power its Source Freshness visualization.

The schema JSON are well defined. We can implement a custom parser based on the schema. So, it has great potential to collaborate with other tools. For instance, LinkedIn’s DataHub can integrate metadata of dbt with dbt artifacts.

As the official documentation describes, we are able to implement a custom tool like DataHub for:

  • calculate project-level test coverage
  • perform longitudinal analysis of run timing
  • identify historical changes in table structure
  • do much, much more

What do we want to do with dbt artifacts?

We want to know not only the current data status, but also historical changes of data status. In addition, we run some different dbt jobs based on the condition. For instance, we have daily jobs of dbt on apache airflow as well as continuous delivery jobs when dbt models are changed with github actions. So, we don’t have comprehensive run results JSON file which contains information about all models, all tests and so on. We want to investigate data downtime with detailed information about failed dbt tests. That’s why we want to persist dbt artifacts. Moreover, BigQuery would be a great place since we use BigQuery and our BI tools can easily connect to BigQuery.

Why don’t we use the comming --store-failures?

As you might know if you are following the dbt development, dbt 0.20.0 will support a new feature to store test failures to warehouses. It enables us to store data about failed tests to a warehouse. That would be great! But it doesn’t satisfy our requirements, as stored data are brief summaries about test results and are distributed across GCP projects, datasets and tables. It is a little hard to analyze the detail of failed tests as when and why tests were failed. Especially, we want to inspect data downtime with detailed history of failures.

How do we persist dbt artifacts on BigQuery?

dbt itself doesn’t have the feature to persist dbt artifacts. We have to implement that. If you use dbt with snowflake, dbt_artifacts might be helpful to persist dbt artifacts on snowflake. But, unfortunately, we use dbt with BigQuery. As BigQuery doesn’t have a statement like snowflake’s copy , we implemented the function to load dbt artifacts JSON files to BigQuery leveraging the integration of Google Cloud Storage, Cloud Pub/Sub and Google Cloud Run.

An overview of the architecture to insert dbt artifacts to BigQuery
  1. Upload dbt artifacts to Google Cloud Storage. As dbt doesn’t have the feature to store dbt artifacts to GCS, we have to implement it. But, just gsutil cp is ok.
  2. Send Cloud Pub/Sub messages when objects on GCS are finalized.
  3. Push requests based on messages notified by GCS to the Cloud Run application.
  4. Insert the dbt artifacts based on the Pub/Sub messages to BigQuery.

The repository to set up the Cloud Run application is public on the GitHub repository. I described how to launch the application on your google cloud project in the README.

What is Data Status Time Machine like?

We have stored artifacts of dbt test results to BigQuery with the repository. Now, we are ready to take advantage of them on top of a BI tool. In our case, we put a query on redash as below. We can see the failed test during the specified time period from 2021–06–16 to 2021–06–23. We can investigate when and what tests and tables were down.

In addition, the dbt artifacts on BigQuery enables us to briefly investigate data downtime of a dbt model (BigQuery table). The bar chart below shows the total number of failed tests on a dbt model from 2021–06–01 to Today. The x-axis means the date and time when dbt was invocated and the y-axis means the total number of failed tests on a dbt model. As you can see, the dbt model was down for a certain period. As we implement other queries and dashboard on redash, we can identify what unique IDs of dbt tests were failed.

Summary

We showed an idea to take advantage of dbt artifacts on BigQuery by dealing with the history so that we understand data downtime and so on. It is impossible to do the same with a single dbt artifact. As I described, combinations with BI tools like redash enables us to easily visualize and inspect when and what tables on warehouse were faults. As I make the code public, you can use it even if you don’t use BigQuery as the primary data warahouse. The idea can be definitely more expanded by integrating with modern data management and data governance tools.

We are hiring!

We are hiring not only in Japan, but also in Singapore. We are looking for talented persons to change the world together. If you are interested in Ubie, please contact me. I am glad to chat with you. Please feel free to leave comments.

--

--

Yu Ishikawa

Data Engineering / Machine Learning / MLOps / Data Governance / Privacy Engineering