The article is brought by Yu as one of the blog post series from Ubie, inc. Ubie automatically generates medical records using an AI-powered patient questionnaire that helps save time and provide better patient care. As you can imagine, data engineering, data management and data governance are very significant to build the high-quality AI-powered AI system.
As you know, the trends on data engineering and management has been shifting to the next step in 2021. The objectives of the modern data stack are beyond just storing and transforming data. We are looking for better ways to manage pipelines, data quality…
I used to use create BigQuery tables with Apache Airflow. These days, I am migrating the queries to dbt, but still use airflow to schedule dbt jobs. One of the obstacles to migrate is Airflow-unique jinja2 macros, such as
ts . So, I implemented a dbt package to use airflow-like macros.
The macros would be very useful to specify proceeded data in
WHERE clause, especially when a table is very large and is partitioned by date and time. Consider if we calculate total amounts on transactions on a certain day and insert the results to a partition of…
dbt (data build tool) is really a great tool, as I posted “5 reasons why BigQuery users should use dbt” before. Especially, dbt tags is very useful to select models depending on the situation by taking advantage of model selection syntax. In the article, I describe the scopes of dbt tags that I misunderstood before. And that can be a pitfall for others too.
Assume if we have a dbt source with various level tags as below. We can annotate dbt tags to a source, a column and a test respectively.
The dbt CLI provides very useful syntax to…
I want to skip unnecessary CircleCI jobs with GitHub, when I change nothing in source code. For instance, consider if we modify only README documentation in a pull request. Do we need to run all unit tests? I don’t think so. Actually, CircleCI provides conditional steps, but it doesn’t work with such a dynamical condition. So, in this article, I describe a way to halt CircleCI jobs if target files are not changed. We will follow the three steps.
How do you implement and test data pipelines with BigQuery to create intermediate tables and manage metadata and data discovery? I used to use Apache Airflow’s operators with BigQuery. However, I basically need to implement code in python and manage the dependencies between BigQuery tables manually. As well as, actually, apache airflow enables us to test BigQuery tables with the
CheckOperator . But, we need to implement BigQuery queries even to test not-null or unique column. It can be useful, but not productive to me. And then, apache airfow doesn’t support metadata management and data discovery.
dbt(data build tool), which…
I described how to serve trained tensorflow models with tensorflow serving in Serving Pre-Modeled and Custom Tensorflow Estimator with Tensorflow Serving before. In the article, I explained how to make tensorflow models with estimator and how to serve the models with tensorflow serving and docker. And tensorflow serving starts supporting the RESTful API feature at the version 1.8 in addition to gRPC API. So, I would like to describe how to server RESTful APIs with tensorflow serving.
In this article, I will give you a hands-on about the RESTful API feature. The goal is to serve an iris classifier with HTTP/REST…
Databricks announced a brand new end-to-end machine learning platform which is an open source project. Nowadays, end-to-end machine learning platform is getting important. We data science guys want to focus on building models as much as possible. But traditionally, we need a bunch of cost to build teams for the end-to-end pipelines.
Recently, there are several machine learning platforms. Some are commertial products. Others are open source projects. First, StudioML enables us to manage machine learning experiments. As well as it enables us to do experiments on AWS and GCP. On the other hand, it doesn’t have any feature to…
How should we manage the documentation about the specifications of trained models? Something like swagger would help machine learning teams with sharing information about that. However, it seems that there is no defacto standard to generate specifications about tensorflow saved models yet.
How can we understand the signatures from a given tensorflow model? We need a staff to understand given saved tensorflow model. In such cases, tensorflow’s SavedModel CLI would help us by showing the signatures of your saved model. In this article, I will explain how to show the signature of inputs/outputs of a tensorflow saved model.
A couple years ago, when we would like to server machine learning models in real time, we needed to make REST APIs with web frameworks, such as flask. Nowaday, thanks to some cutting-edge technologies, such as tensorflow and kubernetes, it is getting easier to do that. With tensorflow serving, all have to do is to prepare for machine learning models. So we machine learning engineers can focus on building machine learning models.
Actually, there are some articles to serve “traditional” tensorflow graph with tensorflow server. However, there is less article about how to server pre-modeled and custom tensorflow estimator with…
I described bigquery-to-datastore which is used to transfer a whole BigQuery table to Google Datastore before. At that time, we didn’t have any installer for that. So, I created a homebrew tap on github to install bigquery-to-datastore.
You need a mac and homebrew, as well as java. If you have not used homebrew, please install it refering https://docs.brew.sh/Installation.html. As well as, bigquery-to-datastore depends on maven. Please install it too.
$ brew install yu-iskw/bigquery-to-datastore/bigquery-to-datastore# show help
Unfortunatelly, we don’t have any installer for windows and linux now. Please compile it for yourself. Otherwise, please download a JAR file.