Data Lineage and Impact Analysis with Alvin and dbt
An end-to-end setup guide
Ease of integration is among the main benefits of working with modern data tools. We know data teams use a bunch of them to get work done, and having a holistic view of what happens in a data platform is critical to delivering value consistently. This is where Alvin shines.
Dbt has undoubtedly changed the landscape of data analytics. But every great innovation comes with unforeseen challenges. The data model explosion is a reality, and with more people creating data assets, keeping track of everything that has been built can be overwhelming.
That’s why Alvin developed an integration for dbt to help address some of these challenges.
Dbt is tied to data warehouses. Your team probably also uses a pipeline orchestration tool to run it periodically, which includes running dependent commands such as dbt run
and dbt test
in a specific order. In this simple scenario, you can find three components of a data stack: a data warehouse, a data transformation tool, and a pipeline orchestrator.
Alvin, the fourth element, is responsible for automatically capturing column-level lineage about the transformations that happen in the data warehouse. It provides users a seamless view of those transformations from the dbt and data warehouse standpoints.
CI/CD can be considered the fifth component of the stack, and Alvin provides ways of running Impact Analysis directly from GitHub or GitLab when making changes to dbt models. Impact Analysis is a powerful tool enabling data engineers to know what assets and people will be impacted before merging a change.
Connecting all these tools requires a recipe.
That’s what this post is: a guide for data teams working on doing this. I will use BigQuery as the DW, dbt Core for data transformations, Airflow as the orchestration tool, and GitHub Actions for CI/CD.
Step 1: Connect BigQuery to Alvin
The first step is connecting Alvin to the data warehouse (BigQuery in this case), using Alvin’s technical guide if needed. Ensure you grant the Alvin service account access to all assets “managed” by dbt. If you don’t have an Alvin account, sign up for a 30-day free trial here.
Don’t forget to sync the platform after creating it. Depending on the metadata volume in your BigQuery datasets, this can take some time. Still, we recommend waiting until it finishes before starting step two for a better user experience.
BigQuery metadata will be available in Alvin’s Data Catalog when the sync finishes:
Step 2: Connect dbt to Alvin
Next, set up a dbt Platform in Alvin. It’s as simple as it sounds.
The Data Warehouse Platform Id
has to be the same one you used for the BigQuery Platform set up in step 1. There is no need to trigger a sync for this platform; syncs will happen in real time, as we will see later.
Next, set up an Alvin API Key: docs.alvin.ai/getting-started/setting-up-your-api-key. Take note of the key value just after creating it because, for security reasons, you won’t be able to read its value later.
Install the dbt-bigquery-alvin
PyPI package in your dbt Core execution environment (the following steps must be repeated in dev, staging, and prod, for instance, if you want to have all of them integrated with Alvin). Alvin’s integration package adds instrumentation to existing dbt adapters, so it has to be installed after dbt-core
and dbt-bigquery
. Here’s a compatibility table for these packages:
| dbt-core | dbt-bigquery | dbt-bigquery-alvin |
| -------- | ------------ | ------------------ |
| ~= 1.0.0 | ~= 1.0.0 | ==0.0.2 |
| ~= 1.2.0 | ~= 1.2.0 | ==0.0.2 |
Set up the environment variables required by the Alvin integration package:
export ALVIN_API_KEY=<Alvin API Key created before>
export ALVIN_DBT_PLATFORM_ID=<Name of the dbt Platform. E.g., dbt_demo>
export ALVIN_URL=https://app.alvin.ai
export ALVIN_VERBOSE_LOG=true
Pro tip:
ALVIN_VERBOSE_LOG=true
is optional but recommended for a better developer experience, as it makes troubleshooting much easier.
Call dbt run
. If you set up ALVIN_VERBOSE_LOG=true
, the below logs should be available in the console:
[YYYY-MM-DD, ... UTC] {...} INFO - Starting Alvin Integration Package installation.
[YYYY-MM-DD, ... UTC] {...} INFO - Matching host packages ['dbt-core'] for dbt
[YYYY-MM-DD, ... UTC] {...} INFO - Host package match: dbt-core 1.2.1
...
[YYYY-MM-DD, ... UTC] {...} INFO - start after_run
[YYYY-MM-DD, ... UTC] {...} INFO - start send dbt metadata
[YYYY-MM-DD, ... UTC] {...} INFO - end send dbt metadata
The integration works in real-time, which means the assets touched by dbt should be immediately available in Alvin:
When selecting a Model, you can see its metadata, including schema, run history, code, and lineage (more on this in the next paragraph). Notice the Info
section where we show the name of the BigQuery table generated by that model. It works even when the table name differs from the model’s, such as when using aliases in dbt.
Lineage is captured at the column level, similar to other integrations provided by Alvin:
Step 3: Schedule dbt to run in Airflow
Triggering dbt run
may work for development purposes but not for production. So we can rely on Airflow to rescue us!
I have successfully tested the integration package in Airflow 2.2.5 (running in Cloud Composer 2.0.32) by doing the following:
- Set up the Alvin environment variables mentioned above and two others:
DBT_PROFILE_DIR=/home/airflow/gcs/data
DBT_PROJECT_DIR=/home/airflow/gcs/dags/dbt-demo-jaffle-shop
Cloud Composer automatically binds /home/airflow/gcs/data
to the data
folder belonging to the GCS Bucket created for the managed Airflow instance’s storage needs. It also binds /home/airflow/gcs/dags
to the dags
folder in the same bucket so we can quickly deploy our DAGs.
2. Uploaded dbt’s profiles.yml
and the service account file used by dbt to access BigQuery to the bucket’s data
folder.
3. Uploaded the dbt project to the bucket’s dags
folder (dbt-demo-jaffle-shop
subfolder).
4. Uploaded the below DAG to the dags
folder:
import datetime
import os
import airflow
from airflow.utils import dates
from airflow_dbt.operators import dbt_operator
DBT_PROFILE_DIR = os.getenv("DBT_PROFILE_DIR")
DBT_PROJECT_DIR = os.getenv("DBT_PROJECT_DIR")
args = {
"dir": DBT_PROJECT_DIR,
"env": {
"ALVIN_API_KEY": os.getenv("ALVIN_API_KEY"),
"ALVIN_DBT_PLATFORM_ID": os.getenv("ALVIN_DBT_PLATFORM_ID"),
"ALVIN_URL": os.getenv("ALVIN_URL"),
},
"owner": "airflow",
"profiles_dir": DBT_PROFILE_DIR,
"start_date": dates.days_ago(1),
"target": "prod",
"verbose": True
}
with airflow.DAG(dag_id="alvin-demo-dbt-jaffle-shop",
default_args=args,
schedule_interval="30 0 * * *",
catchup=False,
dagrun_timeout=datetime.timedelta(minutes=60),
tags=["alvin", "demo", "dbt", "jaffle-shop"]) as dag:
task_dbt_run = dbt_operator.DbtRunOperator(
task_id="dbt-run",
)
task_dbt_test = dbt_operator.DbtTestOperator(
task_id="dbt-test",
)
task_dbt_run >> task_dbt_test
And it is enough to have dbt run
and dbt test
running daily at 00:30:00.
Step 4: Run Impact Analysis in CI/CD pipelines
The final thing to cover is running Impact Analysis in CI/CD pipelines. I will use a GitHub Workflow for simplicity, given that Alvin provides a GitHub Action for a seamless developer experience. Still, the same results can be achieved in GitLab with a bit more code.
Pro tip: CI/CD integration only needs the data warehouse’s metadata, so the second and third steps, described above, are not required to run Impact Analysis in such pipelines. You can (and should!) use as many features as required to address your needs, though.
Start by adding an impact-analysis.yaml
file to the .github/workflows/
folder, with the below content. It will call Alvin's Impact Analysis API for all pull requests on the dbt project's repository.
name: alvin-check-sql-impact-analysis
on:
pull_request:
jobs:
impact-run:
runs-on: ubuntu-latest
env:
ALVIN_API_TOKEN: ${{ secrets.ALVIN_API_TOKEN }}
ALVIN_PLATFORM_ID: ${{ secrets.ALVIN_PLATFORM_ID }}
ALVIN_VERBOSE_LOG: ${{ secrets.ALVIN_VERBOSE_LOG}}
DBT_PROFILES_DIR: ${{ secrets.DBT_PROFILES_DIR }}
DBT_ROOT_DIR: ${{ secrets.DBT_ROOT_DIR }}
DBT_TARGET: ${{ secrets.DBT_TARGET}}
DIALECT: bigquery
# The BigQuery connection credentials are provided through the below
# env vars, having their values fetched from GitHub Secrets.
# It is highly recommended using unique connection credentials for this
# workflow.
PROFILES_YML_BQ_PROJECT_ID: ${{ secrets.PROFILES_YML_BQ_PROJECT_ID }}
PROFILES_YML_BQ_DATASET_ID: ${{ secrets.PROFILES_YML_BQ_DATASET_ID }}
PROFILES_YML_BQ_PRIVATE_KEY_ID: ${{ secrets.PROFILES_YML_BQ_PRIVATE_KEY_ID }}
PROFILES_YML_BQ_PRIVATE_KEY: ${{ secrets.PROFILES_YML_BQ_PRIVATE_KEY }}
PROFILES_YML_BQ_CLIENT_EMAIL: ${{ secrets.PROFILES_YML_BQ_CLIENT_EMAIL }}
PROFILES_YML_BQ_CLIENT_ID: ${{ secrets.PROFILES_YML_BQ_CLIENT_ID }}
PROFILES_YML_BQ_AUTH_URI: ${{ secrets.PROFILES_YML_BQ_AUTH_URI }}
PROFILES_YML_BQ_TOKEN_URI: ${{ secrets.PROFILES_YML_BQ_TOKEN_URI }}
PROFILES_YML_BQ_PROVIDER_CERT_URL: ${{ secrets.PROFILES_YML_BQ_PROVIDER_CERT_URL }}
PROFILES_YML_BQ_CLIENT_CERT_URL: ${{ secrets.PROFILES_YML_BQ_CLIENT_CERT_URL }}
steps:
- name: Git checkout
uses: actions/checkout@v2
- uses: actions/setup-python@v2
- id: file_changes
uses: trilom/file-changes-action@v1.2.4
with:
output: ' '
- id: impact_analysis
uses: alvindotai/check-sql-impact@beta_v0.1.0
with:
args: ${{ steps.file_changes.outputs.files}}
- name: Comment on the PR
uses: actions/github-script@v4
if: ${{ always() }}
with:
script: |
github.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: ${{ steps.impact_analysis.outputs.markdown_text }}
})
A few things worth noting:
- The 16 environment variables that have names starting with
secrets.
must be provided as GitHub Secrets. ALVIN_PLATFORM_ID
refers to the BigQuery Platform created in step 1 — e.g.,bigquey
— and not the dbt Platform, given it is not required here.- If your dbt Project files do not reside in the repository's root folder, you must set the project folder using
DBT_ROOT_DIR
. - If the
dbt_project.yml
andprofiles.yml
files reside in the same folder, please setDBT_PROFILES_DIR
's value to.
. Otherwise, set it with the path to the profiles folder.
There must be a dummy profiles.yml
file in the DBT_PROFILES_DIR
folder, just to enable dbt’s Service Account JSON authentication method. Please refer to the below snippet for an example. The environment variables/GitHub Secrets nested to keyfile_json
must be copied from a GCP Service Account file.
alvin-demo-bigquery-db:
target: prod
outputs:
prod:
type: bigquery
method: service-account-json
project: "{{ env_var('PROFILES_YML_BQ_PROJECT_ID') }}"
dataset: "{{ env_var('PROFILES_YML_BQ_DATASET_ID') }}"
threads: 1
keyfile_json:
type: service_account
project_id: "{{ env_var('PROFILES_YML_BQ_PROJECT_ID') }}"
private_key_id: "{{ env_var('PROFILES_YML_BQ_PRIVATE_KEY_ID') }}"
private_key: "{{ env_var('PROFILES_YML_BQ_PRIVATE_KEY') }}"
client_email: "{{ env_var('PROFILES_YML_BQ_CLIENT_EMAIL') }}"
client_id: "{{ env_var('PROFILES_YML_BQ_CLIENT_ID') }}"
auth_uri: "{{ env_var('PROFILES_YML_BQ_AUTH_URI') }}"
token_uri: "{{ env_var('PROFILES_YML_BQ_TOKEN_URI') }}"
auth_provider_x509_cert_url: "{{ env_var('PROFILES_YML_BQ_PROVIDER_CERT_URL') }}"
client_x509_cert_url: "{{ env_var('PROFILES_YML_BQ_CLIENT_CERT_URL') }}"
With the configurations in place, developers should see comments similar to the below one — except for the pink arrows ;) — in the PRs.
The pink arrows highlight how the Impact Analysis tool can warn developers at a glance. If they want more details before merging the changes, especially about the impacted assets and users, they can click the Run >
button and see such info in Alvin:
Conclusion
In this guide, I’ve described how to integrate Alvin with BigQuery, dbt, Airflow, and GitHub Actions, to provide data engineers with detailed steps to make such integrations work with minimal burden.
In minutes, data teams can have an end-to-end understanding of their data platforms and avoid breaking changes just because they don’t know what happens downstream.
Thanks for reading, and you can sign up for a 30-day free trial here if you want to try it out for yourself!