Data Lineage and Impact Analysis with Alvin and dbt

An end-to-end setup guide

Ricardo Mendes
Alvin

--

Ease of integration is among the main benefits of working with modern data tools. We know data teams use a bunch of them to get work done, and having a holistic view of what happens in a data platform is critical to delivering value consistently. This is where Alvin shines.

Photo by Ryoji Iwata on Unsplash

Dbt has undoubtedly changed the landscape of data analytics. But every great innovation comes with unforeseen challenges. The data model explosion is a reality, and with more people creating data assets, keeping track of everything that has been built can be overwhelming.

That’s why Alvin developed an integration for dbt to help address some of these challenges.

Dbt is tied to data warehouses. Your team probably also uses a pipeline orchestration tool to run it periodically, which includes running dependent commands such as dbt run and dbt test in a specific order. In this simple scenario, you can find three components of a data stack: a data warehouse, a data transformation tool, and a pipeline orchestrator.

Alvin, the fourth element, is responsible for automatically capturing column-level lineage about the transformations that happen in the data warehouse. It provides users a seamless view of those transformations from the dbt and data warehouse standpoints.

CI/CD can be considered the fifth component of the stack, and Alvin provides ways of running Impact Analysis directly from GitHub or GitLab when making changes to dbt models. Impact Analysis is a powerful tool enabling data engineers to know what assets and people will be impacted before merging a change.

Connecting all these tools requires a recipe.

That’s what this post is: a guide for data teams working on doing this. I will use BigQuery as the DW, dbt Core for data transformations, Airflow as the orchestration tool, and GitHub Actions for CI/CD.

Step 1: Connect BigQuery to Alvin

The first step is connecting Alvin to the data warehouse (BigQuery in this case), using Alvin’s technical guide if needed. Ensure you grant the Alvin service account access to all assets “managed” by dbt. If you don’t have an Alvin account, sign up for a 30-day free trial here.

Don’t forget to sync the platform after creating it. Depending on the metadata volume in your BigQuery datasets, this can take some time. Still, we recommend waiting until it finishes before starting step two for a better user experience.

BigQuery metadata will be available in Alvin’s Data Catalog when the sync finishes:

BigQuery metadata in Alvin’s Data Catalog
Image 1. BigQuery metadata in Alvin’s Data Catalog

Step 2: Connect dbt to Alvin

Next, set up a dbt Platform in Alvin. It’s as simple as it sounds.

The Data Warehouse Platform Id has to be the same one you used for the BigQuery Platform set up in step 1. There is no need to trigger a sync for this platform; syncs will happen in real time, as we will see later.

Setting up a dbt Platform in Alvin
Image 2. Setting up a dbt Platform in Alvin

Next, set up an Alvin API Key: docs.alvin.ai/getting-started/setting-up-your-api-key. Take note of the key value just after creating it because, for security reasons, you won’t be able to read its value later.

Install the dbt-bigquery-alvin PyPI package in your dbt Core execution environment (the following steps must be repeated in dev, staging, and prod, for instance, if you want to have all of them integrated with Alvin). Alvin’s integration package adds instrumentation to existing dbt adapters, so it has to be installed after dbt-core and dbt-bigquery. Here’s a compatibility table for these packages:

| dbt-core | dbt-bigquery | dbt-bigquery-alvin |
| -------- | ------------ | ------------------ |
| ~= 1.0.0 | ~= 1.0.0 | ==0.0.2 |
| ~= 1.2.0 | ~= 1.2.0 | ==0.0.2 |

Set up the environment variables required by the Alvin integration package:

export ALVIN_API_KEY=<Alvin API Key created before>
export ALVIN_DBT_PLATFORM_ID=<Name of the dbt Platform. E.g., dbt_demo>
export ALVIN_URL=https://app.alvin.ai
export ALVIN_VERBOSE_LOG=true

Pro tip: ALVIN_VERBOSE_LOG=true is optional but recommended for a better developer experience, as it makes troubleshooting much easier.

Call dbt run. If you set up ALVIN_VERBOSE_LOG=true, the below logs should be available in the console:

[YYYY-MM-DD, ... UTC] {...} INFO - Starting Alvin Integration Package installation.
[YYYY-MM-DD, ... UTC] {...} INFO - Matching host packages ['dbt-core'] for dbt
[YYYY-MM-DD, ... UTC] {...} INFO - Host package match: dbt-core 1.2.1
...
[YYYY-MM-DD, ... UTC] {...} INFO - start after_run
[YYYY-MM-DD, ... UTC] {...} INFO - start send dbt metadata
[YYYY-MM-DD, ... UTC] {...} INFO - end send dbt metadata

The integration works in real-time, which means the assets touched by dbt should be immediately available in Alvin:

dbt assets in Alvin (real-time integration)
Image 3. dbt assets in Alvin (real-time integration)

When selecting a Model, you can see its metadata, including schema, run history, code, and lineage (more on this in the next paragraph). Notice the Info section where we show the name of the BigQuery table generated by that model. It works even when the table name differs from the model’s, such as when using aliases in dbt.

[Alvin] Column-level metadata for a dbt Model
Image 4. Column-level metadata for a dbt Model

Lineage is captured at the column level, similar to other integrations provided by Alvin:

[Alvin] Column-level data lineage for dbt Models
Image 5. Column-level data lineage for dbt Models

Step 3: Schedule dbt to run in Airflow

Triggering dbt run may work for development purposes but not for production. So we can rely on Airflow to rescue us!

I have successfully tested the integration package in Airflow 2.2.5 (running in Cloud Composer 2.0.32) by doing the following:

  1. Set up the Alvin environment variables mentioned above and two others:
DBT_PROFILE_DIR=/home/airflow/gcs/data
DBT_PROJECT_DIR=/home/airflow/gcs/dags/dbt-demo-jaffle-shop

Cloud Composer automatically binds /home/airflow/gcs/data to the data folder belonging to the GCS Bucket created for the managed Airflow instance’s storage needs. It also binds /home/airflow/gcs/dags to the dags folder in the same bucket so we can quickly deploy our DAGs.

2. Uploaded dbt’s profiles.yml and the service account file used by dbt to access BigQuery to the bucket’s data folder.

3. Uploaded the dbt project to the bucket’s dags folder (dbt-demo-jaffle-shop subfolder).

4. Uploaded the below DAG to the dags folder:

import datetime
import os

import airflow
from airflow.utils import dates
from airflow_dbt.operators import dbt_operator

DBT_PROFILE_DIR = os.getenv("DBT_PROFILE_DIR")
DBT_PROJECT_DIR = os.getenv("DBT_PROJECT_DIR")

args = {
"dir": DBT_PROJECT_DIR,
"env": {
"ALVIN_API_KEY": os.getenv("ALVIN_API_KEY"),
"ALVIN_DBT_PLATFORM_ID": os.getenv("ALVIN_DBT_PLATFORM_ID"),
"ALVIN_URL": os.getenv("ALVIN_URL"),
},
"owner": "airflow",
"profiles_dir": DBT_PROFILE_DIR,
"start_date": dates.days_ago(1),
"target": "prod",
"verbose": True
}

with airflow.DAG(dag_id="alvin-demo-dbt-jaffle-shop",
default_args=args,
schedule_interval="30 0 * * *",
catchup=False,
dagrun_timeout=datetime.timedelta(minutes=60),
tags=["alvin", "demo", "dbt", "jaffle-shop"]) as dag:

task_dbt_run = dbt_operator.DbtRunOperator(
task_id="dbt-run",
)

task_dbt_test = dbt_operator.DbtTestOperator(
task_id="dbt-test",
)

task_dbt_run >> task_dbt_test

And it is enough to have dbt run and dbt test running daily at 00:30:00.

Step 4: Run Impact Analysis in CI/CD pipelines

The final thing to cover is running Impact Analysis in CI/CD pipelines. I will use a GitHub Workflow for simplicity, given that Alvin provides a GitHub Action for a seamless developer experience. Still, the same results can be achieved in GitLab with a bit more code.

Pro tip: CI/CD integration only needs the data warehouse’s metadata, so the second and third steps, described above, are not required to run Impact Analysis in such pipelines. You can (and should!) use as many features as required to address your needs, though.

Start by adding an impact-analysis.yaml file to the .github/workflows/ folder, with the below content. It will call Alvin's Impact Analysis API for all pull requests on the dbt project's repository.

name: alvin-check-sql-impact-analysis

on:
pull_request:

jobs:
impact-run:
runs-on: ubuntu-latest
env:
ALVIN_API_TOKEN: ${{ secrets.ALVIN_API_TOKEN }}
ALVIN_PLATFORM_ID: ${{ secrets.ALVIN_PLATFORM_ID }}
ALVIN_VERBOSE_LOG: ${{ secrets.ALVIN_VERBOSE_LOG}}
DBT_PROFILES_DIR: ${{ secrets.DBT_PROFILES_DIR }}
DBT_ROOT_DIR: ${{ secrets.DBT_ROOT_DIR }}
DBT_TARGET: ${{ secrets.DBT_TARGET}}
DIALECT: bigquery

# The BigQuery connection credentials are provided through the below
# env vars, having their values fetched from GitHub Secrets.
# It is highly recommended using unique connection credentials for this
# workflow.
PROFILES_YML_BQ_PROJECT_ID: ${{ secrets.PROFILES_YML_BQ_PROJECT_ID }}
PROFILES_YML_BQ_DATASET_ID: ${{ secrets.PROFILES_YML_BQ_DATASET_ID }}
PROFILES_YML_BQ_PRIVATE_KEY_ID: ${{ secrets.PROFILES_YML_BQ_PRIVATE_KEY_ID }}
PROFILES_YML_BQ_PRIVATE_KEY: ${{ secrets.PROFILES_YML_BQ_PRIVATE_KEY }}
PROFILES_YML_BQ_CLIENT_EMAIL: ${{ secrets.PROFILES_YML_BQ_CLIENT_EMAIL }}
PROFILES_YML_BQ_CLIENT_ID: ${{ secrets.PROFILES_YML_BQ_CLIENT_ID }}
PROFILES_YML_BQ_AUTH_URI: ${{ secrets.PROFILES_YML_BQ_AUTH_URI }}
PROFILES_YML_BQ_TOKEN_URI: ${{ secrets.PROFILES_YML_BQ_TOKEN_URI }}
PROFILES_YML_BQ_PROVIDER_CERT_URL: ${{ secrets.PROFILES_YML_BQ_PROVIDER_CERT_URL }}
PROFILES_YML_BQ_CLIENT_CERT_URL: ${{ secrets.PROFILES_YML_BQ_CLIENT_CERT_URL }}
steps:
- name: Git checkout
uses: actions/checkout@v2
- uses: actions/setup-python@v2
- id: file_changes
uses: trilom/file-changes-action@v1.2.4
with:
output: ' '
- id: impact_analysis
uses: alvindotai/check-sql-impact@beta_v0.1.0
with:
args: ${{ steps.file_changes.outputs.files}}
- name: Comment on the PR
uses: actions/github-script@v4
if: ${{ always() }}
with:
script: |
github.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: ${{ steps.impact_analysis.outputs.markdown_text }}
})

A few things worth noting:

  1. The 16 environment variables that have names starting with secrets. must be provided as GitHub Secrets.
  2. ALVIN_PLATFORM_ID refers to the BigQuery Platform created in step 1 — e.g., bigquey — and not the dbt Platform, given it is not required here.
  3. If your dbt Project files do not reside in the repository's root folder, you must set the project folder using DBT_ROOT_DIR.
  4. If the dbt_project.yml and profiles.yml files reside in the same folder, please set DBT_PROFILES_DIR's value to .. Otherwise, set it with the path to the profiles folder.

There must be a dummy profiles.yml file in the DBT_PROFILES_DIR folder, just to enable dbt’s Service Account JSON authentication method. Please refer to the below snippet for an example. The environment variables/GitHub Secrets nested to keyfile_json must be copied from a GCP Service Account file.

alvin-demo-bigquery-db:
target: prod
outputs:
prod:
type: bigquery
method: service-account-json
project: "{{ env_var('PROFILES_YML_BQ_PROJECT_ID') }}"
dataset: "{{ env_var('PROFILES_YML_BQ_DATASET_ID') }}"
threads: 1
keyfile_json:
type: service_account
project_id: "{{ env_var('PROFILES_YML_BQ_PROJECT_ID') }}"
private_key_id: "{{ env_var('PROFILES_YML_BQ_PRIVATE_KEY_ID') }}"
private_key: "{{ env_var('PROFILES_YML_BQ_PRIVATE_KEY') }}"
client_email: "{{ env_var('PROFILES_YML_BQ_CLIENT_EMAIL') }}"
client_id: "{{ env_var('PROFILES_YML_BQ_CLIENT_ID') }}"
auth_uri: "{{ env_var('PROFILES_YML_BQ_AUTH_URI') }}"
token_uri: "{{ env_var('PROFILES_YML_BQ_TOKEN_URI') }}"
auth_provider_x509_cert_url: "{{ env_var('PROFILES_YML_BQ_PROVIDER_CERT_URL') }}"
client_x509_cert_url: "{{ env_var('PROFILES_YML_BQ_CLIENT_CERT_URL') }}"

With the configurations in place, developers should see comments similar to the below one — except for the pink arrows ;) — in the PRs.

[Alvin] Impact Analysis for dbt Model changes (CI/CD pipeline)
Image 6. Impact Analysis for dbt Model changes (CI/CD pipeline)

The pink arrows highlight how the Impact Analysis tool can warn developers at a glance. If they want more details before merging the changes, especially about the impacted assets and users, they can click the Run > button and see such info in Alvin:

[Alvin] Impact Analysis for dbt Model changes
Image 7. Impact Analysis for dbt Model changes (Alvin)

Conclusion

In this guide, I’ve described how to integrate Alvin with BigQuery, dbt, Airflow, and GitHub Actions, to provide data engineers with detailed steps to make such integrations work with minimal burden.

In minutes, data teams can have an end-to-end understanding of their data platforms and avoid breaking changes just because they don’t know what happens downstream.

Thanks for reading, and you can sign up for a 30-day free trial here if you want to try it out for yourself!

--

--