Implementing the Write-Audit-Publish (WAP) Pattern with dbt, Airflow, and BigQuery
A Real-World Example of Ensuring Data Integrity
In our business domain, ensuring high-quality and accurate data is critical, as every minute of missing or flawed data in production can lead to significant revenue losses and erode trust. Traditional workflows, however, often fall short of addressing this challenge. These workflows typically involve processing and writing data to the production table first, then running tests to check its quality. The issue with this approach is that bad data may already be available to users and downstream systems by the time errors are detected.
Here’s an example from our operations highlighting why we implemented this solution. Our business dashboards and reports update every 45 minutes, making it crucial for the data to be accurate. This data is used to make important business decisions, especially at the end of the day, week, month, or year. Some users also automate processes based on this data, which can trigger notifications to broader audiences or impact other systems.
For instance, during a past incident, an unanticipated change in currency-related upstream data caused a missing filter in an intermediary table, leading to inflated revenue figures on dashboards. This prevented users from making informed decisions and led to complaints that required urgent resolution.
With the 45-minute update cycle, we can address data quality issues while complying with the SLO. Delivering incorrect data, even on time, can cause poor customer experiences, resulting in incidents, complaints, and a surge in support tickets.
It’s worth noting that this kind of testing, the data test, differs from the test conducted during development or deployment. During development or deployment in the CI pipeline, tests ensure that changes do not break other system parts. However, data tests, done in runtime, present a distinct challenge: they focus on validating the correctness of the production data itself, rather than just the code.
We adopted the Write-Audit-Publish (WAP) approach in our data pipeline to solve this. It gained broader recognition after Netflix's notable presentation in 2017. The following sections outline the implementation details and explain how this method is applied using dbt, BigQuery, and Airflow.
Challenges with Current Data Workflows
In numerous data engineering scenarios, the standard procedure involves first processing the data, writing the result into a table, and then running tests on it:
Data Processing & Writing: For example, executing
dbt run
.Data Testing: Such as performing
dbt test
.
The primary drawback of this sequential approach is that if errors are discovered during the testing phase, the flawed data has often already been propagated to downstream systems, making it difficult to prevent and fix the spread of incorrect data.
In the case of DBT, tests are executed after a model is built. If the data exhibits unexpected characteristics, the test will fail—but by then, it's too late. The incorrect data has already entered the model, compromising its integrity and potentially affecting downstream systems. This can lead to critical issues such as data anomalies (e.g., row explosions) and incorrect outputs, resulting in customer dissatisfaction and complaints.
In some scenarios, allowing incorrect data to pass through the pipeline may be acceptable rather than stopping it entirely. However, in other cases, delivering inaccurate or flawed data is unacceptable and can lead to disastrous consequences. For example, customers may tolerate a small delay in data availability but will not tolerate incorrect data (for example: end-of-day financial report). In such cases, it is essential to perform runtime testing—testing the data after it has been processed or transformed but before it is published to downstream systems.
The WAP Approach
To address these challenges, we implement the Write-Audit-Publish (WAP) pattern within our data pipelines, embodying the "shift-left" approach to data quality. This means moving quality checks earlier in the process, ensuring potential issues are identified and resolved before they can affect end users. This pattern involves auditing data post-processing but pre-publication. If the data passes the audit, it is then published for consumption.
Conversely, if it fails, an alert is triggered, enabling prompt identification and rectification of issues, all within our Service Level Agreements (SLA). This proactive approach not only prevents customers from encountering inaccuracies but also supports the concept of "fail fast" where problems are detected and addressed as soon as possible.
The WAP pattern involves a more robust approach to ensuring data quality before it reaches the end users. The diagram provided illustrates the conceptual idea of WAP, which is designed to ensure data integrity before it reaches the production environment.
Note that a Publish step is executed after the test if the test passes. The publish step makes the data readable from users and downstream systems. If the test doesn’t pass, then the bad data won’t be readable.
WAP Implementation Details in dbt, BigQuery, and Airflow
The implementation details of WAP using dbt, BigQuery, and Airflow are shown in the below diagram.
Transform & Write
The process begins with the Transform & Write step (dbt run
), where raw data is transformed to meet specific requirements. After the transformation, the data is written to a new table, with each table uniquely named using Airflow's {{ ts_nodash }}
variable macro. This suffix denotes the execution time, resulting in a distinct table for each batch execution. For example:
target_20240402100000
(for execution time 2024-04-02 10:00:00)target_20240402103000
(for execution time 2024-04-02 10:30:00)target_20240402110000
(for execution time 2024-04-02 11:00:00)
Each table represents the output of a specific execution batch.
Instead of overwriting existing tables, new intermediary tables are created for every execution. This approach offers several key advantages:
Robust Rollback Mechanism: By retaining previous versions of the table, rollbacks can be executed instantly by switching back to an earlier table, eliminating the need for costly reprocessing.
Simplified Debugging: Creating new tables makes debugging and investigating issues easier, as historical data is preserved and can be compared with the current state.
Audit
After the Transform & Write step, the newly created table undergoes an Audit (dbt test
) phase. This step applies rigorous quality checks, such as unique key validation (e.g., ensuring primary keys have no duplicates), not null constraints (e.g., verifying critical fields like order_id
or currency_code
are not empty), and data range checks (e.g., ensuring values in fields like order_amount
or timestamp
fall within expected ranges). These tests are critical to ensuring data accuracy and completeness.
Successful Audit: The process proceeds to the Publish stage if the table passes all checks.
Failed Audit: If the audit fails, the pipeline halts immediately, and an alert is sent to notify stakeholders of the issue.
For more information on how to define and implement dbt tests, refer to the dbt test documentation.
Publish
The Publish step makes the audited data available to end users or downstream systems.
This stage employs a zero-copy operation, ensuring data availability with O(1) complexity. This eliminates the need for additional data transfer steps, which could introduce new points of failure and delays.
The zero-copy operation is achieved by dynamically altering a view to point to the most recent validated table. The end-user interacts with this view, which always reflects the latest validated data. This ensures that data is made available almost instantaneously while maintaining integrity and consistency.
Airflow DAG Implementation
We use Airflow to orchestrate the WAP pattern in dbt. The implementation consists of three main Tasks, as outlined below:
build_intermediary_model: This task uses
ts_nodash
as an input variable to build the intermediary table. It executes the dbt run command to create the intermediary table.audit_model: This task performs data quality checks on the intermediary table by running the
dbt test
command. If any test fails, the task will terminate with a failure.publish_model: This task handles view alterations and publishes the final model. It runs the dbt run command with
ts_nodash
as an input variable.
The diagram below illustrates the flow of these tasks within the DAG.
Pros and Cons Summary
Our WAP implementation offers several advantages:
Ensuring Data Accuracy: We ensure only correct data is published by leveraging views that point to validated tables.
Simplified Rollback: Rolling back to a previous, error-free state is straightforward. This is achieved by simply adjusting the view to point to the last known good table, minimizing disruption and recovery time.
Facilitated debugging & investigation: Retaining historical intermediary tables makes it easier to investigate issues when something goes wrong, providing clear insights into the data at each step.
Non-intrusive Publishing: Using View alteration, which has an O(1) complexity, data can be made available almost instantly without the need for additional data transfer steps. This minimizes downtime and enhances data availability.
Despite these benefits, there are also some minor challenges:
Potential Data Delay: In cases of unexpected data quality issues, the pipeline stops publishing new data. However, this is an acceptable trade-off for our use case, where data correctness is paramount. Additionally, the delay remains within our SLA, leaving room for mitigation when necessary.
Table Management Overhead: This approach generates numerous tables, one for each execution cycle of the batch process. However, using BigQuery's table expiration feature simplifies lifecycle management, automatically deleting older tables as needed.
Conclusion
The Write-Audit-Publish (WAP) pattern is a robust approach to ensuring data integrity before it reaches end users. This method is particularly valuable in scenarios where delivering incorrect data is unacceptable, even if a small delay (within SLA) is tolerable. This implementation effectively balances reliability, scalability, and maintainability, ensuring accurate and timely data delivery to downstream systems and end users.