Photo by Robert Clark
Data transformation tools such as dbt or SQLMesh have become staples in many stacks across industries and data departments. These tools provide essential automation for managing multiple data models, making data more accessible and easier to query for other teams within a company. dbt core was first released in 2016, has continuously evolved with new features. However, its approach to certain fundamental operations can introduce complexities, which is an area SQLMesh aims to refine and improve upon. In this post, I will share my experience with both tools and discuss the advantages and limitations I found with each.
The basics
Both tools define models using SQL, and manage model configurations and their metadata. dbt allows separating the model logic from its configuration, which can be stored in YAML files located alongside the models or in the same directory structure. SQLMesh opts for inline configurations and comments to describe the models, which I find more convenient, given it works similar to how you would document code.
dbt favors centralised configuration management, often through its
dbt_project.yml
file, with the ability to override global settings for
individual models as needed.
State management
An important aspect of managing your data models is state tracking. This capability is crucial for automation, particularly for incremental models, as it reduces manual oversight and enables efficient data processing. While not as critical for models requiring a full refresh, incremental models often need to track the last processed entry or time window, representing the current state of the target model.
With dbt, their documentation recommends to do this:
select
*,
my_slow_function(my_column)
from {{ ref('app_data_events') }}
{% if is_incremental() %}
where event_time >= (select coalesce(max(event_time),'1900-01-01') from {{ this }} )
{% endif %}
Which implies that every time the model gets executed, a query gets triggered to
retrieve the state from the end model. While fetching state from the end model
ensures accuracy, it can lead to increased costs depending on table partitioning
and clustering. If the end table isn’t partitioned by event_time
, querying it
could be more expensive due to full table scans or higher compute costs.
An alternative approach involves tracking this state in a dedicated table, ensuring consistent awareness of the last processed entry for easier recovery and reprocessing. SQLMesh adopts such approach. Let’s check an example from their documentation:
MODEL (
name db.events,
kind INCREMENTAL_BY_TIME_RANGE (
time_column event_date
)
);
SELECT
event_date::TEXT as event_date,
event_payload::TEXT as payload
FROM raw_events
WHERE
event_date BETWEEN @start_ds AND @end_ds;
Notice the absence of a direct subquery. So, why event_date BETWEEN @start_ds AND @end_ds
? By tracking the processed data intervals separately, SQLMesh can
more easily manage scenarios where you might need to restate or backfill the model
for specific time ranges. For instance:
- The source data at the DWH didn’t get refreshed on time from its source due to some replication failure.
- The source data was corrupted due to some backend error, and was subsequently fixed manually.
SQLMesh internally tracks the processed time ranges for each model. When a model
with a defined cron schedule runs, SQLMesh automatically adjusts the values of
@start_ds
and @end_ds
(or similar temporal macros like @start_ts
/ @end_ts
depending on the column type) based on the execution time. An hourly model will
only process data from the previous hour. These macros are suffixed with the type
of the time filter.
Data quality
Ensuring that data is accurate and meets the organisation’s data contracts is essential to ensuring data quality. dbt and SQLMesh provide different mechanisms for this.
Testing the transformations
Both tools offer ways to unit test transformation logic, a common practice in Software Engineering, where the expected results of transformations are verified with mock input data. dbt calls them unit tests, and SQLMesh calls them tests.
These tests are executed on demand (e.g. locally or in CI), and aim to verify that individual transformations work as expected before deployment. A good practice is to include these tests as part of a CI/CD pipeline.
Data contracts
Data contracts help guarantee that models produce data conforming to predefined schemas (types and constraints), maintaining consistency for end-user applications.
In dbt, this can be done through a contract, where columns and its types and constraints are specified:
models:
- name: dim_customers
config:
materialized: table
contract:
enforced: true
columns:
- name: customer_id
data_type: int
constraints:
- type: not_null
- type: custom
expression: "tag (my_tag = 'my_value')"
- name: customer_name
data_type: string
- name: non_integer
data_type: numeric(38,3)
SQLMesh offers a somewhat different approach that combines explicit casting
within the model’s SQL with audits. While SQLMesh models also define column
types (e.g., id::INTEGER
), which leads to casting, more complex data quality
checks are typically handled by audits. Built-in audits (like not_null
) can be
referenced directly in the MODEL
definition, and custom audits (SQL queries that
identify “bad” data) can be defined in the audits/
folder.
MODEL (
name sushi.orders,
audits (
not_null(columns := (id, customer_id, waiter_id))
)
);
SELECT
id::INTEGER,
customer_id::INTEGER,
waiter_id::INTEGER,
table::INTEGER,
products::ARRAY<STRUCT<product STRING, amount INTEGER>>
A key difference in how constraints (beyond type conformance) are handled, lies in the execution:
- dbt contracts ensure the resulting schema in the DWH matches the
defined contract. Other constraints will either be enforced by the underlying
engine (not all engines enforce all constraints) or through
dbt test
. - SQLMesh audits run queries on the processed data after each run, before the data is promoted to “official” version of the table. If an audit returns any rows (indicating a quality issue), the deployment of the new model version is blocked, and the execution of the plan is halted.
For custom data validation:
- dbt offers data tests which are queries that should return no rows.
They are executed with a separate command
dbt test
, and by default will not stop the materialisation process. - SQLMesh offers non-blocking audits. Which behave just like normal audits, but the deployment of the model doesn’t stop if the audit fails.
Destination features
When using transformation tools, data teams often expect to delegate technical configurations related to the data warehouse, such as:
- Table partitioning and clustering strategies.
- Managing grants (permissions) for users or roles.
- Setting up view permissions to abstract underlying data access for end-users.
In this regard, dbt is a very mature tool that already has extensive support for these functionalities through engine-specific configurations.
With SQLMesh, while some of these features might not have dedicated configuration parameters as dbt does, similar outcomes can be achieved. SQLMesh allows executing queries or macros before or after the model is created, coupled with its powerful macro system, (which can leverage Python, including libraries specific to the destination engine like snowflake-connector-python or google-cloud-bigquery), users can implement custom solutions for these destination-specific settings.
All in all, both tools are very useful for data transformation, each with their strengths and approaches. After working with both of them, I have found that SQLMesh brings many more possibilities, due to its flexible design, so even if features are not implemented, its robust macro system and hooks often provide the means to engineer a tailored solution, offering a high degree of extensibility. Your choice between them will likely depend on specific project needs, existing team expertise, and which tool’s philosophy aligns best with your data engineering practices.