SQLMesh VS dbt: why data tools need to evolve

Cover Image Photo by Robert Clark

Data transformation tools such as dbt or SQLMesh have become staples in many stacks across industries and data departments. These tools provide essential automation for managing multiple data models, making data more accessible and easier to query for other teams within a company. dbt core was first released in 2016, has continuously evolved with new features. However, its approach to certain fundamental operations can introduce complexities, which is an area SQLMesh aims to refine and improve upon. In this post, I will share my experience with both tools and discuss the advantages and limitations I found with each.

The basics

Both tools define models using SQL, and manage model configurations and their metadata. dbt allows separating the model logic from its configuration, which can be stored in YAML files located alongside the models or in the same directory structure. SQLMesh opts for inline configurations and comments to describe the models, which I find more convenient, given it works similar to how you would document code.

dbt favors centralised configuration management, often through its dbt_project.yml file, with the ability to override global settings for individual models as needed.

State management

An important aspect of managing your data models is state tracking. This capability is crucial for automation, particularly for incremental models, as it reduces manual oversight and enables efficient data processing. While not as critical for models requiring a full refresh, incremental models often need to track the last processed entry or time window, representing the current state of the target model.

With dbt, their documentation recommends to do this:

select
    *,
    my_slow_function(my_column)
from {{ ref('app_data_events') }}
{% if is_incremental() %}
where event_time >= (select coalesce(max(event_time),'1900-01-01') from {{ this }} )
{% endif %}

Which implies that every time the model gets executed, a query gets triggered to retrieve the state from the end model. While fetching state from the end model ensures accuracy, it can lead to increased costs depending on table partitioning and clustering. If the end table isn’t partitioned by event_time, querying it could be more expensive due to full table scans or higher compute costs.

An alternative approach involves tracking this state in a dedicated table, ensuring consistent awareness of the last processed entry for easier recovery and reprocessing. SQLMesh adopts such approach. Let’s check an example from their documentation:

MODEL (
  name db.events,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column event_date
  )
);

SELECT
  event_date::TEXT as event_date,
  event_payload::TEXT as payload
FROM raw_events
WHERE
  event_date BETWEEN @start_ds AND @end_ds;

Notice the absence of a direct subquery. So, why event_date BETWEEN @start_ds AND @end_ds? By tracking the processed data intervals separately, SQLMesh can more easily manage scenarios where you might need to restate or backfill the model for specific time ranges. For instance:

The source data at the DWH didn’t get refreshed on time from its source due to some replication failure.
The source data was corrupted due to some backend error, and was subsequently fixed manually.

SQLMesh internally tracks the processed time ranges for each model. When a model with a defined cron schedule runs, SQLMesh automatically adjusts the values of @start_ds and @end_ds (or similar temporal macros like @start_ts / @end_ts depending on the column type) based on the execution time. An hourly model will only process data from the previous hour. These macros are suffixed with the type of the time filter.

Data quality

Ensuring that data is accurate and meets the organisation’s data contracts is essential to ensuring data quality. dbt and SQLMesh provide different mechanisms for this.

Testing the transformations

Both tools offer ways to unit test transformation logic, a common practice in Software Engineering, where the expected results of transformations are verified with mock input data. dbt calls them unit tests, and SQLMesh calls them tests.

These tests are executed on demand (e.g. locally or in CI), and aim to verify that individual transformations work as expected before deployment. A good practice is to include these tests as part of a CI/CD pipeline.

Data contracts

Data contracts help guarantee that models produce data conforming to predefined schemas (types and constraints), maintaining consistency for end-user applications.

In dbt, this can be done through a contract, where columns and its types and constraints are specified:

models:
  - name: dim_customers
    config:
      materialized: table
      contract:
        enforced: true
    columns:
      - name: customer_id
        data_type: int
        constraints:
          - type: not_null
          - type: custom
            expression: "tag (my_tag = 'my_value')"
      - name: customer_name
        data_type: string
      - name: non_integer
        data_type: numeric(38,3)

SQLMesh offers a somewhat different approach that combines explicit casting within the model’s SQL with audits. While SQLMesh models also define column types (e.g., id::INTEGER), which leads to casting, more complex data quality checks are typically handled by audits. Built-in audits (like not_null) can be referenced directly in the MODEL definition, and custom audits (SQL queries that identify “bad” data) can be defined in the audits/ folder.

MODEL (
  name sushi.orders,
  audits (
    not_null(columns := (id, customer_id, waiter_id))
  )
);
SELECT
  id::INTEGER,
  customer_id::INTEGER,
  waiter_id::INTEGER,
  table::INTEGER,
  products::ARRAY<STRUCT<product STRING, amount INTEGER>>

A key difference in how constraints (beyond type conformance) are handled, lies in the execution:

dbt contracts ensure the resulting schema in the DWH matches the defined contract. Other constraints will either be enforced by the underlying engine (not all engines enforce all constraints) or through dbt test.
SQLMesh audits run queries on the processed data after each run, before the data is promoted to “official” version of the table. If an audit returns any rows (indicating a quality issue), the deployment of the new model version is blocked, and the execution of the plan is halted.

For custom data validation:

dbt offers data tests which are queries that should return no rows. They are executed with a separate command dbt test, and by default will not stop the materialisation process.
SQLMesh offers non-blocking audits. Which behave just like normal audits, but the deployment of the model doesn’t stop if the audit fails.

Destination features

When using transformation tools, data teams often expect to delegate technical configurations related to the data warehouse, such as:

Table partitioning and clustering strategies.
Managing grants (permissions) for users or roles.
Setting up view permissions to abstract underlying data access for end-users.

In this regard, dbt is a very mature tool that already has extensive support for these functionalities through engine-specific configurations.

With SQLMesh, while some of these features might not have dedicated configuration parameters as dbt does, similar outcomes can be achieved. SQLMesh allows executing queries or macros before or after the model is created, coupled with its powerful macro system, (which can leverage Python, including libraries specific to the destination engine like snowflake-connector-python or google-cloud-bigquery), users can implement custom solutions for these destination-specific settings.

All in all, both tools are very useful for data transformation, each with their strengths and approaches. After working with both of them, I have found that SQLMesh brings many more possibilities, due to its flexible design, so even if features are not implemented, its robust macro system and hooks often provide the means to engineer a tailored solution, offering a high degree of extensibility. Your choice between them will likely depend on specific project needs, existing team expertise, and which tool’s philosophy aligns best with your data engineering practices.