A practical introduction to the dbt Semantic Layer

Oct 28, 2023 · Shiv Gupta

The dbt Semantic layer has gone through several milestones: its conception with the “dbt needs to know about metrics” GitHub issue, followed by the dbt Metrics package and the first Semantic Layer implementation, the acquisition of Transform by dbt Labs, and the revamped dbt Semantic Layer that exists today and which is the scope of this article.

The dbt Semantic Layer shows promise in helping organizations converge to a shared and consistent definition of metrics. The core principle is this: once a business metric is defined in dbt, it is discoverable and queryable to end-users through a suite of data products (business intelligence platforms, notebook-style analyses, ML models, data apps). Centralizing metric definitions inside dbt means metrics are calculated with consistent logic regardless of the data product from which they are queried.

This is made possible through two components of the dbt Semantic Layer: metadata APIs that expose metric definitions, and MetricFlow, the engine which compiles metric queries into SQL based on a semantic model defined on top of dbt models. The compiled SQL is then issued to the data warehouse and results are returned to the end-user. With this pattern, the logic of calculating a metric is abstracted away from the data consumer, making querying it incorrectly extremely difficult, if not impossible.

“My number doesn’t match yours” conversations tend to happen because metrics are calculated inconsistently across data products – often due to code duplication and logic mismatch – and the dbt Semantic Layer addresses that.

Querying metrics is a declarative pattern

In my mind, querying metrics using the dbt Semantic Layer is a declarative pattern analogous to SQL. In SQL, the client specifies the result it wants to obtain, offloading the heavy lifting of how to obtain that result to the underlying SQL engine.

In the case of the dbt Semantic Layer, the client is the data consumer querying a metric while the server responsible for receiving the request, compiling it into SQL to send to the data warehouse, and returning the response is MetricFlow.

The data consumer specifies the metrics it wants to query while the dbt Semantic Layer – which has awareness of the entities, dimensions, and measures that make up the metrics – takes care of compiling the request into consistent SQL that is sent to the data warehouse and returned to the data consumer.

Query metrics, not tables

Typically, in the absence of a semantic layer, data consumers compute metrics from a data warehouse by writing a SQL query or by using a BI tool which is running SQL queries on their behalf. There are several issues with this pattern. First, data consumers are not likely to use consistent SQL queries or tables to arrive at the result. For example, when there is no consensus on using Column X in Table A vs Columns Y and Z in Table B to calculate a metric, the metric calculation ends up being inconsistent across the org.

Second, when computing the metric requires multiple tables, consumers need to be intimately familiar with the design of the data warehouse to know the right tables to join and the right columns to use. The alternative is that someone has already figured out this join and exposed it as a pre-aggregated table. But, pre-aggregated tables are inflexible. If consumers needs the metric at a different granularity or with a different set of dimensions, they are forced to write their own SQL to compute the metric, exposing themselves to a risk of calculating it inconsistently.

Ultimately, what ends up happening in practice is duplication of code across data products to calculate the same metric, resulting in inconsistency and “my number doesn’t match yours” conversations.

The dbt Semantic Layer addresses some of these pain points by streamlining and centralizing the process of defining and consuming metrics. Data consumers interested in metrics no longer need join or query tables directly – they query a metric instead. And because the dbt Semantic Layer performs dynamic joins and aggregations, metrics are computed at the granularity specified by the consumer at query time, bypassing the need for pre-aggregated tables.

Consumers don’t need to be familiar with tables in the data warehouse to query metrics confidently as long as the semantic layer is set up for them – they simply query a metric from a BI tool, SQL IDE, or GraphQL – and offload the heavy lifting of generating the SQL with the right tables, columns, joins, and aggregations to MetricFlow. With full context of the entities, measures, and dimensions making up dbt models, the dbt Semantic Layer reduces the likelihood of inconsistently calculated metrics.

Implementing the dbt Semantic Layer

At the risk of stating the obvious, the semantic layer is best implemented on top of a data platform with well-defined facts and dimensions, which serve as the foundational components of a semantic layer. Broadly, implementing the dbt Semantic Layer involves the following steps.

Define semantic models in YAML. Each semantic model consists of the following: a reference to a dbt model, and the entities (e.g. customers), dimensions (e.g. locations), and measures (e.g. revenue) in it.
Define metrics in YAML. Each metric references a semantic model, in addition to a type which defines the kind of metric (simple, cumulative, derived, or ratio), and additional parameters depending on the type. Optionally, a filter string will filter your dimensions or entities at query time. The label for the metric will populate a name for the metric across all data consumers querying it.
Run a production job in dbt Cloud to deploy your metric definitions and make them available to data consumers.

Connecting to the dbt Semantic Layer

The dbt Semantic Layer integrates natively with a number of BI tools including Tableau (beta) and Mode.

Additionally, any tool with a generic JDBC driver that allows writing SQL can be used to query the dbt Semantic Layer. As an example, when the below query is executed, it is submitted to the MetricFlow Server which compiles it into SQL that is sent to the data warehouse for execution. Syntax aside, this feels like an improvement over having to figure out which tables to use for each metric and writing out the SQL necessary to represent all metrics with consistent dimensions.

select 
    * 
from {{
    semantic_layer.query(
        metrics=[
            'food_order_amount', 
            'order_gross_profit',
            'revenue',
            'customer_acquisition_cost'
        ],
        group_by=[
            Dimension('metric_time').grain('month'),
            'customer__customer_type'
        ]
    )
}}

Lastly, the GraphQL API is available to query metrics and their definitions for use in data applications.

The post-dbt Semantic Layer world

The dbt Semantic Layer may seem like a no-brainer, but from a product perspective, I anticipate adoption to be slow. Given the current economic backdrop of tightening budgets and the fact that the dbt Semantic Layer is only available to dbt Cloud customers, dbt Core customers will be especially challenging to onboard. Besides, there is considerable effort involved in setting up the dbt Semantic Layer and championing its adoption internally.

Nevertheless, mature data teams with a variety of data consumers are likely to benefit from investing resources and effort into centralizing their metrics in the dbt Semantic Layer. In a post-dbt Semantic Layer world, I anticipate the focus of conversations around metrics to gradually shift from questioning their accuracy to evaluating their effectiveness in measuring business outcomes and whether they can inspire corrective action. With adoption of the dbt Semantic Layer, the return of investment in data assets increases as the cost to develop metrics and make them available to consumers – in a way that is DRY, consistent, and discoverable – diminishes.

If you or your team is considering implementing the dbt Semantic Layer – or has chosen to hold off or use alternatives (in Looker, PowerBI, or elsewhere) – I would love to hear your thoughts.