What’s in the box?

Black box testing for non-data engineers with DBT

Black box testing is a software testing method in which the functionalities of software applications are tested without having knowledge of internal code structure, implementation details, and internal paths. Let’s borrow that term and use the same analogy to test our black boxes, meaning our dbt models.

So, adapting the lexicon from software engineering, we have:

Black Box: the dbt model that plays the role of transformation

Input(s): the different tables that are used in the query. In dbt we call these sources or references

Output: the table formed after the dbt model has transformed the data

DBT and testing

DBT is a data build tool. We use it, due to its simplicity, to perform transformation in our snowflake data cloud for analytical purposes. Not to overstate the matter, but we love it.

What, exactly, is dbt?

Building a Mature Analytics Workflow

DBT already offers dbt tests, and performing them is a great way to test the data on your tables. By default, the available tests are unique, not_null, accepted_values, relationship. We can even create custom tests, and there are a variety of extensions out there that stretch dbt functionality with additional tests, such as great expectations and dbt-utils. These kinds of tests examine the values of your tables, and they are a great way to identify any critical data quality issues. DBT tests look at the output. However, what we want to do is to test the black box, the transformation.

TDD and Data

Working with Large Tables

More often than not, the tables that we need to build models upon are huge, and accessing billion of rows and performing transformation upon them takes a long time. A 30 minute transformation might be acceptable when it is a part of a production pipeline, but having to wait for half an hour to develop and test the correctness of your transformation is, well, less than ideal. 

Of course you are going to run it against the table, but minimizing the number of runs makes everyone happy. This also limits your Snowflake Warehouse Usage which can save cost and make accountants happy as well.

Edge cases not covered in actual data

Another problem we often face is having a dbt model that works for all intents and purposes for multiple months, only to later discover that there are cases which we didn’t think of. Unsurprisingly, having billions of rows of data means that all the possible scenarios are not at all easy to cover. If only there was a way to test for those cases, as well. The solution we at Orfium use is to generate mock data. They may not be real, but they work well enough to cover our edge cases and future-proof our dbt instances.

Good Tests VS Bad Tests

Writing tests for the sake of writing them is worse than not writing them at all. There, we said it. 

Let’s face it, how many times do we introduce tests on a piece of software, get excited and, thanks to the quick TDD process, we just gleam with self-confidence? Before you know it, we’re writing tests that have no value at all and inventing a fantastic metric called coverage. Coverage is important but not as a single metric. It is only a first indication and should not be used as a goal in itself. Good tests are the ones that provide value. Bad tests, on the other hand, only add to the debt and the maintenance. Remember, tests are a means to an end. To what end? Writing robust code.

Tests as a requirements gathering tool

How many times have we found ourselves sitting in a room with a stakeholder who provides information about a new report that they need. We start formulating questions, and after some back and forth, sooner or later we are reaching the final requirements of the report. So, happily enough after the meeting, we go to our favorite warehouse, only to discover some flaw in the original request that we didn’t think of when we did our requirements gathering. Working in an agile environment that’s no issue. We just schedule a follow-up meeting and reach a consensus for the edge cases. Final delivery is reached. However, wouldn’t it be better if actual cases could be drafted in that first meeting? Business and engineering minds often don’t mesh well, so we can use all the help we can get.

Establishing actual scenarios of how a table could look like and what the result would be, helps a lot in the process of gathering requirements.

Consider the following imaginary scenario:


For our table that contains our daily revenue for all the videos, I would like a monthly summary revenue per video for advertisement category.

Engineer (gotcha):


2 video_id,

3 year(date_rev) as year,

4 month(date_rev) as month,

5 sum(revenue) revenue


7 fct_videos_rev


9 category = 'advertisement'

10group by 

11 video_id,

12 year(date_rev),

13 month(date_rev)



I would also like to see how many records the summation was comprised of.

Engineer (gotcha):


2 video_id,

3 year(date_rev) as year,

4 month(date_rev) as month,

5 sum(revenue) revenue,

6 count(*) counts


8 fct_videos_rev


10 category = 'advertisement'

11group by 

12 video_id,

13 year(date_rev),

14 month(date_rev)



That can’t be right. Why so many counts?


There are many rows with zero revenues, I see. You don’t want them to count towards your total count, is that right?



Engineer (gotcha):


2 video_id,

3 year(date_rev) as year,

4 month(date_rev) as month,

5 sum(revenue) revenue,

6 count(*) counts


8 fct_videos_rev


10 category = 'advertisement'

11 and revenue > 0

12group by 

13 video_id,

14 year(date_rev),

15 month(date_rev)


Of course, this is an exaggerated example. However, imagine if the same dialog went a different way.


For our table that contains our daily revenue for all the videos, I would like a monthly summary on a monthly basis per video for the advertisement category.


If table has the form:


Is the result you want like the following?



I would also like to see how many records the summation was comprised of.

So the result you want it to be like:



Why does the first row have 2 counts?


There are two with zero revenues, I see. You don’t you want them to count towards your total count, is that right?



Engineer (gotcha):


And all that, without having to write a single line of code. Not that an engineer is afraid to write SQL queries. But really, a lot of time is lost in translating business requirements into SQL queries. They are never that simple and they are almost never correct at first try either.

Tests so software engineers can get onboard in SQL

Orfium is a company which, at the time of writing this post, consists of more than 150 engineers. Only 6 of those are data engineers. That might sound strange, given that we are a data-heavy company dealing with billions of rows of data on a monthly basis. So, a new initiative has emerged called data-mesh. This is a program which we practice on a daily basis and are super proud of. One consequence of data mesh is that there are multiple teams handling their own instance of dbt. But, this will be discussed in detail in another post. Stay tuned!

For the most part, software engineers are not familiar with writing complex SQL queries. That’s not their fault, due to the variety of ORM tools available. However, something that software engineers do know how to do very well is to write tests.

In order to bridge that gap, practicing test-driven development on writing SQL is something that can help a lot of engineers to get onboard.

Let the fun begin

We designed a way to test dbt models (the black box). Our main drivers are:

  • Introduce a few changes so that new or mature projects can start using it, without breaking existing behavior.
  • Find a way to define test scenarios and identify which of them failed.

We start by introducing the following macros:

1{%- macro ref_t(table_name) -%}

2    {%- if var('model_name','') == this.table -%}

3        {%- if var('test_mode',false) -%}

4            {%- if var('test_id','not_provided') == 'not_provided' -%}

5                {%- do exceptions.warn("WARNING: test_mode is true but test_id is not provided, rolling back to normal behavior") -%}

6                {{ ref(table_name) }} 

7            {%- else -%}

8                {%- do log("stab ON, replace table: ["+table_name+"] --> ["+this.table+"_MOCK_"+table_name+"_"+var('test_id')+"]", info=True) -%}

9                {{ ref(this.table+'_MOCK_'+table_name+'_'+var('test_id')) }}

10            {%- endif -%}

11        {%- else -%}

12            {{ ref(table_name) }} 

13        {%- endif -%}

14    {%- else -%}

15        {{ ref(table_name) }} 

16    {%- endif -%}


18{%- endmacro -%}


20{%- macro source_t(schema, table_name) -%}


22    {%- if var('model_name','') == this.table -%}

23        {%- if var('test_mode',false) -%}

24            {%- if var('test_id','not_provided') == 'not_provided' -%}

25                {%- do exceptions.warn("WARNING: test_mode is true but test_id is not provided, rolling back to normal behavior") -%}

26                {{ builtins.source(schema,table_name) }}

27            {%- else -%}

28                {%- do log("stab ON, replace table: ["+schema+"."+table_name+"] --> ["+this.table+"_MOCK_"+table_name+"_"+var('test_id')+"]", info=True) -%}

29                {{ ref(this.table+'_MOCK_'+table_name+'_'+var('test_id')) }}

30            {%- endif -%}

31        {%- else -%}

32            {{ builtins.source(schema,table_name) }}

33        {%- endif -%}

34    {%- else -%}

35        {{ builtins.source(schema,table_name) }}

36    {%- endif -%}


38{%- endmacro -%}

The macros are able to optionally change the behavior of the macros of source and ref.

  • model_name: refers to the model actually been tested
  • test_mode: is a flag that helps identifying if the test_mode is enabled
  • test_id: the test scenario that is going to be mocked
  • table_name(argument): is the source table that is either going to be the true source, or we stab it and use one of our own.

Prefer multiple small test cases over few large test cases

Test cases should test something specific. Generating Mock data that contain hundreds of records that test multiple business rules should be avoided. Should the test case fail, it should be easy to identify the cause and its impact.

Suppose we would like to create a test_id with the name: MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE for our model VIDEOS_INFO_SUMMARY which uses a source VIDEOS_INFO

We create a new folder under seeds MOCK_VIDEOS_INFO_SUMMARY

  1. We create the input seed seeds/MOCK_VIDEOS_INFO_SUMMARY/VIDEOS_INFO_SUMMARY_MOCK_VIDEOS_INFO_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE.csv which plays the role of input





  1. We create the output seed seeds/MOCK_VIDEOS_INFO_SUMMARY/VIDEOS_INFO_SUMMARY_MOCK_RESULTS_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE.csv which plays the role of output we would like to have once



  1. We also create a yml seeds/MOCK_VIDEOS_INFO_SUMMARY/VIDEOS_INFO_SUMMARY.yml as follows:
1version: 2




5    config:

6      enabled: "{{ var('test_mode', false) }}"



9    config:

10      enabled: "{{ var('test_mode', false) }}"

Notice that the seeds are created only on test_mode. This allows us to omit creating those seeds on default behavior.

  1. Now we define the test inside our yml model definition:


3    description: "Summary of VIDEOS_INFO"

4    tests:

5        - dbt_utils.equality:



8            compare_columns:

9              - VIDEO_ID

10              - YEAR

11              - MONTH

12              - REVENUE

13              - COUNTS

14            enabled: "{{ var('test_mode', false) }}"
  1. Our model:

2    config

3    (

4        materialized = 'table'

5    )










15 {{ source_t('MY_SCHEMA','VIDEOS_INFO') }}


17 CATEGORY = 'advertisement'






Notice the source_t usage instead of using the default source macro.

Now in order to follow the test process we have to go through the following process.

  1. Load up our seeds as:

1dbt seed –full-refresh -m MOCK_VIDEOS_INFO_SUMMARY –vars ‘{“test_mode”:true}’

  1. Then execute our model as:

1dbt run -m VIDEOS_INFO_SUMMARY –vars ‘{“test_mode”:true,”test_id”:”MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE”,”model_name”:”VIDEOS_INFO_SUMMARY”}’

  1. And then execute dbt test to check if our black box behaved as it should:

1dbt test –select tag:test_VIDEOS_INFO_SUMMARY_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE –vars ‘{“test_mode”:true,”test_id”:”MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE”,”model_name”:”VIDEOS_INFO_SUMMARY”}’

Note: Because the whole process is a bit tedious with writing all those big commands, we wrote a bash script which automates all three steps:

The requirement is to create a file conf_test/tests_definitions.csv which has the format:



  1. Script reads this file and executes all the tests defined in the file in order
  2. Executing tests of only a specific model is supported by passing -m flag ./dbt_test.sh -m VIDEOS_INFO_SUMMARY
  3. Executing a specific test case is supported by passing -t flag ./dbt_test.sh -t MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE
  4. Lines that start with # are skipped

In the whole set-up described above there are some conventions that are important to be followed, otherwise the script/macros might not work

  1. The seed folder must be named MOCK_{model_we_test}
  2. The seed which plays the role of input must be named {model_we_test}_MOCK_{model_we_stab}_{test_id}
  3. The result which plays the role of wanted result must be named {model_we_test}_MOCK_RESULTS_{test_id}

All the code exists in the following repo: https://github.com/vasilisgav/dbt_tdd_example – Connect to preview  

To see it in practice:

  • set up a tdd_example profile
  • make sure you run dbt deps to install dbt_utils
  • make the script executable chmod +x dbt_test.sh
  • and finally execute the script ./dbt_test.sh


What we have found by working with this approach, as it is expected with any TDD approach. The result was a big win into how we release our dbt models


  • models have grown to become quite clean with their business clearly depicted
  • business rules can easily be verified, especially their changes
  • business voids are identified faster
  • business requirements are generated in a cleaner, more efficient way
  • quick development, yes it’s surprising but we deal with billion of rows, the less runs we are going to perform on the full load of table the quicker the development
  • regression tests are handled by our github actions ensuring our models behave as expected (multiple puns here 😀 )
  • QA can happen independently of our dev
  • Warehouse usage is limited


  • Tables sources with multiple columns sometimes are cumbersome to mock, although if columns are not selected then defining them in the mock csv’s is not required
  • It’s somewhat difficult to start

So, what are the key takeaways? That testing is important but good, smart testing can truly free an organization of a lot of daily tedium and allow it, as it has us, to focus more on serving the business efficiently and with the least amount of friction.

Vasilis Gavriilidis

Senior Data Engineer @ ORFIUM