Why is DBT (Data Build Tool) so good?

Introduction

Many, many years ago I entered the data engineering world using a tool called Cognos Data Manager, before that it was called DecisionStream which was created and developed by a small company called Relational Matters based in Guildford, UK. In about 2002 it was acquired by Cognos who was acquired by IBM. Now it no longer exists and has been superseded by IBM DataStage.

Anyway, this was a very, very old ETL tool which was simple, but powerful. Basically you were able to create a connection to a data source (many were available and built into the tool), write a SQL query to pull data from those sources, this could include transformation if necessary. Then there was a transformation area, where you could add derivations or expressions and finally a load area where you decide what kind of delivery to the table the data would load to (e.g. truncate and load, update insert). You could create many ‘builds’, put them into a ‘job stream’ of ‘builds’, set their dependencies and run them via command line.

A screen capture of Cognos Data Manager build — IBM Cognos Data Manager

It was the best tool, the learning curve was low if you knew SQL and because it was SQL, it was powerful. It had other features too which helped generate dimensions or facts, handle surrogate keys or user defined functions. Also, it automatically generated documentation!

Since then, ETL fell out of favour due to the power it needed and evolved more into ELT and I used a fair few other tools mainly built for integration and interfacing, but they were all GUI tools and very complex and somehow lacked simple features. Maybe Matillion comes close. None of them matched the simplicity and power of Cognos Data Manager until…

Enter dbt (Data Build Tool) and the modern data stack

dbt reminds me of that original tool. Connections, SQL, Integration types. It’s all there, it’s all simple, it all makes sense.

dbt is essentially a command line tool with a code repository behind it. That code can be made up of SQL, Jinja, YAML and also Python if you need it.

Admittedly there is a bit more of a learning curve as you will have to learn software engineering best practices and a couple of other very basic languages, but this is a good thing. It keeps code consistent, efficient, you can manage change better and deployments easier.

It’s powerful because it’s code, it’s even more powerful because it extends SQL with Jinja enabling SQL to be even more dynamic. You can create dependencies within the SQL code without the need for additional ‘packages’

Testing is also done in SQL and can be managed separately. Easily understood and developed.

It’s configuration is done in YAML files and documentation is generated automatically.

What dbt doesn’t do however is move any data. That is decoupled and achieved via a tool like Fivetran or Airbyte from various sources and a data warehouse like Snowflake or Databricks Lakehouse as a target. dbt will sit over the top and only do the transformation part using their built in query engines (EL then T).

Conclusion

dbt is continuing to grow exponentially and as of the end of 2023 there are 33,000 teams using it. So why is dbt so good? It’s because like its ancestor it focuses on modelling data and makes that as simple as possible using a universally known language in the industry (SQL). But also adding specific features for real life business needs like Data Lineage and documentation, Testing, Change Management, Data Lifecycle, Data Contracts and more. It means anyone in a business can get closer to the data and if they know SQL, maybe even produce some analysis themselves.