Recursive queries for levelling up your hierarchies
Steven Wallace
Introduction
At some point in a data or analytics engineer's career, there will be a need to query hierarchical data. In this blog I will explain how to do this in various ways and which way may be best for building a dimension.
What is hierarchical data?
Hierarchical data is data (usually rows) which are connected to each other in a parent and child relationship which if visualised would represent a tree like structure similar to how a directory structure would look. One of the most common examples of hierarchical data would be a calendar table containing days, weeks, months and years.
A lot of the time this sort of data may be stored in transactional systems as a self joining table. This means that the table will have a parent column and a child column. With basic SQL knowledge these join together to give you levels.
For example in the below query, there is a table in the database called spatial which has only 3 columns and contains all of the components of a house.
The Challenges
This query is a very basic example with data that we have knowledge about, but what if:
- There are a significant amount of levels.
- We don’t know how many levels may exist.
- The number of levels can change causing us to rewrite the query.
For this there are two main methods of querying.
Connect By (Prior)
This method acts like a join and processes the hierarchy one level at a time. It allows each level to refer to data in the prior level.
The below query is an example of the connect by query method based on our spatial table.
After the ‘FROM’ clause, we specify which level to start with at the top of the hierarchy, in this case: ‘House’.
Then we include the ‘CONNECT BY’ subclause which works in a similar way to a join, however unlike a join, we can include the ‘PRIOR’ keyword here which specifies that the value should be taken from the higher or parent level.
When using ‘CONNECT BY’ there are some pseudo-columns which can also be used:
- level - shows the level of the hierarchy, based on the connect by config
- sys_connect_by_path() - which allows you to create a path string
- connect_by_root - which displays the root node of the hierarchy for that row
Recursive CTE
This method is a type of CTE that references itself in a loop-like fashion. It enables you to join all the levels in a hierarchy without knowing how many total levels there are.
The below query is an example of the recursive CTE method based on our spatial table. It is made up of two parts, the Anchor Query and the Recursive Query
The CTE starts like any other but includes the keyword ‘RECURSIVE’ after the WITH. It is necessary to also specify the columns.
The first part is the Anchor Query which is where we select the top level of our hierarchy, this query is static and only runs once.
The second part is the Recursive Query which is where we select the rows below the top level, this query is run for each level in the data.
The Main Query then selects the results of both the Anchor Query and the cumulative Recursive query.
There is a little more flexibility with the columns in a Recursive CTE as we can specify the top ‘level’ column and iterate the lower levels with ‘level’ + 1. The same with the path column. This is especially powerful for the use case where you may want to generate data, for example adding days to a date to create a calendar. This is limited in the CONNECT BY subclause as the START WITH clause cannot add columns to those already in the row.
Turning this into a dimension using dbt (Data Build Tool)
In dbt, once you have queried your data using the above techniques, save it to a model (e.g. spatial_entity.sql) and you can then refer to that model to create a dimension. This can be done dynamically using the dbt_utils package if you so wish. The caveat being that you still need to create the model properties YAML file.
Using Jinja within dbt we can use dbt_utils.get_column_values to retrieve the values from the level column in our spatial_entity model we created before. As this is just a list, we then need to modify it slightly by concatenating single quotes and commas to produce a comma separated value list which will be required for the pivot.
Now we can use a PIVOT function to pivot the data from our model into the respective level columns. Where we usually specify the column names, we can refer to our jinja variable to make a truly dynamic pivot.
Of course, you can also do this statically which is suitable for most cases as below.
Summary
So we have learnt about what hierarchy data is and depending on our use case how to query it. There are a few things to still to note: For all examples I am using Snowflake SQL dialect but most databases will be similar. Be wary of infinite loops when using recursive queries, ensure you have limited your data and your joins are correct. If you are having trouble there is more information in the snowflake documentation (opens in a new tab). These methods also assume the hierarchical data is not ragged and is contiguous. If you have any questions or comments, please get in touch.