Recursive PLPGSQL, Lookup Table and Ltree

There are many reasons and needs to implement a tree structure inside of an SQL database. In our particular example today, we're using a basic knowledge base tree as an example of a small- to large-scale tree inside of an SQL database. Additionally, any sort of directory/file system layout would be a good match for these concepts. I am going to discuss three basic concepts on dealing with small, medium, and large tree structures

The first, and most applicable to small trees, is a recursive lookup query. In essence, at any given branch of the tree, this function will walk back up the tree and generate a delimited path representing the depth of the given node in the tree. For small trees, this function does work very well, and the dynamic reaction to updates is unparalleled - as every execution of the function examines the tree in full, updates, including branch alterations, are handled very quickly. A delete function, however, is required to remove entire subsets of trees, or to affix them to the parent branch.

Downsides to this approach are the very flexibility it affords, as every single node of a branch is traversed for every object in that tree. One hundred items in a direct linear tree will have significantly more than 100 iterations to walk the entire length. The first branch would walk up two(self, root), the second would walk up three (self, parent, root), and so on, with the final node in the series walking back up the entire length of the tree, 101 iterations of the function. As you can imagine, a fully recursive lookup in this method is highly inefficient for large trees, or for a large connection load. For example, with a basic tree of 10 top-level nodes, each with 10 levels of depth, we have an approximate runtime of 40ms on our development server.

However, if we take this tree to 100 top-level nodes of 10 depth, the runtime increases to 358ms, roughly. So far, it's scaling close to linearly with the size of the database. At 50 top-level nodes, each of 20 depth, the full runtime comes in at 600ms, even though the number of actual rows is the same as before. Doubling the row count to 100 top-level nodes of 20 depth each brings the runtime up to an approximate 1200ms, and this is only with 2000 rows in our database.

As you can see, as the size of the tree increases, so to will the runtime, approximately linearly with the size of the tree. The queries used here were simply to extract the path as "Foo, Bar, Baz" from the tree. The next logical step is a path lookup table. This requires no external modules, merely a trigger and a new SELECT function to operate. It works on the basic principle of, you know where an object in the tree is at insertion time, so why not write it down for later?

The basic layout of the insertion trigger is:

- Does the new row have a parent ID?
- If yes, query the lookup table for the parent ID
- If there's a parent ID in the lookup table, append our path to that path, add a new row
- else, enter the recursive function and compute the parents' path and our own path, and add both to the lookup table
- If no, use our own name as the path and insert ourself into the lookup table.

At this point, our queries can be significantly faster than a full recursive query - the recursion and tree organization is already done, via the lookup table. Querying any level in the tree gives us all the nodes all the way back up, via a regex and a second select statement. For instance, in a "." delimited path, we can extract any member of the tree "Foo.Bar.Baz.Foobar.Foobaz" via

SELECT id FROM lookup_table 
   WHERE path ~ "^Foo.Bar.Baz$";

for any . level of the tree.

This can also be used to manage large-scale deletes and updates of the table. Downsides to this approach are that your triggers must fire properly, both for inserts and deletes from the database. A second trigger or function would be required to manage deletes on branches in the tree. Further downsides include poor performance on the regex queries, though this could be partially alleviated with appropriate indexes being added to the lookup table. Further issues arise from moving branches around inside the tree, requiring more processor time to analyze all the paths in the lookup table and rearrange them correctly to reflect the new tree layout. For instance, if one moves a node from Foo.Bar.Baz to a direct child of Foo.Bar, then one would have to have a subsequent call of

UPDATE lookup_table SET path = "foo.bar." || 
   old path where path = "foo.bar.baz.old_path";

or something in that general idea. Furthermore, additional time must be spent to write a series of regex functions to analyze locations in the tree, and to do interleaving in the event you may want, for instance, "Foo.Bar.*.Baz". This can prove to be a complex and time-consuming proposition. Delete, however, is much faster than the previous concept, where entire swaths of the tree can be removed by

DELETE from tree_table 
WHERE id IN (
   SELECT id FROM lookup_table where path  = "$Foo.Bar.*"
); 

roughly. The initial lookup table is built using the recursive path query lookup. The pathnames are created as "parent.child.second_child", etc. Path is stored as TEXT field. Since queries to generate only the full path listing would be pointless at this stage, due to their pre-existence, we will focus on extracting only specific branches of the tree from here on in. The first query,

SELECT path FROM tree 
   JOIN lookup_table ON tree.id = lookup_table.tree_id;

has an approximate runtime of 11ms. So, let's find all the nodes that have "baz" in them, which carries a runtime of 13ms, using the query

SELECT * FROM tree 
JOIN lookup_table ON tree.id = lookup_table.tree_id 
WHERE path ~ 'baz';

Basic query, very fast. A similar query using the recursive function runs in 1232ms, no drop from simply selecting all of the rows. However, what if we wanted only the rows from the table where 'baz' was the top-level?

SELECT * FROM tree 
JOIN lookup_table ON tree.id = lookup_table.tree_id 
WHERE path ~ '^baz'; 

is one simple way, and runs in approximately 3ms. or where baz is anywhere in the tree, but NOT the root?

SELECT * FROM tree 
JOIN lookup_table ON tree.id = lookup_table.tree_id 
WHERE path ~ '\w{0,}\.baz\\.\w{0,}';

which runs in 20ms, and only returns nodes which have .baz. However, there's no easy way to tell if 'bar' is a child node of 'foo', for instance. A regex similar to '*.foo\\.\w*\.bar'.

However, having to write out these complex regexes takes significant development time and resources and testing. Ltree builds further on top of the lookup table functionality, moving certain complexities into pre-written C functions and binding comparison tools as infix operators, significantly simplifying the queries required. As the majority of the module is written in C, it is most suitable for very large trees, or systems with a large number of concurrent users. All the regexes provided by the standalone lookup table are still applicable in the ltree mode, and some new syntax is provided specifically to work with tree structures.

Converting from the initial lookup tree path to the ltree path format is taken care of by a very convenient function - text2ltree(TEXT). Rebuilding the index can be done in a quick, simple and easy query, remarkably close to the query that originally built the lookup table. However, ltree still has the same basic disadvantages a the lookup table, as the same trigger must be used to create the lookup table, as well as requiring the same delete and insert code. However, such code CAN be significantly simplified by using the infix operators that are provided by the ltree module. For instance, one can acquire all items in a tree using a simple and understandable query such as

SELECT path FROM lookup_table 
WHERE path <@ 'Foo.Bar'; 

returning all nodes under the Bar branch in the Foo tree. This is similar, but significantly more flexible than the regex function discussed in the Lookup Table section, as you are not required to give the entire path to the right side of the query, in this case. Such a query would resemble

SELECT path FROM lookup_table 
WHERE path ~ '*{1,}.bar.*{0,}';

thus giving anything under any of the .bar branches, no matter what their location in the tree. So bar.bar would match, as would foo.bar.baz, or foo.foobage.bar.baz.foobar. This query has a runtime of 2.2ms, in total. However, this doesn't really show the full potential of ltree - yes, it's fast and the syntax allows for very sophisticated manipulation of the tree.

But how fast is it when the tree is excessively large? The ltree documentation uses a fragment of the DMOZ (dmoz.org) index to show off the versatility of its tree manipulation speeds. Our demonstration tree has been increased to a size of 10,000 top-level nodes with a varying depth of 25-75. This gives us, at minimum, 250,000 nodes. The above query, doing a basic path query, took 600ms, using the ltree module, over a set of approximately 280,000 records. The same query, modified slightly to use ltree-to-text for standard regex mapping, took approximately 974ms. The query

SELECT count(*) FROM lookup_table 
WHERE path <@ 'foo';

also took an approximate 220ms to completely, returning 37k rows. The similar query,

SELECT count(*) FROM non_ltree_lookup 
WHERE path ~ '^foo\\.\w{0,}';

using a plain regex took approximately 500ms. All queries are shown without indexes on the lookup tables.