Postgres as a graph database

624 points
1/20/1970
a year ago
by elorant

Comments


szarnyasg

I designed and maintain several graph benchmarks in the Linked Data Benchmark Council, including workloads aimed for databases [1]. We make no restrictions on implementations, they can any query language like Cypher, SQL, etc.

In our last benchmark aimed at analytical systems [2], we found that SQL queries using WITH RECURSIVE can work for expressing reachability and even weighted shortest path queries. However, formulating an efficient algorithm yields very complex SQL queries [3] and their execution requires a system with a sophisticated optimizer such as Umbra developed at TU Munich [4]. Industry SQL systems are not yet at this level but they may attain that sometime in the future.

Another direction to include graph queries in SQL is the upcoming SQL/PGQ (Property Graph Queries) extension. I'm involved in a project at CWI Amsterdam to incorporate this language into DuckDB [5].

[1] https://ldbcouncil.org/benchmarks/snb/

[2] https://www.vldb.org/pvldb/vol16/p877-szarnyas.pdf

[3] https://github.com/ldbc/ldbc_snb_bi/blob/main/umbra/queries/...

[4] https://umbra-db.com/

[5] https://www.cidrdb.org/cidr2023/slides/p66-wolde-slides.pdf

a year ago

JimmyRuska

This video goes through some of the nuances https://www.youtube.com/watch?v=tNgFMBzYcl8

Take a look at RDFox, unlike postgres where you can only derive a column from another column once, you can't derive a column from a derived column, RDFox can derive unlimited columns or even objects through rules incrementally. Kind of like unlimited chaining of materialized views

https://docs.oxfordsemantic.tech/

a year ago

eternalban

Re [5]'s asssertion under "blunders" of the diminish usecases post sql/pgq, what do you think of [something] like Quine?

https://github.com/thatdot/quine

Their claim to fame is progressive incremental computation - each node is an actor responding to events -- and I'm not sure how a relational db could do that and match the latencies. That usecase is pretty much pattern matching and forensics and stuff like that.

https://docs.quine.io/core-concepts/architecture.html

a year ago

szarnyasg

It's also worth to mention the idea of WITH ITERATIVE. This is a proposed SQL syntax extension that allows formulating iterative computations. For many problems (including graph traversals), this makes queries easier to write and execute.

> We derive new iterative CTE variants from the simple loop-based, operational semantics of SQL:1999’s WITH RECURSIVE.

https://www.cidrdb.org/cidr2023/papers/p14-hirn.pdf

https://www.cidrdb.org/cidr2023/slides/p14-hirn-slides.pdf

https://www.youtube.com/watch?v=ZDtSGAeYKU0

a year ago

szarnyasg

Quine seems to be an interesting project but I'm not too familiar with it. Its main feature, evaluating complex multi-way join queries on incoming streaming data, exists in the relational world as the Materialize database which leverages differential dataflow for computation.

Quine uses Cypher so expressing path queries can be done with the concise Kleene-star syntax, e.g. (p1:Person)-[:knows*]-(p2:Person).

Materalize is getting support for WITH RECURSIVE and WITH MUTUALLY RECURSIVE (their own SQL extension that fixes some issues of WITH RECURSIVE):

https://github.com/MaterializeInc/materialize/issues/11176

a year ago

simonw

This same technique - WITH RECURSIVE - is also supported by SQLite.

I ported the example in this tutorial to SQLite here - now you can play with it in Datasette Lite: https://lite.datasette.io/?sql=https%3A%2F%2Fgist.githubuser...

Amusingly I ported the sample PostgreSQL code using the new ChatGPT "Browsing" alpha, with the following prompt:

> Read this tutorial and then output equivalent create table and insert statements for SQLite https://www.dylanpaulus.com/posts/postgres-is-a-graph-databa...

a year ago

punnerud

If we have links to other Sqlite3 databases in the SQLite itself, we could make an static webpage distributed graph database: https://news.ycombinator.com/item?id=27016630

With the static database trick, combined with recursive graphs

a year ago

henryfjordan

I implemented pretty much exactly this setup once.

For what it's worth, there's a lot of footguns with this approach. You need to be careful about infinite cycles and things like that. You also just do not get the ergonomics of using a query language that is meant for graph data. Once you get it all setup and experience the issues you will no doubt be able to build whatever you want but there's a learning curve.

In the end we switched to using Neo4j and were able to build new features a lot more quickly than we would've been able to on Postgres.

It's also worth mentioning that there are many ways to store graph data, and using an "adjacency list" like is being done here may or may not be the best way for your use case. Neo4j I think uses a different approach where nodes store info about edges they are connected to directly, so you don't even need to hit an index to follow an edge.

a year ago

cryptonector

> You need to be careful about infinite cycles and things like that.

Not really, just never forget this: always use `UNION`, and never `UNION ALL`, in recursive queries!

a year ago

twic

There are still ways to go wrong. Try writing a query which explores a graph from a starting node, building up a path from the start to each node as a string. Even using UNION you will get an infinite loop, because every visit to a node in the loop has a different path, so it makes a unique row.

a year ago

[deleted]
a year ago

cryptonector

The aggregation needs to be done after you've computed the transitive closure, not while you're computing it.

a year ago

twic

I think the only way to build the path up is to do it as you're traversing, though.

Here's a small example:

  create table node (
    name varchar primary key
  );

  create table edge (
    from_name varchar not null references node (name),
    to_name varchar not null references node (name)
  );

  insert into node values
    ('Hawaii'),
    ('San Francisco'),
    ('New York'),
    ('Houston');

  insert into edge values
    ('Hawaii', 'San Francisco'),
    ('San Francisco', 'New York'),
    ('San Francisco', 'Houston'),
    ('Houston', 'New York');

  with recursive route (destination_name, path) as (
    values ('Hawaii', 'Hawaii')
    union
    select to_name, route.path || ' -> ' || to_name
    from route
    join edge on route.destination_name = edge.from_name
  )
  select * from route;
That works as long as the graph is acyclic. If you add an edge from New York to Houston, say, then the query goes into an infinite loop.

How would you do this after computing the transitive closure? And how would that avoid an infinite loop?

a year ago

cryptonector

Ah, yes, I flubbed -- this one requires aggregation as you go. It's tricky. Ideally you could have a decent aggregation data structure for the path that lets you select a distinct sequence (path) of middle points. One could write such an aggregation function, I think. Lacking that you have to limit the length of all paths between any pair of points to the longest possible path (which is the number of points minus one), which is annoying, and then you still have to fix up each resulting path to remove any repeating suffix.

  sqlite> .schema
  CREATE TABLE edge (src text, dst text);
  sqlite> with recursive paths (src, dst, len, path) as (
     ...> select src, dst, 0, src || '->' || dst from edge
     ...> union
     ...> select p.src, e.dst, p.len + 1, p.path || '->' || e.src from paths p join edge e on p.dst = e.src where p.len < (select count(*) from (select src from edge union select dst from edge))) select * from paths;
  Hawaii|San Francisco|0|Hawaii->San Francisco
  San Francisco|New York|0|San Francisco->New York
  San Francisco|Houston|0|San Francisco->Houston
  Houston|New York|0|Houston->New York
  Houston|Hawaii|0|Houston->Hawaii
  Hawaii|Houston|1|Hawaii->San Francisco->San Francisco
  Hawaii|New York|1|Hawaii->San Francisco->San Francisco
  San Francisco|Hawaii|1|San Francisco->Houston->Houston
  San Francisco|New York|1|San Francisco->Houston->Houston
  Houston|San Francisco|1|Houston->Hawaii->Hawaii
  Hawaii|Hawaii|2|Hawaii->San Francisco->San Francisco->Houston
  Hawaii|New York|2|Hawaii->San Francisco->San Francisco->Houston
  San Francisco|San Francisco|2|San Francisco->Houston->Houston->Hawaii
  Houston|Houston|2|Houston->Hawaii->Hawaii->San Francisco
  Houston|New York|2|Houston->Hawaii->Hawaii->San Francisco
  Hawaii|San Francisco|3|Hawaii->San Francisco->San Francisco->Houston->Hawaii
  San Francisco|Houston|3|San Francisco->Houston->Houston->Hawaii->San Francisco
  San Francisco|New York|3|San Francisco->Houston->Houston->Hawaii->San Francisco
  Houston|Hawaii|3|Houston->Hawaii->Hawaii->San Francisco->Houston
  Houston|New York|3|Houston->Hawaii->Hawaii->San Francisco->Houston
  Hawaii|Houston|4|Hawaii->San Francisco->San Francisco->Houston->Hawaii->San Francisco
  Hawaii|New York|4|Hawaii->San Francisco->San Francisco->Houston->Hawaii->San Francisco
  San Francisco|Hawaii|4|San Francisco->Houston->Houston->Hawaii->San Francisco->Houston
  San Francisco|New York|4|San Francisco->Houston->Houston->Hawaii->San Francisco->Houston
  Houston|San Francisco|4|Houston->Hawaii->Hawaii->San Francisco->Houston->Hawaii
  sqlite>
a year ago

twic

Aha, nice, i didn't think of just limiting it that way! The need to postprocess this is annoying but only very mildly; i think it takes a fairly simple window function.

I had a go at changing my query to only produce one route to each node, by excluding rows in the recursive step which go to a node for which we already have a destination, but i couldn't find a way to do it which PostgreSQL would accept, because you can't use the recursive table in an outer join or a subquery.

The PostgreSQL has some advice about dealing with cycles, but i haven't worked up the mental energy to absorb it:

https://www.postgresql.org/docs/14/queries-with.html#QUERIES...

a year ago

cryptonector

Well, it's a nice trick but it's O(N^2) in terms of output size because you have N items and paths up to N-1 long, so you have up to N ~N-length paths. If your data is guaranteed to have loops then this could be annoying.

The PG thing works trivially because you can use an array for the path and inspect the array in each iteration. That's harder to do in SQLite3, but... you could: just format a string like '%->' || src || '->%' then use it with `like()` to check if the path you're adding `src` to already has it.

a year ago

blowski

To me, that all sounds like you need to be careful.

a year ago

cryptonector

I mean, yes? This is easy to learn the hard way, but it's also easy to learn. Once learned you don't forget.

a year ago

misnome

You started the thread by disagreeing with “you need to be careful”, and the OP is pointing out that your thread conclusion is “you need to be careful”.

a year ago

xpe

Maybe, but both shared useful points along the way. Elaboration IOW. All good.

a year ago

magicalhippo

Seems to be a Postgres-specific advice?

Others, for example SQLite[1], seem to advocate using UNION ALL over UNION.

[1]: https://sqlite.org/lang_with.html#recursive_query_examples

a year ago

cryptonector

No, it's not Postgres-specific advice. Using UNION ALL in recursive queries is a mistake if you have cycles in your data. This follows from first principles and the standard.

a year ago

magicalhippo

Right, bit early so wasn't thinking in terms of cyclic data, we've only used it for acyclic graphs (typically trees).

a year ago

hedora

There is some theoretical work on datalog you might be interested in. Recursive queries over sets are guaranteed to terminate if they don't contain negations / aggregations, and if they have a finite herbrand universe, which, very roughly speaking, means that no functions can return new constants or tuples of unbounded length.

eg: recursively generating the set of x := x + 1 won't terminate beacuse x goes to infinity, and you get infinite tuples. x := abs(x) is fine, since you get up to two tuples per input tuple (-1, and 1, for example).

Anyway, if you ask for the query to build up all paths (of unbounded length), it won't terminate. If you asked for shortest paths, (bounded by the number of nodes in the source database) it would.

a year ago

magicalhippo

Interesting. Most of that makes sense I think, though a bit outside my normal sphere.

I was struggling why aggregates were excluded so to speak, but I guess it's because they have to be repeatedly evaluated? Ie compute the sum, add it to the output, now you might have to recompute the sum? Perhaps I'm being dense here, left my thinking cap at home.

a year ago

mistermann

Piggybacking on your Neo4J reference....can anyone suggest any resources that would be good for someone from the SQL world who is very uninformed on graph databases to quickly get their head around the key ideas, but even more importantly, how to choose a platform among the offerings (licensing being I think a key issue, I've heard Neo4J can get expensive)...assume a large, "social media" scale system, to be safe.

a year ago

henryfjordan

Neo4j has a self-hosted community edition that is free, though you need to update to Enterprise to expand your DB beyond one node. It's worth noting though that scaling from a single node to a cluster is going to have some pretty huge performance issues whether you are using SQL or Neo4j to store your graph. The power of graph databases is that it should be very inexpensive to follow an edge in the graph but when that leads to a network hop you lose a lot of performance.

If you are comfortable fitting your data onto one box, then development speed is probably more important than other factors and I would just try a few databases out and especially pay attention to how good the libraries are in your programming language of choice. Neo4j for instance had high quality libraries in Java but the experience in Python was not as good.

If you have a lot of data or need next-level performance, I would start by doing some research on the various ways to store graph data and then pick a DB that supports the approach you want to take.

a year ago

j45

Vaticles typed bid intriguing

a year ago

joshspankit

As someone who hated being in the SQL world and couldn’t figure out why until years later when I found out about graph databases, here’s one big shift:

Graph database store true relationships.

The R in RDBMS and the concept of relationships by primary keys are lies (imo). They lead people away from what true relationships can be. Databases like Neo4j are not about doing any sort of PK or join or merge-before-query. Looking up by connection is the fastest way to find information. If your RDBMS had one table for phone numbers, one for email addresses, one for first names, one for last names, one for street, and so-on it would take huge amounts of time to query to find all the detail for a single person. With a graph database it takes a small amount of time to find the first bit of info (let’s say phone number since it’s easily sorted), and then every other bit of linked data is essentially free.

a year ago

tejtm

Apologies in advance; the R for relational is with respect to the tuple, that is amongst the unordered items which are serialized within a row in a table and not the keys which link different tables, the lie you find does not originate in the definition.

As someone who has wanted to love the graph dbs, keep with narrow focus top down queries where the index free adjacency lets you avoid touching the bulk of your data, and stay away from bulk queries that may need to visit everywhere multiple times, which is unfortunately where things get interesting.

a year ago

eurasiantiger

Effectively, moving from RDB to a graph database just shifts relationship resolution from query time to insert/update time. Sometimes this can be an advantage. I’d even wager usually.

a year ago

lolive

It also switches you from schema-based data model to schemaless data model.

a year ago

[deleted]
a year ago

[deleted]
a year ago

mistermann

This is very good news for my purposes, thank you!

a year ago

bavell

Well said and agreed!

a year ago

WolfOliver

The R in RDBMS stands for the mathematical term of relations which is defined as a subset of a cartesian product.

a year ago

throwaway689236

Sounds great, what are the downsides? When is it a definitely bad idea to use it?

a year ago

bavell

I've been using EdgeDB in a greenfield project and loving it. Some rough edges (forgive the pun) but fantastic for prototyping / smaller projects and could be a good candidate for large projecs. I like it better than when I used neo4j a few years back for a client project. Much better license and I really like that it's built on top on postgres.

a year ago

pottertheotter

I was looking at graph databases/Neo4j a lot early last year and found Neo4j has a lot of good resources.

You can get this book for free through their website. [1]

This have some good documentation[2], including this page "Transition from relational to graph database".[3]

[1] https://neo4j.com/graph-databases-book/

[2] https://neo4j.com/docs/getting-started/current/

[3] https://neo4j.com/docs/getting-started/current/appendix/grap...

a year ago

taubek

For some basics of graph modeling take a look at https://memgraph.com/docs/memgraph/tutorials/graph-modeling.

Some differences between SQL and graph databases are covered at https://memgraph.com/blog/graph-database-vs-relational-datab....

a year ago

xpe

> how to choose a platform among the offerings (licensing being I think a key issue, I've heard Neo4J can get expensive)

Safest bet (knowing nothing else about your needs): Build at least three prototypes, preferably rapidly and in parallel. Only after doing so will you be able to ask the right questions that speak accurately to your specific needs and therefore be able to map actual requirements to the different graph databases.

I’d be curious what other experienced graph database people would say about their first bigger projects. Looking back, my first graph projects involved significant iteration and several restarts. And these were still necessary after having significant background knowledge about the algorithms, offerings, and trade offs.

There are many people in roles/orgs that don’t have the humility or organizational cover to admit that these early projects are largely experimental learning experiences.

> ...assume a large, "social media" scale system, to be safe.

This broad assumption is going to cost you.

Instead, I’d encourage you to reframe this assumption in more specific terms such as read, write patterns; need for consistency, availability; support expectations; your talent pool; and lots more. Estimates can be on a log10 scale: 1, 10, 100, 1000 and so on,

a year ago

viksit

a year ago

[deleted]
a year ago

mdavidn

I do love PostgreSQL, and I often reach for this approach when working with hierarchical data, but a word of warning: Recursive queries will not scale to handle _very_ large edge tables. The recursive query will always consider _all_ edges, even when the outer query seeks only a single node's relationships. The solution is to build and index denormalized n-th order relationship tables.

Others have already pointed out the issues of cycles.

a year ago

ellisv

It's not even that it always considers all edges, but that you can't exit the search early when a condition is met. In other words, you have to first find all the paths and then, outside the CTE, filter to the shortest. We push the node filter into the CTE by wrapping it in a function.

> The solution is to build and index denormalized n-th order relationship tables.

This sounds much more performant but also more difficult to maintain.

a year ago

mdavidn

Yes, it is difficult to maintain. That might be a good moment to consider a proper graph database!

a year ago

ellisv

I’ve considered using a column to store indicate 2nd, 3rd, etc degree relations instead of n-th order tables.

a year ago

switchbak

"you can't exit the search early when a condition is met" - I have a DAG traversal written in a recursive CTE, and I can bail early out just fine when my traversal predicates no longer match. Not sure why I'd have to do that outside the (recursive part of the) CTE?

Obviously maintaining a flattened version is going to perform better in queries, but you're trading maintenance (write) time for query time. Or materialization time if you decide to do it lazily.

a year ago

ellisv

If you have an edge list like:

A->B

A->D

B->C

C->D

Postgres will walk both paths from A to D in the recursive CTE and then you can filter them afterwards to keep only the shortest.

You can use aggregate functions within the recursive CTE, so you can’t GROUP BY your starting node and stop once you find the first path. There isn’t a way to compare across multiple paths or iterations of the for-loop.

a year ago

FleaFlicker99

Ok but that's a little different than saying you can't cut the traversal short.

If I'm traversing ancestors (let's say) until I find one that doesn't satisfy a condition, it'll bail out then. I get that this doesn't serve all uses cases, but it isn't a small thing either.

a year ago

Nezteb

I learned from one of the comments on the post about AGE[1], "a PostgreSQL extension that provides graph database functionality."

[1] https://age.apache.org/

a year ago

cryptonector

Cycles are not a problem: just use `UNION`, not `UNION ALL`.

I myself build transitive closure tables that I update incrementally as needed (sometimes in a deferred way), so that many of the recursive queries I do can be very fast, but I only build transitive closures for the smallest portion I can (basically group nesting).

a year ago

liotier

> Recursive queries will not scale to handle _very_ large edge tables

What do you consider large ? We have a 12-year old telco application with a few million objects in Neo4J, doing fault impact calculations... Would PostgreSQL handle that easily nowadays ?

a year ago

simonw

My hunch is that a few million objects is pretty tiny these days - you could probably write a script to import all of them into PostgreSQL in a few minutes and try it out yourself to see how it behaves.

a year ago

simonw

I tried it against SQLite. I got ChatGPT to write me a script that would insert 1m edges and 1m nodes:

https://gist.github.com/simonw/c16ce01244760e186a3a0aa3fee04...

Then I ran that query again and it seems to return results in about 80ms:

https://lite.datasette.io/?sql=https://gist.github.com/simon...

a year ago

eurasiantiger

Not very realistic example, you need to be requesting some actual fields across nodes and doing some filtering on at least strings and dates, maybe geo areas as well.

a year ago

simonw

Go ahead and try it! I've documented all of the tools you need to run some detailed experiments against SQLite entirely in your browser here.

a year ago

ellisv

It depends a lot on how connected the graph is and whether you can fit it in memory.

a year ago

runeks

> The solution is to build and index denormalized n-th order relationship tables.

Can you elaborate on this?

Does an n-th order relationship table contain all the nodes reachable from some node going through n edges?

And you'd have one such table for each integer in the range 2..n?

a year ago

afandian

I'm sorry could you spell it out? What exactly does "recursive query will always consider _all_ edges" mean? A table scan? I'd be very grateful if you could give some pesudocode or point to a doc page.

a year ago

jeremyjh

I think GP means that it has to completely expand the recursive part for every branch before any where condition on the edge nodes can be applied. Graph databases presumably can optimize this.

I've found recursive queries to be difficult to scale in real-world queries past a few million edge nodes. We've denormalized several tree relationships so that the edge is connected both to its parent and to the root of its tree in several cases.

a year ago

afandian

Thanks! Sounds like you're saying that brute-force recursive algorithms without backtracking or early termination aren't a good match for path finding. That's not a surprise.

I'm using recursive algorithm on Postgres to find trees in a graph, but only where I'm specificaly interested in all of the nodes.

a year ago

cpa

It's a good tool as long as your queries are simple, and your graph isn't too large (which is probably the case 99% of the time). And if that allows you to have one less tool to deal with, go for it!

But I've tried to push it to its limits (using PostGIS, my nodes being land plots that were connected to each others — the biggest query used joins, lateral, recCTE, windows… you name it) and ran into many issues: can't branch out early, hard to optimize, no support for complex queries (writing anything beyond BFS and DFS is a pita). Yet, in the end I accepted the struggle (and the long queries) because it was so nice to work with my data directly where it was kept :)

a year ago

amyres

Using the built-in PostGIS topology tools? Or something more custom? I'm curious as I just started digging into and using PostGIS for a land parcel use case. I've wondered about the graph structure of parcels adjacent to others, rather than just a big unordered table of geom objects.

a year ago

cpa

Just using the geography stuff and doing joins on ST_Intersects. I couldn't guarantee that my data formed a partition of the plane, which is necessary for topology iirc.

Fun was had and headaches too. The biggest speedup I got was computing the graph online (creating a node/vertices tables from geometries) and then doing the recursive CTE on that table.

a year ago

amyres

Cool thanks! I'm amazed at how much can be done in PostGIS (if my SQL queries are good enough...)

a year ago

jghn

Over time I've come to take this a step further. I feel people reach for a graph database way too early. Most use cases that most people have will work fine in a standard relational setup. And that brings with it the benefits of using technologies like Postgres that have stood the test of time. So I treat it as I do any performance consideration: first demonstrate that you actually need to go down the path less traveled, don't just assume you need to do so.

a year ago

rajman187

Completely agree. In fact, even massive graph data can make use of a relational database + caching [1]

[1] https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

a year ago

yooloo

Of course relational db can act like a graph db. It's just not as efficient due to how things are stored and queried. Would be great to have a graph db plugin (and I found one https://github.com/apache/age)

a year ago

piyh

Having first class pagerank and shortest path functions in tinkerpop gremlin vs having to roll it yourself with recursive CTEs feels like a graph DB win.

a year ago

eurasiantiger

The real question is, is a RDB better at being a graph DB than a graph DB is at being a RDB?

a year ago

jakewins

Johan wrote the original Neo4j implementation as a graph layer on top of Postgres, random trivia. Then Java released NIO, he got excited and tried replacing the PG engine with one built for graphs from scratch.

a year ago

dapearce

Also see Apache Age, a graph database extension for Postgres: https://github.com/apache/age

a year ago

ChicagoDave

As a C# guy, I’ve figured out how to build in-memory graphs with generics and dsl/fluid queries. I’m working on a blog entry about it because it’s such a powerful skill to learn and leverage. No data store required. I can deserialize/serialize the data structure to/from binary or json.

a year ago

taubek

I personally prefer native graph databases such as Memgraph[1]. Graph databases have it's advantages and disadvantages. Sometimes you, need RDBMS, sometimes you need NoSQL solution. So I pick my stack based on situation and issue that I need to solve.

[1] https://memgraph.com

a year ago

rajman187

“Native” is a misleading or misused term in the context of graph databases because a graph cannot be mapped in a maximally consistent way with the underlying hardware (the von Neumann architecture represents data in a sequential manner and it is much faster to access it this way rather than randomly).

There are generally two families in the graph database world: those which use underlying traditional tables of nodes and many-to-many edges; and index-free adjacency which just means each node in the graph knows the memory address of its connections (other side of the edges).

Distributed graphs necessarily end up using the former because it’s difficult if not impossible for a node to know the memory address of its connection when that crosses a physical boundary. So typically index-free adjacency graphs have a master-slave setup with multiple read replicas but a single one to write to.

So with a “native graph” you don’t rely on potentially expensive join operations to find neighbors of neighbors and can traverse complex paths easily.

a year ago

dtomicevic

This is spot on. Though, in Memgraph’s context, there is a combination of lock free techniques and raw pointer representation that can work exceptionally well for a large variety of use cases, and even better so compared to others on streaming / event driven / large volume of writes. Using raw pointers, there are also optimization opportunities when storing or rebalancing the graph to maximize the number of adjacent nodes in a cache line so benefiting from sequential representation too.

a year ago

mashroom

Thanks for pointing me toward Memgraph. I didn't hear of it before and it appears to be a great alternative to neo4j.

a year ago

Rapzid

I've found this post pretty inspiring for what can be done with graphs in Postgres: https://www.alibabacloud.com/blog/postgresql-graph-search-pr...

There are a LOT of use cases that will fit the capabilities and performance characteristics.

a year ago

pphysch

Small nitpick but I wish the author just made `data` a JSONB field rather than VARCHAR. That really shows the power of Postgres, being able to operate as a "NoSQL graph DB" with little to no hacking.

a year ago

ganderzz

Hey, author here! Thanks for the feedback! I went back and forth with having `data` being a JSONB field or VARCHAR, but ended up with VARCHAR to show that `nodes` don't have to be anything crazy. Really `nodes` is just a standard table you'd see in any old SQL database, and what makes it a "graph" is the `edges`/relationship table.

a year ago

pphysch

Are JSON fields "crazy"? That seems like a myth that should be dispelled. To me, "crazy" is making your `data` field unstructured text.

a year ago

crunchengine

I use EdgeDB for graphs, it works well enough, I just don't know about performances but for sure it is sufficient for most use-cases.

a year ago

e1g

EdgeDB is Postgres so the same constraints apply.

a year ago

pcthrowaway

I imagine that's why the person you're responding to mentioned it. I'm surprised the author of this article didn't

a year ago

Dopameaner

Andy Pavlov, mentioned something similar. Where, he advocated Relational database to solve most of the graph database utilities. He also made a bet, if graph databases spawn a legit usecase by 2030, that he would change his CMU profile picture.

a year ago

phamilton

We materialize the transitive closure of our graph and use that to very quickly traverse the graph.

Our data is a directed graph (an organization tree) so the materialization code (accomplished via triggers) isn't too bad.

The result is that instead of WITH RECURSIVE we just join against our closure table which is in the form `(organization, ancestor)`. Example type of query: "Is Alice an administrator in any of Bob's organizations or in any ancestor organizations", and if that's true it would give them the ability to perform X administrative action. Actually queries are more nuanced but the core Hard Thing :tm: is traversing the graph and it's really easy with our closure table.

a year ago

perfmode

how do you do deletions from the closure table?

a year ago

phamilton

We have a foreign key constraint in place that prevent any org with children from being deleted. So first, all children must either be orphaned or moved to another parent. We have a trigger on updates that will rematerialize an org's closure rows and then do so for all their descendants. That puts all the former children into a correct state. Then when we delete the organization, we have foreign keys from the closure table to the organizations table and we allow cascading deletes.

Our closure table itself contains "evidence" that enables correctness constraints. So a row can only exist if a) it represents a direct parentage (ancestor is the direct parent) or b) ancestor is a parent of my ancestor. This is all enforced via foreign keys, so if the foreign keys are set up correctly an errant row isn't possible.

Behind all this is a proptest (property based testing is awesome for stuff like this). We generate arbitrary org forests, perform random mutations (changing parents, deleting orgs, gaining children, etc) and then assert that the closures table is accurate. We have very high confidence that the data is correct because of this test suite.

a year ago

cbeach

This is nice but it misses what's fundamentally special about "real" graph databases - the way they store data.

Native graph databases like Neo4J use index-free adjacency. This means query durations are proportional to the amount of the graph searched, -not- the size of the data stored.

https://neo4j.com/blog/native-vs-non-native-graph-technology...

For large graphs, or deeply recursive queries, you need a native graph database to get reasonable performance.

a year ago

kaba0

Only remotely relevant, but I have recently come across Gremlin, a sort of “SQL for graphs”. Does anyone have some experience with it they can share? I only used it on an in-memory graph database, and I found it quite great so far.

a year ago

inguz

Tinkerpop/Gremlin is not really SQL-like at all. My experience with it has not been fun: it's difficult to reason about, has obscenely bad documentation, poor tooling, idiosyncratic library support, and simple data access patterns are hard.

a year ago

mdaniel

I mirror the sibling comment's experience: it was a tire fire; also, if one comes from an n-tier application mental model, it's further bad news because the actual query's compute runs _client side_ -- there is no such thing as "connect to Gremlin 'database,' submit query text, have it run queries" it's rather "boot up Gremlin in your appserver, run queries so long as your appserver can connect to every one of the data stores it references, credentials included"

1-star, hate, will quit before using it again

a year ago

philzook

Very interesting. Relatedly, I've been refining simple lightweight ways to use a regular SQL database to execute seminaive datalog queries. Path reachability in a graph is the canonical example. One could use stored functions to implement the outer loop also.

There are some other fun related applications like graph rewriting using SQL, etc.

MiniLitelog: Easy Breezy SQLite Datalog - https://www.philipzucker.com/tiny-sqlite-datalog/

a year ago

ralusek

I've done this on many occasions, but I actually only go with a single nodes table.

Schema is like this

    id
    type: string (or enum)
    data: jsonb
    indexed: jsonb (but this one gets a GIN index)
    toId: nullable foreign key to this same table
    fromId: nullable foreign key to this same table
    createdAt: date
    updatedAt: date
So if I had a post with a comment on it, the data would look something like this

   {
     id: '1111',
     type: 'users',
     data: { name: 'John Doe' },
   }

   {
     id: '2222',
     type: 'users',
     data: { name: 'Jane Doe' },
   }

   {
     id: '3333',
     type: 'posts',
     data: { text: 'Nice day today' },
   }

   {
     id: '4444',
     type: 'comments',
     data: { text: 'For you, maybe. Raining here' },
   }
And now for the edges:

    {
      id: '5555',
      type: '(posts < users)',
      toId: '3333',
      fromId: '1111',
    }

    {
      id: '6666',
      type: '(comments < users)',
      toId: '4444',
      fromId: '2222',
    }

    {
      id: '7777',
      type: '(posts < comments)',
      toId: '3333',
      fromId: '4444',
    }
So then, for example, if I have a post, and I want to find the comments on it, I search for toId = <my post id>, type = '(posts < comments)'.
a year ago

TOMDM

This works well for nodes with at most 1 parent and child, the post is aimed at supporting many to many relationships though

a year ago

ralusek

What I've described here supports many to many...

a year ago

fauigerzigerk

Yes, but the downside of your schema is that the distinction between nodes and edges is somewhat muddled.

Presumably, all nodes would have nulls in both fromId and toId, but the schema doesn't enforce that. The schema also allows linking edges to other edges.

Don't you think this is a bit too much flexibility if the intention is to model a graph?

a year ago

jzelinskie

We used the "WITH RECURSIVE" basically the moment it shipped in Postgres for the original CoreOS Clair[0] data model which was a graph of software versions and vulnerabilities. It scaled well compared to a naive graph abstraction implemented outside the database, but when performance wasn't great, it REALLY wasn't great. After running that version in production for a couple years, we ended up throwing it out and remodeling the domain to be less graph-y to achieve more consistent performance.

I've since worked on SpiceDB[1] which scales much better by taking the traditional design approach for graph databases and only treating Postgres as triple-store. IME, there's no short-cut: if you need a graph, you probably want to use a database optimized for graph access patterns. Most general-purpose graph databases are full of optimizations for common traversals that are uncommon operations in relation databases.

[0]: https://github.com/quay/clair

[1]: https://github.com/authzed/spicedb

a year ago

truth_seeker

Few suggestions based on my prior experience on the same task in production system:-

1. Both the tables (Nodes and Edges) MUST HAVE one more column as "label" and then you must create partitions based on distinct values of "label" column. This greatly helps queries to run faster as search plan is narrowed when you include it in where clause and PG knows partitions to hit.

2. Instead of one big fat primary key column in each table use, consider having different uniques indexes on each partition.

3. The EDGES table can afford to have "data" column of data type TEXT or perhaps JSONB. PG saves it in different data disk layout and compression works great here. We used it to cache the result for previous/next nodes data or compute the result for many extra hops which are required on frequent basis.

4. Use PG procedures to hide the complexity of reusable DFS/BFS queries.

a year ago

hot_gril

Postgres is powerful. I've implemented graph DB, mapreduce, huge sparse matrix math (which was cool), and ofc a KV store with it. With actual performance benefits over the alternatives. It's not very ergonomic for those, but it works.

That said, I've had so many jobs where the team decides to fit our whole data model into a fully recursive graph, and it's been a huge mistake every time regardless of whether they use Postgres or a dedicated graph DB. Just because you can do it (you always can!) doesn't mean you should. Start with just a traditional n-hierarchy schema and only look at modeling things as a graph if you're sure it's necessary. Usually it's not. Sometimes you'll have a mostly non-graph schema with a couple of graph tables for the things that really need that.

a year ago

its-summertime

Handling cycles seems to be a missing point.

https://www.postgresql.org/docs/current/queries-with.html#QU... the docs themselves do have information about it.

a year ago

ellisv

Postgres can detect cycles; see 7.8.2.2 in the docs: https://www.postgresql.org/docs/current/queries-with.html#QU...

a year ago

cryptonector

The reason graph DBs exist is mainly to provide easier syntax for recursive queries (which aren't really recursive because they're tail-recursive, which means iterative rather than what most people think of as recursive).

Any questions?

It's not clear how one might make SQL syntax for recursive queries simpler. Clearly we can have syntactic sugar for "dereferencing" relationships, so one could say `SELECT child-->name FROM parent_child WHERE ...;`, and we could have aggregation `SELECT array_agg(child-->name) FROM parent_child WHERE ...;`, and `SELECT child--->name FROM parent_child WHERE ...;` to say "recurse all the way through the child relation", but if you want more than one column then things get tricky.

a year ago

claytongulick

It doesn't handle all cases or all queries, but one method I've used to deal with graph data in relational databases it to materialize paths and cache them on the node (this only really works well with DAGs).

So, the path key could look something like: bob.ted.lisa.bill (obviously, use ids instead of names here)

Then you can index this in various ways to make LIKE queries fast.

Want do know anyone that rolls up to ted?

    select * from blah where path like 'bob.ted%'
Or, how many people report to ted, but aren't under lisa?

    select count(*) from blah where path like 'bob.ted%' and path not like 'bob.ted.lisa%'
And something a bit more complex that would typically be considered the domain of a graph database, like "What are the average number of direct reports each manager in my organization has?"

    select name, avg((select count(*) from blah as sub where path like  '%' || name || '%' and (array_position(string_to_array(path,'.'), name) = ((array_position(string_to_array(path,'.'), sub.name) - 1))
So, even with that degree of complexity, you're still avoiding recursion, though you might be getting into O(n^2) territory if you don't have an index that will handle '%' || ... || '%' well. You can expand on this with various techniques depending on the application.

It's certainly no replacement for a graph database, but for basic stuff it's very fast, handles huge volumes, and doesn't rely on any sort of recursion at the database layer (unless you materialize in a trigger or something).

The actual materialization code can be tricky, one technique to deal with that is to use the author's technique or something similar to have a recursive CTE in a view. When a graph change is made, you can scope the change based on the path's hierarchy and update the paths from the recursive view, that way you only hit the performance penalty at write time, not read.

a year ago

lenkite

Ok, can the table and query design in this article actually scale to millions of rows ? With hundreds of concurrent queries ? I have generally become a skeptic of doing anything fancy on a RDBMS.

a year ago

sigstoat

any suggestions for getting nice graph db-style functionality out of an postgres schema, where you've got the nodes and edges of a labelled property graph spread across a bunch of different tables? (AGE seems to want to create a separate graph object that lives alongside your other tables, and toy examples like this article just dump everything into special tables)

in my particular case i don't actually have a lot of data so i'm not looking for performance, just want to save developer time.

a year ago

bottlepalm

I can see how this possibly won't scale compared to dedicated graph dbs.

I'm interested in what the fundamental difference is in those graph dbs because rdbms seem so close I don't see why it isn't just provided as an extra layer on top.

As described in the article, any graph can already be built with an rbms, so why for example does neo4j need to be standalone and can't possibly be a layer on top of postgres to leverage existing relationships? Any good articles on that?

a year ago

cbeach

See the following page for an explanation (in particular the "Native Graph Processing" heading)

https://neo4j.com/blog/native-vs-non-native-graph-technology...

In native graph databases like Neo4J query times are proportional to the amount of the graph searched, rather than increasing with the overall size of the data. That's the clincher.

a year ago

emodendroket

I don't know off the top of my head but I'd assume that one could adjust the storage strategy and other performance tweaking to optimize for the expected use case.

a year ago

YouWhy

While the basic technique is sound, SQL recursive queries' no-side-effects character means a huge number of basic graph traversal strategies (e.g.: pruning, shortest path first/Dijkstra) are beyond reasonable engineering.

In addition to that, Postgres is difficult as a computational platform - the execution strategy for recursive queries is not necessarily intuitive, and their performance is hard to assess (for example: how soon will you run out of "stack"?)

a year ago

dfragnito

We accomplish similar with SchemafreeSQL. Currently MySQL 8 back-end but working on other SQL back-ends. This demo uses PlanetScale as the DB back-end, Fly.io handles the SFSQL endpoint, Netlify function handles the query, and cytoscape.js graph network library for the UI

https://harmonious-mermaid-c4d794.netlify.app/got (sorry not mobile friendly but functional)

a year ago

samlambert

This is a cool project.

a year ago

dfragnito

Thanks !! We had hoped to use PS for our serverless backend and also offer a dedicated PS option. But we ran into some issues with Vitess. I was just thinking today I wanted to get back into it and see if we can resolve these issues. We had to go with Aurora.

The issues were on deletes. This demo is read only so it works well with a PS backend. My experience with your CS has been amazing despite these issues!!

a year ago

ape4

Postgres has commands to easily list conventional tables nicely. While I can see this would work, there are no native Postgres commands to list the graph.

a year ago

samwillis

If you are looking to utilise this with Django, there is the brilliant Django CTE project that enables it and other CTE queries with the Django ORM:

https://dimagi.github.io/django-cte/#recursive-common-table-...

a year ago

garyclarke27

In my experience Postgres works really with a Recursive CTE inside a Materialized View for Graphs. Latest version can detect cycles. I used to add a distance column as well, to show number of hops apart, any 2 nodes are. Hoping Postgres adds Automatic Incremental Mat View Maintenance soon, it will be perfect then as a Graph Database.

a year ago

TOMDM

Incremental materialised view maintenance is a killer feature.

If Postgres added it, almost all of my interest in Materialize (the database, not the technique) would vanish immediately.

a year ago

ForHackernews

I can't believe this article doesn't mention the (first-party) ltree extension: https://hoverbear.org/blog/postgresql-hierarchical-structure...

a year ago

avereveard

All in all this seems to be "graph databases, only the simplest parts", and yeah sure SQL can be coerced to traverse an acyclic directed graph, doesn't make for a very efficient not interesting solution for the general case.

Also calling a graph database a serialized tree structure seems quite a stretch.

a year ago

piyh

This isn't just a PG thing. The wiki article isn't great, but the references have links to sqlite, mssql, oracle, ibm, etc.

https://en.wikipedia.org/wiki/Hierarchical_and_recursive_que...

a year ago

zorksylar

For most non-performance critical cases, Postgres is a good choice. However, when dealing with more complex queries such as pattern matching (e.g., triangle counting) that require immediate results, a graph database outperform postgres.

a year ago

varunjain99

This is a neat data modeling exercise, but not sure why you'd prefer this approach over Neo4j.

It's much simpler to query in native graph database land (though it's not exactly SQL, it's pretty close to be honest). And there are performance / reliability implications of forcing postgres into graph land.

a year ago

fbn79

Is not nested sets https://en.wikipedia.org/wiki/Nested_set_model a better solution than the one described in the article?

a year ago

realaleris149

Apparently not:

"However, recursive SQL queries can be expected to perform comparably for 'find immediate descendants' queries, and much faster for other depth search queries, and so are the faster option for databases which provide them, such as PostgreSQL,..."

a year ago

yayr

Can someone explain or point towards an explanation, what the performance impact of such a query is? How does it scale with number of edges or depth of recursion? Is there a special smart way of indexing this to be efficient?

a year ago

winrid

If you have indexes on next_node and previous_node, it'll preform okay.

The size of the table (# of edges) won't actually matter *that* much performance wise with btree indexes. The performance will mostly depend on how deep you want to go with your recursion. The more nodes you check per query, the slower it'll be. Luckily PG will keep the frequently accessed nodes and index pages in the buffer pool, so it'll perform close enough to an in-memory graph database for most use cases that access the same data often - while not requiring your whole dataset to fit into memory all the time.

MongoDB's current graph search mechanism works the same way.

I think actually if you are using incremental ids and you don't delete edges, you could try BRIN indexes for the next/previous node columns. This could still get you pretty good performance with much lower storage space requirements than a "traditional" index, so you could theoretically store a pretty large graph (thinking billions of edges) with minimal disk space.

a year ago

henryfjordan

If you store a list of edges like (startId, endId) with an index on startId, then every time you want to follow an edge will cost O(log(n)) where n is the number of edges in your graph.

You can get O(1) performance on similar queries using a different storage strategy (by storing edges as pointers to the other nodes in memory) but SQL is not the right tool for that, you would be better off with a true graph DB.

a year ago

osigurdson

>> While Postgres is not generally thought of when working with graph data structures, it is perfectly capable to store and query graph data efficiently.

Some scalability and performance metrics would be helpful here.

a year ago

ghotli

I like this idea and ultimately what I want is "is this a DAG or not" and if so "give me the topological sort".

Anyone know of good docs for this sort of pattern with SQL that addresses those two common tasks?

a year ago

elchief

Joe Celko covers graphs in chapters 12-14 of Trees and Hierarchies in SQL for Smarties

as well as chapter 27 of SQL for Smarties, 5th

isn't ISO working on an graph extension to regular SQL? Can't seem to google it correctly

a year ago

nilsbunger

Sure, at its most basic a graph database is just a bunch of many2many relationships.

The recursive query is cool, but how is the performance for graph queries compared to a DB optimized for this use case ?

a year ago

ellisv

I haven't compared, but something like Kùzu is probably much more performant for a lot of use cases because it implements Worst-Case Optimal Joins.

But I think Postgres is likely to get these improvements in a few years too.

a year ago

mharig

IIRC, the maybe biggest graph database in the world, TAO (The Associations and Objects ?) of FB, is based on MySQL.

Meta may have had a good reason not to choose a 'native' graph DB.

a year ago

amir734jj

Good example. Now do it with undirected graph. How will the query look like?

So, is this basically running transitive closure on the database during query? it would be expensive.

a year ago

revskill

PostgreSQL is my love. Thanks for all developers.

Do not blame the tool if you feel it doesn't fit your niche. Blame yourself first, and try to use it the right way.

a year ago

abhishekjha

Can you store hierarchial comments like the follwoing in postgres?

Is there a definite architectural pattern to store hierarchial comments?

a year ago

eurasiantiger

Yes. Create a comments table, and have a parent_id column on it.

a year ago

abhishekjha

Querying it in realtime would be very expensive. You will have to keep joining for too long to find all the comments in a thread.

a year ago

eurasiantiger

And that’s why graph databases. Practically, you always need to limit the recursion depth anyway (can’t show infinite depth on an UI), so you might be able to make do with postgre.

a year ago

ungerik

Unrelated: is VARCHAR(255) still a thing?

a year ago

stevebmark

Is with_recursive nearly as performant as dedicated graph database query languages?

a year ago

user00012-ab

whoa.... postgres is moving quickly and taking jobs away from other databases. Maybe we need to hold up for 6 months on any new uses for postgres until we all have some time to figure this out.

a year ago

vrglvrglvrgl

[dead]

a year ago

_boffin_

[flagged]

a year ago

packetslave

we're posting copyrighted PDFs to HN now?

a year ago