The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations — but it’s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.
So, right away, it’s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn’t solve that scalability problem — it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.
There may exist scalability problems in:
- SQL, which defines requirements outside the scope of the relational model, such as ACID properties and transactional semantics.
- Traditional architectures and implementations of SQL, such as the “table is a file” equivalence, lack of sophisticated types, etc.
- Particular implementations of SQL — e.g. “MySQL can’t do it, so the relational model doesn’t scale”.
Why are these distinctions important? As with many debates, terminology confusion is at the core, and prevents us from dealing with the problems directly. If SQL is defined in a way that causes scalability problems, we need to identify precisely those requirements that cause a problem, so that we can proceed forward without losing all that has been gained. If the traditional architectures are not suitable for some important use-cases, they need to be adapted. If some particular implementations are not suitable, developers need to switch or demand that it be made competitive.
The NoSQL movement (or at least the hype surrounding it) is far too disorganized to make real progress. Usually, incremental progress is best; and sometimes a fresh start is best, after drawing on years of lessons learned. But it’s never a good idea to start over with complete disregard for the past. For instance, an article from Digg starts off great:
The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes.
That’s good because he blames it on the mindset not the model, and then identifies a specific problem. But then the article completely falls flat:
Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP.
A join is faster in PHP than MySQL? Why bother even discussing SQL versus NoSQL if your particular implementation of SQL — MySQL — can’t even do a hash join, the exact operation that you need? Particularly when almost every other implementation can (including PostgreSQL)? That kind of reasoning won’t lead to solutions.
So, where do we go from here?
- Separate the SQL model from the other requirements (some of which may limit scalability) when discussing improvements.
- Improve the SQL model (my readers know that I’ve criticized SQL’s logical problems many times in the past).
- Improve the implementations of SQL, particularly how tables are physically stored.
- If you’re particularly ambitious, come up with a relational alternative to SQL that takes into account what’s been learned after decades of SQL and can become the next general-purpose DBMS language.
EDIT 2010-03-09: I should have cited Josh Berkus’s talk on Relational vs. Non-Relational (complete list of PGX talks), which was part of the inspiration for this post.
Great post as usual.
One little addendum to #4 is, “also take into account the failures in this area of endeavor.” I’m looking at Tutorial D in particular, but “I believe I did, Bob.” (search engine!) is also on point.
Tutorial D is only intended to be an illustration of concepts, so of course it hasn’t replaced SQL, and wasn’t intended to. Mind you, Pascal was only intended as a teaching language, but then it went into production. Now if you mean by saying Tutorial D failed in that it hasn’t inspired other languages that are used in industry, then yes you are right so far, though I intend to change that with my “Muldis D”.
Good luck with that. You’re gonna need plenty.
Absolutely.
the data is the system too.
SQL build a wall in the application from data, better integrate all.
The old multivalue database paradigm (now called casandra) sure is better.
I disagree, but you didn’t provide enough of an argument to refute.
Hi Jeff,
It’s only my opinion, I not have argument, only code and vage sentences like “avoid joins and keep the data near”
Ideally, you should be able to represent your data in relations even if it is stored in the format you describe.
Unfortunately, existing SQL systems almost universally assume “table = file”. That makes it difficult for you to store the data in the format which suits your needs, and still use SQL.
Regarding your #4, that is totally my intent with my Muldis D language, which I am actively developing now and filling in the last major omissions. And this is a completely serious and pragmatic exercise too. I am learning from all the past successes and mistakes I can get my hands on, in its development.
I have been casually following that for a while. Great work!
Last time I checked, I had trouble finding some “hello, world” type examples. It would be very nice if you had a simple primer for SQL person.
I should be adding examples like that in around 2 months, after I’ve fleshed out a few more parts of my spec, since adding those parts would greatly affect how good any code examples would look. (The part I’m currently working on is adding type mixins/etc so that one can properly do things such as operator overloading, which would be a shameful feature to lack; should be out any day now.)
Great to hear. I’ll watch for the next announcement.
As long as it’s relational I’m satisfied to use it for the things I do for a living these days. It’s a solid theoretical foundation, and too much criticism seem to be levied against misapplications and misunderstandings of it. When used as the of a logical or conceptual view of reality for a database then it stands up on it’s own.
One peeve of mine is how changes to this “view of reality” are done in SQL. I don’t like being limited to just firing off a sequence of begin, insert, update, delete, commit/rollback commands. Only having the option of having a set of commands that are issued independently strikes me as “incomplete” somehow. I’d love a formalization of a more solid theoretical foundation for encapsulating the changes done to a database/”set of relations” as well as transaction management into a singe unit, and in practice end up with perhaps a command I use to feed data into like I can read it from a result set. “MERGE” is somewhat closer here, but it’s “not quite there” relative to what I’m thinking of.
Also:
“…mindset, which places the burden of computation on reads rather than writes.”
I can’t say I understand that one, I wonder if it doesn’t show the shortcomings in the critics mindset when they’re reasoning about what they’re criticizing, rather than in that of the proponents. The “burden of computation” is a physical problem, the relational model is a logical model.
For instance in the case of queries that can be sped up with a materialized view one really can do the computation on write or at some other time before the read and just dump out the data in question when it’s queried. There are ways to hide the materialized view and present a “100% relational(tm)” view of the database that can be queried as normal, so that just like adding an index will cause the database to jump to the requested row instead of doing a full table scan when querying, so will adding a materialized view cause the database to jump to the exact data the user needs even if the query explicitly orders for instance a “sum”, so it doesn’t violate the relational mindset either.
I really loved reading your blog. It was very well authored and easy to understand. Unlike other blogs I have read which are really not that good. I also found your posts very interesting. In fact after reading, I had to go show it to my friend and she enjoyed it as well!