Why DBMSs are so complex | Experimental Thoughts

Developing a database application means converting your data into types understood by the DBMS, generating SQL code, sending it to the DBMS for evaluation, retrieving the result, and converting the result into types understood by your application language. It’s even more strange when you use parameterized SQL: you generate some code with placeholders instead of literals, and then you send the literals separately (by “more strange,” I certainly don’t mean “worse,” — parameterized SQL is a huge improvement).

If this design were suggested for any purpose other than interacting with a DBMS, others would react with immediate (and justifiable) suspicion. The use of “eval” is strongly avoided, and many popular languages don’t even support it. So why is this design so widely accepted for database management systems?

The reason is a combination of the rich semantics of a language like SQL that supports constraints, arbitrary relational expressions, isolated transactions with rollback, etc.; and the centralized nature of a DBMS. It’s not an unsolvable problem, but it’s often approached from the wrong direction.

The wrong approach, which includes all persistence layers and most ORMs, use object-oriented design to manage data (by generating simple store-and-retrieve SQL code) instead of the relational model; and mash together transient application data with permanent facts that should be made available to other application instances. This approach considerably increases complexity unless the data is highly independent (that is, one application instance doesn’t care what another instance is doing). See my previous post for more details.

Additionally, this wrong approach usually leaves developers wanting when they need richer semantics to handle interdependent data, like constraints that aren’t feasible to check from application code. In response, these projects put complexity back into the system, and the end result is that the complexity is just moved around (but increased overall). Developers usually jump from one bad persistence layer or bad ORM to the next in a series, none of which live up to the promises of simpler database application development.

A much better approach is that used by the SQLAlchemy project (Python) or the Sequel project (Ruby). An excerpt from the SQLAlchemy website: “SQLAlchemy doesn’t view databases as just collections of tables; it sees them as relational algebra engines.” I would have said “centralized relational algebra engines,” but either way, the statement shows a fundamentally different philosophy that I think will be successful in the long run.

The idea is to allow the developers to use arbitrary relational expressions in their application language, and cleanly separates transient application data structures from the permanent facts that should be made available to other application instances. The complex relational expressions in the application language are then translated to equivalent SQL — not simple store-and-retrieve, but real SQL expressions — and then passed to the DBMS which can then optimize and execute the expressions, and the results can be returned and understood by the application language.

This kind of development makes me wonder why we can’t generate a parse tree, serialize it, and send it directly to the DBMS. It would be easier to generate the parse trees, and yet still allow all of the optimization opportunities. Parameterized SQL is a partial implementation of this idea, and has been very successful.

Another big issue that needs to be tackled is to match the type systems between the application language and the DBMS. PostgreSQL is lightyears ahead of other DBMSs when it comes to types. There’s a great set of built in types like text, numeric, bytea, timestamptz — most database management systems don’t even get text right! With these good types come good operators like regular expression matching, date/time math, etc. But the real power comes from user-defined types, which are just as powerful as native types, and there is a coherent infrastructure for defining the type, defining operators, and allowing highly sophisticated index strategies for those operators.

The ideal situation would be to have all of the types in your application language (including the operators on those types) available in your DBMS as well. This means that there should be a character string type for Java programmers (because Java strings are strings of characters) and a byte string type for Ruby programmers (because Ruby strings are strings of bytes). NULLs of course may cause some difficulty because I am not aware of any non-SQL language that supports anything close to a SQL NULL.

I’m optimistic about the direction things are going for database application development. Using real relational expressions in the application language appears to be catching on, and as those projects mature, they will be looking more and more at PostgreSQL for its extensible type system.

2 thoughts on “Why DBMSs are so complex”

Pingback: None, nil, Nothing, undef, NA, and SQL NULL « Experimental Thoughts
Live Bands in London on November 24, 2009 at 02:58 said:

The most basic thing about DBMS is that all its concepts should be clear. This makes our life much more easier as we have clear ideas about it.