Experimental Thoughts » Language http://thoughts.davisjeff.com Ideas on Databases, Logic, and Language by Jeff Davis Sat, 16 Jun 2012 21:05:47 +0000 en hourly 1 http://wordpress.org/?v=3.3.1 Taking a step back from ORMs http://thoughts.davisjeff.com/2012/02/26/taking-a-step-back-from-orms/ http://thoughts.davisjeff.com/2012/02/26/taking-a-step-back-from-orms/#comments Sun, 26 Feb 2012 22:17:21 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=498 Continue reading ]]> Do object-relational mappers (ORMs) really improve application development?

When I started developing web applications, I used perl. Not even all of perl, mostly just a bunch of “if” statements and an occasional loop that happened to be valid perl (aside: I remember being surprised that I was allowed to write a loop that would run on a shared server, because “what if it didn’t terminate?!”). I didn’t use databases; I used a mix of files, regexes to parse them, and flock to control concurrency (not because of foresight or good engineering, but because I ran into concurrency-related corruption).

I then made the quantum leap to databases. I didn’t see the benefits instantaneously[1], but it was clearly a major shift in the way I developed applications.

Why was it a quantum leap? Well, many reasons, which are outside of the scope of this particular post, but which I’ll discuss more in the future. For now, I’ll just cite the overwhelming success of SQL over a long period of time; and the pain experienced by anyone who has built and maintained a few NoSQL applications[2].

I don’t think ORMs are a leap forward; they are just an indirection[3] between the application and the database. Although it seems like you could apply the same kind of “success” argument, it’s not the same. First of all, ORM users are a subset of SQL users, and I think there are a lot of SQL users that are perfectly content without an ORM. Second, many ORM users feel the need to “drop down” to the SQL level frequently to get the job done, which means you’re not really in new territory.

And ORMs do have a cost. Any tool that uses a lot of behind-the-scenes magic will cause a certain amount of trouble — just think for a moment on the number of lines of code between the application and the database (there and back), and imagine the subtle semantic problems that might arise.

To be more concrete: one of the really nice things about using a SQL DBMS is that you can easily query the database as though you were the application. So, if you are debugging the application, you can quickly see what’s going wrong by seeing what the application sees right before the bug is hit. But you quickly lose that ability when you muddy the waters with thousands of lines of code between the application error and the database[4]. I believe the importance of this point is vastly under-appreciated; it’s one of the reasons that I think a SQL DBMS is a quantum leap forward, and it applies to novices as well as experts.

A less-tangible cost to ORMs is that developers are tempted to remain ignorant of the SQL DBMS and the tools that it has to offer. All these features in a system like PostgreSQL are there to solve problems in the easiest way possible; they aren’t just “bloat”. Working with multiple data sources is routine in any business environment, but if you don’t know about support for foreign tables in postgresql, you’re likely to waste a lot of time re-implementing similar functionality in the application. Cache invalidation (everything from memcache to statically-rendered HTML) is a common problem — do you know about LISTEN/NOTIFY? If your application involves scheduling, and you’re not using Temporal Keys, there is a good chance you are wasting development time and performance; and likely sacrificing correctness. The list goes on and on.

Of course there are reasons why so many people use ORMs, at least for some things. A part of it is that application developers may think that learning SQL is harder than learning an ORM, which I think is misguided. But a more valid reason is that ORMs do help eliminate boilerplate in some common situations.

But are there simpler ways to avoid boilerplate? It seems like we should be able to do so without something as invasive as an ORM. For the sake of brevity, I’ll be using hashes rather than objects, but the principle is the same. The following examples are in ruby using the ‘pg’ gem (thanks Michael Granger for maintaining that gem!).

First, to retrieve records as a hash, it’s already built into the ‘pg’ gem. Just index into the result object, and you get a hash. No boilerplate there.

Second, to do an insert, there is a little boilerplate. You have to build a string (yuck), put in the right table name, make the proper field list (unless you happen to know the column ordinal positions, yuck again), and then put in the values. And if you add or change fields, you probably need to modify it. Oh, and be sure to avoid SQL injection!

Fortunately, once we’ve identified the boilerplate, it’s pretty easy to solve:

# 'conn' is a PG::Connection object
def sqlinsert(conn, table, rec)
  table     = conn.quote_ident(table)
  rkeys     = rec.keys.map{|k| conn.quote_ident(k.to_s)}.join(",")
  positions = (1..rec.keys.length).map{|i| "$" + i.to_s}.join(",")
  query     = "INSERT INTO #{table}(#{rkeys}) VALUES(#{positions})"
  conn.exec(query, rec.values)
end

The table and column names are properly quoted, and the values are passed in as parameters. And, if you add new columns to the table, the routine still works, you just end up with defaults for the unspecified columns.

I’m sure others can come up with other examples of boilerplate that would be nice to solve. But the goal is not perfection; we only need to do enough to make simple things simple. And I suspect that only requires a handful of such routines.

So, my proposal is this: take a step back from ORMs, and consider working more closely with SQL and a good database driver. Try to work with the database, and find out what it has to offer; don’t use layers of indirection to avoid knowing about the database. See what you like and don’t like about the process after an honest assessment, and whether ORMs are a real improvement or a distracting complication.

[1]: At the time, MySQL was under a commercial license, so I tried PostgreSQL shortly thereafter. I switched between the two for a while (after MySQL became GPL), and settled on PostgreSQL because it was much easier to use (particularly for date manipulation).

[2]: There may be valid reasons to use NoSQL, but I’m skeptical that “ease of use” is one of them.

[3]: Some people use the term “abstraction” to describe an ORM, but I think that’s misleading.

[4]: The ability to explore the data through an ORM from a REPL might resemble the experience of using SQL. But it’s not nearly as useful, and certainly not as easy: if you determine that the data is wrong in the database, you still need to figure out how it got that way, which again involves thousands of lines between the application code that requests a modification and the resulting database update.

]]>
http://thoughts.davisjeff.com/2012/02/26/taking-a-step-back-from-orms/feed/ 28
SQL: the successful cousin of Haskell http://thoughts.davisjeff.com/2011/09/25/sql-the-successful-cousin-of-haskell/ http://thoughts.davisjeff.com/2011/09/25/sql-the-successful-cousin-of-haskell/#comments Sun, 25 Sep 2011 07:10:29 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=472 Continue reading ]]> Haskell is a very interesting language, and shows up on sites like http://programming.reddit.com frequently. It’s somewhat mind-bending, but very powerful and has some great theoretical advantages over other languages. I have been learning it on and off for some time, never really getting comfortable with it but being inspired by it nonetheless.

But discussion on sites like reddit usually falls a little flat when someone asks a question like:

If haskell has all these wonderful advantages, what amazing applications have been written with it?

The responses to that question usually aren’t very convincing, quite honestly.

But what if I told you there was a wildly successful language, in some ways the most successful language ever, and it could be characterized by:

  • lazy evaluation
  • declarative
  • type inference
  • immutable state
  • tightly controlled side effects
  • strict static typing

Surely that would be interesting to a Haskell programmer? Of course, I’m talking about SQL.

Now, it’s all falling into place. All of those theoretical advantages become practical when you’re talking about managing a lot of data over a long period of time, and trying to avoid making any mistakes along the way. Really, that’s what relational database systems are all about.

I speculate that SQL is so successful and pervasive that it stole the limelight from languages like haskell, because the tough problems that haskell would solve are already solved in so many cases. Application developers can hack up a SQL query and run it over 100M records in 7 tables, glance at the result, and turn it over to someone else with near certainty that it’s the right answer! Sure, if you have a poorly-designed schema and have all kinds of special cases, then the query might be wrong too. But if you have a mostly-sane schema and mostly know what you’re doing, you hardly even need to check the results before using the answer.

In other words, if the query compiles, and the result looks anything like what you were expecting (e.g. the right basic structure), then it’s probably correct. Sound familiar? That’s exactly what people say about haskell.

It would be great if haskell folks would get more involved in the database community. It looks like a lot of useful knowledge could be shared. Haskell folks would be in a better position to find out how to apply theory where it has already proven to be successful, and could work backward to find other good applications of that theory.

Competing directly in the web application space against languages like ruby and javascript is going to be an uphill battle even if haskell is better in that space. I’ve worked with some very good ruby developers, and I honestly couldn’t begin to tell them where haskell might be a practical advantage for web application development. Again, I don’t know much about haskell aside from the very basics. But if someone like me who is interested in haskell and made some attempt to understand it and read about it still cannot articulate a practical advantage, clearly there is some kind of a problem (either messaging or technical). And that’s a huge space for application development, so that’s a serious concern.

However, the data management space is also huge — a large fraction of those applications exist primarily to collect data or present data. So, if haskell folks could work with the database community to advance data management, I believe that would inspire a lot of interesting development.

]]>
http://thoughts.davisjeff.com/2011/09/25/sql-the-successful-cousin-of-haskell/feed/ 11
Database for a Zoo: the problem and the solution http://thoughts.davisjeff.com/2011/09/21/database-for-a-zoo-the-problem-and-the-solution/ http://thoughts.davisjeff.com/2011/09/21/database-for-a-zoo-the-problem-and-the-solution/#comments Wed, 21 Sep 2011 07:00:52 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=315 Continue reading ]]> Let’s say you’re operating a zoo, and you have this simple constraint:

You can put many animals of the same type into a single cage; or distribute them among many cages; but you cannot mix animals of different types within a single cage.

This rule prevents, for example, assigning a zebra to live in the same cage as a lion. Simple, right?

How do you enforce it? Any ideas yet? Keep reading: I will present a solution that uses a generalization of the standard UNIQUE constraint.

(Don’t dismiss the problem too quickly. As with most simple-sounding problems, it’s a fairly general problem with many applications.)

First of all, let me say that, in one sense, it’s easy to solve: see if there are any animals already assigned to the cage, and if so, make sure they are the same type. That has two problems:

  1. You have to remember to do that each time. It’s extra code to maintain, possibly an extra round-trip, slightly annoying, and won’t work unless all access to the database goes through that code path.
  2. More subtly, the pattern read, decide what to write, write is prone to race conditions when another process writes after you read and before you write. Without excessive locking, solving this is hard to get right — and likely to pass tests during development before failing in production.

[ Aside: if you use true serializability in PostgreSQL 9.1, that completely solves problem #2, but problem #1 remains. ]

Those are exactly the kinds of problems that a DBMS is meant to solve. But what to do? Unique indexes don’t seem to solve the problem very directly, and neither do foreign keys. I believe that they can be combined to solve the problem by using two unique indexes, a foreign key, and an extra table, but that sounds painful (perhaps someone else has a simpler way to accomplish this with SQL standard features?). Row locking and triggers might be an alternative, but also not a very clean solution.

A better solution exists in PostgreSQL 9.1 using Exclusion Constraints (Exclusion Constraints were introduced in 9.0, but this solution requires the slightly-more-powerful version in 9.1). If you have never seen an Exclusion Constraint before, I suggest reading a previous post of mine.

Exclusion Constraints have the following semantics (copied from documentation link above):

The EXCLUDE clause defines an exclusion constraint, which guarantees that if any two rows are compared on the specified column(s) or expression(s) using the specified operator(s), not all of these comparisons will return TRUE. If all of the specified operators test for equality, this is equivalent to a UNIQUE constraint…

First, as a prerequisite, we need to install btree_gist into our database (make sure you have the contrib package itself installed first):

CREATE EXTENSION btree_gist;

Now, we can use an exclude constraint like so:

CREATE TABLE zoo
(
  animal_name TEXT,
  animal_type TEXT,
  cage        INTEGER,
  UNIQUE      (animal_name),
  EXCLUDE USING gist (animal_type WITH <>, cage WITH =)
);

Working from the definition above, what does this exclusion constraint mean? If any two tuples in the relation are ever compared (let’s call these TupleA and TupleB), then the following will never evaluate to TRUE:

TupleA.animal_type <> TupleB.animal_type AND
TupleA.cage        =  TupleB.cage

[ Observe how this would be equivalent to a UNIQUE constraint if both operators were "=". The trick is that we can use a different operator -- in this case, "<>" (not equals). ]

Results: 

=> insert into zoo values('Zap', 'zebra', 1);
INSERT 0 1
=> insert into zoo values('Larry', 'lion', 2);
INSERT 0 1
=> insert into zoo values('Zachary', 'zebra', 1);
INSERT 0 1
=> insert into zoo values('Zeta', 'zebra', 2);
ERROR:  conflicting key value violates exclusion constraint "zoo_animal_type_cage_excl"
DETAIL:  Key (animal_type, cage)=(zebra, 2) conflicts with existing key (animal_type, cage)=(lion, 2).
=> insert into zoo values('Zeta', 'zebra', 3);
INSERT 0 1
=> insert into zoo values('Lenny', 'lion', 2);
INSERT 0 1
=> insert into zoo values('Lance', 'lion', 1);
ERROR:  conflicting key value violates exclusion constraint "zoo_animal_type_cage_excl"
DETAIL:  Key (animal_type, cage)=(lion, 1) conflicts with existing key (animal_type, cage)=(zebra, 1).
=> select * from zoo order by cage;
 animal_name | animal_type | cage
-------------+-------------+------
 Zap         | zebra       |    1
 Zachary     | zebra       |    1
 Larry       | lion        |    2
 Lenny       | lion        |    2
 Zeta        | zebra       |    3
(5 rows)
And that is precisely the constraint that we need to enforce!
  1. The constraint is declarative so you don’t have to deal with different access paths to the database or different versions of the code. Merely the fact that the constraint exists means that PostgreSQL will guarantee it.
  2. The constraint is also immune from race conditions — as are all EXCLUDE constraints — because again, PostgreSQL guarantees it.

Those are nice properties to have, and if used properly, will simplify the overall application complexity and improve robustness.

]]>
http://thoughts.davisjeff.com/2011/09/21/database-for-a-zoo-the-problem-and-the-solution/feed/ 13
Exclusion Constraints are generalized SQL UNIQUE http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/ http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/#comments Sat, 25 Sep 2010 20:37:22 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=321 Continue reading ]]> Say you are writing an online reservation system. The first requirement you’ll encounter is that no two reservations may overlap (i.e. no schedule conflicts). But how do you prevent that?

It’s worth thinking about your solution carefully. My claim is that no existing SQL DBMS has a good solution to this problem before PostgreSQL 9.0, which has just been released. This new release includes a feature called Exclusion Constraints (authored by me), which offers a good solution to a class of problems that includes the “schedule conflict” problem.

I previously wrote a two part series (Part 1 and Part 2) on this topic. Chances are that you’ve run into a problem similar to this at one time or another, and these articles will show you the various solutions that people usually employ in the real world, and the serious problems and limitations of those approaches.

The rest of this article will be a brief introduction to Exclusion Constraints to get you started using a much better approach.

First, install PostgreSQL 9.0 (the installation instructions are outside the scope of this article), and launch psql.

Then, install two modules: “temporal” (which provides the PERIOD data type and associated operators) and “btree_gist” (which provides btree functionality via GiST).

Before installing these modules, make sure that PostgreSQL 9.0 is installed and that the 9.0 pg_config is in your PATH environment variable. Also, $SHAREDIR meas the directory listed when you run pg_config --sharedir.

To install Temporal PostgreSQL:

  1. download the tarball
  2. unpack the tarball, go into the directory, and type “make install
  3. In psql, type: \i $SHAREDIR/contrib/period.sql

To install BTree GiST (these directions assume you installed from source, some packages may help here, like Ubuntu’s “postgresql-contrib” package):

  1. Go to the postgresql source “contrib” directory, go to btree_gist, and type “make install“.
  2. In psql, type: \i $SHAREDIR/contrib/btree_gist.sql

Now that you have those modules installed, let’s start off with some basic Exclusion Constraints:

DROP TABLE IF EXISTS a;
CREATE TABLE a(i int);
ALTER TABLE a ADD EXCLUDE (i WITH =);

That is identical to a UNIQUE constraint on a.i, except that it uses the Exclusion Constraints mechanism; it even uses a normal BTree to enforce it. The performance will be slightly worse because of some micro-optimizations for UNIQUE constraint, but only slightly, and the performance characteristics should be the same (it’s just as scalable). Most importantly, it behaves the same under high concurrency as a UNIQUE constraint, so you don’t have to worry about excessive locking. If one person inserts 5, that will prevent other transactions from inserting 5 concurrently, but will not interfere with a transaction inserting 6.

Let’s take apart the syntax a little. The normal BTree is the default, so that’s omitted. The (i WITH =) is the interesting part, of course. It means that one tuple TUP1 conflicts with another tuple TUP2 if TUP1.i = TUP2.i. No two tuples may exist in the table if they conflict. In other words, there are no two tuples TUP1 and TUP2 in the table, such that TUP1.i = TUP2.i. That’s the very definition of UNIQUE, so that shows the equivalence. NULLs are always permitted, just like with UNIQUE constraints.

Now, let’s see if they hold up for multi-column constraints:

DROP TABLE IF EXISTS a;
CREATE TABLE a(i int, j int);
ALTER TABLE a ADD EXCLUDE (i WITH =, j WITH =);

The conditions for a conflicting tuple are ANDed together, just like UNIQUE. So now, in order for two tuples to conflict, TUP1.i = TUP2.i AND TUP1.j = TUP2.j. This is strictly a more permissive constraint, because conflicts require both conditions to be met. Therefore, this is identical to a UNIQUE constraint on (a.i, a.j).

What can we do that UNIQUE can’t? Well, for starters we can use something other than a normal BTree, such as Hash or GiST (for the moment, GIN is not supported, but that’s only because GIN doesn’t support the full index AM API; amgettuple in particular):

DROP TABLE IF EXISTS a;
CREATE TABLE a(i int, j int);
ALTER TABLE a ADD EXCLUDE USING gist (i WITH =, j WITH =);
-- alternatively using hash, which doesn't support
-- multi-column indexes at all
ALTER TABLE a ADD EXCLUDE USING hash (i WITH =);

So now we can do UNIQUE constraints using hash or gist. But that’s not a real benefit, because a normal btree is probably the most efficient way to support that, anyway (Hash may be in the future, but for the moment it doesn’t use WAL, which is a major disadvantage).

The difference really comes from the ability to change the operator to something other than “=“. It can be any operator that is:

  • Commutative
  • Boolean
  • Searchable by the given index access method (e.g. btree, hash, gist).

For BTree and Hash, the only operator that meets those criteria is “=”. But many data types (including PERIOD, CIRCLE, BOX, etc.) support lots of interesting operators that are searchable using GiST. For instance, “overlaps” (&&).

Ok, now we are getting somewhere. It’s impossible to specify the constraint that no two tuples contain values that overlap with eachother using a UNIQUE constraint; but it is possible to specify such a constraint with an Exclusion Constraint! Let’s try it out.

DROP TABLE IF EXISTS b;
CREATE TABLE b (p PERIOD);
ALTER TABLE b ADD EXCLUDE USING gist (p WITH &&);
INSERT INTO b VALUES('[2009-01-05, 2009-01-10)');
INSERT INTO b VALUES('[2009-01-07, 2009-01-12)'); -- causes ERROR

Now, try out various combinations (including COMMITs and ABORTs), and try with concurrent sessions also trying to insert values. You’ll notice that potential conflicts cause transactions to wait on eachother (like with UNIQUE) but non-conflicting transactions proceed unhindered. A lot better than LOCK TABLE, to say the least.

To be useful in a real situation, let’s make sure that the semantics extend nicely to a more complete problem. In reality, you generally have several exclusive resources in play, such as people, rooms, and time. But out of those, “overlaps” really only makes sense for time (in most situations). So we need to mix these concepts a little.

CREATE TABLE reservation(room TEXT, professor TEXT, during PERIOD);

-- enforce the constraint that the room is not double-booked
ALTER TABLE reservation
    ADD EXCLUDE USING gist
    (room WITH =, during WITH &&);

-- enforce the constraint that the professor is not double-booked
ALTER TABLE reservation
    ADD EXCLUDE USING gist
   (professor WITH =, during WITH &&);

Notice that we actually need to enforce two constraints, which is expected because there are two time-exclusive resources: professors and rooms. Multiple constraints on a table are ORed together, in the sense that an ERROR occurs if any constraint is violated. For the academic readers out there, this means that exclusion constraint conflicts are specified in disjunctive normal form (consistent with UNIQUE constraints).

The semantics of Exclusion Constraints extend in a clean way to support this mix of atomic resources (rooms, people) and resource ranges (time). Try it out, again with a mix of concurrency, commits, aborts, conflicting and non-conflicting reservations.

Exclusion constraints allow solving this class of problems quickly (in a couple lines of SQL) in a way that’s safe, robust, generally useful across many applications in many situations, and with higher performance and better scalability than other solutions.

Additionally, Exclusion Constraints support all of the advanced features you’d expect from a system like Postgres9: deferrability, applying the constraint to only a subset of the table (allows a WHERE clause), or using functions/expressions in place of column references.

]]>
http://thoughts.davisjeff.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/feed/ 9
Flexible Schemas and PostgreSQL http://thoughts.davisjeff.com/2010/05/06/flexible-schemas-and-postgresql/ http://thoughts.davisjeff.com/2010/05/06/flexible-schemas-and-postgresql/#comments Thu, 06 May 2010 17:42:28 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=267 Continue reading ]]> First, what is a “flexible schema”? It’s hard to pin down an exact definition, but it’s used to mean a data model that permits changes in application data structures without needing to migrate old data or incur other administrative hassles.

That’s a worthwhile goal. Applications often grow organically, especially in the early, exploratory stages of development. For example, you may decide to track when a user last did something on the website, so that you can adapt news and notices for those users (e.g. “Did you know that we added feature XYZ since you last visited?”). Developers have a need to produce a prototype quickly to work out the edge cases (do we update that timestamp for all actions, or only certain ones?), and probably a need to put it in production so that the users can benefit sooner.

A common worry is that ALTER TABLE will be a major performance problem. That’s sometimes a problem, but in PostgreSQL, you can add a column to a table in constant time (not dependent on the size of the table) in most situations. I don’t think this is a good reason to avoid ALTER TABLE, at least in PostgreSQL (other systems may impose a greater burden).

There are good reasons to avoid ALTER TABLE, however. We’ve only defined one use case for this new “last updated” field, and it’s a fairly loose definition. If we use ALTER TABLE as a first reaction for tracking any new application state, we’d end up with lots of columns with overlapping meanings (all subtly different), and it would be challenging to keep them consistent with each other. More importantly, adding new columns without thinking through the meaning and the data migration strategy will surely cause confusion and bugs. For example, if you see the following table:

    CREATE TABLE users
    (
      name         TEXT,
      email        TEXT,
      ...,
      last_updated TIMESTAMPTZ
    );

you might (reasonably) assume that the following query makes sense:

    SELECT * FROM users
      WHERE last_updated < NOW() - '1 month'::INTERVAL;

Can you spot the problem? Old user records (before the ALTER TABLE) will have NULL for last_updated timestamps, and will not satisfy the WHERE condition even though they intuitively qualify. There are two parts to the problem:

  1. The presence of the last_updated field fools the author of the SQL query into making assumptions about the data, because it seems so simple on the surface.
  2. NULL semantics allow the query to be executed even without complete information, leading to a wrong result.

Let’s try changing the table definition:

    CREATE TABLE users
    (
      name       TEXT,
      email      TEXT,
      ...,
      properties HSTORE
    );

HSTORE is a set of key/value pairs. Some tuples might have the last_updated key in the properties attribute, and others may not. This accomplishes two things:

  1. There’s no need for ALTER TABLE or cluttering of the namespace with a lot of nullable columns.
  2. The name “properties” is vague enough that query writers would (hopefully) be on their guard, understanding that not all records will share the same properties.

You could still write the same (wrong) query against the second table with minor modification. Nothing has fundamentally changed. But we are using a different development strategy that’s easy on application developers during rapid development cycles, yet does not leave a series of pitfalls for users of the data. When a certain property becomes universally recorded and has a concrete meaning, you can plan a real data migration to turn it into a relation attribute instead.

Now, we need some guiding principles about when to use a complex type to represent complex information, and when to use separate columns in the table. To maximize utility and minimize confusion, I believe the best guiding principle is the meaning of the data you’re storing across all tuples. When defining the attributes of a relation, if you find yourself using vague nouns such as “properties,” or resorting to complex qualifications (lots of “if/then” branching in your definition), consider less constrained data types like HSTORE. Otherwise, it’s best to nail down the meaning in terms of appropriate nouns, which will help keep the DBMS smart and queries simple (and correct). See Choosing Data Types and further guidance in reference [1].

I believe there are three reasons why application developers feel that relational schemas are “inflexible”:

  1. A reliance on NULL semantics to make things “magically work,” when in reality, it just makes queries succeed that should fail. See my previous posts: None, nil, Nothing, undef, NA, and SQL NULL and What is the deal with NULLs?.
  2. The SQL database industry has avoided interesting types, like HSTORE, for a long time. See my previous post: Choosing Data Types.
  3. ORMs make a fundamental false equivalence between an object attribute and a table column. There is a relationship between the two, of course; but they are simply not the same thing. This is a direct consequence of “The First Great Blunder”[2].

EDIT: I found a more concise way to express my fundamental point — During the early stages of application development, we only vaguely understand our data. The most important rule of database design is that the database should represent reality, not what we wish reality was like. Therefore, a database should be able to express that vagueness, and later be made more precise when we understand our data better. None of this should be read to imply that constraints are less important or that we need not understand our data. These ideas mostly apply only at very early stages of development, and even then, prudent use of constraints often makes that development much faster.

[1] Date, C.J.; Darwen, Hugh (2007). Databases, Types, and the Relational Model. pp. 377-380 (Appendix B, “A Design Dilemma”).

[2] Date, C.J. (2000). An Introduction To Database Systems, p. 865.

]]>
http://thoughts.davisjeff.com/2010/05/06/flexible-schemas-and-postgresql/feed/ 2
Temporal PostgreSQL Roadmap http://thoughts.davisjeff.com/2010/03/09/temporal-postgresql-roadmap/ http://thoughts.davisjeff.com/2010/03/09/temporal-postgresql-roadmap/#comments Wed, 10 Mar 2010 04:49:06 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=254 Continue reading ]]> Why are temporal extensions in PostgreSQL important? Quite simply, managing time data is one of the most common requirements, and current general-purpose database systems don’t provide us with the basic tools to do it. Every general-purpose DBMS falls short both in terms of usability and performance when trying to manage temporal data.

What is already done?

  • PERIOD data type, which can represent anchored intervals of time; that is, a chunk of time with a definite beginning and a definite end (in contrast to a SQL INTERVAL, which is not anchored to any specific beginning or end time).
    • Critical for usability because it acts as a set of time, so you can easily test for containment and other operations without using awkward constructs like BETWEEN or lots of comparisons (and keeping track of inclusivity/exclusivity of boundary points).
    • Critical for performance because you can index the values for efficient “contains” and “overlaps” queries (among others).
  • Temporal Keys (called Exclusion Constraints, and will be available in the next release of PostgreSQL, 9.0), which can enforce the constraint that no two periods of time (usually for a given resource, like a person) overlap. See the documentation (look for the word “EXCLUDE”), and see my previous articles (part 1 and part 2) on the subject.
    • Critical for usability to avoid procedural, error-prone hacks to enforce the constraint with triggers or by splitting time into big chunks.
    • Critical for performance because it performs comparably to a UNIQUE index, unlike the other procedural hacks which are generally too slow to use for most real systems.

What needs to be done?

  • Range Types — Aside from PERIOD, which is based on TIMESTAMPTZ, it would also be useful to have very similar types based on, for example, DATE. It doesn’t stop there, so the natural conclusion is to generalize PERIOD into “range types” which could be based on almost any subtype.
  • Range Keys, Foreign Range Keys — If Range Types are known to the Postgres engine, that means that we can have syntactic sugar for range keys (like temporal keys, except for any range type), etc., that would internally use Exclusion Constraints.
  • Range Join — If Range Types are known to the Postgres engine, there could be syntactic sugar for a “range join,” that is, a join based on “overlaps” rather than “equals”. More importantly, there could be a new join type, a Range Merge Join, that could perform this join efficiently (without a Range Merge Join, a range join would always be a nested loop join).
  • Simple table logs — The ability to easily create an effective “audit log” or similar trigger-based table log, that can record changes and be efficiently queried for historical state or state changes.

I’ll be speaking on this subject (specifically, the new Exclusion Constraints feature) in the upcoming PostgreSQL Conference EAST 2010 (my talk description) in Philadelphia later this month and PGCon 2010 (my talk description) in Ottawa this May. In the past, these conferences and others have been a great place to get ideas and help me move the temporal features forward.

The existing features have been picking up a little steam lately. The temporal-general mailing list has some traffic now — fairly low, but enough that others contribute to the discussions, which is a great start. I’ve also received some great feedback from a number of people, including the folks at PGX. There’s still a ways to go before we have all the features we want, but progress is being made.

]]>
http://thoughts.davisjeff.com/2010/03/09/temporal-postgresql-roadmap/feed/ 8
Scalability and the Relational Model http://thoughts.davisjeff.com/2010/03/07/scalability-and-the-relational-model/ http://thoughts.davisjeff.com/2010/03/07/scalability-and-the-relational-model/#comments Sun, 07 Mar 2010 21:37:24 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=242 Continue reading ]]> The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations — but it’s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.

So, right away, it’s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn’t solve that scalability problem — it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.

There may exist scalability problems in:

  • SQL, which defines requirements outside the scope of the relational model, such as ACID properties and transactional semantics.
  • Traditional architectures and implementations of SQL, such as the “table is a file” equivalence, lack of sophisticated types, etc.
  • Particular implementations of SQL — e.g. “MySQL can’t do it, so the relational model doesn’t scale”.

Why are these distinctions important? As with many debates, terminology confusion is at the core, and prevents us from dealing with the problems directly. If SQL is defined in a way that causes scalability problems, we need to identify precisely those requirements that cause a problem, so that we can proceed forward without losing all that has been gained. If the traditional architectures are not suitable for some important use-cases, they need to be adapted. If some particular implementations are not suitable, developers need to switch or demand that it be made competitive.

The NoSQL movement (or at least the hype surrounding it) is far too disorganized to make real progress. Usually, incremental progress is best; and sometimes a fresh start is best, after drawing on years of lessons learned. But it’s never a good idea to start over with complete disregard for the past. For instance, an article from Digg starts off great:

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes.

That’s good because he blames it on the mindset not the model, and then identifies a specific problem. But then the article completely falls flat:

Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP.

A join is faster in PHP than MySQL? Why bother even discussing SQL versus NoSQL if your particular implementation of SQL — MySQL — can’t even do a hash join, the exact operation that you need? Particularly when almost every other implementation can (including PostgreSQL)? That kind of reasoning won’t lead to solutions.

So, where do we go from here?

  1. Separate the SQL model from the other requirements (some of which may limit scalability) when discussing improvements.
  2. Improve the SQL model (my readers know that I’ve criticized SQL’s logical problems many times in the past).
  3. Improve the implementations of SQL, particularly how tables are physically stored.
  4. If you’re particularly ambitious, come up with a relational alternative to SQL that takes into account what’s been learned after decades of SQL and can become the next general-purpose DBMS language.

EDIT 2010-03-09: I should have cited Josh Berkus’s talk on Relational vs. Non-Relational (complete list of PGX talks), which was part of the inspiration for this post.

]]>
http://thoughts.davisjeff.com/2010/03/07/scalability-and-the-relational-model/feed/ 14
Temporal Keys, Part 2 http://thoughts.davisjeff.com/2009/11/08/temporal-keys-part-2/ http://thoughts.davisjeff.com/2009/11/08/temporal-keys-part-2/#comments Mon, 09 Nov 2009 05:48:37 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=180 Continue reading ]]> In the last article, I argued that:

  • A schedule conflict is a typical business problem.
  • The later you try to resolve a schedule conflict, the more costly it is to resolve.
  • In particular, there is a big jump in the cost the moment after conflicting data is recorded.
  • Therefore, it’s best for the DBMS itself to enforce the constraint, because only the DBMS can avoid the conflict effectively before the conflict is recorded.

Then, I opened up a discussion to see how people are dealing with these schedule conflicts. In the comments I received at the end of the article, as well as other anecdotes from conferences, user groups, mailing lists, and my own experience, the solutions fall into a few categories:

  • The rate of conflicts is so low that the costs are not important. For instance, you may make 0.1% of your customers unhappy, and need to refund them, but perhaps that’s a cost you’re willing to pay.
  • The application receives so few requests that performance is not an object, and serialization of all requests is a viable option. The serialization is done using big locks and a read-check-write cycle. Even if performance is not an object, these applications sometimes run into maintenance problems or unexpected outages because of the big locks required.
  • You can break the time slices into manageable chunks, e.g. one day chunks aligned at midnight. This kind of solution is highly specific to the business, reduces the flexibility of the business, and often requires a substantial amount of custom, error-prone procedural code.
  • Complex procedural code: usually a mix of application code, functions in the DBMS, row-level locking, static data in tables that only exists for the purposes of row-level locks, etc. This kind of solution is generally very specific to the application and the business, requires lots of very error-prone custom procedural code, is difficult to adequately test, and it’s hard to understand what’s going on in the system at any given time. Hunting down sporadic performance problems would be a nightmare.

Those solutions just aren’t good enough. We use relational database systems because they are smart, declarative, generally useful for many problems, and maintainable (Note: these principles contrast with NoSQL, which is moving in the opposite direction — more on that in another article).

[UPDATE: The following project has been committed for the next release of PostgreSQL; the feature is now called "Exclusion Constraints"; and the new version of PostgreSQL will be called 9.0 (not 8.5). See the documentation under the heading "EXCLUDE".]

The project that I’ve been working on for PostgreSQL 8.5 is called “Operator Exclusion Constraints“. These are a new type of constraint that most closely resembles the UNIQUE constraint, because one tuple can preclude the existence of other tuples. With a UNIQUE constraint on attribute A of a table with attributes (A, B, C), the existence of the tuple (5, 6, 7) precludes the existence of any tuple (5, _, _) in that table at the same time. This is different from a foreign key, which requires the existence of a tuple in another table; and different from a CHECK constraint which rejects tuples independently from any other tuple in any table (and the same goes for NOT NULL).

The same semantics as a UNIQUE constraint can be easily specified as an Operator Exclusion Constraint, with a minor performance penalty at insert time (one additional index search, usually only touching pages that are already in cache). Exclusion constraints are more general than UNIQUE, however. For instance, with a complex type such as CIRCLE, you can specify that no two circles in a table overlap — which is a constraint that is impossible to specify otherwise (without resorting to the poor solutions mentioned above).

This applies to temporal keys very nicely. First, get the PERIOD data type, which allows you a better way to work with periods of time (sets of time, really), rather than points in time. Second, you need to install the btree_gist contrib module. Then, use an exclusion constraint like so:

[UPDATE 2010-03-09: Syntax updated to reflect the version of this project committed for PostgreSQL 9.0. ]

CREATE TABLE room_reservation
(
  name   TEXT,
  room   TEXT,
  during PERIOD,
  EXCLUDE USING gist (room WITH =, during WITH &&)
);

That will prevent two reservations on the same room from overlapping. There are a few pieces to this that require explanation:

  • && is the “overlaps” operator for the PERIOD data type.
  • USING gist tells PostgreSQL what kind of index to create to enforce this constraint. The operators must map to search strategies for this index method, and searching for overlapping periods requires a GiST index.
  • Because we are using GiST, we need GiST support for equality searches for the TEXT data type, which is the reason we need the btree_gist contrib module.
  • Conflicts will only occur if two tuples have equal room numbers, and overlapping periods of time for the reservation.

This solution:

  • Performs well under light and heavy contention. Not quite as well as a UNIQUE constraint, but much better than the alternatives, and without the surprises you might get from using big locks. Note that the constraint will be enforced at some point, so ignoring the problem is not a high-performance alternative (interpersonal communication has higher latency than a computer).
  • Is declarative. The implementation shows through a little bit — the user will know that an index is being used, for instance — but it’s a relatively simple declaration. As a consequence, it’s not very error-prone from the schema designer’s standpoint.
  • Is not specific to the business. You don’t have to decide on an appropriate time slice (e.g. one hour, one day, etc.); you don’t have to try to partition locks in creative ways; you don’t have to write procedural code (in the database system or application); and you don’t have to come up with interesting ways to detect a conflict or notify the user.

Temporal keys are just one part of the support required for effective temporal data management inside the DBMS. However, it’s one of the most important pieces that requires support from the core engine, and cannot be implemented as a module.

]]>
http://thoughts.davisjeff.com/2009/11/08/temporal-keys-part-2/feed/ 11
PostgreSQL WEST and Temporal Databases http://thoughts.davisjeff.com/2009/10/12/postgresql-west-and-temporal-databases/ http://thoughts.davisjeff.com/2009/10/12/postgresql-west-and-temporal-databases/#comments Tue, 13 Oct 2009 05:09:57 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=152 Continue reading ]]> I’ve been interested in temporal data and relational databases for quite some time. There are going to be at least two people talking about temporal data at PostgreSQL WEST in Seattle: Scott Bailey and me. See the talk descriptions.

In the past, I’ve worked on a temporal extension to PostgreSQL that implements the PERIOD data type. This is a data type that offers both a definite beginning and a definite end time, which is important for describing things that happen over a period of time, rather than instantaneously. Trying to use separate attributes for “start” and “end” is bad for a number of reasons, and will certainly be addressed in a subsequent blog entry. For now, I’ll just say that I believe the PERIOD data type is fundamentally important for handling all kinds of time data, which I believe is a common problem.

At WEST, I’ll be presenting my progress on temporal keys. Temporal keys are used to prevent overlapping periods of time — a schedule conflict — by using an index and following the same concurrent behavior as UNIQUE with minimal performance cost (one extra index search, to be precise).

Temporal keys cannot be expressed in PostgreSQL 8.4, unless you resort to triggers and a full table lock (ouch!). So, additional backend support is required. This is accomplished in my patch for operator exclusion constraints, which are a more general way of using arbitrary operators and index searches to enforce a constraint. I plan to do what’s required for the patch to be accepted in PostgreSQL 8.5.

Temporal modeling is a common problem. It seems like almost every PostgreSQL conference has had at least one talk on the matter, so we know there is some demand for improvement. If you’re interested, I hope you come to WEST and chat with Scott or I, and let’s see if we can come up with some real solutions.

]]>
http://thoughts.davisjeff.com/2009/10/12/postgresql-west-and-temporal-databases/feed/ 5
What is the deal with NULLs? http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/ http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/#comments Sun, 02 Aug 2009 22:40:23 +0000 Jeff Davis http://davisjeff.wordpress.com/?p=4 Continue reading ]]> A recent thread on pgsql-hackers warrants some more extensive discussion. In the past, I’ve criticized NULL semantics, but in this post I’d just like to explain some corner cases that I think you’ll find interesting, and try to straighten out some myths and misconceptions.

First off, I’m strictly discussing SQL NULL here. SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.

  • “NULL is not a value” — If you hear this one, beware: it’s in direct contradiction to the SQL standard, which uses the phrase “null value” dozens of times. It’s hard to imagine that NULL is not any kind of value at all; because it’s routinely passed to functions and operators, predicates can evaluate to NULL, and SQL uses a kind of three-valued logic (3VL) in some contexts. The phrase “NULL is not a value” also raises the question: “what is it, then?”.
  • NULL means “unknown” (i.e. the third truth value) — This doesn’t hold up either. SUM of no tuples returns NULL, but clearly the SUM of no tuples is not unknown! SQL will happily generate NULLs from aggregates or outer joins without any NULLs at all in the database. Do you not know something you did know before, or do you now know that you don’t know something that you didn’t know you didn’t know before? Also, if NULL means “unknown”, how do you differentiate a boolean field for which you do not know the value, and a boolean field for which you do know the value, and it happens to be “unknown” (perhaps this is why boolean columns are a PostgreSQL extension and not part of the core SQL standard)?
  • “NULL is false-like” — Don’t think of NULL as false-like, or “more false than true”. It’s a tempting rule of thumb, but it’s misleading. For instance, in a WHERE clause, a NULL predicate is treated like FALSE. However, in a constraint (like a CHECK constraint), NULL is treated like TRUE. Perhaps most importantly, when in a 3VL context (like a boolean expression), this misconception leads to problems when you try to invert the logic, e.g., use NOT.
  • “Oh, that makes sense” — When you see individual behaviors of NULL, they look systematic, and your brain quickly sees a pattern and extrapolates what might happen in other situations. Often, that extrapolation is wrong, because NULL semantics are a mix of behaviors. I think the best way to think about NULL is as a Frankenstein monster of several philosophies and systems stitched together by a series of special cases.
  • p OR NOT p — Everyone should know that this is not always true in SQL. But most people tend to reason assuming that this is always true, so you have to be very careful, and work against your intuition very deliberately, in order to form a correct SQL query.
  • SUM() versus + (addition) — SUM is not repeated addition. SUM of 1 and NULL is 1, but 1 + NULL is NULL.
  • Aggregates ignore NULLs — According to the standard, aggregates are supposed to ignore NULLs, because the information is missing. But why is it OK to ignore the missing information in an aggregate, but not, say, with the + operator? Is it really OK to just ignore it?
  • Aggregates return NULL — According to the standard, aggregates are supposed to return NULL when they have no non-NULL input. Just because you don’t have any input tuples, does that really mean that the result is undefined, missing, or unknown? It’s certainly not unknown! What about SUM over zero tuples, wouldn’t the most useful result be zero?
  • SQL breaks its own rules — The aforementioned aggregate rules don’t work very well for COUNT(), the simplest of all aggregates. So, they have two versions of count: COUNT(*) breaks the “aggregates ignore nulls” rule and the “aggregates return null” rule, and COUNT(x) only breaks the latter. But wait! There’s more: ARRAY_AGG() breaks the former but not the latter. But no exception is made for SUM — it still returns NULL when there are no input tuples.
  • NULLs appear even when you have complete information — Because of OUTER JOIN and aggregates, NULLs can appear even when you don’t have any NULLs in your database! As a thought experiment, try to reconcile this fact with the various “definitions” of NULL.
  • WHERE NOT IN (SELECT ...) — This one gets everyone at one point or another. If the subselect produces any NULLs, then NOT IN can only evaluate to FALSE or NULL, meaning you get no tuples. Because it’s in a WHERE clause, it will return no results. You are less likely to have a bunch of NULLs in your data while testing, so chances are everything will work great until you get into production.
  • x >= 10 or x <= 10 — Not a tautology in SQL.
  • x IS NULL AND x IS DISTINCT FROM NULL — You probably don’t know this, but this expression can evaluate to TRUE! That is, if x = ROW(NULL).
  • NOT x IS NULL is not the same as x IS NOT NULL — If x is ROW(1,NULL), then the former will evaluate to TRUE, and the latter will evaluate to FALSE. Enjoy.
  • NOT x IS NULL AND NOT x IS NOT NULL — Want to know if you have a value like ROW(1, NULL)? To distinguish it from NULL and also from values like ROW(1,1) and ROW(NULL,NULL), this expression might help you.
  • NULLs can exist inside some things, but not others — If you concatenate: firstname || mi || lastname, and “mi” happens to be null, the entire result will be null. So strings cannot contain a NULL, but as we see above, a record can.

I believe the above shows, beyond a reasonable doubt, that NULL semantics are unintuitive, and if viewed according to most of the “standard explanations,” highly inconsistent. This may seem minor; that is, if you’re writing SQL you can overcome these things with training. But it is not minor, because NULL semantics are designed to make you think you understand them, and think that the semantics are intuitive, and think that it’s part of some ingenious consistent system for managing missing information. But none of those things are true.

I have seen lots of discussions about NULL in various forums and mailing lists. Many of the participants are obviously intelligent and experienced, and yet make bold statements that are, quite simply, false. I’m writing this article to make two important points:

  1. There is a good case to be made that NULL semantics are very counterproductive; as opposed to a simple “error early” system that forces you to write queries that explicitly account for missing information (e.g. with COALESCE). “Error early” is a more mainstream approach, similar to null pointers in java or None in python. If you want compile-time checking, you can use a construct like Maybe in haskell. SQL attempts to pass along the problem, hoping the next operator will turn ignorance into knowledge — but it does not appear that anyone thought through this idea, quite frankly.
  2. You should not attempt to apply your intellect to NULL, it will lead you in the wrong direction. If you need to understand it, understand it, but always treat it with skepticism. Test the queries, read the standard, do what you need to do, but do not attempt to extrapolate.
]]>
http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/feed/ 45