Experimental Thoughts » Logic

Database for a Zoo: the problem and the solution

Jeff Davis — Wed, 21 Sep 2011 07:00:52 +0000

Let’s say you’re operating a zoo, and you have this simple constraint:

You can put many animals of the same type into a single cage; or distribute them among many cages; but you cannot mix animals of different types within a single cage.

This rule prevents, for example, assigning a zebra to live in the same cage as a lion. Simple, right?

How do you enforce it? Any ideas yet? Keep reading: I will present a solution that uses a generalization of the standard UNIQUE constraint.

(Don’t dismiss the problem too quickly. As with most simple-sounding problems, it’s a fairly general problem with many applications.)

First of all, let me say that, in one sense, it’s easy to solve: see if there are any animals already assigned to the cage, and if so, make sure they are the same type. That has two problems:

You have to remember to do that each time. It’s extra code to maintain, possibly an extra round-trip, slightly annoying, and won’t work unless all access to the database goes through that code path.
More subtly, the pattern read, decide what to write, write is prone to race conditions when another process writes after you read and before you write. Without excessive locking, solving this is hard to get right — and likely to pass tests during development before failing in production.

[ Aside: if you use true serializability in PostgreSQL 9.1, that completely solves problem #2, but problem #1 remains. ]

Those are exactly the kinds of problems that a DBMS is meant to solve. But what to do? Unique indexes don’t seem to solve the problem very directly, and neither do foreign keys. I believe that they can be combined to solve the problem by using two unique indexes, a foreign key, and an extra table, but that sounds painful (perhaps someone else has a simpler way to accomplish this with SQL standard features?). Row locking and triggers might be an alternative, but also not a very clean solution.

A better solution exists in PostgreSQL 9.1 using Exclusion Constraints (Exclusion Constraints were introduced in 9.0, but this solution requires the slightly-more-powerful version in 9.1). If you have never seen an Exclusion Constraint before, I suggest reading a previous post of mine.

Exclusion Constraints have the following semantics (copied from documentation link above):

The EXCLUDE clause defines an exclusion constraint, which guarantees that if any two rows are compared on the specified column(s) or expression(s) using the specified operator(s), not all of these comparisons will return TRUE. If all of the specified operators test for equality, this is equivalent to a UNIQUE constraint…

First, as a prerequisite, we need to install btree_gist into our database (make sure you have the contrib package itself installed first):

CREATE EXTENSION btree_gist;

Now, we can use an exclude constraint like so:

CREATE TABLE zoo
(
  animal_name TEXT,
  animal_type TEXT,
  cage        INTEGER,
  UNIQUE      (animal_name),
  EXCLUDE USING gist (animal_type WITH <>, cage WITH =)
);

Working from the definition above, what does this exclusion constraint mean? If any two tuples in the relation are ever compared (let’s call these TupleA and TupleB), then the following will never evaluate to TRUE:

TupleA.animal_type <> TupleB.animal_type AND
TupleA.cage        =  TupleB.cage

[ Observe how this would be equivalent to a UNIQUE constraint if both operators were "=". The trick is that we can use a different operator -- in this case, "<>" (not equals). ]

Results:

=> insert into zoo values('Zap', 'zebra', 1);
INSERT 0 1
=> insert into zoo values('Larry', 'lion', 2);
INSERT 0 1
=> insert into zoo values('Zachary', 'zebra', 1);
INSERT 0 1
=> insert into zoo values('Zeta', 'zebra', 2);
ERROR:  conflicting key value violates exclusion constraint "zoo_animal_type_cage_excl"
DETAIL:  Key (animal_type, cage)=(zebra, 2) conflicts with existing key (animal_type, cage)=(lion, 2).
=> insert into zoo values('Zeta', 'zebra', 3);
INSERT 0 1
=> insert into zoo values('Lenny', 'lion', 2);
INSERT 0 1
=> insert into zoo values('Lance', 'lion', 1);
ERROR:  conflicting key value violates exclusion constraint "zoo_animal_type_cage_excl"
DETAIL:  Key (animal_type, cage)=(lion, 1) conflicts with existing key (animal_type, cage)=(zebra, 1).
=> select * from zoo order by cage;

 animal_name | animal_type | cage
-------------+-------------+------
 Zap         | zebra       |    1
 Zachary     | zebra       |    1
 Larry       | lion        |    2
 Lenny       | lion        |    2
 Zeta        | zebra       |    3
(5 rows)

And that is precisely the constraint that we need to enforce!

The constraint is declarative so you don’t have to deal with different access paths to the database or different versions of the code. Merely the fact that the constraint exists means that PostgreSQL will guarantee it.
The constraint is also immune from race conditions — as are all EXCLUDE constraints — because again, PostgreSQL guarantees it.

Those are nice properties to have, and if used properly, will simplify the overall application complexity and improve robustness.

Why PostgreSQL Already Has Query Hints

Jeff Davis — Sat, 05 Feb 2011 18:33:25 +0000

This is a counterpoint to Josh’s recent post: Why PostgreSQL Doesn’t Have Query Hints. I don’t really disagree, except that I think that there are many different definitions of “hints” floating around, leading to a lot of confusion. I could subtitle this post “More Terminology Confusion” after my previous entry.

So, let’s pick a reasonable definition: “hints are some mechanism to influence the SQL planner to choose a better plan”. Why did I choose that definition? Because it’s the actual use case. If a user encounters a bad plan, or an unstable plan, they need a way to get it to choose a better plan. There’s plenty of room to argue about the right way to do that and the wrong way, but almost every DBMS allows some form of hints. Including PostgreSQL.

Here are a few planner variables you can tweak (out of many):

enable_seqscan
enable_mergejoin
enable_indexscan

Not specific enough for you? Well, you can try plantuner to pick or forbid specific indexes.

Want to enforce join order? Try setting from_collapse_limit.

Want to get even more specific? You can set the selectivity of individual operators.

There is a philosophical difference between PostgreSQL’s approach and that of many other systems. In PostgreSQL, it is encouraged to specify costs and selectivities more than exact plans. There are good reasons for that, such as sheer number of possible plans for even moderately complex queries (as Josh points out). Additionally, specifying exact plans tends to lead you into exactly the type of trouble you are trying to avoid by specifying hints in the first place — after input cardinalities change, the previous plan may now be a very poor one.

PostgreSQL clearly has a set of mechanisms that could be called “hints”. It turns out that there are actually quite a lot of ways to control the plan in postgres; but they generally aren’t recommended except as a solution to a specific problem someone posts to the performance list. That is part of the postgresql culture: a bit like getting a prescription for a doctor, so that the doctor can see the whole picture, help you look for alternative solutions, and weigh the side effects of the treatment against the benefits. I’m exaggerating, of course — these tweaks are documented (well, most of them), and anyone can use them; you just won’t hear them shouted from the rooftops as recommendations.

Except in this post, I suppose, which you should use at your own risk.

Temporal PostgreSQL Roadmap

Jeff Davis — Wed, 10 Mar 2010 04:49:06 +0000

Why are temporal extensions in PostgreSQL important? Quite simply, managing time data is one of the most common requirements, and current general-purpose database systems don’t provide us with the basic tools to do it. Every general-purpose DBMS falls short both in terms of usability and performance when trying to manage temporal data.

What is already done?

PERIOD data type, which can represent anchored intervals of time; that is, a chunk of time with a definite beginning and a definite end (in contrast to a SQL INTERVAL, which is not anchored to any specific beginning or end time).
- Critical for usability because it acts as a set of time, so you can easily test for containment and other operations without using awkward constructs like BETWEEN or lots of comparisons (and keeping track of inclusivity/exclusivity of boundary points).
- Critical for performance because you can index the values for efficient “contains” and “overlaps” queries (among others).

Temporal Keys (called Exclusion Constraints, and will be available in the next release of PostgreSQL, 9.0), which can enforce the constraint that no two periods of time (usually for a given resource, like a person) overlap. See the documentation (look for the word “EXCLUDE”), and see my previous articles (part 1 and part 2) on the subject.
- Critical for usability to avoid procedural, error-prone hacks to enforce the constraint with triggers or by splitting time into big chunks.
- Critical for performance because it performs comparably to a UNIQUE index, unlike the other procedural hacks which are generally too slow to use for most real systems.

What needs to be done?

Range Types — Aside from PERIOD, which is based on TIMESTAMPTZ, it would also be useful to have very similar types based on, for example, DATE. It doesn’t stop there, so the natural conclusion is to generalize PERIOD into “range types” which could be based on almost any subtype.
Range Keys, Foreign Range Keys — If Range Types are known to the Postgres engine, that means that we can have syntactic sugar for range keys (like temporal keys, except for any range type), etc., that would internally use Exclusion Constraints.
Range Join — If Range Types are known to the Postgres engine, there could be syntactic sugar for a “range join,” that is, a join based on “overlaps” rather than “equals”. More importantly, there could be a new join type, a Range Merge Join, that could perform this join efficiently (without a Range Merge Join, a range join would always be a nested loop join).
Simple table logs — The ability to easily create an effective “audit log” or similar trigger-based table log, that can record changes and be efficiently queried for historical state or state changes.

I’ll be speaking on this subject (specifically, the new Exclusion Constraints feature) in the upcoming PostgreSQL Conference EAST 2010 (my talk description) in Philadelphia later this month and PGCon 2010 (my talk description) in Ottawa this May. In the past, these conferences and others have been a great place to get ideas and help me move the temporal features forward.

The existing features have been picking up a little steam lately. The temporal-general mailing list has some traffic now — fairly low, but enough that others contribute to the discussions, which is a great start. I’ve also received some great feedback from a number of people, including the folks at PGX. There’s still a ways to go before we have all the features we want, but progress is being made.

Scalability and the Relational Model

Jeff Davis — Sun, 07 Mar 2010 21:37:24 +0000

The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations — but it’s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.

So, right away, it’s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn’t solve that scalability problem — it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.

There may exist scalability problems in:

SQL, which defines requirements outside the scope of the relational model, such as ACID properties and transactional semantics.
Traditional architectures and implementations of SQL, such as the “table is a file” equivalence, lack of sophisticated types, etc.
Particular implementations of SQL — e.g. “MySQL can’t do it, so the relational model doesn’t scale”.

Why are these distinctions important? As with many debates, terminology confusion is at the core, and prevents us from dealing with the problems directly. If SQL is defined in a way that causes scalability problems, we need to identify precisely those requirements that cause a problem, so that we can proceed forward without losing all that has been gained. If the traditional architectures are not suitable for some important use-cases, they need to be adapted. If some particular implementations are not suitable, developers need to switch or demand that it be made competitive.

The NoSQL movement (or at least the hype surrounding it) is far too disorganized to make real progress. Usually, incremental progress is best; and sometimes a fresh start is best, after drawing on years of lessons learned. But it’s never a good idea to start over with complete disregard for the past. For instance, an article from Digg starts off great:

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes.

That’s good because he blames it on the mindset not the model, and then identifies a specific problem. But then the article completely falls flat:

Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP.

A join is faster in PHP than MySQL? Why bother even discussing SQL versus NoSQL if your particular implementation of SQL — MySQL — can’t even do a hash join, the exact operation that you need? Particularly when almost every other implementation can (including PostgreSQL)? That kind of reasoning won’t lead to solutions.

So, where do we go from here?

Separate the SQL model from the other requirements (some of which may limit scalability) when discussing improvements.
Improve the SQL model (my readers know that I’ve criticized SQL’s logical problems many times in the past).
Improve the implementations of SQL, particularly how tables are physically stored.
If you’re particularly ambitious, come up with a relational alternative to SQL that takes into account what’s been learned after decades of SQL and can become the next general-purpose DBMS language.

EDIT 2010-03-09: I should have cited Josh Berkus’s talk on Relational vs. Non-Relational (complete list of PGX talks), which was part of the inspiration for this post.

Temporal Keys, Part 1

Jeff Davis — Mon, 02 Nov 2009 00:55:34 +0000

“Schedule conflict” — it’s one of the simplest and most common constraints for business or any other organization. One person cannot be in two places at the same time; and in many cases a only a single person can use a given resource at any time.

Does your database system know anything about a schedule conflict? Should it?

Constraints are always enforced at some point, it’s just a matter of when, how, and the cost of correcting the situation.

Consider two professors who try to reserve the same lecture hall at overlapping times (let’s say 1-2pm, and 1:30-2:30pm). Now imagine two possible resolutions:

They are blissfully unaware of the schedule conflict until they have already promoted their lecture, attracted an audience, and arrived at the building. The professor scheduled for 1:30-2:30pm will be disappointed to find it already in use, and the students will be confused. The second lecture may have to be canceled due to the confusion, or lose a substantial number of attendees.
At the time of scheduling, one professor is given an error message to the effect of “this room is already reserved, please choose a different time”.

Observe that neither one really solves the problem: the second resolution still forces one professor to choose a less-desirable time. However, it is a much cheaper resolution. As with many problems, early detection is simplest and cheapest to resolve.

Of course, there are many resolutions that fall between the first and the second. For instance, you might run a program every hour that checks for conflicts and alerts the parties involved. That may work, but there are still problems:

You’ve now introduced uncertainty into the system: do I have a reservation or not? If there is a conflict later, will I be kicked out or will they?
You have to come up with rules for resolution: does the first one win? How do you define “first” if transactions are running for a while? What if someone makes a reservation “first” but then makes all kinds of changes later; were they really first or did they just game the system?
If someone’s reservation gets bumped, you have to now have a new strange state for a reservation, in which it is disabled, the organizer has (hopefully) been notified, and it needs to change.
Notice that everything is very procedural and specific to the business. You have to have a notification mechanism, and rules for how to resolve it.
Let’s go back to the definition of “first”: say you have made a reservation, and you get bumped by the conflict detector. In between the time you made the reservation and the time you were notified of a conflict, someone else reserved your second choice. Are you now first in line for that time slot? If so, that has a cascading effect such that it’s almost impossible for the person that took the second-choice slot to know that they are at risk of being bumped.

These thought experiments might seem like edge cases, but reservation systems have two common traits that make these problems very real:

They tend to start allowing reservations of a specific resource at a specific time, published in advance.
There tend to be some resources and time slots that have a much higher value than the others.

These two traits lead to heavy contention.

Now, how would you go about enforcing such a constraint (non-overlapping time periods) in the database? While considering possible solutions, think about:

Does the solution work for a wide variety of use cases, or only in special cases?
How well would it perform, under low contention and high contention, and under varied workloads?
Can it be implemented in a general purpose RDBMS, like PostgreSQL?
Is it procedural in nature, or declarative?

I think it’s worthwhile to consider the problem, so I will end this article now, and provide some approaches, as well as my real answer, in the next article. In the meantime, feel free to post ideas as comments.

What is the deal with NULLs?

Jeff Davis — Sun, 02 Aug 2009 22:40:23 +0000

A recent thread on pgsql-hackers warrants some more extensive discussion. In the past, I’ve criticized NULL semantics, but in this post I’d just like to explain some corner cases that I think you’ll find interesting, and try to straighten out some myths and misconceptions.

First off, I’m strictly discussing SQL NULL here. SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.

“NULL is not a value” — If you hear this one, beware: it’s in direct contradiction to the SQL standard, which uses the phrase “null value” dozens of times. It’s hard to imagine that NULL is not any kind of value at all; because it’s routinely passed to functions and operators, predicates can evaluate to NULL, and SQL uses a kind of three-valued logic (3VL) in some contexts. The phrase “NULL is not a value” also raises the question: “what is it, then?”.
NULL means “unknown” (i.e. the third truth value) — This doesn’t hold up either. SUM of no tuples returns NULL, but clearly the SUM of no tuples is not unknown! SQL will happily generate NULLs from aggregates or outer joins without any NULLs at all in the database. Do you not know something you did know before, or do you now know that you don’t know something that you didn’t know you didn’t know before? Also, if NULL means “unknown”, how do you differentiate a boolean field for which you do not know the value, and a boolean field for which you do know the value, and it happens to be “unknown” (perhaps this is why boolean columns are a PostgreSQL extension and not part of the core SQL standard)?
“NULL is false-like” — Don’t think of NULL as false-like, or “more false than true”. It’s a tempting rule of thumb, but it’s misleading. For instance, in a WHERE clause, a NULL predicate is treated like FALSE. However, in a constraint (like a CHECK constraint), NULL is treated like TRUE. Perhaps most importantly, when in a 3VL context (like a boolean expression), this misconception leads to problems when you try to invert the logic, e.g., use NOT.
“Oh, that makes sense” — When you see individual behaviors of NULL, they look systematic, and your brain quickly sees a pattern and extrapolates what might happen in other situations. Often, that extrapolation is wrong, because NULL semantics are a mix of behaviors. I think the best way to think about NULL is as a Frankenstein monster of several philosophies and systems stitched together by a series of special cases.
p OR NOT p — Everyone should know that this is not always true in SQL. But most people tend to reason assuming that this is always true, so you have to be very careful, and work against your intuition very deliberately, in order to form a correct SQL query.
SUM() versus + (addition) — SUM is not repeated addition. SUM of 1 and NULL is 1, but 1 + NULL is NULL.
Aggregates ignore NULLs — According to the standard, aggregates are supposed to ignore NULLs, because the information is missing. But why is it OK to ignore the missing information in an aggregate, but not, say, with the + operator? Is it really OK to just ignore it?
Aggregates return NULL — According to the standard, aggregates are supposed to return NULL when they have no non-NULL input. Just because you don’t have any input tuples, does that really mean that the result is undefined, missing, or unknown? It’s certainly not unknown! What about SUM over zero tuples, wouldn’t the most useful result be zero?
SQL breaks its own rules — The aforementioned aggregate rules don’t work very well for COUNT(), the simplest of all aggregates. So, they have two versions of count: COUNT(*) breaks the “aggregates ignore nulls” rule and the “aggregates return null” rule, and COUNT(x) only breaks the latter. But wait! There’s more: ARRAY_AGG() breaks the former but not the latter. But no exception is made for SUM — it still returns NULL when there are no input tuples.
NULLs appear even when you have complete information — Because of OUTER JOIN and aggregates, NULLs can appear even when you don’t have any NULLs in your database! As a thought experiment, try to reconcile this fact with the various “definitions” of NULL.
WHERE NOT IN (SELECT ...) — This one gets everyone at one point or another. If the subselect produces any NULLs, then NOT IN can only evaluate to FALSE or NULL, meaning you get no tuples. Because it’s in a WHERE clause, it will return no results. You are less likely to have a bunch of NULLs in your data while testing, so chances are everything will work great until you get into production.
x >= 10 or x <= 10 — Not a tautology in SQL.
x IS NULL AND x IS DISTINCT FROM NULL — You probably don’t know this, but this expression can evaluate to TRUE! That is, if x = ROW(NULL).
NOT x IS NULL is not the same as x IS NOT NULL — If x is ROW(1,NULL), then the former will evaluate to TRUE, and the latter will evaluate to FALSE. Enjoy.
NOT x IS NULL AND NOT x IS NOT NULL — Want to know if you have a value like ROW(1, NULL)? To distinguish it from NULL and also from values like ROW(1,1) and ROW(NULL,NULL), this expression might help you.
NULLs can exist inside some things, but not others — If you concatenate: firstname || mi || lastname, and “mi” happens to be null, the entire result will be null. So strings cannot contain a NULL, but as we see above, a record can.

I believe the above shows, beyond a reasonable doubt, that NULL semantics are unintuitive, and if viewed according to most of the “standard explanations,” highly inconsistent. This may seem minor; that is, if you’re writing SQL you can overcome these things with training. But it is not minor, because NULL semantics are designed to make you think you understand them, and think that the semantics are intuitive, and think that it’s part of some ingenious consistent system for managing missing information. But none of those things are true.

I have seen lots of discussions about NULL in various forums and mailing lists. Many of the participants are obviously intelligent and experienced, and yet make bold statements that are, quite simply, false. I’m writing this article to make two important points:

There is a good case to be made that NULL semantics are very counterproductive; as opposed to a simple “error early” system that forces you to write queries that explicitly account for missing information (e.g. with COALESCE). “Error early” is a more mainstream approach, similar to null pointers in java or None in python. If you want compile-time checking, you can use a construct like Maybe in haskell. SQL attempts to pass along the problem, hoping the next operator will turn ignorance into knowledge — but it does not appear that anyone thought through this idea, quite frankly.
You should not attempt to apply your intellect to NULL, it will lead you in the wrong direction. If you need to understand it, understand it, but always treat it with skepticism. Test the queries, read the standard, do what you need to do, but do not attempt to extrapolate.