Temporal Keys, Part 1

“Schedule conflict” — it’s one of the simplest and most common constraints for business or any other organization. One person cannot be in two places at the same time; and in many cases a only a single person can use a given resource at any time.

Does your database system know anything about a schedule conflict? Should it?

Constraints are always enforced at some point, it’s just a matter of when, how, and the cost of correcting the situation.

Consider two professors who try to reserve the same lecture hall at overlapping times (let’s say 1-2pm, and 1:30-2:30pm). Now imagine two possible resolutions:

  1. They are blissfully unaware of the schedule conflict until they have already promoted their lecture, attracted an audience, and arrived at the building. The professor scheduled for 1:30-2:30pm will be disappointed to find it already in use, and the students will be confused. The second lecture may have to be canceled due to the confusion, or lose a substantial number of attendees.
  2. At the time of scheduling, one professor is given an error message to the effect of “this room is already reserved, please choose a different time”.

Observe that neither one really solves the problem: the second resolution still forces one professor to choose a less-desirable time. However, it is a much cheaper resolution. As with many problems, early detection is simplest and cheapest to resolve.

Of course, there are many resolutions that fall between the first and the second. For instance, you might run a program every hour that checks for conflicts and alerts the parties involved. That may work, but there are still problems:

  • You’ve now introduced uncertainty into the system: do I have a reservation or not? If there is a conflict later, will I be kicked out or will they?
  • You have to come up with rules for resolution: does the first one win? How do you define “first” if transactions are running for a while? What if someone makes a reservation “first” but then makes all kinds of changes later; were they really first or did they just game the system?
  • If someone’s reservation gets bumped, you have to now have a new strange state for a reservation, in which it is disabled, the organizer has (hopefully) been notified, and it needs to change.
  • Notice that everything is very procedural and specific to the business. You have to have a notification mechanism, and rules for how to resolve it.
  • Let’s go back to the definition of “first”: say you have made a reservation, and you get bumped by the conflict detector. In between the time you made the reservation and the time you were notified of a conflict, someone else reserved your second choice. Are you now first in line for that time slot? If so, that has a cascading effect such that it’s almost impossible for the person that took the second-choice slot to know that they are at risk of being bumped.

These thought experiments might seem like edge cases, but reservation systems have two common traits that make these problems very real:

  1. They tend to start allowing reservations of a specific resource at a specific time, published in advance.
  2. There tend to be some resources and time slots that have a much higher value than the others.

These two traits lead to heavy contention.

Now, how would you go about enforcing such a constraint (non-overlapping time periods) in the database? While considering possible solutions, think about:

  • Does the solution work for a wide variety of use cases, or only in special cases?
  • How well would it perform, under low contention and high contention, and under varied workloads?
  • Can it be implemented in a general purpose RDBMS, like PostgreSQL?
  • Is it procedural in nature, or declarative?

I think it’s worthwhile to consider the problem, so I will end this article now, and provide some approaches, as well as my real answer, in the next article. In the meantime, feel free to post ideas as comments.

13 thoughts on “Temporal Keys, Part 1

  1. I am patiently waiting for temporal keys to find there way into Postgresql. And I already have a project in mind for their use.

    The company I work for provides engineering and construction management services for petro-chemical and other process related industries.

    All of the services that we provide deal with management of change that will take place months or even years in the future. It is important to document how and when the process functions will change ( adding, changing, and removing) and how and when the instrumentation supporting these process functions will change ( added, reallocating, demolition). It is also important the concurrently running projects do not overlap the scope of work.

    Using temporal keys to track how the changes are made will provide a way to verify the management of change.

    • I am patiently waiting for temporal keys to find there way into Postgresql.

      Good to hear. I’ve been working on it, so hopefully we won’t have to be patient for too much longer (see my next post in a couple days).

      Do you have any tools, hacks, or procedures in place now to avoid problems? Does it just involve a lot of phone calls and calendars?

      • Mostly there are manual procedures in place to manage the change. Following these procedures has at least two problems: 1) using people to document the changes of thousands of instruments and process functions is prone to error; 2) the cost to procedurally manage these changes is huge. (i.e. On a project where 10 million dollars of equipment was purchased, 30 million dollars as spent for construction services, and 60 million was spend in engineering services and half of this was used to produce the documentation to manage the changes.)

        And even after all of this effort was expended to produce the required documentation to manage the construction changes, at least 10% of these instruments modifications resulted in scheduling conflicts that resulted in construction delays and additional down time. On review of these short-falls, it become evident that there currently is no mechanism in place to validate the individual time-lines of each instrument, process function, and the relationships between them.

        The other problem that has no tools, hacks, or procedures (yet) is managing the scopes of concurrently running project that have scope over the same areas.

  2. Disclaimer: I’m just an amateur DBA.

    Isn’t this more an UI problem than a data-structure one? As far as I can see, even a simple [starttime, endtime] construction can support this usecase, and the solution would be to add some “AJAX-y” feedback to the interface used to enter the information:

    When the user (professor) enters the intended schedule, it will be immediately checked for availability (before submission) and a visual feedback would be provided. Think of checking the username on signup for different websites. Requests would be still served on a first-come first-served basis, but the need to submit-wait-check if succeeded cycle would be greatly reduced.

    What do you think?

    • It true that you can validate the scheduling conflicts in the client application. However, there is value in having the database enforce these constraints. An over simplified example would be the enforcement of a table’s unique constraint. The client could enforce it, but life is so much easier when the database does it for us.

      Also in the case of temporal data, in this example we are not really modeling professors and lecturehalls. Rather we are modeling professortimelines and lecturehalltimelines. Having the database verify that our time-lines are continuous, having no gaps or overlapping segments has great value.

      Continuous time lines allow to schedule for the future but also allow us to examine the state of things has they existed in the past.

      • The client could enforce it, but life is so much easier when the database does it for us.

        That’s also a good point. Constraints should be declarative, and in the DBMS, to avoid mistakes and to be another line of defense against application bugs.

    • User 1:
      checks for conflict — no conflict
      User 2:
      checks for conflict — no conflict
      User 1:
      inserts [2009-01-01, 2009-01-03)
      User 2:
      inserts [2009-01-02, 2009-01-04)

      Now you have a problem. If you don’t care much about performance, you can serialize access to the table, so that user 2 has to wait for user 1. But if you do care about performance, you don’t want to make user 2 wait for user 1 unless there is a real potential for conflict.

  3. It seems to me that the logical underpinning of schedule conflicts is the idea that a single row can represent multiple items (possibly infinite) in the underlying set.

    Let’s think about integers instead of dates for a moment. Start with a simple table:

    create table single (
    name text unique not null,
    location integer unique not null
    );

    Here we have a relation between one integer and one name. No name can refer to more than one integer. No integer can have more than one name. I am intentionally not picking either one out as a primary key here.

    The next step is to allow names to refer to sets of integers:

    create table multiple (
    name text not null,
    location integer unique not null
    );

    Now each integer can belong to only one name, but a name can refer to multiple integers. Two things to note about this: there is no way to declare a constraint, should we wish to, that all of the integers for a given name be contiguous. Neither is this representation necessarily efficient for a system that’s trying to do that.

    Querying to find the name for a location is very straightforward (select * from multiple where location = x). Getting a set of all of the locations is straightforward as well (select location from multiple where name = x). Finding the bounds of the range is not straightforward, if it is represented in this manner.

    Next, what if we represent the range as a pair of columns?

    create table pair (
    name text not null,
    min_location integer not null,
    max_location integer not null
    );

    Now we have a different set of problems. It is not somewhat more complicated to query for the name given a location (select name from pair where min_location <= x

    (A final note on temporal data… does anybody know of a good system for handling missing date information? I’m not talking about “don’t know the date”, I mean “don’t have full date information”. Like, for a death date, year of death is known, but month and day is not. Or year and month are known, but day is not. Or more strangely, year and day are known but month is not. Having to fall back on multiple columns for this kind of thing and not having any kind of normal date behavior at all has always kind of gotten to me.)

    • You bring up a lot of great issues.

      Next, what if we represent the range as a pair of columns? … Now we have a different set of problems.

      You’re absolutely right, this has serious problems. The exclusivity is one, the searching is another, and constraints are the problem I was concerned with in the article.

      does anybody know of a good system for handling missing date information

      I recommend C.J. Date’s “Temporal Data and the Relational Model”. I think it will clarify a lot of these issues; I know it did for me.

  4. Agh. Less-than-sign-html-thing bit me. Re-reading this, I talked way too much, but here is the missing portion (from x in the above comment to the final paragraph that was included there) for completeness.

    *less than* max_location). There’s also a bit of a semantic issue: I’ve assumed a half-open interval here. One that includes the minimum location but does not include the maximum. This difference is not major for integers… as long as you decided up front which method you’re going to use. It will be a problem in a moment, though. Finding out the range of locations for a name is fairly simple (select min_location, max_location from pair where name = x). And the representation is nicely compact.

    The final down side is: we now need some external mechanism to represent the constraint that ranges may not overlap. The system can’t do it for us without some new machinery. Or at least, not terribly efficiently. We also need a somewhat rarer kind of index to allow us to actually *efficiently* get answers about what name goes with a given location.

    So we’ve won some efficiencies at the expense of formalism. We have, however, also lost some other efficiencies. That’s a pretty reasonable trade-off. But what did we really lose by going away from the formalism?

    Next step: real numbers.

    create table real_pair (
    name text unique not null,
    min_location real not null,
    max_location real not null
    );

    Now we have a more significant problem with the “half-open interval” thing. With real numbers, if we choose “half-open intervals” as our standard, there is absolutely no way to represent the range from 1.0 to 3.0 inclusive. It’s got to be 1.0 to 3.0 leaving out 3.0. That may or may not be a problem depending on your application.

    What if we were to try to use the original solution with reals?

    create table real_simple (
    name text not null,
    location real unique not null
    );

    Whoa. Now here’s a real problem. You can’t represent ranges here at all, because they contain an infinite number of locations. That’s… rather not good.

    Anyway, what this is all leading up to is this thought: the underlying relation involved in these examples is really one in which there are multiple (even infinite) points in the relation, representing a single conceptual entity. We represent this as ranges or as points in a constrained space (imagine that you only schedule on hourly intervals, and integers are hours) purely for the purpose of efficiency.

    But the most natural way to work with these relations is to imagine a system in which a single row can represent multiple (possibly infinite) points in the relation. Then you could have this definition:

    create table real_multiple (
    name text not null,
    location real unique multiple not null
    );

    The idea here is that one row in the database represents a set of rows in the underlying relation. Potentially this could represent a mixture of ranges and points, or ranges with points torn out of the middle, or various kinds of intervals, or… A lot of things. Even more, it gives some way to think about geometry as well:

    create table rect_multiple (
    name text unique not null,
    x real multiple not null,
    y real multiple not null,
    unique (x, y)
    );

    Of course it all gets complicated, but it’s interesting. And it results in additional kinds of constraints. Like “contiguous”, for example.

    Anyway, that’s my take on the underlying problem. I’ll be very interested to hear the more practical ideas that are actually getting implemented. ;>

    • we now need some external mechanism to represent the constraint that ranges may not overlap. The system can’t do it for us without some new machinery. Or at least, not terribly efficiently.

      Be patient, and keep an eye out for the next release of PostgreSQL ;)

      We also need a somewhat rarer kind of index to allow us to actually *efficiently* get answers about what name goes with a given location.

      You can already do that in PostgreSQL. For periods of time, you can use my PERIOD data type (which has GiST indexing support), and do searches on things like “contained by” or “overlaps”.

      http://pgfoundry.org/projects/temporal/

  5. I am waiting on this, as I find this kind of issue never stops coming up, no matter the project. A note on doing the checks in the business logic: I worked on one project where the data load on the servers was so high, they avoided most constraints except maybe for primary keys. And even on primary key columns they sometimes did not even declare those, preferring to just index them as the time taken to validate uniqueness, given the transaction volume, was prohibitive. And we’re talking DBs running on supercomputers! Just thought someone might find this perspective interesting. Personally I am very glad someone is working on this. I prefer the DB to do the work. :)

    • I worked on one project where the data load on the servers was so high, they avoided most constraints except maybe for primary keys.

      As I said in the post, constraints are always enforced, it’s just a matter of when, and at what cost. After all, it’s a business constraint, and would be present even if computers didn’t exist.

      In that particular application, it’s possible that they knew in advance that many of the constraints would not be violated, or at least the conditions under which they would be violated. That’s a lot of work to prove though — it’s so much easier to just declare it to the DBMS.

      I prefer the DB to do the work.

      I prefer that the DBMS does the work as well, because that usually has the lowest total cost of enforcement.