Comments on: Temporal Keys, Part 1

By: Jeff Davis

Jeff Davis — Sun, 07 Mar 2010 19:02:26 +0000

I worked on one project where the data load on the servers was so high, they avoided most constraints except maybe for primary keys.

As I said in the post, constraints are always enforced, it’s just a matter of when, and at what cost. After all, it’s a business constraint, and would be present even if computers didn’t exist.

In that particular application, it’s possible that they knew in advance that many of the constraints would not be violated, or at least the conditions under which they would be violated. That’s a lot of work to prove though — it’s so much easier to just declare it to the DBMS.

I prefer the DB to do the work.

I prefer that the DBMS does the work as well, because that usually has the lowest total cost of enforcement.

By: BillR

BillR — Thu, 21 Jan 2010 02:50:46 +0000

I am waiting on this, as I find this kind of issue never stops coming up, no matter the project. A note on doing the checks in the business logic: I worked on one project where the data load on the servers was so high, they avoided most constraints except maybe for primary keys. And even on primary key columns they sometimes did not even declare those, preferring to just index them as the time taken to validate uniqueness, given the transaction volume, was prohibitive. And we’re talking DBs running on supercomputers! Just thought someone might find this perspective interesting. Personally I am very glad someone is working on this. I prefer the DB to do the work.

By: Jeff Davis

Jeff Davis — Mon, 02 Nov 2009 18:21:10 +0000

we now need some external mechanism to represent the constraint that ranges may not overlap. The system can’t do it for us without some new machinery. Or at least, not terribly efficiently.

Be patient, and keep an eye out for the next release of PostgreSQL

We also need a somewhat rarer kind of index to allow us to actually *efficiently* get answers about what name goes with a given location.

You can already do that in PostgreSQL. For periods of time, you can use my PERIOD data type (which has GiST indexing support), and do searches on things like “contained by” or “overlaps”.

http://pgfoundry.org/projects/temporal/

By: Jeff Davis

Jeff Davis — Mon, 02 Nov 2009 18:16:54 +0000

You bring up a lot of great issues.

Next, what if we represent the range as a pair of columns? … Now we have a different set of problems.

You’re absolutely right, this has serious problems. The exclusivity is one, the searching is another, and constraints are the problem I was concerned with in the article.

does anybody know of a good system for handling missing date information

I recommend C.J. Date’s “Temporal Data and the Relational Model”. I think it will clarify a lot of these issues; I know it did for me.

By: Jeff Davis

Jeff Davis — Mon, 02 Nov 2009 18:05:31 +0000

The client could enforce it, but life is so much easier when the database does it for us.

That’s also a good point. Constraints should be declarative, and in the DBMS, to avoid mistakes and to be another line of defense against application bugs.

By: Jeff Davis

Jeff Davis — Mon, 02 Nov 2009 18:02:22 +0000

User 1:
checks for conflict — no conflict
User 2:
checks for conflict — no conflict
User 1:
inserts [2009-01-01, 2009-01-03)
User 2:
inserts [2009-01-02, 2009-01-04)

Now you have a problem. If you don’t care much about performance, you can serialize access to the table, so that user 2 has to wait for user 1. But if you do care about performance, you don’t want to make user 2 wait for user 1 unless there is a real potential for conflict.

By: J. Prevost

J. Prevost — Mon, 02 Nov 2009 15:59:46 +0000

Agh. Less-than-sign-html-thing bit me. Re-reading this, I talked way too much, but here is the missing portion (from x in the above comment to the final paragraph that was included there) for completeness.

*less than* max_location). There’s also a bit of a semantic issue: I’ve assumed a half-open interval here. One that includes the minimum location but does not include the maximum. This difference is not major for integers… as long as you decided up front which method you’re going to use. It will be a problem in a moment, though. Finding out the range of locations for a name is fairly simple (select min_location, max_location from pair where name = x). And the representation is nicely compact.

The final down side is: we now need some external mechanism to represent the constraint that ranges may not overlap. The system can’t do it for us without some new machinery. Or at least, not terribly efficiently. We also need a somewhat rarer kind of index to allow us to actually *efficiently* get answers about what name goes with a given location.

So we’ve won some efficiencies at the expense of formalism. We have, however, also lost some other efficiencies. That’s a pretty reasonable trade-off. But what did we really lose by going away from the formalism?

Next step: real numbers.

create table real_pair (
name text unique not null,
min_location real not null,
max_location real not null
);

Now we have a more significant problem with the “half-open interval” thing. With real numbers, if we choose “half-open intervals” as our standard, there is absolutely no way to represent the range from 1.0 to 3.0 inclusive. It’s got to be 1.0 to 3.0 leaving out 3.0. That may or may not be a problem depending on your application.

What if we were to try to use the original solution with reals?

create table real_simple (
name text not null,
location real unique not null
);

Whoa. Now here’s a real problem. You can’t represent ranges here at all, because they contain an infinite number of locations. That’s… rather not good.

Anyway, what this is all leading up to is this thought: the underlying relation involved in these examples is really one in which there are multiple (even infinite) points in the relation, representing a single conceptual entity. We represent this as ranges or as points in a constrained space (imagine that you only schedule on hourly intervals, and integers are hours) purely for the purpose of efficiency.

But the most natural way to work with these relations is to imagine a system in which a single row can represent multiple (possibly infinite) points in the relation. Then you could have this definition:

create table real_multiple (
name text not null,
location real unique multiple not null
);

The idea here is that one row in the database represents a set of rows in the underlying relation. Potentially this could represent a mixture of ranges and points, or ranges with points torn out of the middle, or various kinds of intervals, or… A lot of things. Even more, it gives some way to think about geometry as well:

create table rect_multiple (
name text unique not null,
x real multiple not null,
y real multiple not null,
unique (x, y)
);

Of course it all gets complicated, but it’s interesting. And it results in additional kinds of constraints. Like “contiguous”, for example.

Anyway, that’s my take on the underlying problem. I’ll be very interested to hear the more practical ideas that are actually getting implemented. ;>

By: J. Prevost

J. Prevost — Mon, 02 Nov 2009 15:56:06 +0000

It seems to me that the logical underpinning of schedule conflicts is the idea that a single row can represent multiple items (possibly infinite) in the underlying set.

Let’s think about integers instead of dates for a moment. Start with a simple table:

create table single (
name text unique not null,
location integer unique not null
);

Here we have a relation between one integer and one name. No name can refer to more than one integer. No integer can have more than one name. I am intentionally not picking either one out as a primary key here.

The next step is to allow names to refer to sets of integers:

create table multiple (
name text not null,
location integer unique not null
);

Now each integer can belong to only one name, but a name can refer to multiple integers. Two things to note about this: there is no way to declare a constraint, should we wish to, that all of the integers for a given name be contiguous. Neither is this representation necessarily efficient for a system that’s trying to do that.

Querying to find the name for a location is very straightforward (select * from multiple where location = x). Getting a set of all of the locations is straightforward as well (select location from multiple where name = x). Finding the bounds of the range is not straightforward, if it is represented in this manner.

Next, what if we represent the range as a pair of columns?

create table pair (
name text not null,
min_location integer not null,
max_location integer not null
);

Now we have a different set of problems. It is not somewhat more complicated to query for the name given a location (select name from pair where min_location <= x

(A final note on temporal data… does anybody know of a good system for handling missing date information? I’m not talking about “don’t know the date”, I mean “don’t have full date information”. Like, for a death date, year of death is known, but month and day is not. Or year and month are known, but day is not. Or more strangely, year and day are known but month is not. Having to fall back on multiple columns for this kind of thing and not having any kind of normal date behavior at all has always kind of gotten to me.)

By: Richard Broersma Jr.

Richard Broersma Jr. — Mon, 02 Nov 2009 15:55:53 +0000

It true that you can validate the scheduling conflicts in the client application. However, there is value in having the database enforce these constraints. An over simplified example would be the enforcement of a table’s unique constraint. The client could enforce it, but life is so much easier when the database does it for us.

Also in the case of temporal data, in this example we are not really modeling professors and lecturehalls. Rather we are modeling professortimelines and lecturehalltimelines. Having the database verify that our time-lines are continuous, having no gaps or overlapping segments has great value.

Continuous time lines allow to schedule for the future but also allow us to examine the state of things has they existed in the past.

By: Richard Broersma Jr.

Richard Broersma Jr. — Mon, 02 Nov 2009 15:43:21 +0000

Mostly there are manual procedures in place to manage the change. Following these procedures has at least two problems: 1) using people to document the changes of thousands of instruments and process functions is prone to error; 2) the cost to procedurally manage these changes is huge. (i.e. On a project where 10 million dollars of equipment was purchased, 30 million dollars as spent for construction services, and 60 million was spend in engineering services and half of this was used to produce the documentation to manage the changes.)

And even after all of this effort was expended to produce the required documentation to manage the construction changes, at least 10% of these instruments modifications resulted in scheduling conflicts that resulted in construction delays and additional down time. On review of these short-falls, it become evident that there currently is no mechanism in place to validate the individual time-lines of each instrument, process function, and the relationships between them.

The other problem that has no tools, hacks, or procedures (yet) is managing the scopes of concurrently running project that have scope over the same areas.