Why are temporal extensions in PostgreSQL important? Quite simply, managing time data is one of the most common requirements, and current general-purpose database systems don’t provide us with the basic tools to do it. Every general-purpose DBMS falls short both in terms of usability and performance when trying to manage temporal data.
What is already done?
The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations — but it’s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.
So, right away, it’s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn’t solve that scalability problem — it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.
My last post about the Linux OOM Killer got a lot of attention, and I don’t intend to re-open the whole discussion. However, a couple themes developed that I think warrant some further discussion. One of those themes is that “error recovery is hard,” and that’s absolutely true. It’s so true that many of the kernel developers seemed to develop a more extreme version: “error recovery is infeasible” (see this post) — but that is not true.
PostgreSQL is a great example: it gets error recovery right. I didn’t say it’s perfect, but graceful degradation is designed into postgres from the start, and if it doesn’t degrade gracefully, that’s probably a bug. I am saying this as a person who experienced a not-so-graceful PostgreSQL problem first-hand back in 2006 that was due to an OOM condition on FreeBSD (and yes,
- The bug occurred under fairly strange circumstances where very small allocations ate the memory very completely, leaving no room (not even a few dozen bytes), during a transaction that included extensive DDL (note that most database systems can’t even do transactional DDL at all).
- The bug was fixed!
A database system getting error recovery right is no accident, fluke, or statistical outlier. Database systems are meant to get this right so that applications do not have to do it. In other words, write applications against PostgreSQL (or another robust relational database system), and you get error recovery for free.
The application may be performing a complex series of updates or state changes, and trying to ensure graceful error recovery without a database system to help may be nearly impossible. But by simply wrapping those operations in a single transaction, the application only needs to know how to handle a few classes of errors, and will always see consistent data afterward.
Moral of the story: if you’re not building on top of a robust DBMS, and you have any significant amount of state to manage, then you are not getting error recovery right (unless you put in the serious analysis required). Similar problems apply to those using a DBMS like a filesystem, with no constraints (to prevent corruption) and related operations split across transactional boundaries. So: use PostgreSQL, use transactions, and use constraints.
The Linux OOM Killer heuristic can be summed up as:
- Run out of memory.
- Kill PostgreSQL.
- Look for processes that might be using too much memory, and kill them, hopefully freeing memory.
Notice that step #2 is independent of factors like:
- Is PostgreSQL consuming a significant share of the memory resources?
- Will killing PostgreSQL alleviate any memory pressure at all? Continue reading
In the last article, I argued that:
- A schedule conflict is a typical business problem.
- The later you try to resolve a schedule conflict, the more costly it is to resolve.
- In particular, there is a big jump in the cost the moment after conflicting data is recorded.
- Therefore, it’s best for the DBMS itself to enforce the constraint, because only the DBMS can avoid the conflict effectively before the conflict is recorded.
Then, I opened up a discussion to see how people are dealing with these schedule conflicts. In the comments I received at the end of the article, as well as other anecdotes from conferences, user groups, mailing lists, and my own experience, the solutions fall into a few categories:
“Schedule conflict” — it’s one of the simplest and most common constraints for business or any other organization. One person cannot be in two places at the same time; and in many cases a only a single person can use a given resource at any time.
Does your database system know anything about a schedule conflict? Should it?
Constraints are always enforced at some point, it’s just a matter of when, how, and the cost of correcting the situation.
I really enjoyed PostgreSQL Conference WEST this year. It’s a shame I didn’t have a little more time to explore Seattle, however; it looks like a nice city.
There were a lot of great talks. Scott Bailey’s temporal data talk was excellent, of course — but I’m not an impartial observer, I expressed enthusiasm before the conference began. Another talk that I found very interesting was Bill Karwin’s talk on practical object oriented models in SQL. The talk was well-researched and the points were made very clearly, although I would have liked to see a little more depth. There was some audience discussion, and I’d like to see the points hashed out in a little more detail.
I’m always excited to present the projects that I’m working on, and this conference was no exception. I thoroughly enjoyed discussing Operator Exclusion Constraints, the new feature that I’m writing for PostgreSQL 8.5 that enables “temporal keys”. Slides are here — I’ll write in more detail in another article. I was pleased with the audience participation and questions during the talk, and I hope that everyone enjoyed listening to me as much as I enjoyed speaking.
Slides for my talk on extensibility are available here.
I’ve been interested in temporal data and relational databases for quite some time. There are going to be at least two people talking about temporal data at PostgreSQL WEST in Seattle: Scott Bailey and me. See the talk descriptions.
In the past, I’ve worked on a temporal extension to PostgreSQL that implements the
PERIOD data type. This is a data type that offers both a definite beginning and a definite end time, which is important for describing things that happen over a period of time, rather than instantaneously. Trying to use separate attributes for “start” and “end” is bad for a number of reasons, and will certainly be addressed in a subsequent blog entry. For now, I’ll just say that I believe the
PERIOD data type is fundamentally important for handling all kinds of time data, which I believe is a common problem.
At WEST, I’ll be presenting my progress on temporal keys. Temporal keys are used to prevent overlapping periods of time — a schedule conflict — by using an index and following the same concurrent behavior as
UNIQUE with minimal performance cost (one extra index search, to be precise).
Temporal keys cannot be expressed in PostgreSQL 8.4, unless you resort to triggers and a full table lock (ouch!). So, additional backend support is required. This is accomplished in my patch for operator exclusion constraints, which are a more general way of using arbitrary operators and index searches to enforce a constraint. I plan to do what’s required for the patch to be accepted in PostgreSQL 8.5.
Temporal modeling is a common problem. It seems like almost every PostgreSQL conference has had at least one talk on the matter, so we know there is some demand for improvement. If you’re interested, I hope you come to WEST and chat with Scott or I, and let’s see if we can come up with some real solutions.
Database design includes a lot of concepts, and it all starts from understanding the business. If you’ve done a good job, you can usually tell a lot about the business from the schema alone — without data, and without application code or a user interface. This article is about data types, which are the “nouns” of your business, loosely speaking (the analogy is borrowed from C.J. Date’s An Introduction to Database Systems).
A recent thread on pgsql-hackers warrants some more extensive discussion. In the past, I’ve criticized NULL semantics, but in this post I’d just like to explain some corner cases that I think you’ll find interesting, and try to straighten out some myths and misconceptions.
First off, I’m strictly discussing SQL NULL here. SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.