Applications vs. Platforms

When are you developing an application, and when are you developing a platform? A lot of discussion about programming comes down to this question; and the less-helpful discussions can usually be traced to confusion over this point.

For instance, an article here (by an author I respect) entitled Your Coding Philosophies are Irrelevant makes a fairly typical point: that it’s hard to make a connection between a good end-user experience and particular programming practices or philosophies. It’s similar to an argument that the ends justify the means, though hopefully not fraught with the same moral problems.

On the other hand, advocates of development styles, programming languages, etc., point out how their approach helps manage the complexity of software development. This section of Learn You a Haskell extols the “safety” of a type system (presumably implying that your program will work better).

So who is right? If you are developing an application, then you need to set the philosophies aside, pick up whatever tools are most convenient, and build. But if you are building a platform, taking care with the methods you choose and the interfaces or languages you design are crucial.

Continue reading

Taking a step back from ORMs

Do object-relational mappers (ORMs) really improve application development?

When I started developing web applications, I used perl. Not even all of perl, mostly just a bunch of “if” statements and an occasional loop that happened to be valid perl (aside: I remember being surprised that I was allowed to write a loop that would run on a shared server, because “what if it didn’t terminate?!”). I didn’t use databases; I used a mix of files, regexes to parse them, and flock to control concurrency (not because of foresight or good engineering, but because I ran into concurrency-related corruption).

I then made the quantum leap to databases. I didn’t see the benefits instantaneously[1], but it was clearly a major shift in the way I developed applications.

Continue reading

SQL: the successful cousin of Haskell

Haskell is a very interesting language, and shows up on sites like http://programming.reddit.com frequently. It’s somewhat mind-bending, but very powerful and has some great theoretical advantages over other languages. I have been learning it on and off for some time, never really getting comfortable with it but being inspired by it nonetheless.

But discussion on sites like reddit usually falls a little flat when someone asks a question like:

If haskell has all these wonderful advantages, what amazing applications have been written with it?

The responses to that question usually aren’t very convincing, quite honestly.

But what if I told you there was a wildly successful language, in some ways the most successful language ever, and it could be characterized by:

  • lazy evaluation
  • declarative
  • type inference
  • immutable state
  • tightly controlled side effects
  • strict static typing

Surely that would be interesting to a Haskell programmer? Of course, I’m talking about SQL.

Continue reading

Database for a Zoo: the problem and the solution

Let’s say you’re operating a zoo, and you have this simple constraint:

You can put many animals of the same type into a single cage; or distribute them among many cages; but you cannot mix animals of different types within a single cage.

This rule prevents, for example, assigning a zebra to live in the same cage as a lion. Simple, right?

How do you enforce it? Any ideas yet? Keep reading: I will present a solution that uses a generalization of the standard UNIQUE constraint.

(Don’t dismiss the problem too quickly. As with most simple-sounding problems, it’s a fairly general problem with many applications.)

Continue reading

Building SQL Strings Dynamically, in 2011

I saw a recent post Avoid Smart Logic for Conditional WHERE Clauses which actually recommended, “the best solution is to build the SQL statement dynamically—only with the required filters and bind parameters”. Ordinarily I appreciate that author’s posts, but this time I think that he let confusion run amok, as can be seen in a thread on reddit.

To dispel that confusion: parameterized queries don’t have any plausible downsides; always use them in applications. Saved plans have trade-offs; use them sometimes, and only if you understand the trade-offs.

When query parameters are conflated with saved plans, it’s creates FUD about SQL systems because it mixes the fear around SQL injection with the mysticism around the SQL optimizer. Such confusion about the layers of a SQL system are a big part of the reason that some developers move to the deceptive simplicity of NoSQL systems (I say “deceptive” here because it often just moves an even greater complexity into the application — but that’s another topic).

The confusion started with this query from the original article:

Continue reading

Why PostgreSQL Already Has Query Hints

This is a counterpoint to Josh’s recent post: Why PostgreSQL Doesn’t Have Query Hints. I don’t really disagree, except that I think that there are many different definitions of “hints” floating around, leading to a lot of confusion. I could subtitle this post “More Terminology Confusion” after my previous entry.

So, let’s pick a reasonable definition: “hints are some mechanism to influence the SQL planner to choose a better plan”. Why did I choose that definition? Because it’s the actual use case. If a user encounters a bad plan, or an unstable plan, they need a way to get it to choose a better plan. There’s plenty of room to argue about the right way to do that and the wrong way, but almost every DBMS allows some form of hints. Including PostgreSQL.

Continue reading

Big Company Uses Product XYZ

Joshua Drake’s recent article makes some interesting points, but there’s one thing in particular I find missing among many of these discussions. From the article:

It appeared they felt we should be impressed that Facebook runs on MySQL not PostgreSQL. … The problem I have, is that Facebook data is worthless.

All of the concentration is on the company, and whether their use case matters (of course it does, at least to them and their customers). But phrases like “runs on” and “uses” are used too loosely, in my opinion.

Even with celebrity endorsements — for example, a basketball player endorsing shoes — at least they use shoes in roughly the same manner as you might. The shoes might not help you play basketball in any appreciable way, but at least “use” means the same for both the basketball player and you.

Continue reading

Exclusion Constraints are generalized SQL UNIQUE

Say you are writing an online reservation system. The first requirement you’ll encounter is that no two reservations may overlap (i.e. no schedule conflicts). But how do you prevent that?

It’s worth thinking about your solution carefully. My claim is that no existing SQL DBMS has a good solution to this problem before PostgreSQL 9.0, which has just been released. This new release includes a feature called Exclusion Constraints (authored by me), which offers a good solution to a class of problems that includes the “schedule conflict” problem.

I previously wrote a two part series (Part 1 and Part 2) on this topic. Chances are that you’ve run into a problem similar to this at one time or another, and these articles will show you the various solutions that people usually employ in the real world, and the serious problems and limitations of those approaches.

The rest of this article will be a brief introduction to Exclusion Constraints to get you started using a much better approach.

Continue reading

Joining Aster Data

On Monday, May 10th I will be joining Aster Data. I am very excited to work on some ambitious new projects there. Aster is a heavy user of PostgreSQL, so I will (of course) continue to actively participate in the community.

Friday was my last day at Truviso. I enjoyed working there very much, and it was a rewarding experience. I wish all of my former colleagues well — I’m sure our paths will cross in the future.

Flexible Schemas and PostgreSQL

First, what is a “flexible schema”? It’s hard to pin down an exact definition, but it’s used to mean a data model that permits changes in application data structures without needing to migrate old data or incur other administrative hassles.

That’s a worthwhile goal. Applications often grow organically, especially in the early, exploratory stages of development. For example, you may decide to track when a user last did something on the website, so that you can adapt news and notices for those users (e.g. “Did you know that we added feature XYZ since you last visited?”). Developers have a need to produce a prototype quickly to work out the edge cases (do we update that timestamp for all actions, or only certain ones?), and probably a need to put it in production so that the users can benefit sooner.

A common worry is that ALTER TABLE will be a major performance problem. That’s sometimes a problem, but in PostgreSQL, you can add a column to a table in constant time (not dependent on the size of the table) in most situations. I don’t think this is a good reason to avoid ALTER TABLE, at least in PostgreSQL (other systems may impose a greater burden).

There are good reasons to avoid ALTER TABLE, however. We’ve only defined one use case for this new “last updated” field, and it’s a fairly loose definition. If we use ALTER TABLE as a first reaction for tracking any new application state, we’d end up with lots of columns with overlapping meanings (all subtly different), and it would be challenging to keep them consistent with each other. More importantly, adding new columns without thinking through the meaning and the data migration strategy will surely cause confusion and bugs. For example, if you see the following table:

    CREATE TABLE users
    (
      name         TEXT,
      email        TEXT,
      ...,
      last_updated TIMESTAMPTZ
    );

you might (reasonably) assume that the following query makes sense:

    SELECT * FROM users
      WHERE last_updated < NOW() - '1 month'::INTERVAL;

Can you spot the problem? Old user records (before the ALTER TABLE) will have NULL for last_updated timestamps, and will not satisfy the WHERE condition even though they intuitively qualify. There are two parts to the problem:

  1. The presence of the last_updated field fools the author of the SQL query into making assumptions about the data, because it seems so simple on the surface.
  2. NULL semantics allow the query to be executed even without complete information, leading to a wrong result.

Let’s try changing the table definition:

    CREATE TABLE users
    (
      name       TEXT,
      email      TEXT,
      ...,
      properties HSTORE
    );

HSTORE is a set of key/value pairs. Some tuples might have the last_updated key in the properties attribute, and others may not. This accomplishes two things:

  1. There’s no need for ALTER TABLE or cluttering of the namespace with a lot of nullable columns.
  2. The name “properties” is vague enough that query writers would (hopefully) be on their guard, understanding that not all records will share the same properties.

You could still write the same (wrong) query against the second table with minor modification. Nothing has fundamentally changed. But we are using a different development strategy that’s easy on application developers during rapid development cycles, yet does not leave a series of pitfalls for users of the data. When a certain property becomes universally recorded and has a concrete meaning, you can plan a real data migration to turn it into a relation attribute instead.

Now, we need some guiding principles about when to use a complex type to represent complex information, and when to use separate columns in the table. To maximize utility and minimize confusion, I believe the best guiding principle is the meaning of the data you’re storing across all tuples. When defining the attributes of a relation, if you find yourself using vague nouns such as “properties,” or resorting to complex qualifications (lots of “if/then” branching in your definition), consider less constrained data types like HSTORE. Otherwise, it’s best to nail down the meaning in terms of appropriate nouns, which will help keep the DBMS smart and queries simple (and correct). See Choosing Data Types and further guidance in reference [1].

I believe there are three reasons why application developers feel that relational schemas are “inflexible”:

  1. A reliance on NULL semantics to make things “magically work,” when in reality, it just makes queries succeed that should fail. See my previous posts: None, nil, Nothing, undef, NA, and SQL NULL and What is the deal with NULLs?.
  2. The SQL database industry has avoided interesting types, like HSTORE, for a long time. See my previous post: Choosing Data Types.
  3. ORMs make a fundamental false equivalence between an object attribute and a table column. There is a relationship between the two, of course; but they are simply not the same thing. This is a direct consequence of “The First Great Blunder”[2].

EDIT: I found a more concise way to express my fundamental point — During the early stages of application development, we only vaguely understand our data. The most important rule of database design is that the database should represent reality, not what we wish reality was like. Therefore, a database should be able to express that vagueness, and later be made more precise when we understand our data better. None of this should be read to imply that constraints are less important or that we need not understand our data. These ideas mostly apply only at very early stages of development, and even then, prudent use of constraints often makes that development much faster.

[1] Date, C.J.; Darwen, Hugh (2007). Databases, Types, and the Relational Model. pp. 377-380 (Appendix B, “A Design Dilemma”).

[2] Date, C.J. (2000). An Introduction To Database Systems, p. 865.