Experimental Thoughts » NULL

Building SQL Strings Dynamically, in 2011

Jeff Davis — Sat, 09 Jul 2011 16:57:50 +0000

I saw a recent post Avoid Smart Logic for Conditional WHERE Clauses which actually recommended, “the best solution is to build the SQL statement dynamically—only with the required filters and bind parameters”. Ordinarily I appreciate that author’s posts, but this time I think that he let confusion run amok, as can be seen in a thread on reddit.

To dispel that confusion: parameterized queries don’t have any plausible downsides; always use them in applications. Saved plans have trade-offs; use them sometimes, and only if you understand the trade-offs.

When query parameters are conflated with saved plans, it’s creates FUD about SQL systems because it mixes the fear around SQL injection with the mysticism around the SQL optimizer. Such confusion about the layers of a SQL system are a big part of the reason that some developers move to the deceptive simplicity of NoSQL systems (I say “deceptive” here because it often just moves an even greater complexity into the application — but that’s another topic).

The confusion started with this query from the original article:

SELECT first_name, last_name, subsidiary_id, employee_id
FROM employees
WHERE ( subsidiary_id    = :sub_id OR :sub_id IS NULL )
  AND ( employee_id      = :emp_id OR :emp_id IS NULL )
  AND ( UPPER(last_name) = :name   OR :name   IS NULL )

[ Aside: In PostgreSQL those parameters should be $1, $2, and $3; but that's not relevant to this discussion. ]

The idea is that one such query can be used for several types of searches. If you want to ignore one of those WHERE conditions, you just pass a NULL as one of the parameters, and it makes one side of the OR always TRUE, thus the condition might as well not be there. So, each condition can either be there and have one argument (restricting the results of the query), or be ignored by passing a NULL argument; thus effectively giving you 8 queries from one SQL string. By eliminating the need to use different SQL strings depending on which conditions you want to use, you reduce the opportunity for error.

The problem is that the article says this kind of query is a problem. The reasoning goes something like this:

Using bind parameters forces the plan to be saved and reused for multiple queries.
When a plan is saved for multiple queries, the planner doesn’t have the actual argument values.
Because the planner doesn’t have the actual argument values, the “x IS NULL” conditions aren’t constant at plan time, and therefore the planner isn’t able to simplify the conditions (e.g., if one condition is always TRUE, just remove it).
Therefore it makes a bad plan.

However, #1 is simply untrue, at least in PostgreSQL. PostgreSQL can save the plan, but you don’t have to. See the documentation for PQexecParams. Here’s an example in ruby using the “pg” gem (EDIT: Note: this does not use any magic query-building behind the scenes, it uses a protocol level feature in the PostgreSQL server to bind the arguments):

require 'rubygems'
require 'pg'

conn = PGconn.connect("dbname=postgres")

conn.exec("CREATE TABLE foo(i int)")
conn.exec("INSERT INTO foo SELECT generate_series(1,10000)")
conn.exec("CREATE INDEX foo_idx ON foo (i)")
conn.exec("ANALYZE foo")

# Insert using parameters. Planner sees the real arguments, so it will
# make the same plan as if you inlined them into the SQL string. In
# this case, 3 is not NULL, so it is simplified to just "WHERE i = 3",
# and it will choose to use an index on "i" for a fast search.
res = conn.exec("explain SELECT * FROM foo WHERE i = $1 OR $1 IS NULL", [3])
res.each{ |r| puts r['QUERY PLAN'] }
puts

# Now, the argument is NULL, so the condition is always true, and
# removed completely. It will surely choose a sequential scan.
res = conn.exec("explain SELECT * FROM foo WHERE i = $1 OR $1 IS NULL", [nil])
res.each{ |r| puts r['QUERY PLAN'] }
puts

# Saves the plan. It doesn't know whether the argument is NULL or not
# yet (because the arguments aren't provided yet), so the plan might
# not be good.
conn.prepare("myplan", "SELECT * FROM foo WHERE i = $1 OR $1 IS NULL")

# We can execute this with:
res = conn.exec_prepared("myplan",[3])
puts res.to_a.length
res = conn.exec_prepared("myplan",[nil])
puts res.to_a.length

# But to see the plan, we have to use the SQL string form so that we
# can use EXPLAIN. This plan should use an index, but because we're
# using a saved plan, it doesn't know to use the index. Also notice
# that it wasn't able to simplify the conditions away like it did for
# the sequential scan without the saved plan.
res = conn.exec("explain execute myplan(3)")
res.each{ |r| puts r['QUERY PLAN'] }
puts

# ...and use the same plan again, even with different argument.
res = conn.exec("explain execute myplan(NULL)")
res.each{ |r| puts r['QUERY PLAN'] }
puts

conn.exec("DROP TABLE foo")

See? If you know what you are doing, and want to save a plan, then save it. If not, do the simple thing, and PostgreSQL will have the information it needs to make a good plan.

My next article will be a simple introduction to database system architecture that will hopefully make SQL a little less mystical.

Flexible Schemas and PostgreSQL

Jeff Davis — Thu, 06 May 2010 17:42:28 +0000

First, what is a “flexible schema”? It’s hard to pin down an exact definition, but it’s used to mean a data model that permits changes in application data structures without needing to migrate old data or incur other administrative hassles.

That’s a worthwhile goal. Applications often grow organically, especially in the early, exploratory stages of development. For example, you may decide to track when a user last did something on the website, so that you can adapt news and notices for those users (e.g. “Did you know that we added feature XYZ since you last visited?”). Developers have a need to produce a prototype quickly to work out the edge cases (do we update that timestamp for all actions, or only certain ones?), and probably a need to put it in production so that the users can benefit sooner.

A common worry is that ALTER TABLE will be a major performance problem. That’s sometimes a problem, but in PostgreSQL, you can add a column to a table in constant time (not dependent on the size of the table) in most situations. I don’t think this is a good reason to avoid ALTER TABLE, at least in PostgreSQL (other systems may impose a greater burden).

There are good reasons to avoid ALTER TABLE, however. We’ve only defined one use case for this new “last updated” field, and it’s a fairly loose definition. If we use ALTER TABLE as a first reaction for tracking any new application state, we’d end up with lots of columns with overlapping meanings (all subtly different), and it would be challenging to keep them consistent with each other. More importantly, adding new columns without thinking through the meaning and the data migration strategy will surely cause confusion and bugs. For example, if you see the following table:

    CREATE TABLE users
    (
      name         TEXT,
      email        TEXT,
      ...,
      last_updated TIMESTAMPTZ
    );

you might (reasonably) assume that the following query makes sense:

    SELECT * FROM users
      WHERE last_updated < NOW() - '1 month'::INTERVAL;

Can you spot the problem? Old user records (before the ALTER TABLE) will have NULL for last_updated timestamps, and will not satisfy the WHERE condition even though they intuitively qualify. There are two parts to the problem:

The presence of the last_updated field fools the author of the SQL query into making assumptions about the data, because it seems so simple on the surface.
NULL semantics allow the query to be executed even without complete information, leading to a wrong result.

Let’s try changing the table definition:

    CREATE TABLE users
    (
      name       TEXT,
      email      TEXT,
      ...,
      properties HSTORE
    );

HSTORE is a set of key/value pairs. Some tuples might have the last_updated key in the properties attribute, and others may not. This accomplishes two things:

There’s no need for ALTER TABLE or cluttering of the namespace with a lot of nullable columns.
The name “properties” is vague enough that query writers would (hopefully) be on their guard, understanding that not all records will share the same properties.

You could still write the same (wrong) query against the second table with minor modification. Nothing has fundamentally changed. But we are using a different development strategy that’s easy on application developers during rapid development cycles, yet does not leave a series of pitfalls for users of the data. When a certain property becomes universally recorded and has a concrete meaning, you can plan a real data migration to turn it into a relation attribute instead.

Now, we need some guiding principles about when to use a complex type to represent complex information, and when to use separate columns in the table. To maximize utility and minimize confusion, I believe the best guiding principle is the meaning of the data you’re storing across all tuples. When defining the attributes of a relation, if you find yourself using vague nouns such as “properties,” or resorting to complex qualifications (lots of “if/then” branching in your definition), consider less constrained data types like HSTORE. Otherwise, it’s best to nail down the meaning in terms of appropriate nouns, which will help keep the DBMS smart and queries simple (and correct). See Choosing Data Types and further guidance in reference [1].

I believe there are three reasons why application developers feel that relational schemas are “inflexible”:

A reliance on NULL semantics to make things “magically work,” when in reality, it just makes queries succeed that should fail. See my previous posts: None, nil, Nothing, undef, NA, and SQL NULL and What is the deal with NULLs?.
The SQL database industry has avoided interesting types, like HSTORE, for a long time. See my previous post: Choosing Data Types.
ORMs make a fundamental false equivalence between an object attribute and a table column. There is a relationship between the two, of course; but they are simply not the same thing. This is a direct consequence of “The First Great Blunder”[2].

EDIT: I found a more concise way to express my fundamental point — During the early stages of application development, we only vaguely understand our data. The most important rule of database design is that the database should represent reality, not what we wish reality was like. Therefore, a database should be able to express that vagueness, and later be made more precise when we understand our data better. None of this should be read to imply that constraints are less important or that we need not understand our data. These ideas mostly apply only at very early stages of development, and even then, prudent use of constraints often makes that development much faster.

[1] Date, C.J.; Darwen, Hugh (2007). Databases, Types, and the Relational Model. pp. 377-380 (Appendix B, “A Design Dilemma”).

[2] Date, C.J. (2000). An Introduction To Database Systems, p. 865.

None, nil, Nothing, undef, NA, and SQL NULL

Jeff Davis — Wed, 13 Aug 2008 18:00:02 +0000

In my last post, Why DBMSs are so complex, I raised the issue of type mismatches between the application language and the DBMS.

Type matching between the DBMS and the application is as important as types themselves for successful application development. If a type behaves one way in the DBMS, and a “similar” type behaves slightly differently in the application, that can only cause confusion. And it’s a source of unnecessary awkwardness: you already need to define the types that suit your business best in one place, why do you need to redefine them somewhere else, based on a different basic type system?

At least we’re using PostgreSQL, the most extensible database available, where you can define sophisticated types and make them perform like native features.

But there are still problems. Most notably, it’s a non-trivial challenge to find an appropriate way to model NULLs in the application language. You can’t not use them in the DBMS, because the SQL spec generates them from oblivion, e.g. from an outer join or an aggregate function, even when you have no NULLs in your database. So the only way to model the same semantics in your application is to somehow make your application language understand NULL semantics.

Here’s how SQL NULL behaves:

=> -- aggregate with one NULL input
=> select sum(column1) from (values(NULL::int)) t;
sum
-----

(1 row)

=> -- aggregate with two inputs, one of them NULL
=> select sum(column1) from (values(1),(NULL)) t;
sum
-----
1
(1 row)

=> -- aggregate with no input
=> select sum(column1) from (values(1),(NULL)) t where false;
sum
-----

(1 row)

=> -- + operator
=> select 1 + NULL;
?column?
----------

(1 row)

I’ll divide the “NULL-ish” values of various languages into two broad categories:

Separate type, few operators defined, error early, no 3VL — Python, Ruby and Haskell fall into this category, because their “NULL-ish” types (None, nil, and Nothing, respectively) usually result in an immediate exception, unless the operator to which the NULLish value is passed handles it as a special case. Few built-in operators are defined for arguments of these types. These fail to behave like SQL NULL, because they employ no three-valued logic (3VL) at all, and thus fail in the forth portion of the SQL example.
Member of all types, every operator defined — Perl and R fall into this category. Perl’s undef can be passed through many built-in operators (like +), but doesn’t ever use 3VL, so fails the forth portion of the SQL example. R uses a kind of 3VL for it’s NA value, but it uses it everywhere, so sum(c(1,NA)) results in NA (thus failing the second portion of the SQL example). In R, you can omit NAs from the sum explicitly (not a very good solution, by the way), but then it will fail the first portion of the SQL example.

As far as I can tell (correct me if I’m mistaken), none of these languages support the third portion of the SQL example: the sum of an empty list in SQL is NULL. The languages that I tested with a built-in sum operator (Python, R, Haskell) all return 0 when passed an empty list.

Languages from the first category appear safer, because you will catch the errors earlier rather than later. However, transforming SQL NULLs in these languages to None, nil, or Nothing is actually quite dangerous, because a change in the data you store in your database (inserting NULLs or deleting records that may be aggregated) or even a change in a query (outer join, or an aggregate that may have no input) can produce NULLs, and therefore produce exceptions, that can evade even rigorous testing procedures and sneak into production.

Languages from the second category tend to pass the “undef” or “NA” along deeper into the application, which can cause unintuitive and difficult-to-trace problems. Perhaps worse, something will always happen, and usually the result will take the form of the correct answer even if it is wrong.

So where does that leave us? I think the blame here rests entirely on the SQL standard’s definition of NULL, and the inconsistency between “not a value at all” and “the third logical value” (both of which can be used to describe NULL in different contexts). Not much can be done about that, so I think the best strategy is to try to interpret and remove NULLs as early as possible. They can be removed from result sets before returning to the client by using COALESCE, and they can be removed after they reach the client with client code. Passing them along as some kind of special value is only useful if your application already must be thoroughly aware of that special value.

Note: Microsoft has defined some kind of “DBNull” value, and from browsing the docs, it appears a substantial amount of work went into making them behave as SQL NULLs. This includes a special set of SQL types and operators. Microsoft appears to be making a lot of progress matching DBMS and application types more closely, but I think the definition of SQL NULLs is a lost cause.