A recent thread on pgsql-hackers warrants some more extensive discussion. In the past, I’ve criticized NULL semantics, but in this post I’d just like to explain some corner cases that I think you’ll find interesting, and try to straighten out some myths and misconceptions.
First off, I’m strictly discussing SQL NULL here. SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.
- “NULL is not a value” — If you hear this one, beware: it’s in direct contradiction to the SQL standard, which uses the phrase “null value” dozens of times. It’s hard to imagine that NULL is not any kind of value at all; because it’s routinely passed to functions and operators, predicates can evaluate to NULL, and SQL uses a kind of three-valued logic (3VL) in some contexts. The phrase “NULL is not a value” also raises the question: “what is it, then?”.
- NULL means “unknown” (i.e. the third truth value) — This doesn’t hold up either. SUM of no tuples returns NULL, but clearly the SUM of no tuples is not unknown! SQL will happily generate NULLs from aggregates or outer joins without any NULLs at all in the database. Do you not know something you did know before, or do you now know that you don’t know something that you didn’t know you didn’t know before? Also, if NULL means “unknown”, how do you differentiate a boolean field for which you do not know the value, and a boolean field for which you do know the value, and it happens to be “unknown” (perhaps this is why boolean columns are a PostgreSQL extension and not part of the core SQL standard)?
- “NULL is false-like” — Don’t think of NULL as false-like, or “more false than true”. It’s a tempting rule of thumb, but it’s misleading. For instance, in a WHERE clause, a NULL predicate is treated like FALSE. However, in a constraint (like a
CHECK
constraint), NULL is treated like TRUE. Perhaps most importantly, when in a 3VL context (like a boolean expression), this misconception leads to problems when you try to invert the logic, e.g., useNOT
. - “Oh, that makes sense” — When you see individual behaviors of NULL, they look systematic, and your brain quickly sees a pattern and extrapolates what might happen in other situations. Often, that extrapolation is wrong, because NULL semantics are a mix of behaviors. I think the best way to think about NULL is as a Frankenstein monster of several philosophies and systems stitched together by a series of special cases.
p OR NOT p
— Everyone should know that this is not always true in SQL. But most people tend to reason assuming that this is always true, so you have to be very careful, and work against your intuition very deliberately, in order to form a correct SQL query.- SUM() versus + (addition) — SUM is not repeated addition. SUM of
1
andNULL
is1
, but1 + NULL
is NULL. - Aggregates ignore NULLs — According to the standard, aggregates are supposed to ignore NULLs, because the information is missing. But why is it OK to ignore the missing information in an aggregate, but not, say, with the + operator? Is it really OK to just ignore it?
- Aggregates return NULL — According to the standard, aggregates are supposed to return NULL when they have no non-NULL input. Just because you don’t have any input tuples, does that really mean that the result is undefined, missing, or unknown? It’s certainly not unknown! What about SUM over zero tuples, wouldn’t the most useful result be zero?
- SQL breaks its own rules — The aforementioned aggregate rules don’t work very well for
COUNT()
, the simplest of all aggregates. So, they have two versions of count:COUNT(*)
breaks the “aggregates ignore nulls” rule and the “aggregates return null” rule, and COUNT(x) only breaks the latter. But wait! There’s more:ARRAY_AGG()
breaks the former but not the latter. But no exception is made for SUM — it still returns NULL when there are no input tuples. - NULLs appear even when you have complete information — Because of OUTER JOIN and aggregates, NULLs can appear even when you don’t have any NULLs in your database! As a thought experiment, try to reconcile this fact with the various “definitions” of NULL.
WHERE NOT IN (SELECT ...)
— This one gets everyone at one point or another. If the subselect produces any NULLs, then NOT IN can only evaluate to FALSE or NULL, meaning you get no tuples. Because it’s in a WHERE clause, it will return no results. You are less likely to have a bunch of NULLs in your data while testing, so chances are everything will work great until you get into production.x >= 10 or x <= 10
— Not a tautology in SQL.x IS NULL AND x IS DISTINCT FROM NULL
— You probably don’t know this, but this expression can evaluate to TRUE! That is, if x = ROW(NULL).NOT x IS NULL
is not the same asx IS NOT NULL
— Ifx
isROW(1,NULL)
, then the former will evaluate to TRUE, and the latter will evaluate to FALSE. Enjoy.NOT x IS NULL AND NOT x IS NOT NULL
— Want to know if you have a value likeROW(1, NULL)
? To distinguish it from NULL and also from values likeROW(1,1)
andROW(NULL,NULL)
, this expression might help you.- NULLs can exist inside some things, but not others — If you concatenate:
firstname || mi || lastname
, and “mi” happens to be null, the entire result will be null. So strings cannot contain a NULL, but as we see above, a record can.
I believe the above shows, beyond a reasonable doubt, that NULL semantics are unintuitive, and if viewed according to most of the “standard explanations,” highly inconsistent. This may seem minor; that is, if you’re writing SQL you can overcome these things with training. But it is not minor, because NULL semantics are designed to make you think you understand them, and think that the semantics are intuitive, and think that it’s part of some ingenious consistent system for managing missing information. But none of those things are true.
I have seen lots of discussions about NULL in various forums and mailing lists. Many of the participants are obviously intelligent and experienced, and yet make bold statements that are, quite simply, false. I’m writing this article to make two important points:
- There is a good case to be made that NULL semantics are very counterproductive; as opposed to a simple “error early” system that forces you to write queries that explicitly account for missing information (e.g. with
COALESCE
). “Error early” is a more mainstream approach, similar to null pointers in java or None in python. If you want compile-time checking, you can use a construct like Maybe in haskell. SQL attempts to pass along the problem, hoping the next operator will turn ignorance into knowledge — but it does not appear that anyone thought through this idea, quite frankly. - You should not attempt to apply your intellect to NULL, it will lead you in the wrong direction. If you need to understand it, understand it, but always treat it with skepticism. Test the queries, read the standard, do what you need to do, but do not attempt to extrapolate.
One example I forgot:
SELECT NULL / 0;
Returns NULL. This is an interesting example because it dispels the misconception that NULL is “zero-like” (similar to the misconception that NULL is “false-like”). No matter what the numerator, dividing by zero results in something that is certainly not like zero.
Pingback: NULL By Mouth
The closest operational definition I have ever been able to come up with for NULL is, “This could be any value, but since I do not know what the value is, I will not treat it as being any value in particular.”
That works right up until you get to the “aggregates ignore NULLs unless all inputs are NULL” rule.
You are exactly right: It’s just a laundry list of behaviors. Attempting to come up with a mental model will, sooner or later, cause you to expect something that isn’t going to work.
> The phrase “NULL is not a value” also raises the question: “what is it, then?”.
The most interesting answer to this I’ve read is that NULL *is* a value. The reason is that NULL always has a data type when it’s in a query result set.
“SELECT foo FROM mytable”
The mytable.foo column has a defined data type. If foo is NULL on a given row, you get the NULL value in that query result, but that column of the result set is still defined with the data type of mytable.foo. A data type is a named set of values, so NULL must be a value that is a set member in every data type.
> NULL must be a value that is a set member in every data type.
If NULL is a value in some domain, then NULL = NULL should yield TRUE, but it does not.
The entire point of my post is that such philosophies as you mention above will lead to wrong results. It is impossible to extrapolate from your “NULL is a value” philosophy the variety of behaviors I list above.
Thank you for making such points about NULLs.
Pingback: Santiago Zarate (foursixnine) 's status on Monday, 03-Aug-09 04:59:44 UTC - Identi.ca
Pingback: Ben Finney (bignose) 's status on Monday, 03-Aug-09 05:05:31 UTC - Identi.ca
I prefer to think about NULLs just like about infinitesimal and infinity from math. They have a lot of common, I’m sure.
The concepts of infinity, epsilon, and NaN are all interesting, and a little confusing because they are outside the domain of real numbers (and complex, for that matter). However, there are basic rules about what you can and can’t do, and mathematicians treat such values with healthy skepticism.
Those things aren’t true of NULL. As soon as you say “I prefer to think about NULLs just like …,” I fear that unexpected results are in your future
For instance, how does your interpretation of NULL explain the fact that SUM of 1 and NULL is 1, and 1 + NULL is NULL?
Oh, we will find another L’Hospital’s Rules
I meant l’Hôpital’s rules (also called Bernoulli’s rule) of course. Sorry for misspelling
I was taught by Hugh Darwen at university, who ingrained in our souls the evil-ness of SQL’s Nulls. He has some interesting papers – http://www.dcs.warwick.ac.uk/~hugh/.
There’s lots of other ways SQL ignores the relational model, for example, returning scalar values from aggregates, returning duplicate rows from queries etc.
I had an idea recently to write a Relational-SQL mapper, which is similar to an Object-Relational mapper, but instead of mapping an object oriented syntax to an SQL backend, it maps a relational syntax (similar to something like Tutorial D) to an SQL backend. A kind of “fix all the non-relational stuff about SQL” mapper.
I’ve not got very far however
The main challenge with such a project is that you have to create a full language, or at least I think you do. People want to be able to sort stuff inside the DBMS, for instance, so just supporting relational algebra is probably not good enough to be useful. So, now you need non-relational operators, too, and the whole thing becomes a giant undertaking.
See my other post on the subject:
http://thoughts.davisjeff.com/2008/08/03/why-dbmss-are-so-complex/
FYI Darren Duncan has been working on just such a project for several years.
http://pugs.postgresql.org/node/404
http://www.mail-archive.com/dbi-users@perl.org/msg31943.html
Tim, thanks for the lip-service to my project. It also gives more support to what I’m doing when I hear that others have wanted or tried to do similar, as it shows more that demand is there.
First you think NULL is a value.
Then you think NULL is not a value.
Finally you realize NULL is NULL and there is no point in trying to understand it.
Well said!
Well, I still have doubts why an empty string (”) is not a NULL (like in Oracle in example).
Hi! Interesting post.
I know you warned us not to, but I can’t help trying to apply some wits against null. Besides, everybody in the comments seem to agree with you, so I guess I gotto be the shmuck to try and at least play devil’s advocate just once :p
First off, NULL and “the null value” are not the same thing – NULL is a keyword that denotes the generic null value, regardless of context. But usually, when talking or writing about some expression that evaluates to the null value, it ends up being denoted as NULL – the keyword. I will add “the keyword” when I mean that, and just NULL when I mean any null-value expression.
Discussing a few items you mentioned:
#1 “The phrase “NULL is not a value” also raises the question: “what is it, then?”.”
It is indeed confusing, but not as hard as it seems. ISO/IEC 9075-1:2003 (E) mentions it on two occasions:
3.1.1.11 null value: A special value that is used to indicate the absence of any data value.
4.4.2 The null value
Every data type includes a special value, called the null value, sometimes denoted by the keyword NULL. This
value differs from other values in the following respects:
— Since the null value is in every data type, the data type of the null value implied by the keyword NULL
cannot be inferred; hence NULL can be used to denote the null value only in certain contexts, rather than
everywhere that a literal is permitted.
— Although the null value is neither equal to any other value nor not equal to any other value — it is unknown
whether or not it is equal to any given value — in some contexts, multiple null values are treated together;
for example, the treats all null values together.
So, this just says: every data type has a collection of possible values, which includes a special one that behaves differently, with the explicit purpose of *indicating* the absence of a value.
Most people will agree that something cannot be there and at the same time not be there. So, saying that the null value is an absent value is nonsense. Saying that the null value is a *placeholder* that an appear in any place where you can normally expect a value isn’t that strange IMO.
It looks a bit more ambiguous when you read the part that goes: “…it is unknown whether or not it is equal to any given value…”, but this is just part of the “specialness” of the null value that allows it to properly indicate absence of value.
So, in short, NULL is a value, its just a special one with the explicit purpose of indicating the absence of value. Some argue that the entire concept of absence of value is non-sense, but it turns out to be mighty convenient: for example, how would outer join operations work if we wouldn’t have a concept of absence?
#2 “Aggregates ignore NULLs — According to the standard, aggregates are supposed to ignore NULLs, because the information is missing. But why is it OK to ignore the missing information in an aggregate, but not, say, with the + operator? Is it really OK to just ignore it?”
If anything, this is really a happy convenience, not an evil foe. If you look at aggregate functions like AVG, or CORR, or for that matter, any aggregate that somewhere presumes counting – these cases are far better of with the current null value behaviour than with any alternative I can think of. For SUM, you could get around not having NULL by using 0 (zero) for the missing values. But for AVG and CORR, this doesn’t work anymore, because you can’t distinguish between known measurements and unknown or absent ones. This just boils down to the concept of counting things: counting is keeping track of that which is there, not that which is not there. If the null value really indicates the absence of value, it stands to reason not to count it. (BTW: I do agree that by that same logic, 1 + NULL should probably disregard the NULL too as something which is not there, and evaluate to 1, not NULL)
You could argue that aggregate functions that encounter null values somewhere should evaluate to the null value too, just like + does. However it would complicate things like CORR more than it’s worth IMO. You would have to expand your queries to ensure you’re always including only the rows for which all required values are known. When calculating multiple aggregates over multiple columns in the same query, this would complicate things even more, whereas the current behaviour will usually give results that will be statistically meaningful most of the time (I don’t mean “statistical” in a formal sense, I’m just saying the outcome of the aggregates will generally provide a good ball park figure that characterizes the population)
#3
“The aforementioned aggregate rules don’t work very well for COUNT(), the simplest of all aggregates.”
I don’t see why not – from where I’m at, not counting the values that indicate absence makes perfect sense, in fact it allows derived aggregates like CORR and AVG to work just fine
#4
“They have two versions of count: COUNT(*) breaks the “aggregates ignore nulls” rule”
The part about two versions – that’s completely true. The standard says:
4.15.4 Aggregate functions “…The result of the aggregate function COUNT (*) is the number of rows in the aggregation…”
The part about breaking any rules is not true however – the rules are designed not to break each other, and from what I can tell, they did a good job.
I guess you could argue that COUNT(*) is an unhappy choice of denotation, since it reminds is so much of COUNT(x). Something like ROWCOUNT would have been more explicit, and would have removed any source of confusion with COUNT(x).
At any rate, if you treat COUNT(*) as a single token and ignore the similarity to COUNT(x), well, it all works out fine.
#5
“But wait! There’s more: ARRAY_AGG() breaks the former but not the latter”
As far as I can see, this is not a standard SQL aggregate function, so you can’t blame the standard for that one
I guess the rules for this one are up to whoever decides to implement it.
#6
“Because of OUTER JOIN and aggregates, NULLs can appear even when you don’t have any NULLs in your database! As a thought experiment, try to reconcile this fact with the various “definitions” of NULL.”
IMO, this fits very neatly with the “absensce” notion of the null value. Without it, it would be very hard to work with sets that do not have a counterpart in another set. Please suggest an alternative for that problem – I’d be interested to hear how this would work without the null value concept.
BTW – I can see one case where the OUTER JOIN null value semantics would break down: if the schema design is such that all columns of the rows in the outer joined set were allowed to be nullable. But this would also mean that that the outer joined set does not have a primary key, so I consider this an unimportant corner case.
#7
“WHERE NOT IN (SELECT …) — This one gets everyone at one point or another. If the subselect produces any NULLs, then NOT IN can only evaluate to FALSE or NULL, meaning you get no tuples.”
#8
“NOT x IS NULL is not the same as x IS NOT NULL — If x is ROW(1,NULL), then the former will evaluate to TRUE, and the latter will evaluate to FALSE. Enjoy.”
To me, this has more to do with the urge to make SQL look like English than anything else . The problem is that IS NOT NULL is three keywords, but really just one operator. We are tempted to treat the NOT in IS NOT NULL as a operator on its own, but as you correctly point out, its not. Similarly we are tempted to treat the NULL keyword in there as if it is the standalone NULL keyword, but really it all makes much more sense to treat it as a unit “IS NOT NULL”, denoting a unary “not null test operator”. And yes, it doesn’t really help that IS NULL is a similarly looking operator that does not have the inverse meaning of IS NOT NULL
Here are my responses. They’re somewhat brief, but I don’t mean to be dismissive of the rest of your comment.
> So, this just says: every data type has a collection of possible values, which includes a
> special one that behaves differently, with the explicit purpose of *indicating* the
> absence of a value.
If SQL stopped there, that might be OK. But they actually prescribe what to do with the absence of a value in general cases. SQL requires different behaviors for different general cases, e.g., do one thing for an aggregates, one thing for IN, etc.
But what you actually want to do with the absence of data is highly context-sensitive… you may want to ignore it, you may want to raise an error, you may want to produce some other special value, you may want to use a default, the list goes on. SQL does not give you the tools to detect the possibility of NULLs at compile time, nor produce effective errors at runtime (just silently produces wrong results), nor give you effective means to handle NULLs as special cases.
> AVG, or CORR
Good examples. I am not suggesting that it should be impossible to ignore NULLs, only that it should not be required to ignore them just because the operation you’re performing happens to be an aggregate.
> Something like ROWCOUNT would have been more explicit, and would have
> removed any source of confusion with COUNT(x)
But COUNT(x) returns 0 when there are no input tuples (or they are all NULL). Other aggregates returns NULL in that case.
> As far as I can see, this [ARRAY_AGG] is not a standard SQL aggregate function
In SQL:2008, see 10.9 General Rules 8, and in particular, (g) (ii) Note 267.
> Without it, it would be very hard to work with sets that do not have a counterpart in
> another set.
See my reply to your other comment.
Thanks for reply!
gives me somthing to think about.
BTW – thanks for the link to array_agg – I was still working with the 2003 version – I guess i should start looking at the 2008 version.
Hi again
separate comment to separate it from my noisy previous comment
Aside all the things we could say about NULL, the thing I am really interested in is how you would solve all problems we now solve with outer joins. I can’t escape the notion that there is some need for “absence” in SQL in order to deal with these problems. I can’t see how this would be solved with what you called a simple “error early” system.
Thanks in advance,
Roland.
Please explain, I’m genuinly interested.
My earlier post has some related information:
http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/
and I plan to write about this topic in the future in greater detail. Here are two abbreviated answers:
1. The outer join itself would produce the same result, but if you aggregate over the NULLs you’d get an ERROR. So, you’d have some syntax like SUM(x SKIP NULLS) and that would give you the result you’re expecting. The difference is that it’s an explicit guard, and it’s explicitly telling the query how to handle NULLs in that specific context. Guarding against NULLs shouldn’t be a difficult process, but it should be explicit and consistent across operations. This allows you to do a similar thing for an IN (SELECT …) query, which right now behaves differently than an aggregate.
Compare to something like python. If you pass None to a python method, what happens? It calls your method, and you figure out what to do (or you get an exception). Imagine how frustrating it would be if passing a None to a method meant that the method was not called at all, and it just picked something to return for you! That’s what SQL does.
The idea is that when you get a NULL, you have to do something with it, or it will produce an error. SQL uses the philosophy of passing the NULL along, which is a horrible idea, because you’ve lost the context where you know enough to do something useful with it.
2. Outer joins are easy to do away with completely. Instead of a left join, simply collect the matching tuples on the right hand side into a multiset, and then you aggregate over that multiset. For instance:
1 | foo
2 | bar
3 | baz
left join
1 | 10
1 | 20
2 | 30
2 | 30
results in:
1 | foo | { 10, 20 }
2 | bar | { 30, 30 }
3 | baz | {}
And if you SUM over the third column, you get 30, 60, and 0.
The reason this isn’t done now is because of an interpretation of first normal form that is, in my opinion, wrong.
Hi!
Thanks again – I got to think your answer through. You definitely have a point about passing the NULL along.
Your solution for outer joins is interesting too. I haven’t worked with multisets so I guess I should start and try.
kind regards,
Roland
Pingback: SQL NULLs are Evil! Part IV « Manni Wood
> perhaps this is why boolean columns are a PostgreSQL extension
> and not part of the core SQL standard
boolean has been part of the ANSI SQL standard since 1999 (SQL3)
“boolean has been part of the ANSI SQL standard since 1999 (SQL3)”
Looks like you’re right. For some reason I thought it was defined as a type, but you couldn’t have a column of type boolean. I wonder where I got that idea?
Thanks.
What’s even more interesting is that for BOOLEAN they invented the keyword UNKNOWN and the 2003 standard states “The null value of the boolean data type is equivalent to the Unknown truth value.” So for BOOLEAN (and only BOOLEAN AFAICT) you’re supposed to say WHERE <boolean primary> IS [NOT] UNKNOWN. And in the definition of “literal”, which is supposed to “Specify a non-null value”, “boolean literal” is equated to TRUE, FALSE or UNKNOWN (but the latter is equivalent to a “null value” a few pages later).
Wow. Very strange.
Hi all,
Interesting discussion there. I would like to know if we could derive a list of good practices for those who have to use SQL now, do not plan to rewrite a query language, and want to avoid the dangers of NULL. I try to start with these:
#1 – Always add a not null constraint on all your fields when creating a table.
#2 – Lift the not-null constraint only when you are sure you have to allow a data value to be tagged as “unknown” or “unknown yet” (the difference is noticeable).
#3 – When you have a nullable field, go all over your client code and make sure you coalesce your comparisons, your strings concatenation, etc.
#4 – Never allow nulls on any “key-like” field.
#5 – Never use null as a meaningful value. e.g. if you want to store the date value “infinite time in the past”, use the lowest date available in the system, not “null value”.
#6 – Do not allow null on real booleans. E.g. it is ok to have a nullable client_say_he_owns_a_dog field, but not a soft delete flag, which is either on or off.
#7 – If you have tables with a significant proportion of null cells, you should refactor by write the nullable fields in a new id-key-value table (so absence of value in a cell becomes the absence of a row, which is better).
#8 – Be careful when counting, count(*) is the row count, not the same as count(x), which counts the not null x values. (May be we could say “never use count(*), use count(id) instead” ?)
…
PS: I have messed a bit with PostgreSQL and I think it works great. I did have some headache about nulls in the begining but I never felt SQL was faulty on that: it was only badly used. I don’t think it is feasible to design a real-world database without any nullable fields, in many domains at least. So the pragmatic way would be to have a set of dos and don’ts most agree upon, like I tried above.
Best regards,
> #1 – Always add a not null constraint on all your fields when creating a table.
Unfortunately, I don’t think that’s a universal solution. (a) NULLs will be generated anyway, by outer join and aggregates; and (b) most SQL implementations don’t effectively optimize physical designs that avoid NULLs, so there may be a significant performance penalty.
> #3 – When you have a nullable field, go all over your client code and make sure you coalesce your comparisons, your strings concatenation, etc.
That’s certainly good practice. When you have a query, look at any fields that can possibly be NULL, consider how those might affect your application differently, and handle them appropriately.
> #5 – Never use null as a meaningful value. e.g. if you want to store the date value “infinite time in the past”, use the lowest date available in the system, not “null value”.
I think that’s a good rule. Don’t use NULL as an arbitrary “special value”
In my opinion NULL’s behaviour is deeply considered and NULLs behave in exact way they should to provide required functionality. NULL can be considered as ‘does not apply for that row’ and aggregate functions should ignore NULL values. However, adding anything to something that ‘does not apply’ makes the whole expression ‘not applicant’.
NULLs are very logical, as long as you don’t look for similarities in more primitive (3rd generation) language. SQL is more sophisticated language, which makes it possible to write in one query the functionality that would require over 100 lines of code in Java f.g. NULLs behaviour is one of reasons this is possible.
> NULLs are very logical
Can you point me to a specific algebra of NULLs, similar to the standard 2VL boolean logic algebra?
> SQL is more sophisticated language, which makes it possible to write in one query the functionality that would require over 100 lines of code in Java
I am not at all suggesting that imperative languages like Java are a replacement for declarative languages like SQL. I am saying that SQL could be better by taking a different approach to NULLs specifically.
Pingback: tech: Four short links: 5 August 2009 | tech3bite
Pingback: NULL is not a value | nullwert
hmm. i think null is conveniently different in diff circumstances. if it is any one ‘thing’ then it would have to be a human concept, and one of blankness, otherness, the dark void, the something else other than is or is not. null is effectively the void filled by spirituality, androgyny, and in some older systems chr(20) to fill out the field length.
> blankness, otherness, the dark void
I don’t suppose you have an algebra of blankness, do you ?
Pingback: IT Security Blog» Blog Archive » delicious/cdman83
Pingback: Four short links: 6 August 2009 | Design Website
Why is it that no one seems to know that …
NULL is the empty set.
All operations in SQL are based on set theory, not algebra as we normally consider it. Set theory is the *basis* for (ordinary) mathematics and mathematical operations. Set operations can appear to be very similar to “normal” mathematical operations, but since they may be missing key properties of that operation (say closure), they act in what appears to be a non-intuitive fashion.
So the definition of NULL is always consistent, it is the definition of operations over a set which may treat NULL elements in the set differently.
What is often the case is that what appears to be a case of …
NULL == NULL
is really a case of …
{NULL} != NULL (i.e. the set containing NULL != NULL alone)
For example, Aggregates often appear to be mathematical operations, like SUM(), but it is really defined as the operation over a set of elements which are either a number or elements convertible to a number. And when converting a NULL to a number for the purposes of a “SUM”, it is easy to say that its number value is zero (which is the same as ignoring it). Otherwise, SUM would be largely useless as an operation.
As such the SUM *operator* is not “+” it is a different (albeit similar) operator, and it inherently ignores NULL elements in the set in its computation. It isn’t inconsistent with “+”, because it was never meant to be the same as “+”.
Does that help anyone?
> NULL is the empty set.
>
> Set operations can appear to be very similar to “normal” mathematical operations, but since they may be missing key properties of that operation (say closure), they act in what appears to be a non-intuitive fashion.
What set operations lack closure? As far as I know set operations are a well-defined mathematical system, and the empty set is just a value in that system. Sets are closed over the operations UNION, INTERSECT, DIFFERENCE, etc.
Set operations are nothing like NULL semantics, which involve 3VL and strange exceptions. You certainly can’t explain all of my examples by just saying “NULL is the empty set”.
> {NULL} != NULL (i.e. the set containing NULL != NULL alone)
If you do “SELECT * FROM foo WHERE NULL = NULL”, which NULL is “{NULL}” and which NULL is “NULL”? They look indistinguishable to me, but the predicate does not evaluate to TRUE.
> Does that help anyone?
I think that this line of reasoning will lead to mistakes. For instance, how do you explain the fact that COUNT(*) doesn’t ignore NULLs? Or that “x IS NOT NULL” is not the same as “NOT x IS NULL” for all x?
> SQL’s approach to missing information is, well, “unique”.
Not so unique. You might be interested in the way R handles NULL value. It is fairly similar (even though many examples are not applicable). For instance:
> sum(NULL, 3)
[1] 3
> NULL + 3
numeric(0)
You should waste your brainpower on something else then SQL. There is no logic and elegance to it, and I predict it will soon be superceded by something simpler.
Thanks for this post and discussion.
I think it is important to distinguish between the context of NULLs. Most of the time, NULLs are a problem with math, not the logic per se. So if they appear in a numeric column, certain rules apply, but in a logical sense, others apply, etc. Also, if they are generated by the engine, from say a JOIN where no rows exist, that’s another form of them.
If anyone is ineterested, please chime in on a SQL Server post I started called The Logic, Mathematics, and Utility of NULLs. Here is the link:
http://www.sqlservercentral.com/Forums/Topic970493-374-1.aspx.