Data Labels and Predicates

Here are a couple common representations of data sets:

username | realname   | phone
jdavis   | Jeff Davis | 555-1212
jsmith   | John Smith | 555-2323


<realname>Jeff Davis</realname>
<realname>John Smith</realname>

We naturally think of the former as representing a relation, but both representations are logically equivalent, even if the latter is overly verbose. In both cases, we’ve essentially just labeled the data, by which I mean we’ve briefly documented the meaning of individual properties, independent of the other properties.

If we were to form a predicate out of this data set, it would look something like (call this P1):
There exists a user with user name [username] and real name [realname] and phone number [phone].

Notice that it’s just several simpler predicates connected with AND. We can freely omit parts of the predicate, such as phone, and still have true statements (although, as explained below, this is different from relational projection).

However, this does not hold true in the general case, where a predicate is more complicated than just a collection of simple predicates connected with AND. For example, let’s add a new property, called during, that represents the time interval over which the predicate is (or was) true. Now the predicate looks something like this (call this P2):
There exists (or existed) a user with user name [username] and real name [realname] and phone number [phone] during the interval of time [during].

Now, we can no longer freely remove the during portion of the predicate, because the statement will no longer be true.

The logical reason for this is the the relational projection operator doesn’t merely eliminate a part of the predicate, it turns it into a bound variable by quantifying it. If we use projection to eliminate the phone attribute from P1, you get a new predicate (call it P3):
There exists a phone number [phone] such that there exists a user with user name [username] and real name [realname] and phone number [phone].

In this case, the variable [phone] is bound, and the tuples that satisfy that predicate only have two attributes: username and realname. This predicate is very similar to just removing phone from the predicate entirely, as though it never existed. However, this predicate does tell you that the user has a phone number (although not what the number is), so it is not identical.

However, if we use projection to eliminate the during attribute from P2, we get the predicate (call it P4):
There exists some time interval [during] such that there exists (or existed) a user with user name [username] and real name [realname] and phone number [phone] during the interval of time [during].

In this case, the resulting predicate is very different from the predicate where during is removed entirely. That is, P4 is different from P1, even though the tuples that satisfy P4 are of the same form (3 attributes) as those that satisfy P1.

The point of all this is that the simple labeling of data has the underlying assumption that the attributes are independent, i.e., you can simply omit attributes of the data and still have accurate information. This is not true in the general case, because attributes are not independent except in the simplest cases. And the meaning of an individual attribute is much more complex than the “obvious” meaning that a label might convey.

Predicates, which form propositions, are a much more complete way to represent the meaning of data, and there is no requirement that the attributes be independent. Attribute names (i.e. labels) are better used as a reminder than as a representation of the actual meaning. This is where the relational model succeeds and XML fails: the relational model provides operators that act on predicates with concrete logical meaning, whereas XML relies on data labels, which don’t accurately represent complex meaning.

Comments are closed.