Database Formalities | Experimental Thoughts

Most application development can essentially be described as formalizing a business process. You take an informal process, remove the ambiguity that’s invariably present, adapt it to computer interfaces, and restate that process in a formal language. This has many benefits, among them:

Automation – This is the most obvious benefit of using a programming language to formalize a process. The computer can now understand the process, and can execute the process without unnecessary human intervention.
Unambiguous – Ambiguity can be removed without actually creating an application, but you can’t create an application without removing the ambiguity. Some of it is removed through discussions with others involved in the project, but most is resolved through the intuition of the application developer.
The ability to handle complexity – Simple processes can be handled correctly by humans using intuition, rules of thumb, and following patterns. But complex processes are much easier to solve with a formal process that has independently verifiable parts.

Consider arithmetic. People do simple multiplication all of the time, but when faced with complexity — e.g. multiplying several large numbers together — it’s much easier to follow formal steps to arrive at a correct answer. Not only is the person more likely to arrive at the correct result, they are also able to re-check the individual steps, and others are able to independently verify those steps. How do we know the formal arithmetic processes we use are correct? We devise them using another formal process: mathematics.Using the formal processes are certainly not required to achieve the correct answer. Ancient civilizations accomplished a lot without the benefit of all of the formal processes that we have today. The point is that formal processes will lead you to solutions as good or better than informal processes, and the formal process will be more effective and efficient at reaching that solution. Also note that not all systems are equal when it comes to formalized processes: try multiplying those large numbers together using Roman numerals!

So what’s this have to do with databases? Normalization is a formal process, but sometimes it’s discussed as though it were a logical design choice. For simple very simple databases, normalization is not important, because the “best” database design is obvious to developers and DBAs alike. I’m talking about logical design here — in specific products there may be limitations that make it very difficult to implement the best logical design efficiently. The point is that if you see two different database designs, both representing the exact same business rules, one in 2NF and one in 3NF, most people (at least those familiar with data management) would strongly agree that the 3NF design is better. The 3NF design enforces certain constraints and eliminates certain redundancies that may exist in the 2NF design, and the benefits of the 3NF design will be apparent.

What happens is that some people confuse two different concepts: formalizing business rules and normalization. Someone may consider two different designs, and perceive one to be “more normalized” and the other to be “less normalized”, but in reality the two designs are likely representing two different sets of business rules. If they choose the latter design, they may call it “denormalization”, even if both designs are in 5NF. Formalizing business rules does involve many design choices, but normalization is just a good way (correct, unambiguous, and mechanizable — that is, formal) to create a good design from a complex set of business rules, after the business rules have already been formalized. So, the design choices are in formalizing the business rules, and normalization is just a tool that creates a good database design based on those business rules.

Formalization is really what separates relational DBMSs from every other type of DBMS. Relations have predicates and tuples in the relation can instantiate those predicates to create propositions, and these predicates and propositions can be manipulated using formal logic. The relational operators operate on relations and produce new relations, and these relational operators have an associated logical meaning, making a formal connection between databases and logic. For instance, JOIN corresponds to logical AND.

If you want to answer a complex question of your DBMS, how would you do so with confidence without a formal system? First of all if your database constraints do not closely match the predicates of your relations, you have a lot of work to do. You’d need to examine application code, and the data already in the system, to determine the actual meaning of the data. Next, if you have only a handful of relations, and simple predicates, you may be able to make some simple inferences correctly, and be able to convince yourself that the result of your manipulations does indeed provide a correct answer to the question. If you have more relations or more complex predicates, it’s almost impossible to be sure that the inferences you’re making are correct, or to convince anyone else of their correctness. So what people do in practice is to ask simpler questions of their data, and rely less on the answers they get.

With a formal model, such as the relational model, you can ask very complex questions about very complex data and trust the results a great deal, because you can formalize the business rules to form predicates, and infer new information from the data in a way that can be independently verified and automated.