Most application development can essentially be described as formalizing a business process. You take an informal process, remove the ambiguity that’s invariably present, adapt it to computer interfaces, and restate that process in a formal language. This has many benefits, among them:
- Automation – This is the most obvious benefit of using a programming language to formalize a process. The computer can now understand the process, and can execute the process without unnecessary human intervention.
- Unambiguous – Ambiguity can be removed without actually creating an application, but you can’t create an application without removing the ambiguity. Some of it is removed through discussions with others involved in the project, but most is resolved through the intuition of the application developer.
- The ability to handle complexity – Simple processes can be handled correctly by humans using intuition, rules of thumb, and following patterns. But complex processes are much easier to solve with a formal process that has independently verifiable parts.
So what’s this have to do with databases? Normalization is a formal process, but sometimes it’s discussed as though it were a logical design choice. For simple very simple databases, normalization is not important, because the “best” database design is obvious to developers and DBAs alike. I’m talking about logical design here — in specific products there may be limitations that make it very difficult to implement the best logical design efficiently. The point is that if you see two different database designs, both representing the exact same business rules, one in 2NF and one in 3NF, most people (at least those familiar with data management) would strongly agree that the 3NF design is better. The 3NF design enforces certain constraints and eliminates certain redundancies that may exist in the 2NF design, and the benefits of the 3NF design will be apparent.
What happens is that some people confuse two different concepts: formalizing business rules and normalization. Someone may consider two different designs, and perceive one to be “more normalized” and the other to be “less normalized”, but in reality the two designs are likely representing two different sets of business rules. If they choose the latter design, they may call it “denormalization”, even if both designs are in 5NF. Formalizing business rules does involve many design choices, but normalization is just a good way (correct, unambiguous, and mechanizable — that is, formal) to create a good design from a complex set of business rules, after the business rules have already been formalized. So, the design choices are in formalizing the business rules, and normalization is just a tool that creates a good database design based on those business rules.
Formalization is really what separates relational DBMSs from every other type of DBMS. Relations have predicates and tuples in the relation can instantiate those predicates to create propositions, and these predicates and propositions can be manipulated using formal logic. The relational operators operate on relations and produce new relations, and these relational operators have an associated logical meaning, making a formal connection between databases and logic. For instance, JOIN corresponds to logical AND.
If you want to answer a complex question of your DBMS, how would you do so with confidence without a formal system? First of all if your database constraints do not closely match the predicates of your relations, you have a lot of work to do. You’d need to examine application code, and the data already in the system, to determine the actual meaning of the data. Next, if you have only a handful of relations, and simple predicates, you may be able to make some simple inferences correctly, and be able to convince yourself that the result of your manipulations does indeed provide a correct answer to the question. If you have more relations or more complex predicates, it’s almost impossible to be sure that the inferences you’re making are correct, or to convince anyone else of their correctness. So what people do in practice is to ask simpler questions of their data, and rely less on the answers they get.
With a formal model, such as the relational model, you can ask very complex questions about very complex data and trust the results a great deal, because you can formalize the business rules to form predicates, and infer new information from the data in a way that can be independently verified and automated.