I seem to remember that import and export used to be standard fare on any application that might hold your data. To an extent, that’s still true, but it seems like it’s becoming less so — especially with web applications. In my last post, I pointed out that the import and export capabilities of Serendipity are severely lacking. If you export, and then try to import into a fresh installation, you will lose all the comments! That simply doesn’t sound like a valid import/export feature at all, and users would probably be safer if it didn’t exist. WordPress is better, and the export file gives me more confidence that it’s correct, but it’s still not without flaw: the comment threading gets completely garbled on import to a fresh install.
I decided that I need to start posting again, it’s been a while since I’ve taken the time to write. Now I’m using WordPress, instead of Serendipity, because of the overall quality and because of the import/export capabilities (caveat: WP isn’t perfect here, either… more on this later). This decision took some consideration because, as a PostgreSQL advocate, I naturally would want to choose software that works with PostgreSQL (Serendipity does; WordPress is MySQL-only).
It got me thinking a little bit about why I like PostgreSQL, when it matters, and when it doesn’t.
In my last post, Why DBMSs are so complex, I raised the issue of type mismatches between the application language and the DBMS.
Type matching between the DBMS and the application is as important as types themselves for successful application development. If a type behaves one way in the DBMS, and a “similar” type behaves slightly differently in the application, that can only cause confusion. And it’s a source of unnecessary awkwardness: you already need to define the types that suit your business best in one place, why do you need to redefine them somewhere else, based on a different basic type system?
Developing a database application means converting your data into types understood by the DBMS, generating SQL code, sending it to the DBMS for evaluation, retrieving the result, and converting the result into types understood by your application language. It’s even more strange when you use parameterized SQL: you generate some code with placeholders instead of literals, and then you send the literals separately (by “more strange,” I certainly don’t mean “worse,” — parameterized SQL is a huge improvement).
If this design were suggested for any purpose other than interacting with a DBMS, others would react with immediate (and justifiable) suspicion. The use of “eval” is strongly avoided, and many popular languages don’t even support it. So why is this design so widely accepted for database management systems?
Here are a couple common representations of data sets:
username | realname | phone
jdavis | Jeff Davis | 555-1212
jsmith | John Smith | 555-2323
Most application development can essentially be described as formalizing a business process. You take an informal process, remove the ambiguity that’s invariably present, adapt it to computer interfaces, and restate that process in a formal language. This has many benefits, among them:
- Automation – This is the most obvious benefit of using a programming language to formalize a process. The computer can now understand the process, and can execute the process without unnecessary human intervention.
- Unambiguous – Ambiguity can be removed without actually creating an application, but you can’t create an application without removing the ambiguity. Some of it is removed through discussions with others involved in the project, but most is resolved through the intuition of the application developer.
- The ability to handle complexity – Simple processes can be handled correctly by humans using intuition, rules of thumb, and following patterns. But complex processes are much easier to solve with a formal process that has independently verifiable parts.
I recently read the following article, another opinion in the long-standing surrogate key vs. natural key “debate”:
I put “debate” in quotes because it’s filled with so much confusion, I don’t think most people understand what they’re arguing over. People use different definitions for surrogate keys, often in the same discussion, so for the purposes of this article I’ll define surrogate key to mean: “A system generated key that is not derived from the business rules, hidden from the business, but visible to the application”. The surrogate key is used to take the place of a natural key, which is derived from the business rules. In other words, a surrogate key is a pointer.
The solution ORMs provide is to “map” an object class, which is a type (or domain), onto a table, which is a relation variable (a.k.a. relvar). This supposedly abstracts away an “impedance mismatch” between the two. ORMs are already off to a bad start, mapping a type to a variable, but I’ll continue.
The real impedance mismatch is a set of very fundamental differences between application data and data in a well-designed relational database.
My employer is evaluating RT. So, we have a test instance set up, of course, and it’s in a working state. The application is written in Perl with the Mason framework, it has support for many databases (including PostgreSQL), and generally has a decent feature set.
We need to migrate from our previous system, so we need to import the data into RT. RT’s tools and APIs do not meet our importing needs, so we look into the schema to do the import directly. I decide the easiest way to figure out how the schema fits together is to just turn on query logging. So, we reload the page, and all of a sudden a deluge of output appears in our “tail -f”. At first I thought it was just the normal number of queries generated by some abstraction layer, and we could dig through them. But it was about 700 queries.