How to Bring Import and Export Back

I seem to remember that import and export used to be standard fare on any application that might hold your data. To an extent, that’s still true, but it seems like it’s becoming less so — especially with web applications. In my last post, I pointed out that the import and export capabilities of Serendipity are severely lacking. If you export, and then try to import into a fresh installation, you will lose all the comments! That simply doesn’t sound like a valid import/export feature at all, and users would probably be safer if it didn’t exist. WordPress is better, and the export file gives me more confidence that it’s correct, but it’s still not without flaw: the comment threading gets completely garbled on import to a fresh install.

If you read the manuals to these peices of software, the general suggestion is to copy the files and do a dump/reload of the database. Of course that will work, but that’s far from ideal for a number of reasons:

  • You can’t import into anything other than a fresh install. Clearly all of the auto-numbered fields will overlap and you’ll have problems.
  • Why should you have to be a DBA to be able to save your data?
  • What if it’s a hosted service managed by someone else? Maybe you don’t have direct access to the database system.
  • How do you move data to a new version? Do you just hope it works five versions later, or do you go through a chain of upgrades so that your data doesn’t get destroyed?

I think we are missing a lot of control over our data that we might have had before. Why is this happening?

  • It’s not a huge draw for new users — only existing users care about import/export.
  • With import/export, they might want to move their data to another application. The developers might not care if the user does move away, but the developer also might not go out of their way to help the process.
  • Getting import/export right is hard, and requires discipline in the release process, and generally slows down development.
  • Increases testing requirements.

I think that PostgreSQL can do something to help, though: make pg_dump and pg_restore callable in a way similar to COPY. Application developers don’t like to invoke executables located somewhere in the filesystem, probably provide connection information all over again, etc., just to give people a file to save. Also, pg_dump and pg_restore should be improved to allow changing identifiers around, because that would allow the application developer to import the data into a “staging area”. From the staging database, they could make modifications to allow it to integrate with the rest of the database (e.g. change auto-generated keys, etc.).

With features like these, application developers would have a standard template to follow for allowing import/export capabilities that would not be to onerous:

  1. Backups are simple: call pg_dump (telling it to prefix all schema names with the name of the backup) and send to client, using libpq the whole time.
  2. Restore puts all the data into the existing database, where it won’t conflict with anything (because of the schema name prefix).
  3. Perform all automatic schema upgrades to get it ready for the version of the software you’re importing into.
  4. Now you have a nice staging area already committed, and you can lead the user through a series of questions about how they’d like to integrate the imported data and the existing data. If there are auto-numbering conflicts, show them how they can map the objects to new auto-generated numbers; if there is a real conflict, show them how they can merge it.
  5. Move the data in. Done.

Compare the aforementioned plan with some custom XML export format: where does the data go while you’re asking the user how to resolve the conflicts? You have to find a place to put it, try to upgrade it to the latest version, ask the user how to resolve the conflicts, convert it to SQL, and import it, all as one big awkward process. Not only that, the user doesn’t get access to any of these intermediate stages, so they can’t fix some data now, and then fix the rest later. And if the user is a DBA, they can’t use a clever SQL query involving the imported data and the existing data to save themselves time.

Are there other ways that PostgreSQL can make offering import/export capabilities from an application easier?

3 thoughts on “How to Bring Import and Export Back

  1. The idea to make a pg_dump library so that it’s easier to call it from an application has already been brought up here:

    http://archives.postgresql.org/pgsql-hackers/2008-07/msg01252.php

    I envision that it would be almost as easy as executing a COPY and sending the results back as a file. A few more options, but as long as you don’t have to execute a file on the filesystem, it would work for me.

    pg_dump/restore would still need to be able to change names on the fly, however. I think that’s important for a number of reasons.

  2. Pingback: Valuable Internet Information » How to Bring Import and Export Back « Experimental Thoughts