Experimental Thoughts » Ruby http://thoughts.davisjeff.com Ideas on Databases, Logic, and Language by Jeff Davis Sat, 16 Jun 2012 21:05:47 +0000 en hourly 1 http://wordpress.org/?v=3.3.1 Taking a step back from ORMs http://thoughts.davisjeff.com/2012/02/26/taking-a-step-back-from-orms/ http://thoughts.davisjeff.com/2012/02/26/taking-a-step-back-from-orms/#comments Sun, 26 Feb 2012 22:17:21 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=498 Continue reading ]]> Do object-relational mappers (ORMs) really improve application development?

When I started developing web applications, I used perl. Not even all of perl, mostly just a bunch of “if” statements and an occasional loop that happened to be valid perl (aside: I remember being surprised that I was allowed to write a loop that would run on a shared server, because “what if it didn’t terminate?!”). I didn’t use databases; I used a mix of files, regexes to parse them, and flock to control concurrency (not because of foresight or good engineering, but because I ran into concurrency-related corruption).

I then made the quantum leap to databases. I didn’t see the benefits instantaneously[1], but it was clearly a major shift in the way I developed applications.

Why was it a quantum leap? Well, many reasons, which are outside of the scope of this particular post, but which I’ll discuss more in the future. For now, I’ll just cite the overwhelming success of SQL over a long period of time; and the pain experienced by anyone who has built and maintained a few NoSQL applications[2].

I don’t think ORMs are a leap forward; they are just an indirection[3] between the application and the database. Although it seems like you could apply the same kind of “success” argument, it’s not the same. First of all, ORM users are a subset of SQL users, and I think there are a lot of SQL users that are perfectly content without an ORM. Second, many ORM users feel the need to “drop down” to the SQL level frequently to get the job done, which means you’re not really in new territory.

And ORMs do have a cost. Any tool that uses a lot of behind-the-scenes magic will cause a certain amount of trouble — just think for a moment on the number of lines of code between the application and the database (there and back), and imagine the subtle semantic problems that might arise.

To be more concrete: one of the really nice things about using a SQL DBMS is that you can easily query the database as though you were the application. So, if you are debugging the application, you can quickly see what’s going wrong by seeing what the application sees right before the bug is hit. But you quickly lose that ability when you muddy the waters with thousands of lines of code between the application error and the database[4]. I believe the importance of this point is vastly under-appreciated; it’s one of the reasons that I think a SQL DBMS is a quantum leap forward, and it applies to novices as well as experts.

A less-tangible cost to ORMs is that developers are tempted to remain ignorant of the SQL DBMS and the tools that it has to offer. All these features in a system like PostgreSQL are there to solve problems in the easiest way possible; they aren’t just “bloat”. Working with multiple data sources is routine in any business environment, but if you don’t know about support for foreign tables in postgresql, you’re likely to waste a lot of time re-implementing similar functionality in the application. Cache invalidation (everything from memcache to statically-rendered HTML) is a common problem — do you know about LISTEN/NOTIFY? If your application involves scheduling, and you’re not using Temporal Keys, there is a good chance you are wasting development time and performance; and likely sacrificing correctness. The list goes on and on.

Of course there are reasons why so many people use ORMs, at least for some things. A part of it is that application developers may think that learning SQL is harder than learning an ORM, which I think is misguided. But a more valid reason is that ORMs do help eliminate boilerplate in some common situations.

But are there simpler ways to avoid boilerplate? It seems like we should be able to do so without something as invasive as an ORM. For the sake of brevity, I’ll be using hashes rather than objects, but the principle is the same. The following examples are in ruby using the ‘pg’ gem (thanks Michael Granger for maintaining that gem!).

First, to retrieve records as a hash, it’s already built into the ‘pg’ gem. Just index into the result object, and you get a hash. No boilerplate there.

Second, to do an insert, there is a little boilerplate. You have to build a string (yuck), put in the right table name, make the proper field list (unless you happen to know the column ordinal positions, yuck again), and then put in the values. And if you add or change fields, you probably need to modify it. Oh, and be sure to avoid SQL injection!

Fortunately, once we’ve identified the boilerplate, it’s pretty easy to solve:

# 'conn' is a PG::Connection object
def sqlinsert(conn, table, rec)
  table     = conn.quote_ident(table)
  rkeys     = rec.keys.map{|k| conn.quote_ident(k.to_s)}.join(",")
  positions = (1..rec.keys.length).map{|i| "$" + i.to_s}.join(",")
  query     = "INSERT INTO #{table}(#{rkeys}) VALUES(#{positions})"
  conn.exec(query, rec.values)
end

The table and column names are properly quoted, and the values are passed in as parameters. And, if you add new columns to the table, the routine still works, you just end up with defaults for the unspecified columns.

I’m sure others can come up with other examples of boilerplate that would be nice to solve. But the goal is not perfection; we only need to do enough to make simple things simple. And I suspect that only requires a handful of such routines.

So, my proposal is this: take a step back from ORMs, and consider working more closely with SQL and a good database driver. Try to work with the database, and find out what it has to offer; don’t use layers of indirection to avoid knowing about the database. See what you like and don’t like about the process after an honest assessment, and whether ORMs are a real improvement or a distracting complication.

[1]: At the time, MySQL was under a commercial license, so I tried PostgreSQL shortly thereafter. I switched between the two for a while (after MySQL became GPL), and settled on PostgreSQL because it was much easier to use (particularly for date manipulation).

[2]: There may be valid reasons to use NoSQL, but I’m skeptical that “ease of use” is one of them.

[3]: Some people use the term “abstraction” to describe an ORM, but I think that’s misleading.

[4]: The ability to explore the data through an ORM from a REPL might resemble the experience of using SQL. But it’s not nearly as useful, and certainly not as easy: if you determine that the data is wrong in the database, you still need to figure out how it got that way, which again involves thousands of lines between the application code that requests a modification and the resulting database update.

]]>
http://thoughts.davisjeff.com/2012/02/26/taking-a-step-back-from-orms/feed/ 28
SQL: the successful cousin of Haskell http://thoughts.davisjeff.com/2011/09/25/sql-the-successful-cousin-of-haskell/ http://thoughts.davisjeff.com/2011/09/25/sql-the-successful-cousin-of-haskell/#comments Sun, 25 Sep 2011 07:10:29 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=472 Continue reading ]]> Haskell is a very interesting language, and shows up on sites like http://programming.reddit.com frequently. It’s somewhat mind-bending, but very powerful and has some great theoretical advantages over other languages. I have been learning it on and off for some time, never really getting comfortable with it but being inspired by it nonetheless.

But discussion on sites like reddit usually falls a little flat when someone asks a question like:

If haskell has all these wonderful advantages, what amazing applications have been written with it?

The responses to that question usually aren’t very convincing, quite honestly.

But what if I told you there was a wildly successful language, in some ways the most successful language ever, and it could be characterized by:

  • lazy evaluation
  • declarative
  • type inference
  • immutable state
  • tightly controlled side effects
  • strict static typing

Surely that would be interesting to a Haskell programmer? Of course, I’m talking about SQL.

Now, it’s all falling into place. All of those theoretical advantages become practical when you’re talking about managing a lot of data over a long period of time, and trying to avoid making any mistakes along the way. Really, that’s what relational database systems are all about.

I speculate that SQL is so successful and pervasive that it stole the limelight from languages like haskell, because the tough problems that haskell would solve are already solved in so many cases. Application developers can hack up a SQL query and run it over 100M records in 7 tables, glance at the result, and turn it over to someone else with near certainty that it’s the right answer! Sure, if you have a poorly-designed schema and have all kinds of special cases, then the query might be wrong too. But if you have a mostly-sane schema and mostly know what you’re doing, you hardly even need to check the results before using the answer.

In other words, if the query compiles, and the result looks anything like what you were expecting (e.g. the right basic structure), then it’s probably correct. Sound familiar? That’s exactly what people say about haskell.

It would be great if haskell folks would get more involved in the database community. It looks like a lot of useful knowledge could be shared. Haskell folks would be in a better position to find out how to apply theory where it has already proven to be successful, and could work backward to find other good applications of that theory.

Competing directly in the web application space against languages like ruby and javascript is going to be an uphill battle even if haskell is better in that space. I’ve worked with some very good ruby developers, and I honestly couldn’t begin to tell them where haskell might be a practical advantage for web application development. Again, I don’t know much about haskell aside from the very basics. But if someone like me who is interested in haskell and made some attempt to understand it and read about it still cannot articulate a practical advantage, clearly there is some kind of a problem (either messaging or technical). And that’s a huge space for application development, so that’s a serious concern.

However, the data management space is also huge — a large fraction of those applications exist primarily to collect data or present data. So, if haskell folks could work with the database community to advance data management, I believe that would inspire a lot of interesting development.

]]>
http://thoughts.davisjeff.com/2011/09/25/sql-the-successful-cousin-of-haskell/feed/ 11
Building SQL Strings Dynamically, in 2011 http://thoughts.davisjeff.com/2011/07/09/building-sql-strings-dynamically-in-2011/ http://thoughts.davisjeff.com/2011/07/09/building-sql-strings-dynamically-in-2011/#comments Sat, 09 Jul 2011 16:57:50 +0000 Jeff Davis http://thoughts.davisjeff.com/?p=403 Continue reading ]]> I saw a recent post Avoid Smart Logic for Conditional WHERE Clauses which actually recommended, “the best solution is to build the SQL statement dynamically—only with the required filters and bind parameters”. Ordinarily I appreciate that author’s posts, but this time I think that he let confusion run amok, as can be seen in a thread on reddit.

To dispel that confusion: parameterized queries don’t have any plausible downsides; always use them in applications. Saved plans have trade-offs; use them sometimes, and only if you understand the trade-offs.

When query parameters are conflated with saved plans, it’s creates FUD about SQL systems because it mixes the fear around SQL injection with the mysticism around the SQL optimizer. Such confusion about the layers of a SQL system are a big part of the reason that some developers move to the deceptive simplicity of NoSQL systems (I say “deceptive” here because it often just moves an even greater complexity into the application — but that’s another topic).

The confusion started with this query from the original article:

SELECT first_name, last_name, subsidiary_id, employee_id
FROM employees
WHERE ( subsidiary_id    = :sub_id OR :sub_id IS NULL )
  AND ( employee_id      = :emp_id OR :emp_id IS NULL )
  AND ( UPPER(last_name) = :name   OR :name   IS NULL )

[ Aside: In PostgreSQL those parameters should be $1, $2, and $3; but that's not relevant to this discussion. ]

The idea is that one such query can be used for several types of searches. If you want to ignore one of those WHERE conditions, you just pass a NULL as one of the parameters, and it makes one side of the OR always TRUE, thus the condition might as well not be there. So, each condition can either be there and have one argument (restricting the results of the query), or be ignored by passing a NULL argument; thus effectively giving you 8 queries from one SQL string. By eliminating the need to use different SQL strings depending on which conditions you want to use, you reduce the opportunity for error.

The problem is that the article says this kind of query is a problem. The reasoning goes something like this:

  1. Using bind parameters forces the plan to be saved and reused for multiple queries.
  2. When a plan is saved for multiple queries, the planner doesn’t have the actual argument values.
  3. Because the planner doesn’t have the actual argument values, the “x IS NULL” conditions aren’t constant at plan time, and therefore the planner isn’t able to simplify the conditions (e.g., if one condition is always TRUE, just remove it).
  4. Therefore it makes a bad plan.

However, #1 is simply untrue, at least in PostgreSQL. PostgreSQL can save the plan, but you don’t have to. See the documentation for PQexecParams. Here’s an example in ruby using the “pg” gem (EDIT: Note: this does not use any magic query-building behind the scenes, it uses a protocol level feature in the PostgreSQL server to bind the arguments):

require 'rubygems'
require 'pg'

conn = PGconn.connect("dbname=postgres")

conn.exec("CREATE TABLE foo(i int)")
conn.exec("INSERT INTO foo SELECT generate_series(1,10000)")
conn.exec("CREATE INDEX foo_idx ON foo (i)")
conn.exec("ANALYZE foo")

# Insert using parameters. Planner sees the real arguments, so it will
# make the same plan as if you inlined them into the SQL string. In
# this case, 3 is not NULL, so it is simplified to just "WHERE i = 3",
# and it will choose to use an index on "i" for a fast search.
res = conn.exec("explain SELECT * FROM foo WHERE i = $1 OR $1 IS NULL", [3])
res.each{ |r| puts r['QUERY PLAN'] }
puts

# Now, the argument is NULL, so the condition is always true, and
# removed completely. It will surely choose a sequential scan.
res = conn.exec("explain SELECT * FROM foo WHERE i = $1 OR $1 IS NULL", [nil])
res.each{ |r| puts r['QUERY PLAN'] }
puts

# Saves the plan. It doesn't know whether the argument is NULL or not
# yet (because the arguments aren't provided yet), so the plan might
# not be good.
conn.prepare("myplan", "SELECT * FROM foo WHERE i = $1 OR $1 IS NULL")

# We can execute this with:
res = conn.exec_prepared("myplan",[3])
puts res.to_a.length
res = conn.exec_prepared("myplan",[nil])
puts res.to_a.length

# But to see the plan, we have to use the SQL string form so that we
# can use EXPLAIN. This plan should use an index, but because we're
# using a saved plan, it doesn't know to use the index. Also notice
# that it wasn't able to simplify the conditions away like it did for
# the sequential scan without the saved plan.
res = conn.exec("explain execute myplan(3)")
res.each{ |r| puts r['QUERY PLAN'] }
puts

# ...and use the same plan again, even with different argument.
res = conn.exec("explain execute myplan(NULL)")
res.each{ |r| puts r['QUERY PLAN'] }
puts

conn.exec("DROP TABLE foo")

See? If you know what you are doing, and want to save a plan, then save it. If not, do the simple thing, and PostgreSQL will have the information it needs to make a good plan.

My next article will be a simple introduction to database system architecture that will hopefully make SQL a little less mystical.

]]>
http://thoughts.davisjeff.com/2011/07/09/building-sql-strings-dynamically-in-2011/feed/ 6
None, nil, Nothing, undef, NA, and SQL NULL http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/ http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/#comments Wed, 13 Aug 2008 18:00:02 +0000 Jeff Davis http://davisjeff.wordpress.com/?p=14 Continue reading ]]> In my last post, Why DBMSs are so complex, I raised the issue of type mismatches between the application language and the DBMS.

Type matching between the DBMS and the application is as important as types themselves for successful application development. If a type behaves one way in the DBMS, and a “similar” type behaves slightly differently in the application, that can only cause confusion. And it’s a source of unnecessary awkwardness: you already need to define the types that suit your business best in one place, why do you need to redefine them somewhere else, based on a different basic type system?

At least we’re using PostgreSQL, the most extensible database available, where you can define sophisticated types and make them perform like native features.

But there are still problems. Most notably, it’s a non-trivial challenge to find an appropriate way to model NULLs in the application language. You can’t not use them in the DBMS, because the SQL spec generates them from oblivion, e.g. from an outer join or an aggregate function, even when you have no NULLs in your database. So the only way to model the same semantics in your application is to somehow make your application language understand NULL semantics.

Here’s how SQL NULL behaves:

=> -- aggregate with one NULL input
=> select sum(column1) from (values(NULL::int)) t;
sum
-----

(1 row)

=> -- aggregate with two inputs, one of them NULL
=> select sum(column1) from (values(1),(NULL)) t;
sum
-----
1
(1 row)

=> -- aggregate with no input
=> select sum(column1) from (values(1),(NULL)) t where false;
sum
-----

(1 row)

=> -- + operator
=> select 1 + NULL;
?column?
----------

(1 row)

I’ll divide the “NULL-ish” values of various languages into two broad categories:

  1. Separate type, few operators defined, error early, no 3VL — Python, Ruby and Haskell fall into this category, because their “NULL-ish” types (None, nil, and Nothing, respectively) usually result in an immediate exception, unless the operator to which the NULLish value is passed handles it as a special case. Few built-in operators are defined for arguments of these types. These fail to behave like SQL NULL, because they employ no three-valued logic (3VL) at all, and thus fail in the forth portion of the SQL example.
  2. Member of all types, every operator defined — Perl and R fall into this category. Perl’s undef can be passed through many built-in operators (like +), but doesn’t ever use 3VL, so fails the forth portion of the SQL example. R uses a kind of 3VL for it’s NA value, but it uses it everywhere, so sum(c(1,NA)) results in NA (thus failing the second portion of the SQL example). In R, you can omit NAs from the sum explicitly (not a very good solution, by the way), but then it will fail the first portion of the SQL example.

As far as I can tell (correct me if I’m mistaken), none of these languages support the third portion of the SQL example: the sum of an empty list in SQL is NULL. The languages that I tested with a built-in sum operator (Python, R, Haskell) all return 0 when passed an empty list.

Languages from the first category appear safer, because you will catch the errors earlier rather than later. However, transforming SQL NULLs in these languages to None, nil, or Nothing is actually quite dangerous, because a change in the data you store in your database (inserting NULLs or deleting records that may be aggregated) or even a change in a query (outer join, or an aggregate that may have no input) can produce NULLs, and therefore produce exceptions, that can evade even rigorous testing procedures and sneak into production.

Languages from the second category tend to pass the “undef” or “NA” along deeper into the application, which can cause unintuitive and difficult-to-trace problems. Perhaps worse, something will always happen, and usually the result will take the form of the correct answer even if it is wrong.

So where does that leave us? I think the blame here rests entirely on the SQL standard’s definition of NULL, and the inconsistency between “not a value at all” and “the third logical value” (both of which can be used to describe NULL in different contexts). Not much can be done about that, so I think the best strategy is to try to interpret and remove NULLs as early as possible. They can be removed from result sets before returning to the client by using COALESCE, and they can be removed after they reach the client with client code. Passing them along as some kind of special value is only useful if your application already must be thoroughly aware of that special value.

Note: Microsoft has defined some kind of “DBNull” value, and from browsing the docs, it appears a substantial amount of work went into making them behave as SQL NULLs. This includes a special set of SQL types and operators. Microsoft appears to be making a lot of progress matching DBMS and application types more closely, but I think the definition of SQL NULLs is a lost cause.

]]>
http://thoughts.davisjeff.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/feed/ 7
ruby-pg is now the official postgres ruby gem http://thoughts.davisjeff.com/2007/12/14/ruby-pg-is-now-the-official-postgres-ruby-gem/ http://thoughts.davisjeff.com/2007/12/14/ruby-pg-is-now-the-official-postgres-ruby-gem/#comments Fri, 14 Dec 2007 18:00:10 +0000 Jeff Davis http://davisjeff.wordpress.com/?p=18 Continue reading ]]> ruby-pg is now the official rubyforge project for the “postgres” ruby
gem. See the project here:

http://www.rubyforge.org/projects/ruby-pg

or install the gem directly:

# gem install –remote postgres

The previous project has gone unmaintained for a long time, which lead
to the fork.

This gem includes some important fixes, most notably the ability to
compile against PostgreSQL 8.3.

The gem contains two modules:

  • ‘postgres’ — require this module as before, you can use it without
    making any changes to your application. This is essentially just a fork
    from version 0.7.1.2006.04.06, but contains some important fixes,
    including the ability to build against 8.3.
  • ‘pg’ — a new interface, designed to offer every feature available in
    libpq to Ruby, with a better API. This module is simpler, cleaner, and
    more portable. It is still unstable, so test before using.

PostgreSQL+Ruby users: please test and report any problems. I’d like to
make sure this is as stable as possible, and builds on all necessary
platforms.

]]>
http://thoughts.davisjeff.com/2007/12/14/ruby-pg-is-now-the-official-postgres-ruby-gem/feed/ 1