Joshua Drake’s recent article makes some interesting points, but there’s one thing in particular I find missing among many of these discussions. From the article:
It appeared they felt we should be impressed that Facebook runs on MySQL not PostgreSQL. … The problem I have, is that Facebook data is worthless.
All of the concentration is on the company, and whether their use case matters (of course it does, at least to them and their customers). But phrases like “runs on” and “uses” are used too loosely, in my opinion.
Even with celebrity endorsements — for example, a basketball player endorsing shoes — at least they use shoes in roughly the same manner as you might. The shoes might not help you play basketball in any appreciable way, but at least “use” means the same for both the basketball player and you.
However, do you think that running a query at [insert big company here] involves just using the “mysql” client, logging in, and running any ad-hoc query you want? I doubt it. I suspect that the data is always spread around in complex ways with complex caches, and there’s a lot of custom supporting code to get the right information from the right cache at the right time. For every new query, they can unleash a team of very good engineers to build the necessary caches, provision the necessary servers, distribute data to the right places, write the code to populate and read the caches appropriately, and integrate it into the general data-movement architecture.
If your environment looks like that, then a lot of the little problems go away. One might complain that Slony is hard to set up; but in an environment like the one above, it’s insignificant. If there’s some missing feature, you can write it. If something is bothering you, you can fix it. People do that all the time with PostgreSQL, and many of those things get released in the community version. For MySQL, they tend to build up as “patch sets” (or forks, some might call them). I suspect that PostgreSQL gets more contributions because it does everything possible to make the process of community contribution smooth — clean code, no copyright assignment requirement, well-defined “commit fests”, community review, and a diverse group of core members, committers, and contributors. PostgreSQL also has a rock-solid foundation, giving developers more confidence to build the features they need without destabilizing the product.
If your environment doesn’t look like that, and you just want to use the product directly, then take advantage of that. Use the product that makes your life easier, helps you catch errors before they become problems, and keeps your data safe. By the time you scale up, you will be using the DBMS in such a radically different way that it almost doesn’t matter what DBMS you started with.
In an upcoming paper, Mike Stonebraker and Rick Cattell agree with you in a colorful way: “unless you squint, the dominant commercial vendors (Oracle, IBM, Microsoft) as well as the major open source systems (MySQL, PostgreSQL) all look about the same”
http://cattell.net/datastores/CACM-Paper.pdf
From the perspective of a large company with a lot of resources that’s operating on the edge of the possible, the systems all require a lot of work above/around (i.e. management, caching, loading, and data distribution all with certain kinds of queries in mind), and within (i.e. patch sets). In that case, the primary question is: “what is the best starting place?”. From that perspective, I agree that they are quite similar, and differences tend to be quantitative rather than qualitative.
To organizations that are operating with data volumes and transaction/query rates that are well within the possible, and who have finite engineering resources, the DBMS does matter quite a lot. And in this case, differences tend to be more qualitative.
Having worked with all of the mentioned engines, Stonebraker and Cattell are quite full it; likely attempting to sell whatever snake oil is their current batch.
There is a world of difference among these engines. No two of them even have semantically equivalent ACID implementations; although Oracle/PostgreSQL are the two closest. So far as knobs and switches go, the three commercial engines are by far the most flexible. But they should be; they have legions of coders.
“by far the most flexible” is not something you can sell me on either.
You mean flexible, as in requiring one data type for text data 2000 bytes (Oracle). Or flexible as in not running on anything by Windows, with known scalability problems (SQLServer). Or flexible like preventing read SQL while a writes are taking place, and upgrading locks to table level if you lock too many rows (DB2).
Open source gives people what they need by delivering many small usability enhancements and peer review of code means things work the way they should, not the way the marketing department thinks is probably OK.
flexible meaning: “you can take the defaults, but you can change storage and memory and blah to suit your needs”. DB2 is particularly flexible. As to ANSI isolation levels, there’s lots o folks who don’t find MVCC to be the Holy Grail. If nothing else, MVCC databases routinely eat servers for breakfast. Oh, and nothing prevents read on write in DB2, just choose Read Uncommitted. As to MVCC, it amounts to Read On Last Commit, and as is often discussed on PostgreSQL and Oracle groups, that’s why MVCC databases chew servers like peanuts. There Ain’t No Such Thing As A Free Lunch. It’s also why collisions and deadlocks become the user’s responsibility later rather than the engine’s responsibility early. “oops, you can’t really update that row; it’s already been updated by Sally.”
One can choose page size by table. One can choose buffer size by tablespace. One can assign tables to arbitrary tablespaces. One can have covering indexes. And the list goes on.
Again, the number of folks that Oracle/IBM/M$ dedicate to their engines is likely an order of magnitude greater than what PostrgreSQL has.
How an Open Source product moves is, often, a Squeakiest Wheels rather than Grand Vision thing. Python excepted. In the case of PG, the storage model and memory model can be a lot more like DB2 (fur instance) if Enterprise is the way the community wants to go. Or not. But if PG community does want to be an Enterprise DB, then it needs to think seriously along such lines.
The full quote is:
“It appeared they felt we should be impressed that Facebook runs on MySQL not PostgreSQL. That somehow Facebook validates (or Google) the argument for MySQL.
The problem I have, is that Facebook data is worthless. It isn’t worthless to them obviously. They make money selling your data (same as Google in a horizontal way). ”
Other than that, very good points.
Hopefully I didn’t misrepresent your statements. I was trying to pick out a certain aspect that I thought was under-analyzed — in particular, loose phrases such as “runs on” and “uses”.
Also, this applies to every discussion of the form “Big Company runs on XYZ; therefore XYZ must be good.”. I didn’t mean that yours was the only article that glossed over words like “runs on” — I think it’s quite common, which is why I posted this.
That statement requires context. It was made in response to JD’s claim that once you get big, you can’t run MySQL.
It doesn’t imply that I think MySQL is better or worse than PG. I think that PG is awesome and rocks (but I have yet to acquire my “PG Rocks” t-shirt).
That’s a little confusing, because you said “that statement requires context” without identifying the statement I believe you’re referring to a statement you made at PG WEST, but I’m not 100% sure.
This post wasn’t so much about the statement itself, but how it is being analyzed. That’s why I tried to step away from specific company names, because I just don’t know enough of the details.
The main point is that “using” a DBMS means very different things to different organizations. That’s even more true when it’s an open source DBMS, because the large organization probably modifies it significantly; but it applies to the closed systems as well. While I don’t know much about Facebook specifically, I’m fairly sure that it takes a huge amount of custom code and engineering to go from stock MySQL to the architecture that supports Facebook’s data management.
Yes, I was ambiguous. “Company A uses product X at web-scale” was said in response to the claim in the rebuttal by JD that you can’t use MySQL once you get big.
I agree that these references get too much credit in the press. They should mean that you can be successful using product X but are misinterpreted to mean that you will be successful using it.
PG isn’t immune from having patches and forks. PGWest was hosted by the vendor of one (EnterpriseDB) and the rebuttal came from JD whose company does Mammoth Replicator, a PG replication fork/patch.
Skype was cited as a company that runs PG at web-scale. I haven’t found many details on that deployment (another company X uses product Y reference). But they also have patches (SkyTools) to make it work for them.
“they should mean that you can be successful using product X”
I would add to that “…along with a huge amount of other engineering effort around and within X to get X to do what you really need”.
“PG isn’t immune from having patches and forks.”
Absolutely true. I work for such a company. The difference I think is that postgres, as a community, seems to have a more authoritative mainline offering and less confusion over patches, forks, and even options. Equally as important, the postgresql mainline accepts contributions without copyright assignment.
That being said, I get the impression that MySQL is standardizing on InnoDB for the vast majority of uses now that it’s the default in mainline. I think that’s a very positive step (observing mostly from outside, of course).
There are many things that make PG appealing to external developers like me. Including, far more is discussed in the open (excluding the interesting conversations at Greenplum, AsterData and the like), code compiles without warnings, all developers are external, the community process is more mature.
Alas, that is not an option for me today.
It makes perfect sence for InnoDB to be the default storage engine now that Oracle owns MySQL. Not only for the technical reasons, but for the business reasons as well. Oracle has owned InnoDB for the last five years. Now they own the entire product.