In The Database or In The App?

I recently came across an interesting post on High Scalability. It talks about the architecture of one of my favorite tech (and other) news sites, Digg.

Digg is a monster. It has ~1.2 million users and does ~200 million page views per month. Ouch. It isn’t serving up flat data either - there are a lot of database read and write transactions going on. Web sites referred to on Digg are often blasted into oblivion when users read a summary and click a link over to the mentioned site. It happens so often the “Digg effect” has been added to the web developer’s vernacular.

The overall information is brief and interesting (LAMP stack, yo), but what struck me most were two little nuggets.

None of the scaling challenges we faced had anything to do with PHP. The biggest issues faced were database related.

There are some legitimate complaints to be made about PHP, but it can take a serious pounding. Also note they are talking about MySQL on the database end, which in my experience for many types of transactions is about as fast as you're going to get (particularly reads to MyISAM).

But this is the big one:

The lightweight nature of PHP allowed them to move processing tasks from the database to PHP in order to improve scaling. Ebay does this in a radical way. They moved nearly all work out of the database and into applications, including joins, an operation we normally think of as the job of the database.

Thank you, thank you, thank you.

One of my many hats has DBA emblazoned across the top of it. Lately those letters are in lower case as I’ve moved on to other things, but I still carry the hat with me. I’ve worn that hat for a lot of years now, managing Microsoft SQL Server, MySQL, and PostgreSQL databases. Many DBA’s will think this heresy, but: if it can be avoided, don’t put it in the database.

By ‘it’ I am referring to anything other than the data itself. In the last 15+ years DBA’s have gone way, way overboard in trying to put business logic/application functionality into the database. Your database is being asked to handle things an application or a middle tier used to take care of. And it is screwing you in a couple of very important ways:

Your flexibility to move your database to another platform begins to approach nil. There really isn't a standard for stored procedures et. al.. Once your logic is in the database, your are going to have a horrible time trying to move it to another database. And someday you are going to want to do that. Bank on it.
This stuff can kill your database server. When you had a million dollar mainframe and your app was running on an i386, putting the logic on the database made some sense. These days not only might your apps be on faster servers than your database, but scaling your app is generally a lot cheaper and easier and your lightweight app will probably process the business logic a lot faster.

There are always exceptions to every rule. I can think of a half-dozen reasons why some logic pieces really have to go in the database. And there are also many, many factors that go into database optimization and tuning, and if you haven't tuned in some other areas this one may be an insignificant blip. But over time the rule for DBA's has become put nearly everything in the database. Logic in the database is generally the case where I work, and it probably is where you work too.

I’ve been involved in two projects with our IT group where the web site created ended up getting a good bit of traffic. In each instance the database (Microsoft SQL Server) croaked. The answer always seemed to be to throw more money at the database. In one case GIS was able to keep hitting the “dead” database (we were using straight SQL and applying logic on the application end) while another group’s stored procedures were maxing out the CPU and timing out. I actually listened to a DBA in one postmortem meeting state SQL Server just can’t handle more than ~60-70 concurrent connections.* The solution to that one was to periodically spit out straight HTML from the database rather than hit the live data. Which will certainly work, but my brain cried a little when I heard that.

I’ve got a MS SQL server that handles a lot of data throughput (much of it hefty vector GIS data) and a ton of web traffic on a fairly old box supporting a few dozen web sites, and it has never crashed due to traffic. Right now is a plain-old day and it has 343 user connections. In the past minute it did 2,869 SQL transactions (~48 a second, damn you SDE). These numbers are down considerably, as some of our apps are hitting a new SQL box we’re transitioning to, and others still have been upgraded to PostgreSQL/PostGIS. Still, by all accounts it’s a busy little SQL Server box. I would like to credit our SDE administrator’s brilliance for its performance, but I give more credit to what the database server isn’t doing.**

Maybe your server isn’t going to get a whole lot of traffic and this stuff just won’t matter, or maybe other performance factors are much more important in your environment. But as a general rule of thumb, in my opinion, put in some serious thought before you drop anything in your database that doesn’t look like a table.

To his credit, he said it with a straight face.
*I would, however, rate her brilliance a close second.