Ok, ok admittedly it's not really a "shootout". We all already know SQL-based data sources are generally less performant and less scalable than their NoSQL counterparts, but they are also generally more robust and ACIDic with automated mechanisms to ensure data integrity and disaster recoverability. These characteristics are deliberate by design. What I wanted to do though was quantify how much more performant and scalable NoSQL can be and have my own numbers on where to draw the cutoff. Thus, I went about doing some empirical testing.
I decided to choose PostgreSQL and MongoDB to represent each side of the camp. Both Open Source database engines are pretty mature, pretty popular and are well-supported by the community. Rather than set up my own installs, I went with the DaaS (database-as-a-service) pros who do it day-in and day-out for customers around the globe. On the SQL-side I selected Heroku's Postgres addon with the Crane plan. On the NoSQL side I selected MongoHQ's 25GB SSD plan. Granted that Heroku's database is probably not SSD-backed, I needed the storage space to support the type of row count I planned to run my tests on. For the test itself I created a simple table with five columns: an auto-increment ID, a string, two decimal values, and a timestamp, and the corresponding document structure in MongoDB. I placed individual indexes on all columns. I proceeded to fill both up with randomly generated rows in a loop running up to 100 million iterations. All queries were conducted by a remote shell connection to the database engines. Here is what I observed:
Not long after 50MM rows, Postgres would no longer return with even the simplest of queries that require a full table scan (ex. SELECT * FROM table WHERE id > 50000000 LIMIT 1). The engine was still operational since it would respond to O(1) statements like "select 1", so most likely my connection was timing out before my queries could complete. I speculate that at that point the table had grown so large on disk (30+GB) and the table spread across so many non-contiguous sectors that disk access and/or disk contention with other tenants was timing me out. Extrapolating from DB size, I estimate I was at about 60MM rows when this occurred.
I also noticed MongoHQ had throttled my insert to a more consistent 400-600/s. While inserting rows at this rate in a collection already 60MM rows long in MongoHQ:
So there you have it, with a commodity yet production-level DaaS setup, 50MM rows as an absolute max in Postgres is a good rule-of-thumb to keep in mind. You'll probably have performance headaches way before that. NoSQL performance gains are real to the tune of an entire order of magnitude in common queries. Remember however there are many variables to consider when architecting your own solution and you may achieve quite different numbers on your own. Getting a dedicated server (DaaS or otherwise) means you won't have any chance of resource contention with other tenants and likely better averages. The nature of my test also didn't really leverage the caching mechanisms of either engine, so your production DB may yield more performance as well. On the other hand, I had a very simple usage pattern with a brain-dead schema in my test: just adding rows at the end of the table/document. Your schema and I/O will likely be much more complicated leading to different types of DB lock contention and increased fragmentation on disk. I also only had 2-3 connections open to the DB. If you are building a site with a hefty number of concurrent users, you'll also experience degraded performance.
Hopefully you have found your own takeaways in these statistics and have some new data points to guide you in your future database architecture decisions. Happy Querying!