VERSUS

Ok, ok admittedly it's not really a "shootout". We all already know SQL-based data sources are generally less performant and less scalable than their NoSQL counterparts, but they are also generally more robust and ACIDic with automated mechanisms to ensure data integrity and disaster recoverability. These characteristics are deliberate by design. What I wanted to do though was quantify how much more performant and scalable NoSQL can be and have my own numbers on where to draw the cutoff. Thus, I went about doing some empirical testing.
I decided to choose PostgreSQL and MongoDB to represent each side of the camp. Both Open Source database engines are pretty mature, pretty popular and are well-supported by the community. Rather than set up my own installs, I went with the DaaS (database-as-a-service) pros who do it day-in and day-out for customers around the globe. On the SQL-side I selected Heroku's Postgres addon with the Crane plan. On the NoSQL side I selected MongoHQ's 25GB SSD plan. Granted that Heroku's database is probably not SSD-backed, I needed the storage space to support the type of row count I planned to run my tests on. For the test itself I created a simple table with five columns: an auto-increment ID, a string, two decimal values, and a timestamp, and the corresponding document structure in MongoDB. I placed individual indexes on all columns. I proceeded to fill both up with randomly generated rows in a loop running up to 100 million iterations. All queries were conducted by a remote shell connection to the database engines. Here is what I observed:
- Postgres grows on disk a lot quicker! At 10MM rows, Postgres took up over 6.3GB on disk, while MongoDB only consumed 1.2GB of storage. Storage growth was more or less linear on both engines as row count went up.
- With nothing else going on the system, I/O topped out at about 3,000 inserts/s into the indexed table on MongoHQ. The inserts seemed to come in regular pulses though ranging from 0-3,000/s. It was harder to retrieve the same statistic as insert rate seemed even more erratic. Inserting many millions of rows continuously would time out the shell connection in this setup, while the loop continued.
- At 50MM rows and growing (the insert loops were still running), MongoDB returned a simple query on an indexed column in 70-80ms while Postgres returned in 780-1700ms. 1.7s sounds decent but when we start considering joins and the multiple queries often required to render a single web page, I realized this was no longer acceptable for most OLTP applications.
Not long after 50MM rows, Postgres would no longer return with even the simplest of queries that require a full table scan (ex. SELECT * FROM table WHERE id > 50000000 LIMIT 1). The engine was still operational since it would respond to O(1) statements like "select 1", so most likely my connection was timing out before my queries could complete. I speculate that at that point the table had grown so large on disk (30+GB) and the table spread across so many non-contiguous sectors that disk access and/or disk contention with other tenants was timing me out. Extrapolating from DB size, I estimate I was at about 60MM rows when this occurred.
I also noticed MongoHQ had throttled my insert to a more consistent 400-600/s. While inserting rows at this rate in a collection already 60MM rows long in MongoHQ:
- Simple query on indexed column with MongoDB: 91ms.
- Simple range query count on _id: 4,945 ms. Count returned 10MM+.
- Simple range query count on date attribute: 4,155 ms. Count returned 10MM+.
- Same simple range query on date attribute returning first page of data: 15 ms.
- Deleted ~240,000 rows from middle of the collection: 505,558 ms. Inserts crawled to 10-20/s during delete and shot back up after delete. Simple queries still near instant.
- Drop an index: 89ms.
- Ensure an index: 443,003 ms. This action somehow stopped the insert loop. Could not run simple collection count. Could not connect via another shell. Insert loop resumed after the index command finished. Ensuring index seems to block the entire server. (I learned later that I had not had the background flag set to true, and doing so would not have been so jarring to the server.)
- Aggregate function (max) on rows with same string column across entire data set. This crashed the session and terminated the insert-loop running in another session.
So there you have it, with a commodity yet production-level DaaS setup, 50MM rows as an absolute max in Postgres is a good rule-of-thumb to keep in mind. You'll probably have performance headaches way before that. NoSQL performance gains are real to the tune of an entire order of magnitude in common queries. Remember however there are many variables to consider when architecting your own solution and you may achieve quite different numbers on your own. Getting a dedicated server (DaaS or otherwise) means you won't have any chance of resource contention with other tenants and likely better averages. The nature of my test also didn't really leverage the caching mechanisms of either engine, so your production DB may yield more performance as well. On the other hand, I had a very simple usage pattern with a brain-dead schema in my test: just adding rows at the end of the table/document. Your schema and I/O will likely be much more complicated leading to different types of DB lock contention and increased fragmentation on disk. I also only had 2-3 connections open to the DB. If you are building a site with a hefty number of concurrent users, you'll also experience degraded performance.
Hopefully you have found your own takeaways in these statistics and have some new data points to guide you in your future database architecture decisions. Happy Querying!
You need to be a member of TechHui to add comments!
Join TechHui