TechHui

Hawaiʻi's Technology Community

VERSUS

Ok, ok admittedly it's not really a "shootout". We all already know SQL-based data sources are generally less performant and less scalable than their NoSQL counterparts, but they are also generally more robust and ACIDic with automated mechanisms to ensure data integrity and disaster recoverability. These characteristics are deliberate by design. What I wanted to do though was quantify how much more performant and scalable NoSQL can be and have my own numbers on where to draw the cutoff. Thus, I went about doing some empirical testing.

I decided to choose PostgreSQL and MongoDB to represent each side of the camp. Both Open Source database engines are pretty mature, pretty popular and are well-supported by the community. Rather than set up my own installs, I went with the DaaS (database-as-a-service) pros who do it day-in and day-out for customers around the globe. On the SQL-side I selected Heroku's Postgres addon with the Crane plan. On the NoSQL side I selected MongoHQ's 25GB SSD plan. Granted that Heroku's database is probably not SSD-backed, I needed the storage space to support the type of row count I planned to run my tests on. For the test itself I created a simple table with five columns: an auto-increment ID, a string, two decimal values, and a timestamp, and the corresponding document structure in MongoDB. I placed individual indexes on all columns. I proceeded to fill both up with randomly generated rows in a loop running up to 100 million iterations. All queries were conducted by a remote shell connection to the database engines. Here is what I observed:

  • Postgres grows on disk a lot quicker! At 10MM rows, Postgres took up over 6.3GB on disk, while MongoDB only consumed 1.2GB of storage. Storage growth was more or less linear on both engines as row count went up.
  • With nothing else going on the system, I/O topped out at about 3,000 inserts/s into the indexed table on MongoHQ. The inserts seemed to come in regular pulses though ranging from 0-3,000/s. It was harder to retrieve the same statistic as insert rate seemed even more erratic. Inserting many millions of rows continuously would time out the shell connection in this setup, while the loop continued.
  • At 50MM rows and growing (the insert loops were still running), MongoDB returned a simple query on an indexed column in 70-80ms while Postgres returned in 780-1700ms. 1.7s sounds decent but when we start considering joins and the multiple queries often required to render a single web page, I realized this was no longer acceptable for most OLTP applications.

Not long after 50MM rows, Postgres would no longer return with even the simplest of queries that require a full table scan (ex. SELECT * FROM table WHERE id > 50000000 LIMIT 1). The engine was still operational since it would respond to O(1) statements like "select 1", so most likely my connection was timing out before my queries could complete. I speculate that at that point the table had grown so large on disk (30+GB) and the table spread across so many non-contiguous sectors that disk access and/or disk contention with other tenants was timing me out. Extrapolating from DB size, I estimate I was at about 60MM rows when this occurred.

I also noticed MongoHQ had throttled my insert to a more consistent 400-600/s. While inserting rows at this rate in a collection already 60MM rows long in MongoHQ:

  • Simple query on indexed column with MongoDB: 91ms.
  • Simple range query count on _id: 4,945 ms. Count returned 10MM+.
  • Simple range query count on date attribute: 4,155 ms. Count returned 10MM+.
  • Same simple range query on date attribute returning first page of data: 15 ms.
  • Deleted ~240,000 rows from middle of the collection: 505,558 ms. Inserts crawled to 10-20/s during delete and shot back up after delete. Simple queries still near instant.
  • Drop an index: 89ms.
  • Ensure an index: 443,003 ms. This action somehow stopped the insert loop. Could not run simple collection count. Could not connect via another shell. Insert loop resumed after the index command finished. Ensuring index seems to block the entire server. (I learned later that I had not had the background flag set to true, and doing so would not have been so jarring to the server.)
  • Aggregate function (max) on rows with same string column across entire data set. This crashed the session and terminated the insert-loop running in another session.

So there you have it, with a commodity yet production-level DaaS setup, 50MM rows as an absolute max in Postgres is a good rule-of-thumb to keep in mind. You'll probably have performance headaches way before that. NoSQL performance gains are real to the tune of an entire order of magnitude in common queries. Remember however there are many variables to consider when architecting your own solution and you may achieve quite different numbers on your own. Getting a dedicated server (DaaS or otherwise) means you won't have any chance of resource contention with other tenants and likely better averages. The nature of my test also didn't really leverage the caching mechanisms of either engine, so your production DB may yield more performance as well. On the other hand, I had a very simple usage pattern with a brain-dead schema in my test: just adding rows at the end of the table/document. Your schema and I/O will likely be much more complicated leading to different types of DB lock contention and increased fragmentation on disk. I also only had 2-3 connections open to the DB. If you are building a site with a hefty number of concurrent users, you'll also experience degraded performance.

Hopefully you have found your own takeaways in these statistics and have some new data points to guide you in your future database architecture decisions. Happy Querying!

Views: 596

Comment

You need to be a member of TechHui to add comments!

Join TechHui

Comment by Curtis J. Kropar on November 13, 2013 at 3:36am

Not sure if its commonly available, but Check into GT.M

http://en.wikipedia.org/wiki/GT.M

NoSQL, before there even was SQL....Created in 1966

http://en.wikipedia.org/wiki/MUMPS

Its sooo fast, that many of the worlds banks and hospitals are dropping their SQL databases and converting over to Mumps / GT.M

The database engine, made open source in 2000, is maintained by Fidelity Information Services. GT.M is used as the backend of their FIS Profile banking application, and it powers ING DIRECT banks in the United States, Canada, Spain, France, Italy, Holland and the UK

Plus other "little" companies like ScottTrade & Ameritrade use it for all of their online stock transactions.

Comment by Justin Mayer on November 11, 2013 at 1:35pm

While speed comparison can be a fun exercise, I think the money quote here is this: "There are many variables to consider when architecting your own solution."

Any conclusions you might draw from this speed experiment should be far, far down on that list of variables. What's at the top? The type of data you need to store, and what you want to be able to do with it.

If you are dealing with arbitrary blobs that require no relations whatsoever, and speed is more important to you than data integrity, then by all means use a document data store. But realize that there are few use cases for this. Very few.

Many people talk about dropping MongoDB into their web applications as a replacement for relational databases such as PostgreSQL. This is almost always an inherently bad idea. I could go into detail as to why, but Sarai Mei has already covered this and thus saved me the trouble:

http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mo...

Even if you decide that the structure of your data has no value, you'd most likely be better off using PostgreSQL's hstore. For many operations, it's faster than MongoDB:

http://obartunov.livejournal.com/175235.html

Last but not least, remind yourself of the YAGNI principle whenever you can. Chances are, you're never going to run up against a PostgreSQL performance issue that can't be easily solved by caching and/or faster hardware. And if you ever do, congratulations — at that point you'll be successful enough that you'll have no trouble hiring excellent people to help you scale ever upward.

Comment by Joseph Lui on November 7, 2013 at 6:14pm
Comment by Daniel Leuck on November 5, 2013 at 4:15pm

Nice post! I look forward to your MongoDB vs. Riak and Key / Value vs. Big Table-style NoSQL shootout :-)

Comment by Joseph Lui on November 3, 2013 at 7:30pm

I think using both is certainly an option to have in your tool belt. There will probably be a price to pay in maintainability, but sometimes the gains far outweigh the costs. You might even think of it the other way as Postgres being the bulk of your data handling and Mongo for the parts that absolutely need to be fast.

Comment by Cameron Souza on November 3, 2013 at 3:14am

Interesting. Thank you for posting! What do you think of using both, with Postgres handing the few parts that actually need ACID guarantees and Mongo handling everything else?

At 50MM rows and growing (the insert loops were still running), MongoDB returned a simple query on an indexed column in 70-80ms while Postgres returned in 780-1700ms

That is an eyeopener. I knew it was faster, but I didn't know it was 10X+ faster (at least for this sort of thing.) ACID is expensive!

Sponsors

web design, web development, localization

© 2024   Created by Daniel Leuck.   Powered by

Badges  |  Report an Issue  |  Terms of Service