Benchmarking SSD with MongoDB and CouchDB, Part 1

Doing benchmarks is always complicate because it is most of the time totally unclear what one wants to measure and what will be measured. There is article describing this dilemma, see Benchmarks-You-are-Doing-it-Wrong. The CouchDB – The Definite guide contains a longer version of this, see High Performance. I assume that this is also one of the reasons why MongoDB does not officially publish benchmarks, see Benchmarks. Kristina Chodorow, Software Engineer at 10gen, published a nice, unofficial benchmark at snailinaturtleneck.

As SSD are getting cheaper and cheaper I wanted to investigate what impact SSD have on the performance of MongoDB and CouchDB. Let me stress that this is not a real life example, it is not exhaustive test. It is not a complete test suit measuring the performance of a document store. I did that test out of interest for the implication of SSD on the NoSQL movement. There are also more complete benchmarks of SSD, but as zorinaq points out most of them are also flawed.

They very first question is “what do we actually measure?”. In order to answer that we need some baselines. CouchDB is using a REST/HTTP interface, so I am going to measure the throughput of this interface. Just requesting the version information should give a good indication. The next step would be to reproduce the claims in High Performance. Basically I’m interested in the “reliable” setup, where each and every change is written back to disk. This should show the biggest difference in performance between hard disks and SSD. Again you might ask “what is the point?”. It is clear that only very few application require that level of reliability, but I do not want to measure how good CouchDB or MongoDB are handling main memory. That would be a completely different test setup. Presumable, you would want to setup some sort of 2 server replication and feed realistic request with a certain create/read/update/delete ratio, because that is what most web applications look like. Or you want append only tests with few reads and no replication because you are using that database as store for log-files. Or you want an update-centric setup, because you are storing ticks for shares. Or, or, or … There are quite a lot of very different scenarios and presumably MongoDB and CouchDB will behave differently in all these situations.

The article High Performance suggest that there is a difference between Linux and Mac OS X in the way syncs are handled. Therefore the following hardware was used in the tests:

  • iMac 2010 running Mac OS X 11 (Core i3)
  • Dell Optiplex 980 running OpenSuSE 11.4 (Core i7)
  • OCZ Vertex 3 Max IOPS

I first wanted to establish a few cornerstones.

  1. What is the disk / ssd throughput?
  2. What is the throughput of msync?
  3. What is the throughput of fsync?
  4. What is the protocol overhead?

Disk Throughput of the Harddisk

Using DD to copy a 1 GByte file with random data gave

MAC / Hard Disk

> dd if=random of=test
2048000+0 records in
2048000+0 records out
1048576000 bytes transferred in 8.753879 secs (119784157 bytes/sec)

Linux / Hard Disk

> dd if=random of=test
2048000+0 Datensätze ein
2048000+0 Datensätze aus
1048576000 Bytes (1,0 GB) kopiert, 10,0028 s, 105 MB/s

The write performance between Linux and Mac OS X seems to be comparable.

Memory Mapped Files

Because of the hints given by CouchDB concerning MacOS and Linux, I’ve tried to understand what the difference between Linux and Mac OS X concerning memory mapped files is. I wrote a program to create (append-only) memory mapped files, which can be found on github. The program repeatedly appends blocks of 674 bytes. After each append msync was called.

Mac

The throughput under Mac OS is

> msync-bench "/tmp/test1" 674 10000
insert time: 2.277690 sec for 10000 documents (4390.413094 docs / sec, 0.000228 secs / doc)
2959138.425334 bytes / sec, 2.822054 mbyte / sec

Each memory page of size 4096 is written 6 times, because msync is called after 674 bytes.

To get a baseline for the overhead introduced by my programm I’ve switched off the msync – only one msync at the end.

> msync-bench "/tmp/test1" 674 10000
insert time: 0.081228 sec for 10000 documents (123110.257547 docs / sec, 0.000008 secs / doc)
82976313.586448 bytes / sec, 79.132379 mbyte / sec

So msync is indeed doing something.

Linux

Under Linux the behavior is very different

> msync-bench /tmp/test1 674 1000
insert time: 26.022067 sec for 1000 documents (38.428923 docs / sec, 0.026022 secs / doc)
25901.093868 bytes / sec, 0.024701 mbyte / sec

In order to investigated further I used a larger object of size 4096

> msync-bench /tmp/test1 4096 1000
insert time: 25.632864 sec for 1000 documents (39.012418 docs / sec, 0.025633 secs / doc)
159794.863344 bytes / sec, 0.152392 mbyte / sec

The throughput seems to be independent from the size. The number of msync is what counts.

The baseline without msync (resp. only one msync at the end) is

> msync-bench /tmp/test1 674 10000
insert time: 0.044011 sec for 1000 documents (22721.592329 docs / sec, 0.000044 secs / doc)
15314353.229874 bytes / sec, 14.604905 mbyte / sec

Using FDATASYNC

Instead of msync I also tried to fsync a file. Basically same setup as before. Open a file, append a block of 674 bytes and fdatasync the file descriptor.

Mac

Under Mac OS I get

> fsync-bench /tmp/test2 674 10000
insert time: 34.173727 sec for 10000 documents (292.622458 docs / sec, 0.003417 secs / doc)
197227.536815 bytes / sec, 0.188091 mbyte / sec

Using fdatasync is much slower (10 times) than msync. Again to get a base without fdatasync

> fsync-bench /tmp/test2 674 10000
insert time: 0.412287 sec for 10000 documents (24254.948616 docs / sec, 0.000041 secs / doc)
16347835.367111 bytes / sec, 15.590511 mbyte / sec

Linux

The same under Linux

> fsync-bench /tmp/test2 674 10000
insert time: 11.934775 sec for 1000 documents (83.788760 docs / sec, 0.011935 secs / doc)
56473.624346 bytes / sec, 0.053857 mbyte / sec

and the baseline

> fsync-bench /tmp/test2 674 10000
insert time: 0.046587 sec for 1000 documents (21465.215618 docs / sec, 0.000047 secs / doc)
14467555.326593 bytes / sec, 13.797336 mbyte / sec

The MSYNC Mac Mystery

MAC LINUX
msync 4 390 38
msync base 123 110 22 721
fdatasync 292 83
fdatasync base 24 255 21 465

So, either memory mapped files are much better under Mac OS X than under Linux, Mac OS X is somehow lying about msync, I am missing some magic madvice flags, or one needs a different filesystem. I’m using EXT4 under Linux. Strangely, fdatasync base (writing the file and then fdatasync at the end) is equally fast on both machines.

Any suggestions about what is going on are welcome. Why is msync under Linux 100 times slower? With only one final msync the Linux version is still six times as slower.

Next Steps

I’ve ordered a OCZ Vertex 3 MAX IOPS. It will be fun to see if and how this changes the figures. I still have to figure out how to connect the SSD to my Mac. Maybe I’ll try Firewire first before resorting to more drastic measures. After that we are ready to test MongoDB & CouchDB on this SSD.

To be continued … here

Tags: , , ,