Mammoth
  • |
  • Contact
  • |
CMD | Command Prompt, Inc. - PostgreSQL Solutions, Support & Hosting
  • |
  • |
  • |
  • |
  • |
Is that performance I smell? Ext2 vs Ext3 on 50 spindles, testing for PostgreSQL
Posted Wednesday Apr 23rd, 2008 10:31am
by Joshua Drake
| Permalink

Follow cmdpromptinc on Twitter


There are few things I like better than when a customer says to the team, "I want the best machine I can buy for XXX dollars". It inspires a certain sense of joy not unlike the feeling an average Slashdot reader gets when they walk into the local gadget store. It is particularly special because you know as much as you could make use of such a machine, you know you would never justify the expense.
In this case, the customer was willing to spend a modest but not excessive amount of money. I applaud this decision because I run into far to many people that feel that the only way to get real performance is to buy some ridiculous SAN at 10 times the cost to performance ratio.
Machine Specs:
  • HP DL585.
  • 4 Dual core 8222 processors
  • 64GB of ram. Storage:
  • (2) MSA70 direct attached storage arrays.
  • 25 spindles in each array.
  • Single HP P800 controller.
    Filesystem layout
    /dev/cciss/c1d1p1    1693108576    201228 1606902380   1% /data2
    /dev/cciss/c1d0p1    1693104732    201292 1606898664   1% /data1
    /dev/cciss/c0d1p1     282181440    195616  267651768   1% /xlogs
    
    Where /data[n] is an MSA70 and /xlogs is a RAID 10 on the embedded controller.
    Filesystem options
    /dev/cciss/c1d0p1       /data1    ext3    data=writeback     1 2
    /dev/cciss/c1d1p1 	/data2    ext3    data=writeback     1 2
    /dev/cciss/c0d1p1       /xlogs    ext2    defaults           1 2
    
  • Xlog performance
    The PostgreSQL WAL is written in a sequential fashion negating the need for a large number of spindles to get reasonable performance. It is random writes that kills performance. Further, when the WAL is used for recovery purposes, it will recover up to the last known good transaction and throw all transactions after away. This ensures a consistent database regardless of crash. It also partly why we are able to forgo a journaling filesystem for the xlog files. Just for kicks, I ran tests for xlog on both ext3 and ext2. The benchmarking software being used is IOzone. The command used was:
    /opt/iozone/bin/iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u
    
    Here are the results: xlogs ext3 with defaults (ordered mode for journaling)
           Children see throughput for  1 rewriters 	=   87418.44 KB/sec
    	Parent sees throughput for  1 rewriters 	=   87395.65 KB/sec
    	Min throughput per process 			=   87418.44 KB/sec 
    	Max throughput per process 			=   87418.44 KB/sec
    	Avg throughput per process 			=   87418.44 KB/sec
    
    xlogs ext3 with data=writeback
           Children see throughput for  1 rewriters 	=   84712.55 KB/sec
    	Parent sees throughput for  1 rewriters 	=   83513.39 KB/sec
    	Min throughput per process 			=   84712.55 KB/sec 
    	Max throughput per process 			=   84712.55 KB/sec
    	Avg throughput per process 			=   84712.55 KB/se
    
    xlogs ext2 with defaults
           Children see throughput for  1 rewriters 	=  115378.34 KB/sec
    	Parent sees throughput for  1 rewriters 	=  115345.26 KB/sec
    	Min throughput per process 			=  115378.34 KB/sec 
    	Max throughput per process 			=  115378.34 KB/sec
    	Avg throughput per process 			=  115378.34 KB/sec
    
    A pretty clear indicator that one should always consider /xlogs on a separate channel. The next series of tests I ran were with ext3 and the /data[n] partitions. Remember each of the partitions are on their own Direct Attached Storage. /data1 with data=journal
           Children see throughput for 1 random writers 	=   49444.73 KB/sec
    	Parent sees throughput for 1 random writers 	=   48709.89 KB/sec
    	Min throughput per process 			=   49444.73 KB/sec 
    	Max throughput per process 			=   49444.73 KB/sec
    	Avg throughput per process 			=   49444.73 KB/sec
    
    /data1 with data defaults (ordered mode)
           Children see throughput for 1 random writers 	=  142926.14 KB/sec
    	Parent sees throughput for 1 random writers 	=  142872.21 KB/sec
    	Min throughput per process 			=  142926.14 KB/sec 
    	Max throughput per process 			=  142926.14 KB/sec
    	Avg throughput per process 			=  142926.14 KB/sec
    
    /data1 with data=writeback
           Children see throughput for 1 random writers 	=  168948.55 KB/sec
    	Parent sees throughput for 1 random writers 	=  168867.03 KB/sec
    	Min throughput per process 			=  168948.55 KB/sec 
    	Max throughput per process 			=  168948.55 KB/sec
    	Avg throughput per process 			=  168948.55 KB/sec
    
    The ext3 journal mode of writeback is the obvious winner here. A note of caution however, it is likely not safe to use writeback unless you have a battery backed RAID controller. The overall bandwidth is respectable at ~ 170MB/s. How much of that is journaling? /data1 with ext2
           Children see throughput for 1 random writers 	=  178404.45 KB/sec
    	Parent sees throughput for 1 random writers 	=  178320.32 KB/sec
    	Min throughput per process 			=  178404.45 KB/sec 
    	Max throughput per process 			=  178404.45 KB/sec
    	Avg throughput per process 			=  178404.45 KB/sec
    
    Although ext2 is faster, I don't think it is fast enough to satisfy the downside of running a non journaled filesystem (long fsck times). What happens when we access both /data1 and /data2 at the same time. /data1 and /data2 using separate processes
           Children see throughput for 1 random writers 	=   93932.16 KB/sec
    	Parent sees throughput for 1 random writers 	=   93909.48 KB/sec
    	Min throughput per process 			=   93932.16 KB/sec 
    	Max throughput per process 			=   93932.16 KB/sec
    	Avg throughput per process 			=   93932.16 KB/sec
           
           Children see throughput for 1 random writers 	=  105375.49 KB/sec
    	Parent sees throughput for 1 random writers 	=  105292.74 KB/sec
    	Min throughput per process 			=  105375.49 KB/sec 
    	Max throughput per process 			=  105375.49 KB/sec
    	Avg throughput per process 			=  105375.49 KB/sec
    
    I am not actually buying these numbers. The reason is as I monitored multiple thread results and how they interacted with each processor, whether I was running two processes separately or a single process over multiple threads, processor utilization was never correctly aggregated. I think this a failure of the benchmark software. In theory I should see almost identical results for a single arrray as the dual arrays. In an effort to get more accurate results across not only the arrays but the availability of processors I wrote a quick script. The script fires the benchmark software as four independent processes each with a single writer. I then executed that script on /data1 and /data2 simultaneously. This allowed us to have much better utilization of all processors and also gave us a more accurate representation of the peformance as a whole. ext3 data=writeback, /data1 and /data2, four threads each
    	Children see throughput for 1 random writers 	=   50916.17 KB/sec
    	Parent sees throughput for 1 random writers 	=   50909.04 KB/sec
    	Min throughput per process 			=   50916.17 KB/sec 
    	Max throughput per process 			=   50916.17 KB/sec
    	Avg throughput per process 			=   50916.17 KB/sec
    
    	Children see throughput for 1 random writers 	=   51021.88 KB/sec
    	Parent sees throughput for 1 random writers 	=   51013.58 KB/sec
    	Min throughput per process 			=   51021.88 KB/sec 
    	Max throughput per process 			=   51021.88 KB/sec
    	Avg throughput per process 			=   51021.88 KB/sec
    
            Children see throughput for 1 random writers 	=   51048.78 KB/sec
    	Parent sees throughput for 1 random writers 	=   51040.33 KB/sec
    	Min throughput per process 			=   51048.78 KB/sec 
    	Max throughput per process 			=   51048.78 KB/sec
    	Avg throughput per process 			=   51048.78 KB/sec
    
    	Children see throughput for 1 random writers 	=   50755.62 KB/sec
    	Parent sees throughput for 1 random writers 	=   50746.71 KB/sec
    	Min throughput per process 			=   50755.62 KB/sec 
    	Max throughput per process 			=   50755.62 KB/sec
    	Avg throughput per process 			=   50755.62 KB/sec
    
    /data2:
    
    	Children see throughput for 1 random writers 	=   49711.77 KB/sec
    	Parent sees throughput for 1 random writers 	=   49704.75 KB/sec
    	Min throughput per process 			=   49711.77 KB/sec 
    	Max throughput per process 			=   49711.77 KB/sec
    	Avg throughput per process 			=   49711.77 KB/sec
    
    	Children see throughput for 1 random writers 	=   49708.98 KB/sec
    	Parent sees throughput for 1 random writers 	=   49695.55 KB/sec
    	Min throughput per process 			=   49708.98 KB/sec 
    	Max throughput per process 			=   49708.98 KB/sec
    	Avg throughput per process 			=   49708.98 KB/sec
    
    	Children see throughput for 1 random writers 	=   49713.46 KB/sec
    	Parent sees throughput for 1 random writers 	=   49691.86 KB/sec
    	Min throughput per process 			=   49713.46 KB/sec 
    	Max throughput per process 			=   49713.46 KB/sec
    	Avg throughput per process 			=   49713.46 KB/sec
    
    	Children see throughput for 1 random writers 	=   49707.78 KB/sec
    	Parent sees throughput for 1 random writers 	=   49699.04 KB/sec
    	Min throughput per process 			=   49707.78 KB/sec 
    	Max throughput per process 			=   49707.78 KB/sec
    	Avg throughput per process 			=   49707.78 KB/sec
    
    That is more like it... ~200MB/s. In seeing this improvement, I decided to run 8 threads per partition. I am only going to post one output but all of the threads had similar performance. The per thread performance went down but the aggregate performance for each partition was higher at ~ 280MB/s. Between the two arrays that is ~ 560MB/s.
    	Children see throughput for 1 random writers 	=   35403.08 KB/sec
    	Parent sees throughput for 1 random writers 	=   35398.90 KB/sec
    	Min throughput per process 			=   35403.08 KB/sec 
    	Max throughput per process 			=   35403.08 KB/sec
    	Avg throughput per process 			=   35403.08 KB/sec
    
    As a final note before I leave you to your regularly scheduled broadcasting, the I/O wait even with the 16 processes was moderate at a range of 10% - 20 %. Clearly there is more room on these arrays if I had more time.

    Categories: PostgreSQL, OpenSource

    blog comments powered by Disqus

    Copyright © 2000-2014 Command Prompt, Inc. All Rights Reserved. All trademarks property of their respective owners.