Every now and then I think up some crazy master plan, last night was one of these times – sometimes they work out, sometimes they don’t so much.
I was reading the Linux kernel code for software RAID1 because I was totally bored, and something caught my eye – the ability to prefer to write to certain disks in the array (mdadm –write-mostly). I decided that I was going to find some use for it that was a bit outside the box.
After a little tweaking and configuring I came up with this Evil Plan:
mdadm --create --verbose /dev/md5 --level=1 --raid-devices=3 /dev/ram0 --write-mostly /dev/sda3 /dev/sdd3
This config gives you an array as in the diagram below:
The smart people reading this will probably have figured out by now that what you now have is a RAID1 array that tends to read from RAM and write to disk.
What they may not have figured is what we actually have here is a RAM disk that is mostly* safe on disk in the event of crashes and reboots. The RAM is effectively just a copy of what’s on the disk – but an automatic copy, there’s no manual syncing, it’s handled by the RAID code in the kernel – if something is written it’s marked as dirty and the code will go to the source disk (the one the data was written to until the changes have been synced across).
At this point you’re probably wondering what the downsides are. Well firstly on a reboot you lose a disk from the array. I haven’t actually rebooted the server yet so I’m not completely sure how this is going to respond to a reboot – if it’ll say oh hey here’s a blank disk so we can add it to the array or if I’m going to need a boot script to add it back into the array – either way it will automatically sync the data when that’s done – and really on a server you really don’t want to be rebooting too much anyway.
*The other issues is it may not be totally crash-proof all of the time. I can imagine a scenario where you’re write blocked on the two hard disks – and it’ll have to write to the RAM – the data from which will be copied back to the hard disks right when it can be. For that time it could be a little risky if you have a crash at that point. One of the answers is that if you’re tending to write a lot and it’s writing to RAM – add more disks to the array so the code doesn’t block writes to the array.
So what are the performance numbers?
Well, I’ve not done any write benchmarking, but I’d expect it to be generally standard hard disk write speed – until you start doing concurrent writes – the array will probably actually get faster in this example when you do 3 or more concurrent writes – which is actually very strange for a RAID array, usually you start getting performance loss with more concurrency – but with this array at some stage the code will decide the disks are IO blocked and write to the RAM drive. The speeds we’re talking about start to get extreme, but it’ll be 1000MB/sec plus judging by the read benchmarks:
Reading from this array will prefer to hit the RAM disk – this is where things get interesting. I’m starting to get the feeling that the speed may be IO-bound by the filesystem code and other parts of the kernel, but, a bog-standard hdparm benchmark:
hdparm -t /dev/md5
Timing buffered disk reads: 3986 MB in 3.00 seconds = 1328.17 MB/sec
Wow. If you think that’s useful, you should see the seeker results:
Benchmarking /dev/md5 [4102MB], wait 30 seconds…………………………
Results: 1149116 seeks/second, 0.00 ms random access time
My 4 (SATA HDD) disk RAID1 array for comparison:
Benchmarking /dev/md1 [200000MB], wait 30 seconds………………………..
Results: 99 seeks/second, 10.09 ms random access time
It’s not hard to see where the performance can come in useful. If you’re getting bogged down by a lot of random reads, want reasonably safe storage and have RAM to spare – this is going to take some beating. SSDs? Pah!
Why use this rather than relying on filesystem cache? Aside from the fact that it’s targeted so the data you want is always there as opposed to the 1% hit-rate-if-you’re-lucky that the filesystem cache will give you? Other reasons that I could list…
Like I said sometimes my crazy ideas actually work. As far as I can tell nobody has done this before.. Or at least I can’t find any evidence of it in Google. I’d love to hear if somebody has seen it done before though, be nice to compare notes.
So on reboot I now know for sure – it does totally kick the ram disk out the array – but this is an easy fix, just a case of having say an init script that runs a command like:
mdadm --add /dev/md5 /dev/ram0
It’ll then rebuild the array automatically.