Samstag, 8. Mai 2010

Flash filesystems on block devices

Most flash that consumers can buy today comes packages with block device translation. Essentially it emulates a hard disk. Sometimes this emulation is relatively good. Sometimes it is apallingly bad. Pretty much all flash cards you can buy for your camera belongs to the latter category. I doubt that even a single exception exists - if it did the manufacturer should advertise the fact broadly and independent journalists would pick it up and verify such lofty claims.

And this actually matters for filesystems. Something between the bare hardware and the data needs to deal with wear leveling, error correction and scrubbing. If the translation layer does not, or does a bad job of it, the filesystem has to step in. Or failing that, data will be lost.

Traditional wear leveling on the cheap devices works on small areas of 1024 blocks, usually 128MB. Write 1000x anywhere within the area and the writes will be spread to 1000 blocks[1]. Write a million times and each block within the area will be written about 1000x. And no writes to any blocks outside this area. Oops!

Error correction is another problem. In most cases, one-bit errors can be corrected and two-bit errors can be detected. If there is a one-bit error, the hardware will read the data with an error, correct the error in RAM, and return the corrected data to the operating system. It will not write the corrected data back to flash. The next time you read from this location, there will still be an error that is corrected on the fly. Unless a second bit gets corrupted in the meantime. In that case you have an uncorrectable error and effectively lost some data. Oops!

It would be nice if the translation layer rewrote the corrected data. It would still be ok if it at least informed the operating system about the error and the operating system rewrote the data. As it stands, you can only suspect the worst and occasionally rewrite everything on the device, just in case.

And things get really messy when you look at read disturb and write disturb. Flash has the nasty tendency that if you write to block 4711, some of those electrons may spread to neighboring blocks and now you have errors in blocks 4710 and 4712. Even reading can produce errors somewhere nearby. People have experienced this with ext3, where data close to the journal was getting corrupted on a large number of devices. Oops!

The solution is called scrubbing. You simply read everything from beginning to end, once every so often. If you notice any errors, rewrite the data as long as the errors are correctable. Or if the device does not tell you about errors (see above), you rewrite everything, just to be sure.

So what can you take home from all this? A number of things:
  • Most flash device suck.
  • If you are not sure, assume your device sucks.
  • Use a flash filesystem to work around the translation layers shortcomings.
  • Translation layers hide important information (like corrected errors) from you.
  • Using a flash filesystem on raw flash would be better than using a flash filesystem on top of a translation layer.
[1] Or to just 25 blocks. Spreading the wear over 1024 blocks is called static wear leveling and often advertised as a huge improvement.

Sonntag, 25. April 2010

Log2

It appears as if there will be a log2 filesystem fairly soon. The reason is compression, or rather some of the problems it created. In short, log2 will roughly double the random read performance for uncompressed data and speed up erases by a factor of 256 or thereabout.

Erases in normal filesystems are a fairly simple thing. Pseudocode would look roughly like this:
for each block in file {
free bytes
}


Once you add compression, things get a little more complicated. You no longer know the blocksize:
for each block in file {
figure out how many bytes the compressed block fills
free those bytes
}


And with the current logfs format, block size is part of a header prepended to each block. We only need two bytes of that header. But reading from the device always happens in a granularity of 4096 bytes, so effectively we have to read 4096 bytes per deleted block. And the result doesn't feel like a greased weasel on caffeine.

So the solution will be to add those two bytes, and a couple of other fields from the header, to the block pointer in the indirect block. The whole block pointer will be 16 bytes, thus explaining the 256x improvement.

The random read problem is - again - caused by the 4096 byte granularity. With compression, data block will often span two 4096 byte pages. Uncompressed data will always do so, and rarely span three if you include the header. So reading a random block from a file usually requires reading two pages from the device. Bummer.

Solution is simple, align the data. One problem with aligned data is where to find the header. But since we just solved that problem two paragraphs up, things should be relatively simple. We shall see.

So why create a new filesystem and not just change the existing logfs? Well, mainly to prevent bugs that new and intrusive code always brings from interfering with people's existing filesystems. The ext family has set an interesting precedent in this respect.

Samstag, 3. April 2010

Flash errors

We all know that NAND flash will occasionally return errors and requires some form of ECC to deal with this problem. But additional details are hard to come by. Manufacturers are tight-lipped, ECC is often hidden from users and I don't know of any independent studies. So I have done my own. To generate data, I have rewritten flash with a test pattern of 0x55 (01010101 in binary) 100 times and read the data back. Whenever data was corrected by the driver, I noted the exact bit that was corrected. Raw results are here.

Several things stand out:
  • Almost always bits 0, 2, 4 and 6 are corrected. Only two exceptions exist among 21690 corrections.
  • A number of errors are related. There are four runs of 32 errors with similar offsets and incidence rate. A fifth run of 25 errors with low incidence rate exists.
  • Only 848 bits have ever been corrected out of a total of 128GiB.
While the first two details are likely particular to the flash chips used in this test, the last one gives us a useful insight. On average, each bit has a chance of 1:633M to be bad on any particular run. If bit errors were randomly distributed, we would expect each bad bit to be affected only once per 100 runs. But on average, the 848 bad bits were affected 25 times. The error rate increased from 1:633M to 25%.

In other words, it makes sense to keep statistiks of which bits required correction and how often. Anything that was corrected more than once is likely to be a manufacturing problem, not the infamous stray alpha particle. Avoiding affected areas is relatively cheap (as a percentage of storage lost) and can improve your data's life expectency.

Mittwoch, 31. März 2010

I finally decided to write the 33 lines of code and automate my testing somewhat. Now I get a nightly email showing that LogFS is currently bug-free. Well, obviously it cannot be bug-free - what software ever is? - but at least my GC-heavy testcase runs perfectly for 256 iterations.

If anyone wants to set things up as well, have a look here and here. You have to install libunwind first, which probably requires the usual ./configure, make & make install procedure since your distribusion doesn't ship it (mine didn't). Take about 500MB of disk, most of which goes to a copy of the Linux kernel.

Sonntag, 21. März 2010

I've never been much of a writer, but some people keep prodding me. So let's try and start a blog and see if it survives for more than a month.

The blog is called Solid State Storage for several reasons:
1. I have wanted that stuff in my private machine for nearly a decade now.
2. I have written my own filesystem to support such devices.
3. I strongly disagree with the industry-standard approach of SSD.
4. http://lwn.net/Articles/378407/ asked me to pick a better name.

So assuming this blog survives, it should contain articles about non-mechanical storage devices (currently that practically always spells flash) and their handling. And more often than not, it will explain how these devices should behave, as opposed to buying the latest marketing panacea and explaining why company X has finally solved all your marital problems with their new product.