Samstag, 8. Mai 2010

Flash filesystems on block devices

Most flash that consumers can buy today comes packages with block device translation. Essentially it emulates a hard disk. Sometimes this emulation is relatively good. Sometimes it is apallingly bad. Pretty much all flash cards you can buy for your camera belongs to the latter category. I doubt that even a single exception exists - if it did the manufacturer should advertise the fact broadly and independent journalists would pick it up and verify such lofty claims.

And this actually matters for filesystems. Something between the bare hardware and the data needs to deal with wear leveling, error correction and scrubbing. If the translation layer does not, or does a bad job of it, the filesystem has to step in. Or failing that, data will be lost.

Traditional wear leveling on the cheap devices works on small areas of 1024 blocks, usually 128MB. Write 1000x anywhere within the area and the writes will be spread to 1000 blocks[1]. Write a million times and each block within the area will be written about 1000x. And no writes to any blocks outside this area. Oops!

Error correction is another problem. In most cases, one-bit errors can be corrected and two-bit errors can be detected. If there is a one-bit error, the hardware will read the data with an error, correct the error in RAM, and return the corrected data to the operating system. It will not write the corrected data back to flash. The next time you read from this location, there will still be an error that is corrected on the fly. Unless a second bit gets corrupted in the meantime. In that case you have an uncorrectable error and effectively lost some data. Oops!

It would be nice if the translation layer rewrote the corrected data. It would still be ok if it at least informed the operating system about the error and the operating system rewrote the data. As it stands, you can only suspect the worst and occasionally rewrite everything on the device, just in case.

And things get really messy when you look at read disturb and write disturb. Flash has the nasty tendency that if you write to block 4711, some of those electrons may spread to neighboring blocks and now you have errors in blocks 4710 and 4712. Even reading can produce errors somewhere nearby. People have experienced this with ext3, where data close to the journal was getting corrupted on a large number of devices. Oops!

The solution is called scrubbing. You simply read everything from beginning to end, once every so often. If you notice any errors, rewrite the data as long as the errors are correctable. Or if the device does not tell you about errors (see above), you rewrite everything, just to be sure.

So what can you take home from all this? A number of things:
  • Most flash device suck.
  • If you are not sure, assume your device sucks.
  • Use a flash filesystem to work around the translation layers shortcomings.
  • Translation layers hide important information (like corrected errors) from you.
  • Using a flash filesystem on raw flash would be better than using a flash filesystem on top of a translation layer.
[1] Or to just 25 blocks. Spreading the wear over 1024 blocks is called static wear leveling and often advertised as a huge improvement.

Keine Kommentare:

Kommentar veröffentlichen