Sonntag, 25. April 2010

Log2

It appears as if there will be a log2 filesystem fairly soon. The reason is compression, or rather some of the problems it created. In short, log2 will roughly double the random read performance for uncompressed data and speed up erases by a factor of 256 or thereabout.

Erases in normal filesystems are a fairly simple thing. Pseudocode would look roughly like this:
for each block in file {
free bytes
}


Once you add compression, things get a little more complicated. You no longer know the blocksize:
for each block in file {
figure out how many bytes the compressed block fills
free those bytes
}


And with the current logfs format, block size is part of a header prepended to each block. We only need two bytes of that header. But reading from the device always happens in a granularity of 4096 bytes, so effectively we have to read 4096 bytes per deleted block. And the result doesn't feel like a greased weasel on caffeine.

So the solution will be to add those two bytes, and a couple of other fields from the header, to the block pointer in the indirect block. The whole block pointer will be 16 bytes, thus explaining the 256x improvement.

The random read problem is - again - caused by the 4096 byte granularity. With compression, data block will often span two 4096 byte pages. Uncompressed data will always do so, and rarely span three if you include the header. So reading a random block from a file usually requires reading two pages from the device. Bummer.

Solution is simple, align the data. One problem with aligned data is where to find the header. But since we just solved that problem two paragraphs up, things should be relatively simple. We shall see.

So why create a new filesystem and not just change the existing logfs? Well, mainly to prevent bugs that new and intrusive code always brings from interfering with people's existing filesystems. The ext family has set an interesting precedent in this respect.

Samstag, 3. April 2010

Flash errors

We all know that NAND flash will occasionally return errors and requires some form of ECC to deal with this problem. But additional details are hard to come by. Manufacturers are tight-lipped, ECC is often hidden from users and I don't know of any independent studies. So I have done my own. To generate data, I have rewritten flash with a test pattern of 0x55 (01010101 in binary) 100 times and read the data back. Whenever data was corrected by the driver, I noted the exact bit that was corrected. Raw results are here.

Several things stand out:
  • Almost always bits 0, 2, 4 and 6 are corrected. Only two exceptions exist among 21690 corrections.
  • A number of errors are related. There are four runs of 32 errors with similar offsets and incidence rate. A fifth run of 25 errors with low incidence rate exists.
  • Only 848 bits have ever been corrected out of a total of 128GiB.
While the first two details are likely particular to the flash chips used in this test, the last one gives us a useful insight. On average, each bit has a chance of 1:633M to be bad on any particular run. If bit errors were randomly distributed, we would expect each bad bit to be affected only once per 100 runs. But on average, the 848 bad bits were affected 25 times. The error rate increased from 1:633M to 25%.

In other words, it makes sense to keep statistiks of which bits required correction and how often. Anything that was corrected more than once is likely to be a manufacturing problem, not the infamous stray alpha particle. Avoiding affected areas is relatively cheap (as a percentage of storage lost) and can improve your data's life expectency.