Samstag, 3. April 2010

Flash errors

We all know that NAND flash will occasionally return errors and requires some form of ECC to deal with this problem. But additional details are hard to come by. Manufacturers are tight-lipped, ECC is often hidden from users and I don't know of any independent studies. So I have done my own. To generate data, I have rewritten flash with a test pattern of 0x55 (01010101 in binary) 100 times and read the data back. Whenever data was corrected by the driver, I noted the exact bit that was corrected. Raw results are here.

Several things stand out:
  • Almost always bits 0, 2, 4 and 6 are corrected. Only two exceptions exist among 21690 corrections.
  • A number of errors are related. There are four runs of 32 errors with similar offsets and incidence rate. A fifth run of 25 errors with low incidence rate exists.
  • Only 848 bits have ever been corrected out of a total of 128GiB.
While the first two details are likely particular to the flash chips used in this test, the last one gives us a useful insight. On average, each bit has a chance of 1:633M to be bad on any particular run. If bit errors were randomly distributed, we would expect each bad bit to be affected only once per 100 runs. But on average, the 848 bad bits were affected 25 times. The error rate increased from 1:633M to 25%.

In other words, it makes sense to keep statistiks of which bits required correction and how often. Anything that was corrected more than once is likely to be a manufacturing problem, not the infamous stray alpha particle. Avoiding affected areas is relatively cheap (as a percentage of storage lost) and can improve your data's life expectency.

Keine Kommentare:

Kommentar veröffentlichen