Reallocated sectors/blocks and data corruption


Recommended Posts

Just would like to share this, most know it's common for HDDs and SSDs to reallocate sectors/blocks, but some probably don't know that there's a chance of data corruption when that happens.

 

It was my feeling this was possible (likely?), but luckily I haven't had reallocated sectors in any of my server's HDDs for many years, I did however had an issue yesterday with one of my SSDs used for VMs, I got these notifications:

 

12-03-2017 22:53    unRAID device sdk SMART health [187]    Warning [TOWER7] - reported uncorrect is 26    SanDisk_SDSSDA120G_153910407249 (sdk)    warning    
12-03-2017 22:53    unRAID device sdk SMART health [5]    Warning [TOWER7] - retired block count is 1    SanDisk_SDSSDA120G_153910407249 (sdk)    warning    

There was nothing is the syslog, ie, this was all handled by the SSD firmware.

 

I have a script doing daily incremental backups of my vdisks, so today I looked at the log and sure enough, there was an error:

 

ERROR: send ioctl failed with -5: Input/output error
ERROR: unexpected EOF in stream.

Looking at the syslog I could see the reason for the errors:

 

Mar 13 00:09:59 Tower7 kernel: BTRFS warning (device sdk1): csum failed ino 262 off 21441933312 csum 2062942272 expected csum 1983964368
Mar 13 00:09:59 Tower7 kernel: BTRFS warning (device sdk1): csum failed ino 262 off 21441933312 csum 2062942272 expected csum 1983964368
Mar 13 00:10:00 Tower7 kernel: BTRFS warning (device sdk1): csum failed ino 262 off 21441933312 csum 2062942272 expected csum 1983964368
Mar 13 00:10:01 Tower7 kernel: BTRFS warning (device sdk1): csum failed ino 262 off 21441933312 csum 2062942272 expected csum 1983964368
Mar 13 00:10:02 Tower7 kernel: BTRFS warning (device sdk1): csum failed ino 262 off 21441933312 csum 2062942272 expected csum 1983964368
Mar 13 00:10:03 Tower7 kernel: BTRFS warning (device sdk1): csum failed ino 262 off 21441933312 csum 2062942272 expected csum 1983964368

And a scrub confirmed the problem and the affected file:

 

Mar 13 10:24:49 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987755520 on dev /dev/sdk1, sector 192759352, root 313, inode 262, offset 21441933312, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:49 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987755520 on dev /dev/sdk1, sector 192759352, root 407, inode 262, offset 21441933312, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:49 Tower7 kernel: BTRFS error (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Mar 13 10:24:49 Tower7 kernel: BTRFS error (device sdk1): unable to fixup (regular) error at logical 102987755520 on dev /dev/sdk1
Mar 13 10:24:49 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987759616 on dev /dev/sdk1, sector 192759360, root 313, inode 262, offset 21441937408, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:49 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987759616 on dev /dev/sdk1, sector 192759360, root 407, inode 262, offset 21441937408, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:49 Tower7 kernel: BTRFS error (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Mar 13 10:24:49 Tower7 kernel: BTRFS error (device sdk1): unable to fixup (regular) error at logical 102987759616 on dev /dev/sdk1
Mar 13 10:24:49 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987763712 on dev /dev/sdk1, sector 192759368, root 313, inode 262, offset 21441941504, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:49 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987763712 on dev /dev/sdk1, sector 192759368, root 407, inode 262, offset 21441941504, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:49 Tower7 kernel: BTRFS error (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Mar 13 10:24:49 Tower7 kernel: BTRFS error (device sdk1): unable to fixup (regular) error at logical 102987763712 on dev /dev/sdk1
Mar 13 10:24:50 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987767808 on dev /dev/sdk1, sector 192759376, root 313, inode 262, offset 21441945600, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:50 Tower7 kernel: BTRFS warning (device sdk1): checksum error at logical 102987767808 on dev /dev/sdk1, sector 192759376, root 407, inode 262, offset 21441945600, length 4096, links 1 (path: Win8.1/vdisk1.img)
Mar 13 10:24:50 Tower7 kernel: BTRFS error (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Mar 13 10:24:50 Tower7 kernel: BTRFS error (device sdk1): unable to fixup (regular) error at logical 102987767808 on dev /dev/sdk1

 

Problem fixed by restoring the vdisk from a previous backup, I'll keep the SSD for now and keep an eye on it, but hope this reminds users on the importance of backups and especially having checksums, although in case of vdisks having them on a btrfs device if the only practical way of doing it.

 

 

 

Link to comment

An interesting report, and your point is good, backups and checksums are vital.  But something about what happened bothered me, so I gave it some thought, and realized this is not at all how we are used to seeing normal hard drives behave.  Because of sector ECC, the drive knows whether the data read is correct or not, and even tries to correct it if it only has a few wrong bits.  But it never returns the data to ANY reader (including the file system code) if the data cannot be read perfectly, corrected or not.  This means you CANNOT get corrupted data back, you only get perfect data or an error code.

 

With normal hard drives, behaving the way we are familiar with, data that can't pass its ECC test cannot be read, but the sector is not reallocated until you give up on the sector and write to it.  That initiates the drives testing of the sector, and a possible remapping, and then the writing of the new data.

 

Based on what you have written, this SSD has done something different, and I don't think it should have.  It decided on its own when to test and reallocate, something that at first *sounds* like a good idea, but it also then wrote what it could recover of the old data to the new sector, corrupt data.  That's unacceptable.  But if the drive decides to help you by *fixing* the sector behind the scenes, it has to come up with *something* to write to the new sector, and that's why this corruption occurred.  If the old data was corrupt, you can't use it, and that's why you can't fix a sector until you're given new data for it.  I don't like this.  Sounds like an inexperienced developer with a *bright idea*, who did not think it through.  You CANNOT write to a sector with anything less than perfect data.  This time, the built-in checksumming caught it, but generally there isn't any checksumming, and this would result in silent corruption.  I don't see how we could recommend any SSD's with that firmware.

 

But I could be off-base, we could be misinterpreting what actually happened.  Or there was a terrible coincidence here - the sector went bad but was safely replaced, and then the same sector was corrupted somehow (might want to avoid that bad luck sector!).  Edit: in taking another look above, I don't think we can conclude for sure that the remapped sector was the one with the corruption, and that may change everything.  If it wasn't, then that takes SanDisk off the hook.

 

Off-topic, but this made me wonder why we ever worry about bit-rot.  Something that occurs on the order of one bit in a thousand terabytes would be caught and easily handled by the sector ECC info.  What am I missing?

Link to comment
4 hours ago, RobJ said:

But I could be off-base, we could be misinterpreting what actually happened.  Or there was a terrible coincidence here

 

While I cannot be 100% certain I really doubt it was a coincidence, I make daily btrfs snapshot backups then use btrfs send/receive to do an incremental backup to another disk, if there is a checksum error during the copy (send/receive) it will fail with a read error (btrfs always aborts a read if checksum error is detected), all previous backups were successful,  so it would be a very big coincidence this happening the same day I got the retire block error on the SSD.

Link to comment

That puts SanDisk back on the hook.  Since most SSD users are probably Windows users, definitely not BTRFS or checksum users, there may well be a fair amount of undiscovered corruption out there, once the drives are old enough to have retired sectors.  This should perhaps be a news item.  Certainly needs more investigation.

Link to comment

I have said this many times before, SMART counters do not predict errors, only reports errors which have already occurred, after data loss.

 

ECC has a very limited ability to detect errors, and a more limited ability to correct. What a reported uncorrected error means, is the drive has data it knows is bad because it failed ECC, but was unable to correct. But a collision (easy with ECC) means you can not really trust data just because it passes ECC. Read up on SHA collisions if you want to know what this means or how bad it is.

 

Enter check-summing.

 

BTRFS is one filesystem which has checksums for the data (ZFS is another). XFS and others have checksums for the metadata, it's deemed more important. Filesystems like WAFL/CephFS/OneFS actually do correction in the event of a checksum error.

 

The general direction is less reliable storage devices, at lower costs. If this event had been a video file, the data loss would probably be unnoticeable. And if you don't want to lose data, a higher level protection scheme (Replication(backup)/RAID/EC) needs to be in place. Which is exactly what happened in this case. The data was protected by a higher level scheme, a backup.

 

One last thing, a reallocation does not corrupt or lose data. The reallocation is long after the loss. Reallocation is done on the write of new data to a location previously found to be unreliable. The loss is what caused the reallocation. Prior to a reallocation you might see a pending, or ECC corrected and uncorrected.

 

 

 

 

Edited by c3
Link to comment
1 hour ago, c3 said:

I have said this many times before, SMART counters do not predict errors, only reports errors which have already occurred, after data loss.

 

I don't think anyone was saying that, at least I didn't intend to.

 

1 hour ago, c3 said:

ECC has a very limited ability to detect errors, and a more limited ability to correct. What a reported uncorrected error means, is the drive has data it knows is bad because it failed ECC, but was unable to correct. But a collision (easy with ECC) means you can not really trust data just because it passes ECC. Read up on SHA collisions if you want to know what this means or how bad it is.

 

I think what you are saying is that you believe the problem above could have been caused by an ECC collision, that the data was corrupted in such a way that it still matched the ECC info, but was caught by checksum.  That's a plausible explanation.  It's the first time though that I've ever heard that ECC collisions could be statistically easy.  I'm still an amateur, and if you can point me to any studies about this, I would really appreciate it.

 

1 hour ago, c3 said:

One last thing, a reallocation does not corrupt or lose data. The reallocation is long after the loss. Reallocation is done on the write of new data to a location previously found to be unreliable. The loss is what caused the reallocation. Prior to a reallocation you might see a pending, or ECC corrected and uncorrected.

 

I don't believe I said that.  I'm sorry if I wasn't clear, but what I was trying to say was not that the reallocation caused the corruption, but that the reallocation caused the just previously or simultaneously corrupted data to be written to the new replacement sector.

 

We don't know if it occurred or not, but within the brief one day window, there was no report of pending or other issue.

Link to comment

Look at the OP. First line, SMART errors and the disconnect from them meaning data loss. For me, data corruption equal data loss.

 

The drive is reporting uncorrected errors, these are not ECC collisions, these are known bad/lost data.

 

The topic title talks about reallocation and corruption. I was trying to make it clear that reallocation is the best thing that happened in this whole mess. The drive has learned to not use the bad place. But the learning was costly. Again the OP, first line, the reallocation is not the problem. It was the errors before the reallocation.

 

I don't believe the SMART data was polled fast enough to state that there were no pending. Pending is not a permanent state.

 

I have not quoted anyone, and would hope that people will understand that this means nothing is directed at anyone directly.

Edited by c3
Link to comment
18 hours ago, c3 said:

...

ECC has a very limited ability to detect errors, and a more limited ability to correct. What a reported uncorrected error means, is the drive has data it knows is bad because it failed ECC, but was unable to correct. But a collision (easy with ECC) means you can not really trust data just because it passes ECC.

...

 

Untrue! Rather than (try to) get into the theory of error detection and correction [at the level implemented in HDD firmware] (which I doubt ANY reader of this forum is competent to do--I'm surely not!), consider this: If a collision was easy (Hell! forget easy; if a collision was even *possible*), hard drives would only be used by careless hobbyists.

 

(Regarding HDDs) I agree with RobJ (and I've written the same 2+ times on this board in the last few years):

On 3/13/2017 at 10:12 PM, RobJ said:

Because of sector ECC, the drive knows whether the data read is correct or not, and even tries to correct it if it only has a few wrong bits.  But it never returns the data to ANY reader (including the file system code) if the data cannot be read perfectly, corrected or not.  This means you CANNOT get corrupted data back, you only get perfect data or an error code.

Note that few is pretty large (8-12+, I think) for a 512/4096-byte sector. And the firmware will make many retry attempts to get a good read; I've seen evidence of 20. And then the OS driver will usually retry several times. Only then does the OS throw a UCE.

 

As for johnnie.black's  original experience/report, I'm intrigued/disturbed. Whose controller does Sandisk use?

 

[

Added: Note in the original post, there appear (I don't know unRAID's logging methodology) to have been 26 UCEs Reported by this drive (over its lifetime, prior to 12Mar2017:2253 (that is SMART code 187) and 1 Reallocated sector (SMART code 5; that "somebody" is labeling `retired'). Do I assume, because of the way this logging is done, and the way that you are monitoring it, that you *know* that all of this bogosity (the 26 & 1) happened very recently? And that there was no sign of it in dmesg etc? If so, and if this SSD's firmware implements SMART correctly, where were those 26 errors REPORTED?

 

As I understand, there is a different category of SMART error for logging "implicit" errors (from self-diagnosis, Trim, etc.) that can't be "reported". Speaking of which ... what does a "smartctl -a /dev/sdX" show in the Error_Log? Or, does Sandisk (mis)behave like Western Digital and not bother to Log Errors? (Hitachi/HGST has spoiled me--they do [almost] everything right.) (Hey, who remembers the Samsung fiasco when they had a serious firmware bug in the F4 series (HD204, etc.)? As if the bug wasn't bad enough, when they released a fixed firmware, they HAD NOT CHANGED THE FIRMWARE VERSION/REVISION #

]

 

-- UhClem

 

Edited by UhClem
Link to comment
2 hours ago, UhClem said:

Do I assume, because of the way this logging is done, and the way that you are monitoring it, that you *know* that all of this bogosity (the 26 & 1) happened very recently? And that there was no sign of it in dmesg etc?

 

Yes to both, SMART was clean, all errors were reported at the same time, so they happened in a matter of seconds/minutes, whichever time the unRAID notification system uses to poll SMART, and there was nothing on the syslog about them.

 

Link to comment
Quote

Untrue! Rather than (try to) get into the theory of error detection and correction [at the level implemented in HDD firmware] (which I doubt ANY reader of this forum is competent to do--I'm surely not!), consider this: If a collision was easy (Hell! forget easy; if a collision was even *possible*), hard drives would only be used by careless hobbyists.

 

Unfortunately, I do see drives returning corrupt data on a regular basis. Enterprise storage manufacturers actually fight over who does this better. Enterprise storage uses drives which offer variable (512/520/524/528 bytes per sector) configurations for additional checksum storage. They don't do this just because they want you to buy more drives.

 

Additional sources

https://bartsjerps.wordpress.com/2011/12/14/oracle-data-integrity/

http://blog.fosketts.net/2014/12/19/big-disk-drives-require-data-integrity-checking/

 

Edited by c3
additional sources
Link to comment

It is interesting if (when?) we see a transition from spinners to solid state drives for disk arrays, that we may start to see more of this type of corruption occurring. It's not bitrot, but maybe the digital equivalent. BUT, I have to believe that in a world where accurate data is mandatory, we unRAIDers with our media servers are not the ones leading the charge. Although it may be true (not doubting c3) that data accuracy is an issue even with today's drives, UhClem's "untrue!" comment is the de facto understanding approaching fact. Bottom line, I have a high degree of confidence this will get sorted out if it has not already.

 

Of note, and I think this is a reality, typical spinner drive behavior of sectors slowly "wearing out" but maintaining ability to read with multiple retries (sometimes resorting to putting drives in the freezer) will be moving towards a binary state - either readable or not. This will make lives more complicated, especially for our users who occasionally poke themselves in the eye trying to recover, and wind up relying on a failing drive being largely readable as a safety net.

 

 

Link to comment

While everyone wants to talk about Annual Failure Rate (AFR), there are plenty of other measures, some more important depending on how the device is used. SSDs have a lower AFR than HDD. SSDs also have higher uncorrected errors. This means SSDs lose data more often, but keep working. You can read the study from last year.

 

Many years ago the BER number from drive manufacturers were confirmed by CERN. Pretty much everyone agrees, there is no guarantee that data will be written and read without corruption. CERN found 500 errors in reading 5x10^15 bits.

 

"Silent data corruption is an unrecognized threat that may affect as many as 1 in 90 SATA drives." from NEC as an enterprise storage provider.

Edited by c3
additional source
Link to comment
11 hours ago, c3 said:

Unfortunately, I do see drives returning corrupt data on a regular basis.

Please convince me (not being argumentative--I'm sincere!). But, to convince me, you'll need a rigorous presentation, using a valid evidence trail. I appreciate your time/effort.

11 hours ago, c3 said:

Enterprise storage uses drives which offer variable (512/520/524/528 bytes per sector) configurations for additional checksum storage.

Oh yeah, I do recall seeing that, but always in a casual perusal of one of HGST Ultrastar manuals. Thanks for pointing it out to me in the context of a technical discussion, where I'm motivated to dig into it deeper. My first two minutes of digging has already added a few crumbs of new knowledge to my quest to understand the factory/low-level formatting.

11 hours ago, c3 said:

They don't do this just because they want you to buy more drives.

Agreed ... since the reduction in drive data capacity when going from 4k sectors to (4k+128) sectors is only 3.5%. I believe they do it so you'll buy (the same # of) (much!) more expensive drives. After all, these are the same execs/bureaucrats that spent ~$500Billion on the Y2K scam; why not soak them for a measly extra $5-10B to protect their data (and cover their hiney). Remember, fear, and lawyers, (and fear of lawyers) are great motivators in such finagles.

 

Presently, I'm about 75% serious in the above. But I'm waiting, and very open, to be convinced otherwise.

 

-- UhClem

 

Link to comment
6 hours ago, c3 said:

Many years ago the BER number from drive manufacturers were confirmed by CERN. Pretty much everyone agrees, there is no guarantee that data will be written and read without corruption. CERN found 500 errors in reading 5x10^15 bits.

Not that CERN "study" again. c3, did you actually read it ? (not just casually)

 

I invite everyone who has participated in this thread to read it (this version is only 7 pages). See if you can find the numerous flaws in his presentation, and conclusions. Extra credit if you deduce the overall premise/motivation.

 

-- UhClem        "Gird your grid for a big one ..."

 

Link to comment

I spent some time going through the linked papers, and I do appreciate their provision, did learn a little.  But I'm afraid I have yet to find one bit of evidence against the strength of the ECC bits to preserve data integrity.  I have more reading to do, especially want to read up on the newer technologies like AF format, and how it does ECC, compared with the old 512 byte sectors.  All of the papers were old, roughly 10 years old, and based on older technologies, with 512 byte sectors.

 

The Toronto study was a good one, probably the best of the papers, but irrelevant to data corruption.  It was heavy on discussion of bit error rates, SSD's vs hard drives, and AFR's, but every bit of it was about drive reliability.  There was a clear assumption that while there are numerous bit errors from various sources, they are ALL caught, and are either corrected in 'transparent errors' or caught as 'non-transparent errors'.  If anything, the study reinforces the idea that bit errors are either corrected or caught.  Admittedly it's indirectly, as it's not directly stated, but no other possibility is entertained.

 

The CERN paper was frustrating, and I don't know why several sources are quoting it or referring to it.  CERN has great scientists, and I'll always be interested in whatever is published in connection with CERN, but this was more of a vague announcement of the discovery of data corruption.  It contains scientific data, but gathered and analyzed with a sad lack of rigor.  They produced some numbers, but nothing you could actually draw any conclusions from, at all.  And they openly admit that they deduced there were hardware and firmware issues that dramatically affected the numbers and results.  The tests were run against whole systems (CPU's, RAM, drives, etc), with little effort to determine and distinguish between the many sources of errors.  They give a statement that a vague 80% of the errors detected were probably from issues between their 3Ware controllers and their WD drives!  And that they were now going to replace the firmware on 3000 of them!  They thought that about 10% of the other errors were related to memory, but were detected by its being ECC RAM.  And the last 10% was from something that wasn't explained, no physical source mentioned.  Nowhere is there any mention, let alone a clear attribution to sector bit errors not being caught.  I don't think this paper should be cited at all.

 

The NEC paper is good, and worth reading by everyone for the ideas about other sources of silent data corruption.  At no point, does it implicate a weakness in ECC.  Rather, it points out other possibilities.  The 'Torn writes' I discount, as while it would corrupt data and parity, it would be detected, by a huge ECC mismatch.  I may be wrong, but I think I can ignore the 'Misdirected write' issue too, as I've never heard of its possibility EVER, and I feel fairly sure that it's a problem long solved by current drive vendors (NEC paper was from 2008).  That leaves the 'data path corruption' issue, the interesting one here, and the one that is consistent with our own experience.  It's about the many sources of corruption between the drive interface and the media surface and back again, including the memory buffers, busses, and registers, and firmware issues.  We had basically concluded we saw it quite awhile back when a few users had repeating parity errors, with the only possible source being the memory caches or registers on the physical drive itself.  This one needs some thought.  In this paper, see also the nice diagram of what they call 'parity pollution', something that's applicable to us.  I don't know how much confidence to put in their frequency numbers though, they're based on 512 byte technology.   My apologies for only draft quality writing here, a bit rushed.

Link to comment
5 hours ago, RobJ said:

I spent some time going through the linked papers, and I do appreciate their provision, did learn a little.  But I'm afraid I have yet to find one bit of evidence against the strength of the ECC bits to preserve data integrity. 

Thank you for your summary. (I hadn't bothered -- once I saw the CERN paper was one of the references, I lost all confidence ["...baby...bath water." :) ]).

5 hours ago, RobJ said:

The CERN paper was frustrating, and I don't know why several sources are quoting it or referring to it. 

....  [ (excellent synopsis of the flaws) ] ...

I don't think this paper should be cited at all.

A+ for you.

 

I'm surprised it was even published ... but I appreciate CERN's openness (or was it ignorance?). Personally, I would be totally embarrassed to admit that I had purchased, and deployed into production, 600 RAID controllers and 3000 drives, without first getting 3-4 controllers & 15-20 drives and beating the sh*t out of it all for a week or two (and not just 2GB every 2 hours). But, why should they care ... it's just the(ir) taxpayers' money. [And, in 2006, that probably represented ~US$750,000+ (in 2006 euros).] Did they even get competitive bids? [Make that $1M+]

 

5 hours ago, RobJ said:

The NEC paper is good, and worth reading by everyone for the ideas about other sources of silent data corruption.  At no point, does it implicate a weakness in ECC.  Rather, it points out other possibilities.

  ...

That leaves the 'data path corruption' issue, the interesting one here, and the one that is consistent with our own experience.  It's about the many sources of corruption between the drive interface and the media surface and back again ...

Those data path issues were formally addressed in 2007 when they were added to SMART, but had probably been implemented in drive firmware even earlier by the competent manufacturer(s).

 

--UhClem   (almost accepted a job offer from CERN in 1968 ... then my draft deferment came through)

 

Edited by UhClem
Link to comment

If all you are looking for is weakness in ECC, there are plenty of papers on solomon-reed (and it's replacements). It comes down to a game of show me your ECC method and there is a dataset which will be undetected. As mentioned early on, the more recent case of this is SHA. Widely used and thought to be collision "impossible", until 2005 when a non brute force method was published. Now the non brute force method has been improved and weaponized. It is important to note the "impossible" was always really just impractical. The time/probability was so large people just ignored it. Some organizations, like CERN, work at larger scale and understand that these can not be ignored, but measured at scale.

 

If the data you are trying to protect is x bits long, protecting is with fewer bits y, leaves a gap. In the case of the 512byte sector, the ECC is (40) 10 symbol fields. So,  512bytes is protected by 50 bytes. This is all done at the DSP level so the bits to bytes, baud rates etc apply. Now in the 4k sector the ECC is 100 bytes. Yeah, more data protected with a smaller portion allocated to protection. And remember, unlike SHA, this 100bytes has to not only detect errors, but correct errors.

 

This double edge sword is played against the manufacturer's goal of quality vs cost. Each quality improvement becomes an opportunity to reduce cost. When the spindle motor wobble/vibration is improved, the track width is narrowed. The error rates goes down with one and up with the other. The current range is 10^14 to 10^17. Some will take that as impossible.

 

I am sorry for the old papers. I'll stop doing that. They are from when I worked for an enterprise storage manufacturer. I have since moved up the stack to OS and filesystems, where we flat out tell disk drive manufacturers there is no expectation of perfect, and work to build robust systems on imperfect hardware.

Edited by c3
Link to comment
  • 2 weeks later...

I am sorry for the delay, I was busy working on things like data checksum on the XFS roadmap. Everyone there understood why, it was just the timing and who would do the work.

 

I did take time to have discussions with several disk drive manufacturers about the ECC performance, which remains RS. Two of them indicated I might get my hands on the details under NDA, of a non current device (like from 2TB). We spent a lot of time talking about the impact adding read heads (TDMR) will have on this whole process. There was a pretty good joke about going from helium to vacuum drives would help stabilize the head, but then how to fly in vacuum, maglev. I guess you had to be there.

 

Since I was told RS is still being used, Lemma 4 ends with (emphasis is mine);

"If that occurs, then Lemma 4 implies that there must have been more than e errors. We cannot correct them, but at least we have detected them. Not every combination of more than e errors can be detected, of course—some will simply result in incorrect decoding".

This is the foundation of UDE(paywalled), which drives the need for filesystem level checksum.

 

UDE is a larger set of conditions, especially when talking spinning rust. You can see where TDMR will help.

 

To improve the chance of avoid anything that might have my ideas, work, or decisions; use Microsoft products, but only the newer filesystems.

 

In other news, quietly slipped into unRAID 6.3.2 was f2fs. Which is very exciting (or worrisome), especially post 4.10, probably the next roll of unRAID. f2fs now has SMR support (took 2 years). But the stuff I work on takes a long time to do and even longer to get rid of.  SMR is just the beginning, but the whole concept of decoupling the write head from the read head/track width was fundamental to driving density. Other filesystems will follow, and/or use dm-zoned. Doing so will have benefits for flash write durability. Best of luck avoiding.

 

But things like filesystem checksum will be needed outside the spinning rust world, and the OP is probably grateful.

Edited by c3
added citations as requested.
Link to comment
  • 4 years later...

Reviving this thread since it happened again, now with a hard drive, a Seagate ST1000LM035, disk had already some reallocated sectors when I began using it for this function, but it passed an extended SMART test and worked fine without any issues for a few weeks, this morning had several emails from last night, first with new reallocated sectors:
 

18-04-2021 01:21 Unraid Wbackups disk SMART health [5] Warning [TOWER1] - reallocated sector ct is 984 ST1000LM035-1RK172_ZDE5AFTA (sdd) 
18-04-2021 01:18 Unraid Wbackups disk SMART health [5] Warning [TOWER1] - reallocated sector ct is 968 ST1000LM035-1RK172_ZDE5AFTA (sdd) 
18-04-2021 00:17 Unraid Wbackups disk SMART health [5] Warning [TOWER1] - reallocated sector ct is 960 ST1000LM035-1RK172_ZDE5AFTA (sdd) 

 

Then and because I have a script monitoring all btrfs pools for errors, got an email about that:

 

18-04-2021 01:47 Unraid Status ERRORS on wbackups pool 

 

In this case the errors detected were corruption errors:

 

root@Tower1:~# btrfs dev stats /mnt/wbackups/
[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  20
[/dev/sdd1].generation_errs  0

 

A scrub this morning confirmed the data corruption:

 

root@Tower1:~# btrfs scrub status /mnt/wbackups/
UUID:             9c12f50a-ad56-4a61-934a-4b1ee064cae9
Scrub started:    Sun Apr 18 12:26:01 2021
Status:           finished
Duration:         1:21:57
Total to scrub:   545.20GiB
Rate:             113.59MiB/s
Error summary:    csum=650
  Corrected:      0
  Uncorrectable:  650
  Unverified:     0

 

Note that this is a single disk btrfs device, so no redundancy to fix data corruption (this is only used as another backup destination), looking at the syslog it identifies the corrupt file, one of several zip files from a remote backup that are synced to that disk:

 

Apr 18 12:30:47 Tower1 kernel: BTRFS warning (device sdd1): checksum error at logical 271712112640 on dev /dev/sdd1, physical 38710136832, root 5, inode 5879, offset 162140160, length 4096, links 1 (path: SageBackups/.stversions/503951269202104071502.zip)

 

So yeah, while this should never happen, i.e., devices shouldn't reallocate sectors without corrupting data, here it is once again proof that it can happen.

Link to comment
17 hours ago, hawihoney said:

Sure that this is not a BTRFS problem?

Yes, AFAIK there aren't and there ever weren't any issues with the btrfs checksum verification, also I confirmed and that zip file really was corrupt, it gave an error on extraction.

 

And you want me to believe that after using btrfs almost exclusively for the last 4 or 5 years never had any data corruption found (or any other issues for that matter), other than twice after similar events with the devices, it was a coincidence? Or a false positive (which I can confirm it wasn't in this case)? It wouldn't make any sense.

Link to comment
39 minutes ago, hawihoney said:

I just said that BTRFS shouldn't be ignored during investigation.

btrfs was what allowed the data corruption to be detected, and since it was real and correctly detected it just confirms this is an issue, at least sometimes.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.