HELP : Bad sector during 2 drives upgrade

April 22, 201511 yr

Guys, I need some advice please...

I have an unraid server with 19 drives that I've used for awhile and upgraded over the year0 with no problems whatsoever. So lately I've been upgrading their drives again in batches, almost all of them are 3-4TB now, except for 2.

Last week I got 2 more 4TB drives to upgrade the two remaining 2TB drives; and I started running preclear on the two new drives and finished sometime last weekend. The preclear was alright, so I replaced disk4 with one of the 4TB drive, ran drive expansion and everything was smooth. I ended up with disk4 that has 2.2TB free. The next few days I moved some files to that drive, to free up space on some of the other drives. No big deal, no read/write errors so far.

So earlier today I proceeded with upgrading the second drive (disk13). I stopped the array, replaced the last 2TB drive in my system and waited for it to expand the filesystem to 4TB. Everything went alright... so I left it for a bit. Then I came back to this dreadful sight:

disk4 Browse /mnt/disk4 WDC_WD40EFRX-68WT0N0_WD-WCC00000000 (sdf) 3907018532 36°C 4 TB 1.49 TB 299703 10 29824 <-- (last numbers = 'Errors')

Data-Rebuild in progress. Cancel will stop Data-Rebuild.

WARNING: canceling Data-Rebuild will leave the array unprotected!

Total size: 4 TB

Current position: 151.63 GB (4%)

Estimated speed: 564.92 KB/sec

Estimated finish: 113020 minutes

and my syslog is filled with these:

Result: hostbyte=0x00 driverbyte=0x08
sd 1:0:4:0: [sdf]
Sense Key : 0x3 [current]
Info fld=0x11417e78
sd 1:0:4:0: [sdf]
ASC=0x11 ASCQ=0x0
sd 1:0:4:0: [sdf] CDB:
cdb[0]=0x88: 88 00 00 00 00 00 11 41 7c d0 00 00 04 00 00 00
end_request: critical target error, dev sdf, sector 289504464
md: disk4 read error, sector=289504400
md: disk4 read error, sector=289504408
md: disk4 read error, sector=289504416
md: disk4 read error, sector=289504424
md: disk4 read error, sector=289504432
md: disk4 read error, sector=289504440
md: disk4 read error, sector=289504448
md: disk4 read error, sector=289504456
md: disk4 read error, sector=289504464
md: disk4 read error, sector=289504472
md: disk4 read error, sector=289504480
md: disk4 read error, sector=289504488
md: disk4 read error, sector=289504496

TL;DR: I replaced two 2TB drives with 4TB drives consecutively in my system, and during replacement of the second drive; the first drive that I just replaced craps out.

What should I do???

1). Can I undo the file system expansion of the second drive, put the original 2TB drive back in its place so I can tend to disk4 which is dying?

2). Should I just wait it out and let this drive rebuild finish, before dealing with the bad sector?

3). Or should I move the files from the 'phantom disk' (the second drive that has just been expanded and being recreated), clear it, then remove that disk altogether from the array, then replace disk4 ?

Is there any similar case in the forum? I've tried searching for the error msg but didn't find any.

What are your thoughts?

Thanks in advance,

ps : No, I did not run parity check in between the drive upgrades

syslog_sdq.txt

smart_sdq.txt

Quote

April 22, 201511 yr

Deep breath - don't do anything hasty.

Likely disk4 is fine, but that the cabling to the drive might not be secure.

You still have the 2 2T disks, so you have that data backed up. The data I might be more worried about is any data you copied to disk4 after its rebuild finished. But likely data loss is minimal if at all.

First question - are the errors growing as the rebuild of disk13 continues? If not, I'd let it complete, and plan to do an MD5 comparisons from the original disks after the rebuild (and shutdown/cable tightening on disk4) are done.

But if the sector errors are growing at an alarming rate, I might suggest stopping the rebuild, stopping the array, powering down, and checking the cabling on disk4, and trying again. You'll still want to do the md5 comparisons after all is rebuilt.

It is always a good idea to do a parity check before and after a disk replacement. (Sorry, had to say it.) I admit to an occasional shortcut on the AFTER parity check. will sometimes run a short parity check of a couple hundred gig (which runs pretty quickly) to give confidence that the rebuild operation was successful. But always always run a full parity check within a couple days of starting the rebuild.

[uPDATE] - just noticing the smart report

disk4 has over 600 pending sectors. This is not good and not caused by loose cabling. Did you preclear the disk? Were these present after the preclear? Normally these types of issues are defected by the preclear. I am also noticing a self test failed. Did you initiate that self-test?

I might advise that you reinsert the 2 2T drives, do a new config, reassign all of your disks the way they were before you started, and rebuild parity. You'd be back to square one. You could return the new 4T (with the pending sectors) for a new drive, preclear the other disk (if you haven't), and do the rebuild of disk4 or disk13, and wait for the new drive which you'd preclear and then replace the other one.

But there is the issue of the files you copied to disk4 AFTER the rebuild. You would loose those if you did the above. So you may want to try to copy those files off of the bad disk4 and maybe to a workstation or other disk before removing it from the array. There may be corruption but 619 is a very small number compared to the size of the disk, so there may not be much if any corruption of those specific files. You might also have these files backed up (or maybe you didn't delete them from their original locations). If these files are backed up or easily recovered, you can ignore them. But otherwise you need to do what you can while you still have that disk to get the data off that disk.

Quote

April 22, 201511 yr

Author

Thanks for taking the time to analyze my condition. Yes, I did switch the new drives to 'known good slots' of the replaced drives earlier when I was troubleshooting. I have done nothing but start/stop parity recreation process and reboot the server, fearing any move that would void my raid configuration. But as I think about my situation, I think I may have to just go back to the 2 working 2TB drives, recreate the raid and salvage what files I could from the messed up first 4TB drive (about 700MB was moved to it).

Yes, both new drives were precleared together. This is the report of the now-failing drive:

== invoked as: /boot/preclear_disk.sh /dev/sdw
==  WDC WD40EFRX-68WT0N0    WD-WCC4E1858715
== Disk /dev/sdw has been successfully precleared
== with a starting sector of 1 
== Ran 1 cycle
==
== Using :Read block size = 8225280 Bytes
== Last Cycle's Pre Read Time  : 10:40:37 (104 MB/s)
== Last Cycle's Zeroing time   : 10:15:31 (108 MB/s)
== Last Cycle's Post Read Time : 22:17:25 (49 MB/s)
== Last Cycle's Total Time     : 43:14:32
==
== Total Elapsed Time 43:14:32
==
== Disk Start Temperature: 32C
==
== Current Disk Temperature: 35C, 
==
============================================================================
** Changed attributes in files: /tmp/smart_start_sdw  /tmp/smart_finish_sdw
                ATTRIBUTE   NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE
      Temperature_Celsius =   117     120            0        ok          35
No SMART attributes are FAILING_NOW

0 sectors were pending re-allocation before the start of the preclear.
0 sectors were pending re-allocation after pre-read in cycle 1 of 1.
0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.
0 sectors are pending re-allocation at the end of the preclear,
    the number of sectors pending re-allocation did not change.
0 sectors had been re-allocated before the start of the preclear.
0 sectors are re-allocated at the end of the preclear,
    the number of sectors re-allocated did not change. 
============================================================================

Ok, thanks for the advice. As I pondered and accepted my fate, I came to the same conclusion

Quote

April 22, 201511 yr

Author

I left my computer running the doomed Data-Rebuild and last I seen it was crawling at less than 1MB/sec at the 150GB mark... when I got back from dinner, it's blazing at 100M/sec again. If the data rebuild ends without any more errors (I've noted the # of errors) should I just swap this bad disk and work from there?

Quote

April 22, 201511 yr

I left my computer running the doomed Data-Rebuild and last I seen it was crawling at less than 1MB/sec at the 150GB mark... when I got back from dinner, it's blazing at 100M/sec again. If the data rebuild ends without any more errors (I've noted the # of errors) should I just swap this bad disk and work from there?

I would do the md5 comparison of the files on the new disk from the original 2T disk. I am not sure if all those errors caused corruption or not. I would not be surprised to see some data errors. Since the failing disk4 was used to rebuild disk13, both are at risk for data errors. But I do think the ~700MB (that is a tiny amount, are you sure it is not 700G?) will likely be salvageable if you copy to a good disk, although no promises.

Quote

April 22, 201511 yr

Author

I left my computer running the doomed Data-Rebuild and last I seen it was crawling at less than 1MB/sec at the 150GB mark... when I got back from dinner, it's blazing at 100M/sec again. If the data rebuild ends without any more errors (I've noted the # of errors) should I just swap this bad disk and work from there?

I would do the md5 comparison of the files on the new disk from the original 2T disk. I am not sure if all those errors caused corruption or not. I would not be surprised to see some data errors. Since the failing disk4 was used to rebuild disk13, both are at risk for data errors. But I do think the ~700MB (that is a tiny amount, are you sure it is not 700G?) will likely be salvageable if you copy to a good disk, although no promises.

Sorry, I meant 700GB.

I'm thinking... right now my disk4 has a blackhole and disk13 is created with fragments of that blackhole littered on it... but the crap on disk13 don't affect the integrity of the whole raid set. So my main priority is to replace disk4. Once that is done, I will compare the files on disk4 and disk13 with the original 2TB drives and replace as needed.

After that, all I have to worry about is the 700GB right? But the rest should be 100% ok.

Quote

HELP : Bad sector during 2 drives upgrade

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)