5 disk array, one disk emulated, another disk with read sector errors at 1.86TB into rebuild. Best course of action?


iarp

Recommended Posts

Well. Probably the worst case scenario has happened for me.

 

I'm not entirely sure where to go from here.

 

1 parity drive at 4tb, 4 data drives. One data drive is emulated currently and I bought a replacement but 1.86tb into the 2tb rebuild and another drive throws read sector errors.

 

While the array still starts should I unload the emulated disk onto another outside of the array? Perhaps also attempt to unload the other disk with read errors? Or will that just not work at all?

storage-diagnostics-20210115-0628.zip

Link to comment

disk 5 is emulated, disk 3 is one throwing sector read errors.

 

i have syslog copied to flash which i forgot diag doesn't capture.

 

syslog(2)_trimmed is the latest one from the flash drive, its 890MB in size. All i "trimmed" from the file was an unbelievable number of "Jan 15 05:48:00 storage kernel: md: disk3 read error, sector=3679445936". disk2 errors occurred from wiggling a sata cable thinking that may've been disk3's issue. After that happened i replaced loose sata cables and disk2 has not had issues yet.

syslog_20210115_0702.zip

Link to comment

Is there someway I can test and tell using the original disk3 and disk5 drives?

 

disk3 is actually a rebuild from 4-5 days ago so i have the very original disk3, i have the currently active rebuilt disk3 in the array.

 

I'm starting to question if the original issue was a loose cables to begin with which started this.

Link to comment

SMART short self-test on original disk3 and disk5 returns "Completed without error".

 

I'm thinking I should probably take the 4TB I was using the rebuild disk5 with and see if i can successfully copy original disk3 and 5 onto it.

Edited by iarp
Link to comment

Looks like it'll be several hours before that completes.

 

Hypothetical thoughts/questions here:

So every Sunday I start up another machine running my second license of unraid and i run the command "rsync -avW ..." to basically image all shares from this machine except video and tvshows.

 

As it currently stands with disk5 being emulated and disk3 with read errors. If the array were started and I wanted to copy something off, as long as the file isn't in the error portion of disk3 i should get a clean file. But if disk3 has a read error, would the copy fail or would i get a corrupted file and just not know it?

 

I'm thinking of running out and grabbing 3 more 4TBs and basically just rebuilding from scratch. Mount and copy the contents of disk1, and disk2 onto the new array and then somehow using rsync to copy missing things back?

Link to comment
11 minutes ago, iarp said:

As it currently stands with disk5 being emulated and disk3 with read errors. If the array were started and I wanted to copy something off, as long as the file isn't in the error portion of disk3 i should get a clean file. But if disk3 has a read error, would the copy fail or would i get a corrupted file and just not know it?

You'll get an i/o error, i.e. copy will fail.

Link to comment

Ok. So I went for a walk and had spent some time thinking.

 

In my panic of having never dealt with multi-drive failure at the same time, I completely forgot about my second machine that duplicates everything. So really I may have only lost everything since Sunday morning.

 

I think i'll just cut my losses and call it a day on attempting to repair this array. Theres too many errors and issues going on, i mean how can i rebuild one thats missing while another is having read errors.

 

If I can, I'm going to start the array and at least try to get a document i spent 6 hours working on that sunday evening and hope its ok.

 

I'll be heading out to get some more new 4tb drives to setup a new array. I don't want to lose all my user/share configs, do I just run New Config and assign the new drives from there?

 

In the end, did you think there would've been any way to recover this mess?

Link to comment

I've attached both smart entries. I should note that disk3 was rebuilt 2-3 days ago using the current disk3 thats somehow failing.

 

But this is the one and only disk5 that i've been trying to rebuild.

 

live-disk3-WDC - is the current disk3 in the array thats erroring

disk3-WDC - is the very original disk3 that was replaced

 

I have not bought those 4tb i was thinking of buying yet. I did however buy a bunch of new sata cables and replaced all of them. They're all very firmly fitting and clicked into place with no movement.

 

disk3-WDC_WD20EARX-00MMMB0_WD-WCAWZ2814440-20210115-1413.txt disk5-WDC_WD20EARX-00PASB0_WD-WMAZA6594240-20210115-1445.txt

 

live-disk3-WDC_WD20EFRX-68AX9N0_WD-WMC300660429-20210115-2232.txt

Edited by iarp
Link to comment

So just to confirm, New Config then reassign original parity, disk1, disk2, followed by original disk5 and for disk3 i want WCAWZ2814440 which claims to have 9 power on hours?

 

I have a hard time believing it actually has 9 hours because I can look at my screenshots of 2018, 2019, 2020, 2021 disk configurations and that drive has been in use just as long as the rest.

 

If I added another parity after doing the new config, would that be ok or mess something up?

Edited by iarp
Link to comment

I don't think I should've used drive 4440 for disk3. Rebuild is going at a blazing 208KB/s.

 

Edit: Actually, i stopped the rebuild, restarted and as i'm watching the syslog its disk5 thats taking for-bloody-ever to mount and be read.

 

Jan 16 11:48:48 storage emhttpd: shcmd (143): mount -t xfs -o noatime,nodiratime /dev/md4 /mnt/disk4
Jan 16 11:48:48 storage kernel: SGI XFS with ACLs, security attributes, no debug enabled
Jan 16 11:48:48 storage kernel: XFS (md4): Mounting V5 Filesystem
Jan 16 11:50:39 storage nginx: 2021/01/16 11:50:39 [error] 11003#11003: *36 upstream timed out (110: Connection timed out) w>
Jan 16 11:51:11 storage ntpd[1766]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 16 11:55:04 storage kernel: XFS (md4): Ending clean mount
Jan 16 11:55:28 storage emhttpd: shcmd (144): xfs_growfs /mnt/disk4
Jan 16 11:55:28 storage root: meta-data=/dev/md4               isize=512    agcount=4, agsize=122094660 blks
Jan 16 11:55:28 storage root:          =                       sectsz=512   attr=2, projid32bit=1
Jan 16 11:55:28 storage root:          =                       crc=1        finobt=1, sparse=0, rmapbt=0
Jan 16 11:55:28 storage root:          =                       reflink=0
Jan 16 11:55:28 storage root: data     =                       bsize=4096   blocks=488378638, imaxpct=5
Jan 16 11:55:28 storage root:          =                       sunit=0      swidth=0 blks
Jan 16 11:55:28 storage root: naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
Jan 16 11:55:28 storage root: log      =internal log           bsize=4096   blocks=238466, version=2
Jan 16 11:55:28 storage root:          =                       sectsz=512   sunit=0 blks, lazy-count=1
Jan 16 11:55:28 storage root: realtime =none                   extsz=4096   blocks=0, rtextents=0
Jan 16 11:55:28 storage emhttpd: shcmd (145): sync
Jan 16 11:55:29 storage emhttpd: shcmd (146): mkdir /mnt/user

 

Edit 4 hours later: disk3 just vanished from the array. I stopped the rebuild, restarted and its back. I then used DD to test the speeds of both disk3 and disk5 while mounted out of the array and 3 can barely breaking 200KB/s and 5 maxes at 2.5mb/s.

Edited by iarp
Link to comment

Oh I just realized I never updated my last post.

 

After disk3(sdb in graphs above) vanished during the parity rebuild, It dawned on me that disk1 and 2 still had all their data. So I ran out and bought new 4TB's.

 

In the graphs above Parity 1, 2, and disks 1, 2, and 3 are all new 4TB drives. Disk 4 and 5 are the original disk 1 and 2 respectively. The rebuild of parity completed early this morning.

 

It's been at least 6 years since I used ddrescue. I'm going to give a shot at the original disk5 (sdi in those graphs above) and see what it can save for me. Original disk3 (sdb) i think is lost as it just vanishes after a couple minutes of attempting to access any file. I'm surprised it lasted long enough to do the benchmark really.

 

Also seeing those graphs really highlights just how badly one drive brings down the rest. The parity rebuild was going at 35MB/s in the 1.5TB to 2TB range and once it hit 2 out of 4TB completed it instantly shot up to 151MB/s. So I'll be glad to get rid of those old drives just for array speed sense.

Link to comment

In the saga of dealing with this, and now the idea of loose cables being in my mind; I'm a little worried about putting this machine back together.

 

I use 2 supermicro 5in3 cages but the case i have available is a bit tight and to make room to work i just placed them on top of the case and then ran the new sata cables to the cages. Thankfully everything has been running fine now, but I'd like to get the cages back in and hopefully not have more errors.

 

In the event of errors appearing and most likely being assumed its a bad cable connection. Where would I go from there? Does it require another rebuild or check or just make another post with diags attached and work things from there?

 

disk5 has been reading at 1.08 to 1.53MB/s for over a day now. disk3 i think is lost. I did some googling about WD greens and slow responsiveness and i remember one of the youtube videos mentioning its a common issue with WD Greens. Something to do with the number of bad sectors increasing and firmware that starts bugging out when the number of bad sectors gets too high..etc. I think he was able to fix the drive easily but of course the software and machine to do the fixing is many thousands.

 

Update four days later: To put an end to this. Thankfully I had that second server duplicating data which meant the bulk of my information was back up and running quickly. For the shares I did not have duplicating because they're too big and I don't overly care to lose it, one of the rebuild drives I had used for disk3 but started throwing sector errors worked when you skip the file that has a sector error. So It was easy enough to use rsync and watch syslog for sector error messages, whenever a message appeared I would kill rsync, add the files name to the ignore list and restart rsync. Using this method I've only lost 16 files due to sector issues.

 

Thanks to everyone for their help.

Edited by iarp
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.