iarp Posted January 15, 2021 Share Posted January 15, 2021 Well. Probably the worst case scenario has happened for me. I'm not entirely sure where to go from here. 1 parity drive at 4tb, 4 data drives. One data drive is emulated currently and I bought a replacement but 1.86tb into the 2tb rebuild and another drive throws read sector errors. While the array still starts should I unload the emulated disk onto another outside of the array? Perhaps also attempt to unload the other disk with read errors? Or will that just not work at all? storage-diagnostics-20210115-0628.zip Quote Link to comment
JorgeB Posted January 15, 2021 Share Posted January 15, 2021 Diags are after rebooting, which disk had the read error? Quote Link to comment
iarp Posted January 15, 2021 Author Share Posted January 15, 2021 disk 5 is emulated, disk 3 is one throwing sector read errors. i have syslog copied to flash which i forgot diag doesn't capture. syslog(2)_trimmed is the latest one from the flash drive, its 890MB in size. All i "trimmed" from the file was an unbelievable number of "Jan 15 05:48:00 storage kernel: md: disk3 read error, sector=3679445936". disk2 errors occurred from wiggling a sata cable thinking that may've been disk3's issue. After that happened i replaced loose sata cables and disk2 has not had issues yet. syslog_20210115_0702.zip Quote Link to comment
JorgeB Posted January 15, 2021 Share Posted January 15, 2021 Disk3 appears to really be failing, best bet to recover as much data as possible would be to clone disk3 with ddrescue (to a disk of the same size), then use that to rebuild disk5, but there will always be some data loss on both disks. Is old disk5 completely dead? Quote Link to comment
iarp Posted January 15, 2021 Author Share Posted January 15, 2021 Is there someway I can test and tell using the original disk3 and disk5 drives? disk3 is actually a rebuild from 4-5 days ago so i have the very original disk3, i have the currently active rebuilt disk3 in the array. I'm starting to question if the original issue was a loose cables to begin with which started this. Quote Link to comment
JorgeB Posted January 15, 2021 Share Posted January 15, 2021 4 minutes ago, iarp said: Is there someway I can test and tell using the original disk3 and disk5 drives? You can mount them and check SMART with the UD plugin, array should be stopped because of duplicate UUIDs. Quote Link to comment
iarp Posted January 15, 2021 Author Share Posted January 15, 2021 (edited) SMART short self-test on original disk3 and disk5 returns "Completed without error". I'm thinking I should probably take the 4TB I was using the rebuild disk5 with and see if i can successfully copy original disk3 and 5 onto it. Edited January 15, 2021 by iarp Quote Link to comment
JorgeB Posted January 15, 2021 Share Posted January 15, 2021 Better to run an extended test, SMART report might also give a better idea. Quote Link to comment
iarp Posted January 15, 2021 Author Share Posted January 15, 2021 Looks like it'll be several hours before that completes. Hypothetical thoughts/questions here: So every Sunday I start up another machine running my second license of unraid and i run the command "rsync -avW ..." to basically image all shares from this machine except video and tvshows. As it currently stands with disk5 being emulated and disk3 with read errors. If the array were started and I wanted to copy something off, as long as the file isn't in the error portion of disk3 i should get a clean file. But if disk3 has a read error, would the copy fail or would i get a corrupted file and just not know it? I'm thinking of running out and grabbing 3 more 4TBs and basically just rebuilding from scratch. Mount and copy the contents of disk1, and disk2 onto the new array and then somehow using rsync to copy missing things back? Quote Link to comment
JorgeB Posted January 15, 2021 Share Posted January 15, 2021 11 minutes ago, iarp said: As it currently stands with disk5 being emulated and disk3 with read errors. If the array were started and I wanted to copy something off, as long as the file isn't in the error portion of disk3 i should get a clean file. But if disk3 has a read error, would the copy fail or would i get a corrupted file and just not know it? You'll get an i/o error, i.e. copy will fail. Quote Link to comment
iarp Posted January 15, 2021 Author Share Posted January 15, 2021 Ok. So I went for a walk and had spent some time thinking. In my panic of having never dealt with multi-drive failure at the same time, I completely forgot about my second machine that duplicates everything. So really I may have only lost everything since Sunday morning. I think i'll just cut my losses and call it a day on attempting to repair this array. Theres too many errors and issues going on, i mean how can i rebuild one thats missing while another is having read errors. If I can, I'm going to start the array and at least try to get a document i spent 6 hours working on that sunday evening and hope its ok. I'll be heading out to get some more new 4tb drives to setup a new array. I don't want to lose all my user/share configs, do I just run New Config and assign the new drives from there? In the end, did you think there would've been any way to recover this mess? Quote Link to comment
JorgeB Posted January 15, 2021 Share Posted January 15, 2021 29 minutes ago, iarp said: In the end, did you think there would've been any way to recover this mess? Yes, the one I mentioned above, or doing a new config with the old disks if they really are OK. Quote Link to comment
iarp Posted January 15, 2021 Author Share Posted January 15, 2021 (edited) I've attached both smart entries. I should note that disk3 was rebuilt 2-3 days ago using the current disk3 thats somehow failing. But this is the one and only disk5 that i've been trying to rebuild. live-disk3-WDC - is the current disk3 in the array thats erroring disk3-WDC - is the very original disk3 that was replaced I have not bought those 4tb i was thinking of buying yet. I did however buy a bunch of new sata cables and replaced all of them. They're all very firmly fitting and clicked into place with no movement. disk3-WDC_WD20EARX-00MMMB0_WD-WCAWZ2814440-20210115-1413.txt disk5-WDC_WD20EARX-00PASB0_WD-WMAZA6594240-20210115-1445.txt live-disk3-WDC_WD20EFRX-68AX9N0_WD-WMC300660429-20210115-2232.txt Edited January 16, 2021 by iarp Quote Link to comment
JorgeB Posted January 16, 2021 Share Posted January 16, 2021 Both "old" disks look fine, disk3 is practically new, you can just do a new config with both and the rest of the array and re-sync parity, obviously will lose any data on those disks written after they were replaced. Quote Link to comment
iarp Posted January 16, 2021 Author Share Posted January 16, 2021 (edited) So just to confirm, New Config then reassign original parity, disk1, disk2, followed by original disk5 and for disk3 i want WCAWZ2814440 which claims to have 9 power on hours? I have a hard time believing it actually has 9 hours because I can look at my screenshots of 2018, 2019, 2020, 2021 disk configurations and that drive has been in use just as long as the rest. If I added another parity after doing the new config, would that be ok or mess something up? Edited January 16, 2021 by iarp Quote Link to comment
trurl Posted January 16, 2021 Share Posted January 16, 2021 1 hour ago, iarp said: If I added another parity after doing the new config, would that be ok or mess something up? Should be OK. You have to rebuild parity anyway so might as well do both at once. Quote Link to comment
iarp Posted January 16, 2021 Author Share Posted January 16, 2021 (edited) I don't think I should've used drive 4440 for disk3. Rebuild is going at a blazing 208KB/s. Edit: Actually, i stopped the rebuild, restarted and as i'm watching the syslog its disk5 thats taking for-bloody-ever to mount and be read. Jan 16 11:48:48 storage emhttpd: shcmd (143): mount -t xfs -o noatime,nodiratime /dev/md4 /mnt/disk4 Jan 16 11:48:48 storage kernel: SGI XFS with ACLs, security attributes, no debug enabled Jan 16 11:48:48 storage kernel: XFS (md4): Mounting V5 Filesystem Jan 16 11:50:39 storage nginx: 2021/01/16 11:50:39 [error] 11003#11003: *36 upstream timed out (110: Connection timed out) w> Jan 16 11:51:11 storage ntpd[1766]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Jan 16 11:55:04 storage kernel: XFS (md4): Ending clean mount Jan 16 11:55:28 storage emhttpd: shcmd (144): xfs_growfs /mnt/disk4 Jan 16 11:55:28 storage root: meta-data=/dev/md4 isize=512 agcount=4, agsize=122094660 blks Jan 16 11:55:28 storage root: = sectsz=512 attr=2, projid32bit=1 Jan 16 11:55:28 storage root: = crc=1 finobt=1, sparse=0, rmapbt=0 Jan 16 11:55:28 storage root: = reflink=0 Jan 16 11:55:28 storage root: data = bsize=4096 blocks=488378638, imaxpct=5 Jan 16 11:55:28 storage root: = sunit=0 swidth=0 blks Jan 16 11:55:28 storage root: naming =version 2 bsize=4096 ascii-ci=0, ftype=1 Jan 16 11:55:28 storage root: log =internal log bsize=4096 blocks=238466, version=2 Jan 16 11:55:28 storage root: = sectsz=512 sunit=0 blks, lazy-count=1 Jan 16 11:55:28 storage root: realtime =none extsz=4096 blocks=0, rtextents=0 Jan 16 11:55:28 storage emhttpd: shcmd (145): sync Jan 16 11:55:29 storage emhttpd: shcmd (146): mkdir /mnt/user Edit 4 hours later: disk3 just vanished from the array. I stopped the rebuild, restarted and its back. I then used DD to test the speeds of both disk3 and disk5 while mounted out of the array and 3 can barely breaking 200KB/s and 5 maxes at 2.5mb/s. Edited January 16, 2021 by iarp Quote Link to comment
JorgeB Posted January 17, 2021 Share Posted January 17, 2021 Run a test with the diskspeed docker and post the graph here. Quote Link to comment
iarp Posted January 17, 2021 Author Share Posted January 17, 2021 (edited) Edited January 17, 2021 by iarp Quote Link to comment
iarp Posted January 17, 2021 Author Share Posted January 17, 2021 Oh I just realized I never updated my last post. After disk3(sdb in graphs above) vanished during the parity rebuild, It dawned on me that disk1 and 2 still had all their data. So I ran out and bought new 4TB's. In the graphs above Parity 1, 2, and disks 1, 2, and 3 are all new 4TB drives. Disk 4 and 5 are the original disk 1 and 2 respectively. The rebuild of parity completed early this morning. It's been at least 6 years since I used ddrescue. I'm going to give a shot at the original disk5 (sdi in those graphs above) and see what it can save for me. Original disk3 (sdb) i think is lost as it just vanishes after a couple minutes of attempting to access any file. I'm surprised it lasted long enough to do the benchmark really. Also seeing those graphs really highlights just how badly one drive brings down the rest. The parity rebuild was going at 35MB/s in the 1.5TB to 2TB range and once it hit 2 out of 4TB completed it instantly shot up to 151MB/s. So I'll be glad to get rid of those old drives just for array speed sense. Quote Link to comment
JorgeB Posted January 18, 2021 Share Posted January 18, 2021 Yep, those disks are in very bad shape, but still worth trying ddrescue. Quote Link to comment
iarp Posted January 18, 2021 Author Share Posted January 18, 2021 (edited) In the saga of dealing with this, and now the idea of loose cables being in my mind; I'm a little worried about putting this machine back together. I use 2 supermicro 5in3 cages but the case i have available is a bit tight and to make room to work i just placed them on top of the case and then ran the new sata cables to the cages. Thankfully everything has been running fine now, but I'd like to get the cages back in and hopefully not have more errors. In the event of errors appearing and most likely being assumed its a bad cable connection. Where would I go from there? Does it require another rebuild or check or just make another post with diags attached and work things from there? disk5 has been reading at 1.08 to 1.53MB/s for over a day now. disk3 i think is lost. I did some googling about WD greens and slow responsiveness and i remember one of the youtube videos mentioning its a common issue with WD Greens. Something to do with the number of bad sectors increasing and firmware that starts bugging out when the number of bad sectors gets too high..etc. I think he was able to fix the drive easily but of course the software and machine to do the fixing is many thousands. Update four days later: To put an end to this. Thankfully I had that second server duplicating data which meant the bulk of my information was back up and running quickly. For the shares I did not have duplicating because they're too big and I don't overly care to lose it, one of the rebuild drives I had used for disk3 but started throwing sector errors worked when you skip the file that has a sector error. So It was easy enough to use rsync and watch syslog for sector error messages, whenever a message appeared I would kill rsync, add the files name to the ignore list and restart rsync. Using this method I've only lost 16 files due to sector issues. Thanks to everyone for their help. Edited January 22, 2021 by iarp Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.