New drive failing during rebuild

JorgeB · August 29, 2016

If it's not getting a lease something is wrong with your network, you can rename network.cfg back but doubt it will make any difference.

Ziggy · August 29, 2016

If it's not getting a lease something is wrong with your network, you can rename network.cfg back but doubt it will make any difference.

Yeah, I'm starting to think you're right. I just got a DHCP lease, but pinging a Google DNS server returns about 73% package loss, and I still can't load the web interface. I'm not sure why I'm not experiencing any issues on the other machine... Maybe the cable is dying... I'll move the hardware to my router and hook it up there to see what happens.

Ziggy · August 29, 2016

Nope. It is definitely NOT an issue with my network. I moved the hardware and hooked it up straight to my router after testing the router and new cat5 cable with another device. It is possible that the PCI NIC got damaged, it's been gathering dust for a couple of years now... I don't have another NIC to test .

EDIT: I'll go out to buy a new NIC, looks like I can find one el-cheapo for 10 bucks.

Ziggy · August 29, 2016

Replacing the NIC did not resolve the issue. I'm getting an IP lease, but there's loss when pinging other devices.

I am ready to start pulling my hair out at this point. I have NO idea what to try next.

EDIT: Putting the other motherboard in resolves the network issue. I am 100% positive it is not a physical issue.

Ziggy · August 29, 2016

Sorry for the quadruple post... So I slid my old motherboard back in and am having severe issues mounting my disks now... Multiple disks are unmountable, including my two RAID 0 SSD cache drives... https://gyazo.com/bfc06adeaa77e8a7dc2bc71b13a0aa2f

I'm assuming one of the filesystems became 'corrupt' because of all the hard resets I had to do to address the issues discussed hereabove. I tried running xfs_repair on /dev/sde but it's been stuck at phase 1 for over an hour now...

Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...

ziggy_unraid-diagnostics-20160829-2141.zip

JorgeB · August 29, 2016

...it's been stuck at phase 1 for over an hour now...

This can take several hours, up to the same time as a full scan of the disk.

Ziggy · August 29, 2016

...it's been stuck at phase 1 for over an hour now...

This can take several hours, up to the same time as a full scan of the disk.

Thank you for putting up with me!

Alright, I'll let it run overnight. What about my two cache drives that are unmountable? They have BTRFS, how do I scan/repair those?

JorgeB · August 29, 2016

Alright, I'll let it run overnight. What about my two cache drives that are unmountable? They have BTRFS, how do I scan/repair those?

Try one at a time to see if it mounts, the other has to be physically disconnected from the server, not enough to unassign it.

Ziggy · August 29, 2016

Try one at a time to see if it mounts, the other has to be physically disconnected from the server, not enough to unassign it.

Nope, they're not mounting separately either (I tried disconnecting each of the SATA cables and then mounting the drives, no joy). For your information, I did put them in a RAID0.

JorgeB · August 29, 2016

I did put them in a RAID0.

Sorry, didn't notice that, btrfs recovery tools are not the best, good luck:

https://lime-technology.com/wiki/index.php/Check_Disk_Filesystems#Drives_formatted_with_BTRFS

Ziggy · August 30, 2016

Sorry, didn't notice that, btrfs recovery tools are not the best, good luck:

https://lime-technology.com/wiki/index.php/Check_Disk_Filesystems#Drives_formatted_with_BTRFS

Thanks a bunch. You weren't kidding about the btrfs tools, took me many hours to recover the contents...

I'm still stuck with disk two though. xfs_repair did manage to make it mountable again, but it's still spitting out errors in the CLI. On top of that, xfs_repair keeps crashing: https://gyazo.com/45901f8134be18ad90de7317af30b962 .

ziggy_unraid-diagnostics-20160830-1523.zip

Ziggy · September 4, 2016

Bump, still experiencing this issue

All my user shares disappeared and I'm getting this error when executing 'ls' in /mnt/user/: https://gyazo.com/2bd24cb0091d09ef165932e6f6cf74ac

JorgeB · September 4, 2016

If xfs_repair keeps crashing you can try 2 things, upgrade to v6.2 (includes newer xfs_repair) or move all data to other disk(s) and format that disk.

Ziggy · September 4, 2016

If xfs_repair keeps crashing you can try 2 things, upgrade to v6.2 (includes newer xfs_repair) or move all data to other disk(s) and format that disk.

Hmm, upgrading to v6.2-rc4 did not resolve the issue. How would you recommend I go about moving the data to a new disk? Can I MV everything or should I use specialized tools?

JorgeB · September 5, 2016

You can use any tool you want, MC (midnight commander) is a popular one, just make sure you always move form disk share to disk share, never from disk share to user share.

Ziggy · September 9, 2016

I'm puzzled. I noticed that some of the disks in the array only went into error mode during a parity check. So I wiped the parity and rebuilt it, and after 48+ hours, the issues seem to have gone away...

Is it possible that a corrupt parity caused the parity check algorithm to crash?

JorgeB · September 9, 2016

Is it possible that a corrupt parity caused the parity check algorithm to crash?

Not really, did you grab the diagnostics when that happened?

Ziggy · September 9, 2016

Is it possible that a corrupt parity caused the parity check algorithm to crash?

Not really, did you grab the diagnostics when that happened?

Same behavior as originally described in this post. PCIe disks going offline during a parity check, formatting the parity disk **seems** to have resolved the issue... Exact same hardware, only an upgrade to Unraid 6.2

trurl · September 10, 2016

Is it possible that a corrupt parity caused the parity check algorithm to crash?

Not really, did you grab the diagnostics when that happened?

Same behavior as originally described in this post. PCIe disks going offline during a parity check, formatting the parity disk **seems** to have resolved the issue... Exact same hardware, only an upgrade to Unraid 6.2

Haven't been following the thread, but formatting the parity disk is completely pointless since it doesn't have a filesystem. Perhaps you meant something other than actually formatting. Format means "write an empty filesystem to this disk". That is what it has always meant on every operating system you have ever used.

Ziggy · September 10, 2016

Haven't been following the thread, but formatting the parity disk is completely pointless since it doesn't have a filesystem. Perhaps you meant something other than actually formatting. Format means "write an empty filesystem to this disk". That is what it has always meant on every operating system you have ever used.

Okay, I simply reconstructed the disk by starting the array without the parity disk prior to re-adding it. I was under the impression that this would format the drive since I did not realize the parity disk does not have a FS. So I guess it just overwrote the disk.

In any case, my box is still up and running after 3 days. I really thought it was a hardware issue and am bummed we'll probably never know what it was.

JorgeB · September 10, 2016

My guess is that whatever was causing your issues was fixed (or aliviated), by the new kernel and drivers on v6.2.

Ziggy · September 12, 2016

**sigh**

Happened again, after 5 days of uninterrupted usage, coincidentally in the time-frame when the automated parity check was supposed to run. Seems to be the drives attached to that SATA controller again (see logs).

I'm going to replace the board once more.

Ziggy · September 15, 2016

Replaced motherboard with the exact same model, ran two MANUAL parity checks: no problems. Ran a scheduled parity check: problem (see logs). I have been able to reproduce this three times now, it seems like only the scheduled parity check triggers this behaviour... Any more advice?

ziggy_unraid-diagnostics-20160915-1843.zip

JorgeB · September 15, 2016

There's no difference between manual ad scheduled parity checks.

Errors continue to be on the 4 disks on the Marvell controller, I would get a different controller, if a 4 port controller is enough you can get an adaptec 1430sa for like 20$ on ebay.

Ziggy · September 15, 2016

There's no difference between manual ad scheduled parity checks.

Errors continue to be on the 4 disks on the Marvell controller, I would get a different controller, if a 4 port controller is enough you can get an adaptec 1430sa for like 20$ on ebay.

I'm just not inclined to believe that after 3 attempts, having randomly run between 1 and 3 manual parity checks before setting up a scheduled check, this is just a coincidence. First time, sure. Second time, ehe why not? But three times in a row, being able to reproduce it like that...? I don't know what to tell you.

It could also be that it has something to do with mover, which is scheduled to run every two hours.

New drive failing during rebuild

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation