New drive failing during rebuild


Ziggy

Recommended Posts

  • Replies 58
  • Created
  • Last Reply

Top Posters In This Topic

If it's not getting a lease something is wrong with your network, you can rename network.cfg back but doubt it will make any difference.

Yeah, I'm starting to think you're right. I just got a DHCP lease, but pinging a Google DNS server returns about 73% package loss, and I still can't load the web interface. I'm not sure why I'm not experiencing any issues on the other machine... Maybe the cable is dying... I'll move the hardware to my router and hook it up there to see what happens.

Link to comment

Nope. It is definitely NOT an issue with my network. I moved the hardware and hooked it up straight to my router after testing the router and new cat5 cable with another device. It is possible that the PCI NIC got damaged, it's been gathering dust for a couple of years now... I don't have another NIC to test :(.

 

EDIT: I'll go out to buy a new NIC, looks like I can find one el-cheapo for 10 bucks.

Link to comment

Replacing the NIC did not resolve the issue. I'm getting an IP lease, but there's loss when pinging other devices.

 

I am ready to start pulling my hair out at this point. I have NO idea what to try next.

 

EDIT: Putting the other motherboard in resolves the network issue. I am 100% positive it is not a physical issue.

Link to comment

Sorry for the quadruple post... So I slid my old motherboard back in and am having severe issues mounting my disks now... Multiple disks are unmountable, including my two RAID 0 SSD cache drives... https://gyazo.com/bfc06adeaa77e8a7dc2bc71b13a0aa2f

 

I'm assuming one of the filesystems became 'corrupt' because of all the hard resets I had to do to address the issues discussed hereabove. I tried running xfs_repair on /dev/sde but it's been stuck at phase 1 for over an hour now...

 

Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...

ziggy_unraid-diagnostics-20160829-2141.zip

Link to comment

...it's been stuck at phase 1 for over an hour now...

 

This can take several hours, up to the same time as a full scan of the disk.

 

Thank you for putting up with me!

 

Alright, I'll let it run overnight. What about my two cache drives that are unmountable? They have BTRFS, how do I scan/repair those?

Link to comment

Try one at a time to see if it mounts, the other has to be physically disconnected from the server, not enough to unassign it.

Nope, they're not mounting separately either (I tried disconnecting each of the SATA cables and then mounting the drives, no joy). For your information, I did put them in a RAID0.

Link to comment

Sorry, didn't notice that, btrfs recovery tools are not the best, good luck:

 

https://lime-technology.com/wiki/index.php/Check_Disk_Filesystems#Drives_formatted_with_BTRFS

 

Thanks a bunch. You weren't kidding about the btrfs tools, took me many hours to recover the contents...

 

I'm still stuck with disk two though. xfs_repair did manage to make it mountable again, but it's still spitting out errors in the CLI. On top of that, xfs_repair keeps crashing: https://gyazo.com/45901f8134be18ad90de7317af30b962 .

ziggy_unraid-diagnostics-20160830-1523.zip

Link to comment

If xfs_repair keeps crashing you can try 2 things, upgrade to v6.2 (includes newer xfs_repair) or move all data to other disk(s) and format that disk.

Hmm, upgrading to v6.2-rc4 did not resolve the issue. How would you recommend I go about moving the data to a new disk? Can I MV everything or should I use specialized tools?

Link to comment

I'm puzzled. I noticed that some of the disks in the array only went into error mode during a parity check. So I wiped the parity and rebuilt it, and after 48+ hours, the issues seem to have gone away...

 

Is it possible that a corrupt parity caused the parity check algorithm to crash?

Link to comment

Is it possible that a corrupt parity caused the parity check algorithm to crash?

 

Not really, did you grab the diagnostics when that happened?

 

Same behavior as originally described in this post. PCIe disks going offline during a parity check, formatting the parity disk **seems** to have resolved the issue... Exact same hardware, only an upgrade to Unraid 6.2

Link to comment

Is it possible that a corrupt parity caused the parity check algorithm to crash?

 

Not really, did you grab the diagnostics when that happened?

 

Same behavior as originally described in this post. PCIe disks going offline during a parity check, formatting the parity disk **seems** to have resolved the issue... Exact same hardware, only an upgrade to Unraid 6.2

Haven't been following the thread, but formatting the parity disk is completely pointless since it doesn't have a filesystem. Perhaps you meant something other than actually formatting. Format means "write an empty filesystem to this disk". That is what it has always meant on every operating system you have ever used.
Link to comment

Haven't been following the thread, but formatting the parity disk is completely pointless since it doesn't have a filesystem. Perhaps you meant something other than actually formatting. Format means "write an empty filesystem to this disk". That is what it has always meant on every operating system you have ever used.

Okay, I simply reconstructed the disk by starting the array without the parity disk prior to re-adding it. I was under the impression that this would format the drive since I did not realize the parity disk does not have a FS. So I guess it just overwrote the disk.

 

In any case, my box is still up and running after 3 days. I really thought it was a hardware issue and am bummed we'll probably never know what it was.

Link to comment

There's no difference between manual ad scheduled parity checks.

 

Errors continue to be on the 4 disks on the Marvell controller, I would get a different controller, if a 4 port controller is enough you can get an adaptec 1430sa for like 20$ on ebay.

I'm just not inclined to believe that after 3 attempts, having randomly run between 1 and 3 manual parity checks before setting up a scheduled check, this is just a coincidence. First time, sure. Second time, ehe why not? But three times in a row, being able to reproduce it like that...? I don't know what to tell you.

 

It could also be that it has something to do with mover, which is scheduled to run every two hours.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.