Jump to content
We're Hiring! Full Stack Developer ×

Troubleshooting steps for 2 Disabled Drives


Go to solution Solved by JorgeB,

Recommended Posts

I was unable to access my unRaid via WebUI or SSH so I needed to force a shut down (I know:(, should've tried to plug in a Keyboard/monitor). When it came back up, 2 of the drives were Disabled (1 of my 2 parities and Disk1). A year or so ago (it's a 3 year old rig without any other issues I can think of) I got an UDMA CRC error count on a drive that resolved when I opened the box and pushed in all the connectors. Each of the disabled drives again reads a UDMA CRC error but opening the box and pushing in the connectors didn't magically bring them back this time.

 

Upon closer inspection it looks like the tension and heat maybe stretched the blue sheathing of the SATA cables and is exposing some metal underneath. I bought all new cables and they come in tomorrow.  Seems like it's probably a good idea to replace the cables in any event, but i'd really like to get a better understanding of the problem to make sure I'm addressing the right one. Is there a way i can confirm this is the issue before doing it? Should I replace these drives while I'm rerunning the cables (I already have replacements)?

 

Also advice on a work flow for what to do inside unRaid after I've replaced the SATA cables et al and, hopefully, the drives are .... reenabled? rediscovered? I'm sorry i'm not sure the nomenclature...obviously:)

PXL_20230508_230037550.jpg

Screenshot 2023-05-09 225003.png

Screenshot 2023-05-09 224446.png

Link to comment
  • Solution

SMART looks fine for both disks, there's a single CRC error and for one of the disks it's an old error, still a good idea to replace cables to rule that out, then and since the emulated disk is mounting, and assuming contents look correct, you can rebuild on top and re-sync parity, both can be done at the same time.

Link to comment
28 minutes ago, JorgeB said:

SMART looks fine for both disks, there's a single CRC error and for one of the disks it's an old error, still a good idea to replace cables to rule that out, then and since the emulated disk is mounting, and assuming contents look correct, you can rebuild on top and re-sync parity, both can be done at the same time.

Thanks for that color! Is there a patirticular order i should rebuild Disk1 and re-sync the parity? or since they can be done simultaneously it doesn't matter (any benefit to doing them separately)?

 

And finally, when i'm rerunning those SATA cables, do I need to keep track of which cable goes to which drive/port on the Card? Or will unraid sort that out without me trying to replicate the same runs...?

Link to comment
11 minutes ago, vitovega said:

or since they can be done simultaneously it doesn't matter (any benefit to doing them separately)?

I would do both at the same time.

 

12 minutes ago, vitovega said:

And finally, when i'm rerunning those SATA cables, do I need to keep track of which cable goes to which drive/port on the Card?

No, it tracks the disks by serial.

Link to comment

Got the cables, but didn't have the time/energy to do it last night, def by the close of the weekend (Agh mother's day!). But something that was bothering me...If one of the CRC errors is old on those 2 drives....why did the drive get disabled?

Edited by vitovega
Link to comment
13 hours ago, JorgeB said:

Check to see if there's a cable or something jamming one of the CPU fans.

unfortunately nothing like that. I checked the IPMI and cpu2 is overheating. so i'm going to try a different fan header and see if I can hear the pump going on the AIO cooler and if not replace?  In order to recable the hard drives I had to remove the radiators from the fan wall and then the fan wall and think i knocked out the cpu header when I was reattaching...

Link to comment

Eep I didn't know that! What sort of catastrophic failure? Leaks? Unfortunately replacing it with anything but the same model will require me to remove the whole mobo and I'm not sure I'm ready to do that if it can be simply replaced... But thank you for the info! I might not even be able to replace with the same model without removing the mb. If it comes to that I'll just get a fan. Thanks once again for the advice 

Edited by vitovega
Link to comment
10 hours ago, vitovega said:

What sort of catastrophic failure?

The relatively low thermal mass of the water block can allow rapid temperature spikes if the fluid stops moving or is gone. Processors do attempt to save themselves from overheating, but the engineers assume a certain amount of mass is going to be available even without airflow, so the lack of mass can allow damaging heat in a matter of seconds.

 

Leaks can be very bad as well, even if the fluid is clean, the boards have dust and particles that will mix with the fluid and cause corrosion, a slow undetected leak is the worst as it seeps into cracks and crevices causing voltage to go where it's not supposed to. Worst case would be a slow leak above a bottom mounted PSU, you could end up with mains voltage going to all your sensitive parts at once, blowing out circuit boards on drives.

 

Granted, that sort of failure is very rare, probably because most water coolers are in gaming rigs for show, and any leaks or failures are caught relatively quickly.

Link to comment

YIKES! wow thanks for that! I'm glad i shut down my rig so fast when it was beeping, hopefully there was no permanent damage done. we'll find out tonight! Do you have any suggestions for what signs et al to look for if there WAS permanent damage? is it like memory issues where it's super hard to diagnose?

Edited by vitovega
Link to comment

Well i re-ran all the cables, reinstalled the fan wall, triple checked all the connections and it instantly beeped constantly the second i turned it on (it had been off for 4 days, so overheating seemed very unlikely). I tried a bunch of different headers for the AIO but eventually I just let it run as I was messing around in the IPMI (and cause it seemed to be booting effectively and didn't even feel warm). Well low and behold before unraid GUI could be accessed the beeping stopped and it's been running for over 2 hours without any further beeping. My server-related discord server suggested I update the BMC (and possibly BIOS). But personally feel I should rebuild the array first but I'm TERRIFIED. Can i just confirm with you that this are my steps?

  1. select no disk on both the Parity and Disk1 then
  2. Start the Array
  3. Stop array
  4. Reassign both disks to their previous slots
  5. start array again
  6. finally sleep easy?

 

I simply can't select "no device" and start the array unless someone smarter than me tells me that's the correct move.

image.thumb.png.ccb07cf9d7f93a5e2657ad0fc4b97117.png

Edited by vitovega
Link to comment

oh fiddlesticks, I got the email this morning that you posted and it just showed "that's it..." and I didn't see the rest and went for it. How WOULD i have gone about checking that the emulated disk was still mounting? it seemed to have...and it's currently rebuilding parity/data-rebuild in process and its 9.8% done. What should i be on the look out for?

Link to comment

fwiw, my running theory to the issue i had with the beeping is that the server was (forcefully) shut down during a accurate alert to a hot CPU2/AIO cooler/fan header not working and kept alerting on boot until it got to a certain part in the boot process and something could be reset. I dunno if this is crazy or stupid but it's the best i can come up with cause its not been without beep for 15 hours or so.

Link to comment

Screenshot_20230518-213519.thumb.png.26cc063b5cf5c27326a28ec3bab93b15.pngSo i thought that during rebuild i would be able to access my files but i noticed that the network shares weren't connected on my pc and plex was still telling me "files not found", so i rebooted my pc and plex server (I did nothing with the unRaid box) and oddly was then unable to access unraid over GUI or ssh (i tried on my android without luck as well). I was looking at the gui several different times today including right before i restarted my pc and it was about 51% of the way through rebuilding after 13+ hours. Do you have a suggestion for how to proceed from here? My current plan is to wait for at least enough time to pass for the array to be rebuilt (i figure tomorrow evening or Saturday) but then what should I do if I'm still unable to access the gui or ssh? hard reset? that's what I did before i had the disabled disks the first time....

 

I suppose i should plug a monitor into the server. fwiw it appears to be running fine with no error lights or beeping...

Edited by vitovega
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...