Disappearing Disks - Replace one and another one goes....


Recommended Posts

Hi, so confused I'm not sure how to explain all the steps I've taken, however I'll try not to write a novel. I hope I have posted in the right area. I did lots of searching but couldn't find anything similar.

 

Setup: KCAS 850GM PSU, ASUS X570-P M/B, AMD Ryzen 9 3900X CPU, LSI SAS9201-16e 16 Port SAS/SATA 6Gb/s PCIe HBA, 2 8TB Parity, 9 Data Drives of various sizes.

 

Finally purchased a 4 port HBA card (LSI SAS9201-16e 16 Port SAS/SATA 6Gb/s PCIe HBA) and installed some extra drives a couple of weeks ago - all good (only using one port at this stage).

 

Last Friday Disk 2 was in error state (1st Pic) and being emulated - only had one parity at that time.

 

image.png.bf3d79d24d47d9adcdec7c6eeaf76bca.png

 

Jumped online and ordered 3 new 8 TB drives with the idea that I would use one to replace disk 2, one for a 2nd parity (which I have been planning on) and one as a spare.

 

Drives arrived yesterday and I used 1 to replace disk 2 - it was marked as having errors soon after parity rebuild began.

Tried another new drive - marked as errors soon after parity rebuild began.

Swapped Sata cable - same

Swapped power cable - same

Connected to HBA card - same

Swapped Sata port, and started in maintenance mode - rebuilt at over 100MB/sec but Disk 4 showed read errors.

 

This afternoon when the rebuild was complete, I simply restarted the server and Disk 4 still showed error state and being emulated.

Shut down and added 2nd Parity Drive

Restarted and now showing that Disk 2 is unmountable

Disk 4 is in error state and disabled

Started in maintenance mode and was able to rebuild 2nd parity.

Remove disk 4... put disk 4 on a windows computer, did a clean and a scan ... no problems.

Put back into unraid and rebuilt disk 4 and disk 9 went... just simply chasing my tail.

 

... A few days later.

 

Swapped Disk 9 this morning and Array would not start, claiming that it was the wrong encryption passphrase. Checked a dozen times that it was copying and pasting correctly, swapped back to the original Disk and tried another dozen times to get the array started. Ended up shutting down as I had other more urgent things to do... then started it up again and Array started first go with the same copy and paste from where I keep it. Total mystery to me, but getting nervous about all these 'weird little issue' that keep happening.

Yesterday had some time so I swapped disk 9, but then disk 4 was 'missing'. Disk 4 is recognised in a separate Ubuntu VM (physical disk attached) and I can even put in the encryption key and go through the directory so the data seems like it's all there. Tried replacing Disk 4 with my last new 8TB and it allows me to select it and then just says 'wrong disk'

 

Took Disk 4 back to windows PC. Did a clean and created a partition, now the BIOS (unraid machine) can see it, but unraid does not show up with lsblk --scsi

 

Currently can't start the array, not even in maintenance mode as it's greyed out. I just can't understand why it was working fine until I added the last couple of drives. It doesn't seem to have anything to do with the SATA leads, the power leads, or the HBA card as I have interchanged all those things so many times. I'm not sure if there is any way to see it BIOS is detecting a drive that is attached to the HBA card, but again... it doesn't seem to matter how the disk is attached.

 

Any help would be greatly appreciated.

 

tower-diagnostics-20220218-1636.zip

 

image.thumb.png.a067c08f3490cb31d31cf63c437c118e.png

 

 

image.thumb.png.b9e99117eeb2e060dc86996a75131f93.png

Link to comment
13 hours ago, JorgeB said:

You need to enter the old encryption passphrase twice, just the 2nd one is filled.

 

About the disks errors diags are after rebooting, so not much to see.

Wow, totally missed that. I've never had to put it in twice before... is there a reason that it is suddenly asking for that?

 

Thank you so much for your reply.

 

So I put in the passphrase twice and it was going to let me start the array, however this would have been with 2 disks down so I shutdown, replaced disk 4 with my last 8TB drive and when I restarted I had Disk 4,5, and 9 out of action (see pic). Attached is the Diagnostics before reboot. Going to return to previous drive config and see what happens. Guessing that to start the array right now would be a bad bad thing... but also quite confident that the data is still intact on Disk 5. Disk 9 had nothing on it anyway, but I really need Disk 4 to rebuild.

 

image.png.0fa96571d305ea514e80b9c8e86374dd.png

tower-diagnostics-20220219-0922.zip

Link to comment

So back to previous drive config and Unraid see the same as it saw before... Did not touch disk 5 on either of the previous reboots, however changing disk 4 (also changed from onboard Sata to HBA Sata) and suddenly it's back. Don't really want to start the array like this so I'm going to replace disk 9... original disk but it has been cleaned and scanned (0 errors) on a windows machine

image.thumb.png.59d6b0f30790b309f20115a4dfeab1da.png

tower-diagnostics-20220219-0935.zip

Link to comment

Not sure if it might help, but around the same time that the first disk failed I noticed that write cache was disabled on a few disks. Manually enabled it, but it seems that each time I've restarted or changed the disk config more are disabled on boot so now I'm manually enabling most of them each time it boots.tower-diagnostics-20220219-1101.zip

 

Diagnostics file is after the array has been running for a while. Preclear in progress, but stopped the rebuild on Disk 9 as it was running extremely slowly (40 odd days for a 1TB disk).

 

Thanks in advance for any light you can shed on this issue.

Edited by Philby1975
Link to comment
10 hours ago, Philby1975 said:

is there a reason that it is suddenly asking for that?

I don't use encryption but it has been known to happen when there's a missing/upgraded disk.

 

10 hours ago, Philby1975 said:

when I restarted I had Disk 4,5, and 9 out of action

 

Disks are not being detected, also some issues with disk7 in this boot, so likely you have a power/connection problem, should also update the LSI firmware since it's very old.

 

 

9 hours ago, Philby1975 said:

Diagnostics file is after the array has been running for a while.

 

These show errors on two disks connected to the onboard SATA controller, again suggesting a power/connection problem.

 

 

 

Link to comment

Thank you so much mate. This is really appreciated as I've invested a lot into Unraid and preached far and wide amongst friends about how great it is.

 

So power is more likely the issue than the HBA card? I will start with updating the firmware... any idea of where I might find a good simple walk through? I wouldn't put my hand up for being stupid, but linux is not my native environment.

 

Have just put the third different disk into Disk 9 for rebuild (after preclear) and it's still going at 3-400 kb/s rather than 100+ MB/s which is what I would expect. I'm guessing this may all be part of the same problem?

Link to comment

Hi JorgeB, I hope you are well.

 

I updated the HBA card (it was advertised as having P19... was surprised to see 9), and I have rigged another power supply to the hard drives. Had some good signs and got down to only 1 hard drive with issues at one point. Now back to three. I'm hoping to borrow some hardware tomorrow to swap the whole thing over to different gear and test as I'm starting to wonder if it's not a hardware issue.

  

Reason for this is that I have had disks show up as SSD that aren't, A disk that was selected for the array but was actually an unnassigned device... also I have removed a bunch of dockers but they are still showing up as 'needing updates' when I run fix common problems.

 

I have attached the diags again the way it is now with the updated HBA and separate PSU for the hard drives. Hoping it might give you some more clues.

 

Appreciate your help.

 

tower-diagnostics-20220225-1903.zip

Link to comment
  • 2 weeks later...

Hi JorgeB, just wanted to update you. Actually fried 10 of my 11 disks while testing and trying... obviously had an impact so I completely forgot about all the help you'd provided.

 

Was very much appreciated so I didn't want to just disappear. Have shelved everything for now, and have all the important stuff on cloud backup... but one day soon I will start again. I might even swap the board on some of the drives and see if I can get anything back.

 

Again, thank you for all your help.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.