Parity drive 'unassigned', shows in 'unassigned devices'.


dingd0ng

Recommended Posts

Hi

 

Tonight my parity drive started making some horrible noises during a move operation, so I basically panicked and unplugged the disk since it sounded like something inside was skipping around. A few hours earlier I also got some SMART errors, the one I remember was about Offline Uncorrectable going from 0 to 8. Just when the noises came and I unplugged it I also got a red warning in the Unraid GUI which I think was about the parity drive being unconnected (as a precaution by Unraid?). I have seen similar things earlier with other disks, and in these cases I have reconnected them and rebuilt parity. 

After reconnecting the parity disk and stopping the array the disk didn't show up in the GUI, so I restarted the array to get the disk to show up again (and to swap SATA data cable as a precaution), but when the GUI restarted my parity disk couldn't be selected in the 'parity'-slot, the only option in the drop down menu was 'unassigned' meaning no disks could be chosen. The disk did show up further down under 'unassigned devices' with the option to format the disk or run a preclear session on it. 

Unfortunately, since I had now restarted the server the attached syslog probably doesn't contain all the errors from before the restart. I have also attached a screenshot of the server config from before the restart. The parity drive in question is ST16000NM001G-2KK103_ZL235Z8D (sde). 

 

Previously when disks have been disconnected from the array due to errors I think I have followed the instructions in this thread (link), but I think this is the first time it is the parity drive itself who's disconnected, and now it already shows as 'unassigned'.

 

How should I proceed? I can't choose the parity drive as the new parity drive since the only option in the drop down menu is 'unassigned'. There are no disks to choose from, since the 'current' parity drive now is shown under 'unassigned devices'.

 

Unraid version: 6.9.1
CPU: Intel i3 8350K

Motherboard: Asus Rog Strix Z370-F Gaming. BIOS from 01.12.2017

Controller card: LSI SAS-9211-8i

Disks: Eight Seagate ranging from 8 to 16 TB, mainly Ironwolf-disks.

PSU: EVGA Supernova G3 750W

 

To complicate things further I should mention that I have been having problems with disks for a long time now. I assumed most was due to old disks, but was also suspecting something in the server damaging the disks. And in the case of my parity drive failing the disk is only about 6 months old, and I precleared it before using it to stresstest it.

 

I assume either the disks have become defect due to mechanical failure, either a fault in the disk all along or due to old age and out of my control, or something in my server is making the disks fail. I don't think Unraid or any other software can do this (?), so I am left with the cables connected to the disks, the LSI controller card, the motherboard, the PSU or overheating as the possible culprits.

  • I have now swapped all SATA data cables. I was mostly using two SAS to 4 SATA-plug cables from the controller card and have swapped both these. I was also connecting some disks directly to the motherboard using ordinary SATA-cables, and I got a lot of SMART errors on some of these disks, but these disks were a few years old and have now been removed from the array, and for the last few weeks all disks have been connected to the controller card. I was therefore expecting these errors to be done with after taking these SATA-cables out of the server and not using the motherboard SATA-ports any more.
    Is it even possible for a SATA data cable to produce these errors? I get that a loose or damaged cable can lead to data loss and certain SMART errors, and I believe some of the SMART errors I have seen have been due to bad cables, but I can't image a bad data cable producing the noises and vibrations I have noticed lately.
  • I have checked all SATA power cables from the PSU to the disks. I can't see anything wrong with them, but don't have any expertise in this. Is it even possible for a PSU or a PSU cable to damage a hard drive? The errors I have been getting seem to be mechanical, since I have heard some decidedly non-normal sounds from the damaged disks lately.
  • I haven't noticed anything weird with the LSI controller card, but don't know how to check this further. I have as mentioned replaced the SAS to SATA-cables connected to the controller.
  • I haven't noticed anything weird with the motherboard either, but again I don't know how this could be further investigated. After the few disks who were connected to the motherboards SATA-ports were removed earlier, I also don't see how this can be the culprit, other than maybe a faulty motherboard might lead to errors on a connected controller card which in turn is connected to the disks in question. But that seems far fetched. 
  • Overheating: Not a problem, the disks are almost never above 40 degrees, and fans are blowing air directly over them.

 

 

 

Skjermbilde 2021-09-25 kl. 22.29.10.png

tower-diagnostics-20210925-2248.zip

Link to comment
Just now, trurl said:

The SMART report for that disk doesn't look like those for your other Exos X16 disks. Is there some controller difference between them?

Controller difference, as in which controller card or SATA-connection they are connected to in the server? No, all disks are connected to the same LSI controller card using two SAS-to-4-SATA-cables. Previously some disks have been connected directly to the motherboard SATA-ports, but not the parity drive I am currently having problems with.

 

I am also having trouble getting a fresh SMART-report from the disk. When clicking on the disk under Unassigned devices there is almost no data, see attached screenshot. Is this normal when a disk is connected to Unassigned devices? Shouldn't I be able to run a SMART self test? 

Skjermbilde 2021-09-26 kl. 00.52.06.png

Skjermbilde 2021-09-26 kl. 00.52.56.png

Link to comment
32 minutes ago, trurl said:

Still having connection problems with that disk.

Maybe the cause of everything else

Sep 25 22:48:01 Tower kernel: blk_update_request: I/O error, dev sdh, sector 31251758976 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 Sense Key : 0xb [current] 
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 ASC=0x0 ASCQ=0x0 
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 CDB: opcode=0x88 88 00 00 00 00 07 46 bf ff 80 00 00 00 08 00 00

 

Or maybe an actual disk problem. Do you have another you could try?

Link to comment
5 minutes ago, trurl said:

Maybe the cause of everything else

Sep 25 22:48:01 Tower kernel: blk_update_request: I/O error, dev sdh, sector 31251758976 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 Sense Key : 0xb [current] 
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 ASC=0x0 ASCQ=0x0 
Sep 25 22:48:01 Tower kernel: sd 1:0:6:0: [sdh] tag#1729 CDB: opcode=0x88 88 00 00 00 00 07 46 bf ff 80 00 00 00 08 00 00

 

Or maybe an actual disk problem. Do you have another you could try?

Unfortunately I don't have any other disks of that size available, and a smaller disk won't work as parity drive since some of the array disks also are 16 TB.

 

Could I disconnect the failed disk and reconnect it to an external drive bay and run a SMART check from Unassigned devices? Or is SMART checks not available from UD?

Link to comment

OK, this is weird, but I connected two smaller disks to the server via an external hard disk bay and ran SMART tests on them to see if that worked, and after doing this the failed parity drive now show up as an option in the drop down menu for Parity. 

I can now also access the parity drive under Unassigned devices with more information than earlier. I still can't run a SMART check on the disk and I still cant see the attributes and capabilities of the disk, but I can download a SMART report. I have attached this to this post.

 

What are my possible options going forward?

  • Since I can now pick this drive as a parity drive I assume I could re-select it and start the array, which I think will trigger a parity rebuild. This seems risky seeing as this is the same disk making weird noises earlier tonight. And those noises were weird.
  • I could maybe try Tools - New Config and reassign the disks to their old configuration, and choose Parity is valid (think it is called this). I think this would get the array online the fastest. But again, I don't trust this disk yet.
  • I could try removing the parity drive completely from the server, connecting it to the same external hard disk bay and see if I can run a proper SMART check on it from there. If that works I think I should get an updated overview of damaged sectors etc on the disk and could make a more informed decision about reusing the disk or not.
  • I could start the array without a parity drive assigned. This is obviously risky with regards to further data loss, but I would get access to my files and dockers? Is this even possible? Would I have to do the New Config-step to do this?

What do the pros suggest? First priority is obviously to find out if the drive is OK, then what made the errors appear in the first place, and then to get the array online as fast as possible.

ST16000NM001G-2KK103_ZL235Z8D-20210926-0158.txt

Link to comment

Since you have single parity and only one disabled disk, you should be able to start the array and access any files on any mounted data disks in the array. This is true whether the disabled disk is parity or data. If it were a data disk disabled, as long as the emulated disk was mountable its data could be accessed even if the disk were removed from the server.

 

No point in doing New Config, no real difference between an array with no parity assigned and an array with a disabled parity disk.

 

Can you put the disk in another computer and try to run the manufacturer's tests on it. I'm sure those can be downloaded free from their website.

Link to comment

I have now run Seagate SeaTools on the disk, and none of the tests even seem able to run. 

Below is the log from SeaTools for the disk. All other disks in the server ran all tests without error.

I still can't access any SMART-info from the disk in Unraid, but the disk shows up in Unassigned Devices.

 

Does anyone have any idea what could cause such a catastrophic failure of the drive during an ordinary move-operation?

 

Quote

=============================================== 
SeaTools Bootable v2.1.2 
Model Number : ST16000NM001G-2KK103
Serial Number : ZL235Z8D
Firmware Revision : SN02
=============================================== 
09-27  | 22:16:58 |  S.M.A.R.T. Check Started
09-27  | 22:16:58 |  S.M.A.R.T. Check PASS
09-27  | 22:17:17 |  Short Drive Self Test Started
09-27  | 22:17:17 |  Short Drive Self Test FAIL
09-27  | 22:17:54 |  Short Generic Started
09-27  | 22:17:54 |  Outer Scan Started
09-27  | 22:17:54 |  Short Generic FAIL
09-27  | 22:22:01 |  Short Generic Started
09-27  | 22:22:01 |  Outer Scan Started
09-27  | 22:22:01 |  Short Generic FAIL
09-27  | 22:24:08 |  S.M.A.R.T. Check Started
09-27  | 22:24:08 |  S.M.A.R.T. Check PASS
09-27  | 22:24:19 |  Short Drive Self Test Started
09-27  | 22:24:19 |  Short Drive Self Test FAIL
09-27  | 22:24:22 |  Short Generic Started
09-27  | 22:24:22 |  Outer Scan Started
09-27  | 22:24:22 |  Short Generic FAIL
09-27  | 22:24:35 |  Short Generic Started
09-27  | 22:24:35 |  Outer Scan Started
09-27  | 22:24:35 |  Short Generic FAIL
09-27  | 22:24:40 |  Long Generic Started
09-27  | 22:24:40 |  Error LBA: 0
09-27  | 22:24:40 |  Error LBA: 1
09-27  | 22:24:40 |  Error LBA: 2
09-27  | 22:24:40 |  Error LBA: 3
09-27  | 22:24:40 |  Error LBA: 4
09-27  | 22:24:40 |  Error LBA: 5
09-27  | 22:24:40 |  Error LBA: 6
09-27  | 22:24:40 |  Error LBA: 7
09-27  | 22:24:41 |  Error LBA: 8
09-27  | 22:24:41 |  Error LBA: 9
09-27  | 22:24:41 |  Error LBA: 10
09-27  | 22:24:41 |  Error LBA: 11
09-27  | 22:24:41 |  Error LBA: 12
09-27  | 22:24:41 |  Error LBA: 13
09-27  | 22:24:41 |  Error LBA: 14
09-27  | 22:24:41 |  Error LBA: 15
09-27  | 22:24:41 |  Error LBA: 16
09-27  | 22:24:41 |  Error LBA: 17
09-27  | 22:24:41 |  Error LBA: 18
09-27  | 22:24:41 |  Error LBA: 19
09-27  | 22:24:41 |  Error LBA: 20
09-27  | 22:24:41 |  Error LBA: 21
09-27  | 22:24:41 |  Error LBA: 22
09-27  | 22:24:41 |  Error LBA: 23
09-27  | 22:24:41 |  Error LBA: 24
09-27  | 22:24:41 |  Error LBA: 25
09-27  | 22:24:41 |  Error LBA: 26
09-27  | 22:24:41 |  Error LBA: 27
09-27  | 22:24:41 |  Error LBA: 28
09-27  | 22:24:41 |  Error LBA: 29
09-27  | 22:24:41 |  Error LBA: 30
09-27  | 22:24:42 |  Error LBA: 31
09-27  | 22:24:42 |  Error LBA: 32
09-27  | 22:24:42 |  Error LBA: 33
09-27  | 22:24:42 |  Error LBA: 34
09-27  | 22:24:42 |  Error LBA: 35
09-27  | 22:24:42 |  Error LBA: 36
09-27  | 22:24:42 |  Error LBA: 37
09-27  | 22:24:42 |  Error LBA: 38
09-27  | 22:24:42 |  Error LBA: 39
09-27  | 22:24:42 |  Error LBA: 40
09-27  | 22:24:42 |  Error LBA: 41
09-27  | 22:24:42 |  Error LBA: 42
09-27  | 22:24:42 |  Error LBA: 43
09-27  | 22:24:42 |  Error LBA: 44
09-27  | 22:24:42 |  Error LBA: 45
09-27  | 22:24:42 |  Error LBA: 46
09-27  | 22:24:42 |  Error LBA: 47
09-27  | 22:24:42 |  Error LBA: 48
09-27  | 22:24:42 |  Error LBA: 49
09-27  | 22:24:42 |  Long Generic FAIL
09-28  | 01:35:28 |  S.M.A.R.T. Check Started
09-28  | 01:35:28 |  S.M.A.R.T. Check PASS
09-28  | 01:35:31 |  Short Drive Self Test Started
09-28  | 01:35:31 |  Short Drive Self Test FAIL
09-28  | 01:35:34 |  Short Generic Started
09-28  | 01:35:34 |  Outer Scan Started
09-28  | 01:35:35 |  Short Generic FAIL

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.