Jump to content

Array / Drive Issues - Once a month disabled drives


Recommended Posts

Hi All,

 

I have been using Unraid for several years now, however I have had random drive errors throughout my whole Unraid usage history and have undergone several hardware changes and can't figure out why they persist - and I don't understand Linux well enough to troubleshoot via command, and the GUI only provides so much detail.

 

In summary;

  • Roughly every 4-6 weeks my server will randomly disable a drive(s) after numerous errors are detected - sometimes up to 3 drives (often including parity).
  • I've performed S.M.A.R.T tests to confirm the drives operability before either a) unassigning and rebuilding the drive from the parity, or b) creating a new array config and rebuilding parity's knowing the data is safe on the array - all depending how many drives randomly get disabled.
  • I sell off old drives and upgrade every 12-18 months - originally starting with 4 x 1tb drives several years ago, now with a variety of 8-12tb drives. None of the drives are the same from the original build.
  • I started with using onboard SATA, upgrading to an 8 port SAS controller mixed with onboard, now using a 16 port controller with some SSD's running from onboard SATA (inc NVME) - each controller has been RocketRaid controllers (Marvell chips)
  • I have undergone 4 motherboard changes, 4 cpu changes, 3 sets of RAM changes (including trying ECC with a XEON cpu, vs unbuffered with an i9 cpu), dedicated vs onboard GPU's and even switched out 3 PSU's in case there was a power delivery issue. 
  • I have implemented UPS's with my new high-efficiency PSU for the cleanest power I can provide.
  • I have upgraded the fans to cool the system better - dropping thermals on drives and devices by 10-15 degrees celcius (it's hot here in Australia).

 

I can't think of any other possible issues besides the physical SAS > SAS cables I have, or the Norco 4224 chassis, however I can't understand why it works 4-6 weeks then fails if it's the enclosure or cables that's causing issues?

 

The reason for me reaching out right now is that one of my parity drives recently disabled itself overnight due to read errors, and after performing the usual "Remove drive from array > Start array > Reboot server > Re-assign disabled parity > Start array > Perform rebuild" trick, I immediately noticed Disk 9 has become unmountable due to its filesystem not being recognised, even though no errors were displayed for Disk 9 when the Parity 2 drive first disabled itself. I have an assumption that it's my HBA that's causing these sporadic issues (RocketRaid 2740) but can't confirm, as I perform S.M.A.R.T checks on the drives that fail and they report back successful with no errors.

 

Currently I have the Parity 2 drive un-assigned and am operating in emulated mode with Disk 9 being emulated due to the filesystem not being recognised. I have read some other issues similar on the forums by searching, but without being an expert, I don't want to proceed in attempting a filesystem repair without confirming with others who have more experience - will I lose any data if I proceed with forcing a filesystem repair in maintainenance mode for Disk 9? Is that my best step right now? It's risky rebuilding Parity 2 when a drive in the array is unmountable, so I am staying in Maintenance mode until someone is able to respond.

 

Hope someone can assist, I'm freaking out!

**diagnostic logs attached**

tower-diagnostics-20200816-0956.zip

Link to comment

I think it might be worthwhile to explain a few things since your description has some slight misunderstandings.

1 hour ago, whitey88x said:

Roughly every 4-6 weeks my server will randomly disable a drive(s) after numerous errors are detected - sometimes up to 3 drives (often including parity).

Not entirely sure if you are using the word "disabled" correctly.

 

A disabled disk is marked with a red X. Unraid will only disable as many drives as you have parity disks. So with single parity max 1 drive could be disabled, and with dual parity, max 2 drives could be disabled. Note that parity drive(s) are included in this count, so with dual parity you could have both parity disks disabled, one parity and one data, or 2 data. But still no more than 2 total disks could be disabled.

 

A disabled disk can be rebuilt from the parity calculation by reading all remaining disks to calculate the data for the disabled disk.

 

If you unassign a disk in the array it might also be considered disabled if there aren't already too many disks disabled.

 

It is possible for any number of disks to be missing and/or unmountable, but these are not the same as disabled.

 

1 hour ago, whitey88x said:

one of my parity drives recently disabled itself overnight due to read errors

Unraid disables a disk when a write to it fails. It is possible for a read error to cause Unraid to get the correct data from the parity calculation and attempt to write it back to the disk, and if that write fails the disk is disabled. It never disables a disk simply due to a read error.

 

1 hour ago, whitey88x said:

Disk 9 being emulated due to the filesystem not being recognised.

A disabled or missing disk can be emulated from the parity calculation, but this is a completely different and independent situation from an unrecognized filesystem.

 

It is possible for a filesystem to be corrupt without the disk being disabled.

 

It is possible for a disk to be disabled but the filesystem of the emulated disk is fine and in fact reads and writes of the emulated disk can continue even though a disabled disk will not be used by unraid; all the other disks are read and the data for the emulated disk is calculated, and if a write is involved, parity will be updated to emulate the write, but the disabled disk will not be used.

 

And it is possible for a disk to be disabled and its emulated filesystem also be corrupt.

 

 

It looks like parity2 is disabled, and as you say, unassigned. Was parity2 disabled before you unassigned it? I didn't see it being disabled in syslog. From syslog it seems the unassigned disk with serial ending 85G6 was parity2. SMART for that disk looks OK

 

Disk9 is unmountable but not disabled, SMART for disk9 also OK.

 

Please post a screenshot of Main - Array Devices to help confirm my understanding of your situation.

 

 

Link to comment

 

35 minutes ago, trurl said:

A disabled disk is marked with a red X. Unraid will only disable as many drives as you have parity disks. So with single parity max 1 drive could be disabled, and with dual parity, max 2 drives could be disabled. Note that parity drive(s) are included in this count, so with dual parity you could have both parity disks disabled, one parity and one data, or 2 data. But still no more than 2 total disks could be disabled.

Hey trurl, thanks for the response. In the past, I have had a parity + 2 other drives disabled (i.e red x indicator), and if unraid won't allow that I don't understand how it's occurred - I had to force a new config file to state the data was in-tact, as no drives had physically died or had any crc errors.

 

35 minutes ago, trurl said:

It is possible for a disk to be disabled but the filesystem of the emulated disk is fine and in fact reads and writes of the emulated disk can continue even though a disabled disk will not be used by unraid; all the other disks are read and the data for the emulated disk is calculated, and if a write is involved, parity will be updated to emulate the write, but the disabled disk will not be used.

True - This is what I've experienced. More often than not, read errors seem to be the issue but there is always several thousand write errors associated with a drive before it becomes disabled.

 

35 minutes ago, trurl said:

A disabled or missing disk can be emulated from the parity calculation, but this is a completely different and independent situation from an unrecognized filesystem.

My mistake - Yes, you're correct. At the same time Parity 2 became disabled, Disk 9 became unmountable (filesystem error). I used the term "emulated" as the array was still accessible even though I had unmounted Parity 2 (in preparation for a rebuild), and then found upon server restart, ready to reassign Parity 2 that Disk 9 was then unmountable but contents still accessible (as Parity 1 was still online).

 

I found it really odd that Parity 2 had errors and disabled itself, however on reboot the Disk 9 was now unmountable. 

The logs may not indicate errors/disabling for Parity 2 as I rebooted the server in preparation for a rebuild, however I'm unsure how much historical data the logs contain. 

 

In general I have four concerns/questions;

  1. Why did Disk 9 become unmountable after I unassigned Parity 2?
  2. Is it safe to perform a filesystem check/repair on Disk 9 to get it mountable, considering it's in good health.
  3. If I repair Disk 9, can I reassign Parity 2 and perform a rebuild safely?
  4. Does anyone have any ideas as to why do I regularly get read/write errors which disable my drives every 4-6 weeks?

I have just bitten the bullet and purchased an LSI 9305-16i as I keep reading that Marvell controllers are flakey, however it's odd that I've been encountering these issues for several years now with using onboard sata ports along-side RocketRaid controllers (different ones at that, unless all were flakey?). I do remember at one stage that it went for up to 3-4 months without errors, but eventually caved in on itself and had to rebuild again.

Edited by whitey88x
Link to comment

Now that I think about the last time I had 3 drives "disabled", I might be confusing the count with 2 drives being disabled, with a 3rd having similar counts of read/write errors. I recall rebuilding the array but can't recall which drives died. My memory may have failed me there, however I most recently (start of July) I definitely had a Parity + Disk become disabled under the same circumstances as my Parity 2 drive right now. 

This is my first time encountering a drive being disabled + having a drive become unmountable, hence why I am freaking out. It's either my controller being a weirdo, or I've configured something incorrectly.

Link to comment
9 minutes ago, whitey88x said:

Disk 9 was then unmountable but contents still accessible

The contents of an umountable disk, whether or not it is emulated, are not accessible, since it is not mounted. Perhaps you meant some files in your user shares were still accessible.

 

12 minutes ago, whitey88x said:

I rebooted the server in preparation for a rebuild, however I'm unsure how much historical data the logs contain. 

Syslog is in RAM and resets on reboot. It is possible to have syslog written somewhere:

 

17 minutes ago, whitey88x said:

I keep reading that Marvell controllers are flakey, however it's odd that I've been encountering these issues for several years now with using onboard sata ports along-side RocketRaid controllers

This may be the root of your issues. You may have viewed your syslog and seen these multiple connection problems:

Aug 16 09:56:09 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Aug 16 09:56:33 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 16 09:56:33 Tower kernel: sas: ata11: end_device-9:2: cmd error handler
Aug 16 09:56:33 Tower kernel: sas: ata9: end_device-9:0: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata10: end_device-9:1: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata11: end_device-9:2: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata12: end_device-9:3: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata13: end_device-9:4: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata14: end_device-9:5: dev error handler
Aug 16 09:56:33 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Aug 16 09:56:33 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 16 09:56:33 Tower kernel: sas: ata11: end_device-9:2: cmd error handler
Aug 16 09:56:33 Tower kernel: sas: ata9: end_device-9:0: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata10: end_device-9:1: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata11: end_device-9:2: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata12: end_device-9:3: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata13: end_device-9:4: dev error handler
Aug 16 09:56:33 Tower kernel: sas: ata14: end_device-9:5: dev error handler
Aug 16 09:56:33 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Aug 16 09:56:35 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 16 09:56:35 Tower kernel: sas: ata15: end_device-10:0: cmd error handler
Aug 16 09:56:35 Tower kernel: sas: ata15: end_device-10:0: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata16: end_device-10:1: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata17: end_device-10:2: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata18: end_device-10:3: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata19: end_device-10:4: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata20: end_device-10:5: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata21: end_device-10:6: dev error handler
Aug 16 09:56:35 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Aug 16 09:56:35 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 16 09:56:35 Tower kernel: sas: ata15: end_device-10:0: cmd error handler
Aug 16 09:56:35 Tower kernel: sas: ata15: end_device-10:0: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata16: end_device-10:1: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata17: end_device-10:2: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata18: end_device-10:3: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata19: end_device-10:4: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata20: end_device-10:5: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata21: end_device-10:6: dev error handler
Aug 16 09:56:35 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Aug 16 09:56:35 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 16 09:56:35 Tower kernel: sas: ata16: end_device-10:1: cmd error handler
Aug 16 09:56:35 Tower kernel: sas: ata15: end_device-10:0: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata16: end_device-10:1: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata17: end_device-10:2: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata18: end_device-10:3: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata19: end_device-10:4: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata21: end_device-10:6: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata20: end_device-10:5: dev error handler
Aug 16 09:56:35 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Aug 16 09:56:35 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Aug 16 09:56:35 Tower kernel: sas: ata16: end_device-10:1: cmd error handler
Aug 16 09:56:35 Tower kernel: sas: ata15: end_device-10:0: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata16: end_device-10:1: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata17: end_device-10:2: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata18: end_device-10:3: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata19: end_device-10:4: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata20: end_device-10:5: dev error handler
Aug 16 09:56:35 Tower kernel: sas: ata21: end_device-10:6: dev error handler
Aug 16 09:56:35 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

 

I would still like to see this:

53 minutes ago, trurl said:

screenshot of Main - Array Devices

 

Link to comment
11 minutes ago, trurl said:

The contents of an umountable disk, whether or not it is emulated, are not accessible, since it is not mounted. Perhaps you meant some files in your user shares were still accessible.

Ah, I assumed that because the shares were accessible it meant the contents were being emulated - I assumed the shares wouldn't be accessible if there was an issue with the array. My terminology isn't the best with this Unraid/Linux stuff - can you tell? 😪

 

Also - missed your request for the screenshot (now attached).

Since I posted a while ago and I'm getting anxious to get the system back up and running, I found a tutorial on rebuilding the filesystem from SpaceInvader on YouTube and followed it closely.

Disk 9 is now mountable and have started the array with Parity 2 back in place - it's currently performing a rebuild with no read/write errors.

 

If this is like every other time I've had a drive error-out and become disabled, this will complete successfully and will be encountered again in 4-6 weeks.

 

Those errors you pointed out for my controller - what exactly do they indicate? That it can't talk to the drive correctly, or the controller is having errors in itself? Is it mainly driver/software related, or could it be something physical causing the issues?

Quote

 

This may be the root of your issues. You may have viewed your syslog and seen these multiple connection problems:

 

Aug 16 09:56:09 Tower emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog

Aug 16 09:56:33 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Aug 16 09:56:33 Tower kernel: sas: ata11: end_device-9:2: cmd error handler

Aug 16 09:56:33 Tower kernel: sas: ata9: end_device-9:0: dev error handler

Aug 16 09:56:33 Tower kernel: sas: ata10: end_device-9:1: dev error handler

Aug 16 09:56:33 Tower kernel: sas: ata11: end_device-9:2: dev error handler

Aug 16 09:56:33 Tower kernel: sas: ata12: end_device-9:3: dev error handler

Aug 16 09:56:33 Tower kernel: sas: ata13: end_device-9:4: dev error handler

Aug 16 09:56:33 Tower kernel: sas: ata14: end_device-9:5: dev error handler

Aug 16 09:56:33 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

 

 

Unraid Array.JPG

Edited by whitey88x
Link to comment
6 minutes ago, whitey88x said:

Those errors you pointed out for my controller - what exactly do they indicate? That it can't talk to the drive correctly, or the controller is having errors in itself? Is it mainly driver/software related, or could it be something physical causing the issues?

Each of those different numbers preceded by ata is referring to a different port, so the controller is the suspect. I think the idea is that some manufacturers don't update their linux drivers.

Link to comment
9 minutes ago, whitey88x said:

Since I posted a while ago and I'm getting anxious to get the system back up and running, I found a tutorial on rebuilding the filesystem from SpaceInvader on YouTube and followed it closely.

Disk 9 is now mountable and have started the array with Parity 2 back in place - it's currently performing a rebuild with no read/write errors.

👍

Link to comment

Am I risking data loss by proceeding to rebuild the array using the RocketRaid 2740, then migrating to the LSI 9305-16i when it arrives in a week or two? It should be a simple and clean switch out, right?

 

The 2740 is currently running as JBOD, so Unraid just sees the disks as they are, however if the controller seems to be having issues talking to its own ports, could it be potentially writing bad data to the array? I assume my whole array would be cactus if that was the case, as I've been using the same controller for 3-4 years now, and before that I used a 2720.

 

Thanks so much for responding btw @trurl - Patience is not my virtue, and guiding me through the correct terminology has been helpful.

Edited by whitey88x
Link to comment
9 minutes ago, trurl said:

Do you have backups of anything important and irreplaceable? Parity isn't a substitute for a backup plan.

Only of my personal data. Media/Content is replaceable (which is 95% of the array).

I know Unraid isn't recommended as a backup plan in itself, but as a NAS for other devices on my network, there should be some expected reliability, right?

 

Link to comment

Parity only allows you to recover a missing or disabled disk. Lots of more common ways to lose data, including user error.

 

Many of us don't try to keep complete backups of everything on our large capacity servers. My Unraid is the backup for other devices on my network. And I have offsite backups of anything important and irreplaceable. Everyone has to decide these for themselves.

 

Data loss is always possible even with parity and even when everything is working well. Sounds like you have been able to recover so far, and that is the way things often work unless the user makes some mistake when trying to recover. Such as formatting an unmoutable disk and then expecting to rebuild it from parity. Even though they were warned.

  • Like 1
Link to comment

Makes total sense at the end of the day, considering I've spent a considerable amount on changing every component in the system and still encounter the issues. My next thought was that my Norco 4224 chassis has a SAS > Sata interface breakout board, so potentially that was causing the issues but no matter what slot I placed a hard drive in, a different one would be dropped next month so there was no consistency to the fault.

 

Thanks to @trurl, I at least know where to look in the logs for future errors for the next controller I install, and hopefully I can go a few months without the need to reboot and rebuild drives.

 

Sorry again for the lack of knowledge, but I do appreciate your time helping me understand the issue better.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...