Husker_N7242C

April 17, 2022

Hi everyone, I'm at a bit of a loss - every couple of days the server goes offline and is not responsive (including GUI and webGUI). Sometimes a disk is randomly disabled but once added back again, often doesn't get disabled again, but another disk gets disabled a few days later.

I've replaced all SATA and power cables and swapped the drives around to try to find a pattern, but can't see one.

Someone suggested that 8TB Seagate Ironwolf drives have an issue with spinning down, so I disabled spin-down which didn't help.

I've tried running it with all dockers disabled and just one VM running (no hardware passthrough) and still get hangs.

I'm not sure if diagnostics is any help as I think it clears each hang but I've attached just in case.

Any suggestions of what to try next??

Much appreciated guys!

Hardware basics:

AsRock x79 Extreme 11 with E5-2670

SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)

Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

32GB DDR3 1333MHz ECC Memory (4x 8GB)

7x 8TB Seagate Ironwolf ST8000VN04-2M2101 (Parity+6 data)

2x 3TB Seagate Barracuda ST3000DM007-1WY10G (data)

2x 2TB Seagate NAS ST2000VN000-1HJ164 (data)

1x 2TB Seagate Barracuda ST2000DM006-2DM164 (data)

Plus 3x Cache pools with 7 SSDs total (mix of Samsung and Crucial/Micron)

diagnostics-20220417-1647.zip

June 27, 2021

22 minutes ago, JorgeB said:

Syslog starts over after every reboot, enable syslog mirror to flash then post hat log after a crash.

Thanks JorgeB! I've activated this to "local" server on a new cache-only share (my flash is about 10 years old so i won't tempt fate). I'll see what happend and upload the log. Thanks for the reply!

June 27, 2021

No response after 3 weeks 😞

I wish I knew where to start with analysing my own diagnostics file. Half the errors just say "post your diagnostics to the forum".

June 27, 2021

Hi everyone,

The Fix Common Problems plugin says the following:
Machine Check Events detected on your server - Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged

It doesn't give me any hint on how to look into the issue myself and I have MCELOG already running so attached is my diagnostics. I would appreciate it if anyone can give me some pointers?

Background that might be related (or might not):

I recently removed a GPU and replaced it if that helps. I had an original GTX Titan for transcoding for the past couple of years however it seemed to be crashing the whole system during transcodes so I have removed it and I am waiting on a GTX 1660 Super to arrive. In the mean time I have an old GPU for GUI output only and relying on the CPU for PLEX transcoding.

Thanks in advance

nas-diagnostics-20210627-1439.zip

June 5, 2021

Hi guys,
My server is randomly locking up completely (not even mouse/keyboard working on GUI).
I can't see anything obvious. Would someone be kind enough to have a look at my diagnostic file attached?

nas-diagnostics-20210605-2124.zip

August 18, 2020

I've ended up running the pre-clear script on both drives with pre and post read cycles. Both passed no errors. I've added disk2 back to the array and copied 2TB of data back to it without an error. I'll add the parity back tomorrow and rebuild.

I saw a post in the Facebook group of (what looks like) the same thing happening to others. Parity gets disabled, nobody can work out why, they check the disk and put it back and all is good. Maybe it is a bug or UNRAID is disabling the drive too ruthlessly?

August 13, 2020

Thanks. I know the parity is on a PCIe HBA card in IT mode (2x SAS to 8x SATA). I did swap the SATA end from the SSD that is unassigned to the Parity which made no difference. I'll have a look and see which HDDs are on the HBA and see if there is a pattern.

August 13, 2020

Thanks again Johnnie (sorry I wrote tee-tee earlier, I was on my mobile and mis-read).

I attempted another parity sync but it failed and disabled the parity drive again (new diag attached).

Re: checksums, I do have Dynamix File Integrity Plugin installed. It runs monthly with SHA2. I have no clue how to use it to help my situation? I've attached a screenshot as it shows that disk2 "build" and "export" are not up to date

nas-diagnostics-20200813-1709.zip

August 12, 2020

Thanks tee-tee. I really appreciate the reply. I've rebooted and this time it has let me put the parity drive back and is trying a sync.
I don't think that disk2 can possibly have rebuilt correctly though. The rebuild failed after like 2 hours.
If this parity sync finished I'm not sure I can trust that I don't have a corrupt allocation table or something on disk2.

I've attached a fresh diagnostics in the hopes that it will now contain something useful?

nas-diagnostics-20200812-1906.zip

August 12, 2020

Hi guys, I've had a really unusual set of circumstances result in what appears like multiple drive failures (but isn't). I've probably lost 2TB of data, I'm at risk of loosing 8TB more and have no partiy now. Please help if you can (diagnostics attached)

Order of events:

1. disk2 was disabled by UNRAID for smart errors (7 year old 2TB)

2. Replaced disk2 with new Ironwolf 8TB

3. rebuild failed for some reason and the parity disk was disabled by UNRAID. The parity is a quite new 8TB ironwolf also with no SMART errors to date.
4. I tried to re-seat the cables and reboot just incase something came loose

5. After reboot disk2 appears to be rebuilt which is impossible in the short time it was rebuilding

6. Parity drive is still disabled and stopping the array and trying to remove and re-add it doesn't work, UNRAID doesn't want it.

7. I tried removing disk2 in the GUI because it is obviously corrupt. UNRAID tries putting data on it but the data just vanishes into oblivion.

8. Ran a read check on all drives hoping that UNRAID would see that disk2 is corrupt and let me do something with it but it got 0 errors from disk2

9. disk4 got 120,000 read errors from the test.... another near new ironwolf 8TB with no smart errors which was fine until now.

I don't even know where to start with this. I realise that 2TB from disk2 is probably gone. I can live with that I guess. I really don't want to loose another 8TB from disk4.

Any help would be received gratefully.

nas-diagnostics-20200812-1631.zip

February 9, 2020

Hi guys, as the title says, I had an RX570 passed through to a VM OK for the past few months (except it won't reset when you reboot the VM) and following a power failure it can't be allocated to a VM (isn't an option in the drop-down, just VNC). The RX570 DOES appear in Devices (as below) so I'm a bit miffed.

[1002:67df]09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)

[1002:aaf0]09:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]

The GTX Titan doesn't get used for VMs, that just does the GUI output and renders video for PLEX docker (Hardware Encoding).

Logs are attached so any advice would be appreciated. I've got trouble with USB devices also so if you see/know anything I can do to get USB passthrough working well with this hardware please let me know, I'd really appreciate it.

nas-diagnostics-20200209-0822.zip

November 3, 2019

I'm running v6.7.0. It takes a while to open the VM tab because it seems to cause many disks to spin-up.

What is the cause and is there a way to avoid this/reduce it?

It doesn't seem to matter if there are VMs running or not.

I have a few VMs that I rarely use with vdisks on the array but the main 2 VMs that are always running are on the cache.

Thanks guys.

June 14, 2019

Thanks Johnnie as always. That's a big bummer but at least I know what to do.

June 14, 2019

Guys, my cache pool was unbalanced which was causing it to pause the VMs for no reason. I ran the btrfs balance start -dusage=50 /mnt/cache from terminal which completed successfully then I tried to run btrfs balance 75 but it gave an error that the cache was read only. The VMs and docker service crashed shortly after and I'm stuck.

I ran docker safe permissions but it didn't help.

I'd appreciate some advice. Thanks!

nas-diagnostics-20190614-0801.zip

Husker_N7242C

Posts

Joined

Last visited

Content Type

Profiles

Forums

Downloads

Store

Gallery

Bug Reports

Documentation

Landing

Posts posted by Husker_N7242C

Random Lock-ups and Disks Disabled

Server completely freezing - please help with my diagnostics attached

Server completely freezing - please help with my diagnostics attached

Machine Check Events detected on your server

Server completely freezing - please help with my diagnostics attached

UNRAID disabled "good disks" | Failed rebuild | Corrupted Disk | Problems keep escalating

UNRAID disabled "good disks" | Failed rebuild | Corrupted Disk | Problems keep escalating

UNRAID disabled "good disks" | Failed rebuild | Corrupted Disk | Problems keep escalating

UNRAID disabled "good disks" | Failed rebuild | Corrupted Disk | Problems keep escalating

UNRAID disabled "good disks" | Failed rebuild | Corrupted Disk | Problems keep escalating

GPU not available to VM after power failure

Disks spin up when opening the VM tab in webgui

Cache read only after btrfs balance - VMs and docker service won't start

Cache read only after btrfs balance - VMs and docker service won't start