My unraid box has been unstable and I have no idea why

djgizmo · August 7, 2023

I’ve been using Unraid at home since 2018. Started with basic. Then moved to plus. And then upgraded to pro. About 6 month apart for each upgrade. Started with a DellR710 and had to reboot monthly which was fine for stability. After I made a custom box with a Rosewill Case, a SuperMicro motherboard, Intel 4770, 16Gb of ram, SSD for cache / VMs/containers.. things seemed a bit bettee, I’d only have to reboot every 2-3 months.

Now for the past 6 months, stability has been garbage garbage. VMs and containers randomly crashing, not being able to start. So I suspected a bad SSD. Swapped SSD. Same issue. Started to reboot weekly and was fine for a bit. Now I’m getting out of memory and errors and frankly, I’m unhappy.
Then today after a reboot, I see ‘unrecognizable file system” error on my SSD and all of my VMs and containers are gone of course. (Luckily I’ve made backups of my container data on my array)

I can’t use my unraid box for more than a basic NAS at this point.

I’ve memtested my ram for 6 hours, all passes and no errors.

I’m not sure what to do now. I need to fix this stability issue.

I don’t have a spare motherboard or cpu, to verify that that as possible issue. PSU is a Corsair 500 watt psu, so in theory power is stable.

ironman-diagnostics-20230807-0804.zip Logs from RAM.txt

Edited August 7, 2023 by djgizmo
attaching diagnostics

JorgeB · August 7, 2023

Enable the syslog server and post that after a crash and the diagnostics.

rkotara · August 7, 2023

9 hours ago, djgizmo said:

" After I made a custom box with a Rosewill Case, a SuperMicro motherboard, Intel 4770, 16Gb of ram, SSD for cache / VMs/containers.. things seemed a bit bettee, I’d only have to reboot every 2-3 months.

...

I’ve memtested my ram for 6 hours, all passes and no errors."

My single threaded memtest will not run a full cycle in 6 hours, but is very accurate on testing. Might be best to let it run a full cycle and ensure your using the single threaded (older) memtest86. The newer multi-threaded version will miss some ram issues in my experience.

Edited August 7, 2023 by rkotara

djgizmo · August 7, 2023

17 minutes ago, rkotara said:

My single threaded memtest will not run a full cycle in 6 hours, but is very accurate on testing. Might be best to let it run a full cycle and ensure your using the single threaded (older) memtest86. The newer multi-threaded version will miss some ram issues in my experience.

Which version should I run?

djgizmo · August 7, 2023

3 hours ago, JorgeB said:

Enable the syslog server and post that after a crash and the diagnostics.

Attached diagnostics and logs that were stored in ram. I've started the local syslog server. Should I be mirroring this to flash drive?

JorgeB · August 7, 2023

20 minutes ago, djgizmo said:

Should I be mirroring this to flash drive?

As you prefer, mirror to flash drive is the easiest option to configure.

djgizmo · August 7, 2023

19 minutes ago, JorgeB said:

As you prefer, mirror to flash drive is the easiest option to configure.

k, I have set this to a share on my array. I don't see any files / folders created for syslog on this share.

JorgeB · August 7, 2023

Possibly not correctly configured, post a screenshot of the settings.

djgizmo · August 7, 2023

1 hour ago, JorgeB said:

Possibly not correctly configured, post a screenshot of the settings.

JorgeB · August 7, 2023

You need to set the remote syslog server, use the server IP and default 514 port

djgizmo · August 7, 2023

done, and now the syslog file is created on disk. Thank you.

Ran an XFS_repair -N on the ssd.

root@IRONMAN:~# xfs_repair -n /dev/sdi1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
invalid start block 1627196033 in record 208 of cnt btree block 2/704825
invalid start block 1627196033 in record 234 of cnt btree block 2/704825
invalid start block 1627196033 in record 236 of cnt btree block 2/704825
agf_freeblks 2545590, counted 2545587 in ag 2
agi unlinked bucket 26 is 125929626 in ag 2 (inode=662800538)
sb_icount 841152, counted 841408
sb_ifree 4855, counted 4673
sb_fdblocks 20284605, counted 20803245
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
free space (2,14441748-14441748) only seen by one free space btree
free space (2,14447755-14447755) only seen by one free space btree
free space (2,14447890-14447890) only seen by one free space btree
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 1
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 662800538, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 662800538 nlinks from 0 to 1
No modify flag set, skipping filesystem flush and exiting.

Do you recommend that I try to mound the ssd manually via command line?

JorgeB · August 7, 2023

You need to run it again without -n or nothing will be done.

djgizmo · August 7, 2023

Did so, said it couldn't read the log. Followed the Spaced Invader video on XFS repair, and used the -L and the file system has been repaired.

Now that I have my base data back, I need to know why this happened and how I can prevent from happening again.

JorgeB · August 7, 2023

Post the syslog when it happens again, hopefully it catches something.

rkotara · August 8, 2023

22 hours ago, djgizmo said:

Which version should I run?

I believe any of the 4.x versions, which also can run on non-UEFI systems as a bonus. Here is a link to the usb installer version of that: https://www.memtest86.com/downloads/memtest86-4.3.7-usb.img.zip

My unraid box has been unstable and I have no idea why

Recommended Posts

djgizmo

Link to comment

JorgeB

Link to comment

rkotara

Link to comment

djgizmo

Link to comment

djgizmo

Link to comment

JorgeB

Link to comment

djgizmo

Link to comment

JorgeB

Link to comment

djgizmo

Link to comment

JorgeB

Link to comment

djgizmo

Link to comment

JorgeB

Link to comment

djgizmo

Link to comment

JorgeB

Link to comment

rkotara

Link to comment

Join the conversation