Jump to content

My unraid box has been unstable and I have no idea why


Recommended Posts

 

I’ve been using Unraid at home since 2018. Started with basic. Then moved to plus. And then upgraded to pro.  About 6 month apart for each upgrade.  Started with a DellR710 and had to reboot monthly which was fine for stability.    After I made a custom box with a Rosewill Case, a SuperMicro motherboard, Intel 4770, 16Gb of ram, SSD for cache / VMs/containers.. things seemed a bit bettee, I’d only have to reboot every 2-3 months.   

Now for the past 6 months, stability has been garbage garbage.  VMs and containers randomly crashing, not being able to start.  So I suspected a bad SSD. Swapped SSD.  Same issue.  Started to reboot weekly and was fine for a bit.  Now I’m getting out of memory and errors and frankly, I’m unhappy.  
Then today after a reboot, I see ‘unrecognizable file system” error on my SSD and all of my VMs and containers are gone of course.  (Luckily I’ve made backups of my container data on my array)

 I can’t use my unraid box for more than a basic NAS at this point.  

I’ve memtested my ram for 6 hours, all passes and no errors.  

I’m not sure what to do now. I need to fix this stability issue.

 

I don’t have a spare motherboard or cpu, to verify that that as possible issue.  PSU is a Corsair 500 watt psu, so in theory power is stable.  

ironman-diagnostics-20230807-0804.zip Logs from RAM.txt

Edited by djgizmo
attaching diagnostics
Link to comment
9 hours ago, djgizmo said:

 

" After I made a custom box with a Rosewill Case, a SuperMicro motherboard, Intel 4770, 16Gb of ram, SSD for cache / VMs/containers.. things seemed a bit bettee, I’d only have to reboot every 2-3 months.   

...

I’ve memtested my ram for 6 hours, all passes and no errors."

My single threaded memtest will not run a full cycle in 6 hours, but is very accurate on testing.  Might be best to let it run a full cycle and ensure your using the single threaded (older) memtest86.  The newer multi-threaded version will miss some ram issues in my experience.

Edited by rkotara
Link to comment
17 minutes ago, rkotara said:

My single threaded memtest will not run a full cycle in 6 hours, but is very accurate on testing.  Might be best to let it run a full cycle and ensure your using the single threaded (older) memtest86.  The newer multi-threaded version will miss some ram issues in my experience.

Which version should I run?

Link to comment

done, and now the syslog file is created on disk.  Thank you. 


Ran an XFS_repair -N on the ssd. 
 

root@IRONMAN:~# xfs_repair -n /dev/sdi1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
invalid start block 1627196033 in record 208 of cnt btree block 2/704825
invalid start block 1627196033 in record 234 of cnt btree block 2/704825
invalid start block 1627196033 in record 236 of cnt btree block 2/704825
agf_freeblks 2545590, counted 2545587 in ag 2
agi unlinked bucket 26 is 125929626 in ag 2 (inode=662800538)
sb_icount 841152, counted 841408
sb_ifree 4855, counted 4673
sb_fdblocks 20284605, counted 20803245
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
free space (2,14441748-14441748) only seen by one free space btree
free space (2,14447755-14447755) only seen by one free space btree
free space (2,14447890-14447890) only seen by one free space btree
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 1
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 662800538, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 662800538 nlinks from 0 to 1
No modify flag set, skipping filesystem flush and exiting.


Do you recommend that I try to mound the ssd manually via command line?

 

Link to comment

Did so,  said it couldn't read the log.  Followed the Spaced Invader video on XFS repair, and used the -L and the file system has been repaired.   

 

Now that I have my base data back,  I need to know why this happened and how I can prevent from happening again. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...