Jump to content

6.10.2 Unraid is starting to get unreliable


Recommended Posts

Hi Guys,

 

Big fan of Unraid, I have 3 servers (so far) but my main primary server is starting to flake.

No doubt it's not Unraid, it's likely a piece of hardware, but need some help identifying the culprit.

 

Still on 6.10.2 because my hall lights need to use the NC function in a script to send raw packets to the controllers & it's not in the later versions of the nerdpack. Soon I will be replacing them with newer (Shelly) devices which will eliminate this obstacle.

 

Symptoms:

Emby Docker starts misbehaving a bit, then the Unraid GUI becomes unresponsive, then SSH "reboot" commands (and all the variations I have tried) stop working, so it has to be a power cycle, with the inevitable parity check afterwards.

1st happened week1 December, and again today.

 

I'm attaching the diagnostics after a fresh reboot, maybe there's s clue in there which will enable someone to point a finger (please not the motherboard!!!).

 

If there's any other info needed - please ask.

 

 

Thanks in advance, all the best

sdd

n1-diagnostics-20231228-1253.zip

Edited by salvdordalisdad
added timings
Link to comment

Hiya

 

 

OK, what a palaver...

 

there are two convincing-looking syslog dockers in the unraid repositories. Neither seems to work, which takes some time to prove.

So I reverted back to a known-good option. 3cdaemon. It runs on windows as a tftp/tfp/syslog server, from the days wen 3Com were a switch manufacturer (yeah, the stone age).

But - importantly - it works.

 

So now waiting for some event logs to start populating it.

Not sure what events will generate a syslog entry, just to test that it's working.

So I used the 2nd server & stopped / started the array, BOY does it send messages?!

This process also stopped & started a few minor things so I got server-1 messages too.

 

NB - the "local" file method of syslog is NOT working. It doesn't save any log files in the stated location, at all.

Maybe someone could look at that one day? (or just take the option away if it doesn't work?)

 

Now we wait & see I guess.

Don't suppose anyone saw anything in the debug logs ?

I presume they're requested for a reason...

 

TIA

 

 

 

Link to comment

OK, 24 hours in & the syslog server is filled with these types of messages. All from this server, all "kernel" sourced.

 

Dec 30 10:58:44 n1 kernel: RSP: 0018:ffffc9000131fdb8 EFLAGS: 00010216
Dec 30 10:58:44 n1 kernel: RAX: 0000000000000000 RBX: ffff8881e54f3cc0 RCX: 0000000000100073
Dec 30 10:58:44 n1 kernel: RDX: 0000000000000000 RSI: ffff8881e54f3cc0 RDI: ffff88810658c960
Dec 30 10:58:44 n1 kernel: RBP: ffff8881d0ea6d18 R08: 000000000000d000 R09: 000014e7111f1000
Dec 30 10:58:44 n1 kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffff88810658c960
Dec 30 10:58:44 n1 kernel: R13: ffff88814c55d0c0 R14: ffff88810658c988 R15: ffff88810658c960
Dec 30 10:58:44 n1 kernel: FS:  0000150eb9581740(0000) GS:ffff8887fe8c0000(0000) knlGS:0000000000000000
Dec 30 10:58:44 n1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 30 10:58:44 n1 kernel: CR2: 00000000004aa000 CR3: 00000001d52ac000 CR4: 0000000000350ee0
Dec 30 10:58:49 n1 kernel: general protection fault, probably for non-canonical address 0xd16719a3d1666fb3: 0000 [#5162] SMP NOPTI
Dec 30 10:58:49 n1 kernel: CPU: 1 PID: 12418 Comm: lsof Tainted: G      D W         5.15.46-Unraid #1
Dec 30 10:58:49 n1 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C95/B550M PRO-VDH (MS-7C95), BIOS 2.80 06/22/2021
Dec 30 10:58:49 n1 kernel: RIP: 0010:show_map_vma+0x3c/0x134
Dec 30 10:58:49 n1 kernel: Code: 00 00 00 48 89 f3 4c 8b 6e 40 48 8b 4e 50 48 85 ed 74 1d 48 8b 45 20 4c 8b 86 98 00 00 00 48 8b 50 28 49 c1 e0 0c 48 8b 40 38 <44> 8b 4a 10 eb 08 45 31 c9 45 31 c0 31 c0 48 8b 53 08 50 4c 89 e7
Dec 30 10:58:49 n1 kernel: RSP: 0018:ffffc90001cf7db8 EFLAGS: 00010216
Dec 30 10:58:49 n1 kernel: RAX: b6b13a8300002709 RBX: ffff8881e54f3cc0 RCX: 0000000000100073
Dec 30 10:58:49 n1 kernel: RDX: d16719a3d1666fa3 RSI: ffff8881e54f3cc0 RDI: ffff888104f26348
Dec 30 10:58:49 n1 kernel: RBP: ffff88815da59748 R08: 000000000000d000 R09: 000014e7111f1000
Dec 30 10:58:49 n1 kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffff888104f26348
Dec 30 10:58:49 n1 kernel: R13: ffff88814c55d0c0 R14: ffff888104f26370 R15: ffff888104f26348
Dec 30 10:58:49 n1 kernel: FS:  000014a73d519740(0000) GS:ffff8887fe840000(0000) knlGS:0000000000000000
Dec 30 10:58:49 n1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 30 10:58:49 n1 kernel: CR2: 0000151f26fbe070 CR3: 00000001f6ae6000 CR4: 0000000000350ee0
Dec 30 10:58:49 n1 kernel: Call Trace:
Dec 30 10:58:49 n1 kernel: <TASK>
Dec 30 10:58:49 n1 kernel: show_map+0xa/0xd
Dec 30 10:58:49 n1 kernel: seq_read_iter+0x258/0x347
Dec 30 10:58:49 n1 kernel: seq_read+0xfc/0x11f
Dec 30 10:58:49 n1 kernel: vfs_read+0xa8/0x108
Dec 30 10:58:49 n1 kernel: ksys_read+0x76/0xbe
Dec 30 10:58:49 n1 kernel: do_syscall_64+0x83/0xa5
Dec 30 10:58:49 n1 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 30 10:58:49 n1 kernel: RIP: 0033:0x14a73d7cf3fe
Dec 30 10:58:49 n1 kernel: Code: c0 e9 e6 fe ff ff 50 48 8d 3d 4e 53 0a 00 e8 59 ea 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
Dec 30 10:58:49 n1 kernel: RSP: 002b:00007ffc803dd0f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Dec 30 10:58:49 n1 kernel: RAX: ffffffffffffffda RBX: 000000000042b2c0 RCX: 000014a73d7cf3fe
Dec 30 10:58:49 n1 kernel: RDX: 0000000000001000 RSI: 0000000000489250 RDI: 0000000000000004
Dec 30 10:58:49 n1 kernel: RBP: 000014a73d8a4520 R08: 0000000000000004 R09: 0000000000000000
Dec 30 10:58:49 n1 kernel: R10: 000014a73d854ac0 R11: 0000000000000246 R12: 000000000042b2c0
Dec 30 10:58:49 n1 kernel: R13: 0000000000000d68 R14: 000014a73d8a3920 R15: 0000000000000d68
Dec 30 10:58:49 n1 kernel: </TASK>

 

Server itself is fine, fully operational as far as I can tell, despite the "general protection fault" message in there...

 

If it fails, then I will upload the last messages etc. Meanwhile I'll let it be.

I will google that message (& probably end up down another rabbit hole...)

 

TIA

 

Link to comment

This rabbit hole begins to point towards a RAM problem...

 

There's a RAM test on the boot menu, so I'll have to add a graphics card to run that, maybe just re-seat the RAM to start with.

a 48 hour soak test would be a painfully long time to be without my prime server.

 

Any votes on this- yay or nay ?

 

The original RAM is still under warranty, but it needs to show up a failure...

Thanks for the sounding board!

;-)

Link to comment
11 minutes ago, salvdordalisdad said:

This rabbit hole begins to point towards a RAM problem...

 

There's a RAM test on the boot menu, so I'll have to add a graphics card to run that, maybe just re-seat the RAM to start with.

a 48 hour soak test would be a painfully long time to be without my prime server.

 

Any votes on this- yay or nay ?

 

The original RAM is still under warranty, but it needs to show up a failure...

Thanks for the sounding board!

;-)

If you have more than one ram stick you can try running with less for a while to see if the symptoms change or only occur with a specific stick.   This tends to cater both for ram sticks going bad, and also the case where too many ram sticks are overloading the memory controller when under load.

Link to comment

OK, well that  was unexpected, but not unwelcome...

Removed one of the DIMM modules, and rebooted ( had to add a graphics card cos of the BIOS complaint, hurumpf)

18 hours later & very few such error messages in the syslog server (which I will now keep as it's good practice anyway!).

The parity check took exactly the same 11 hours, so that's a good sign, too.

In fact the memory stats page looks quite healthy with only a single 16GB DIMM module, so I am tempted to not put it back.

Of course I will put it back for completeness' sake & if it's still faulty, then it will need to be replaced - assuming I can get it through the Corsair Warranty System, which appears to be designed to avoid warranty claims!

Will need some more memory in the meanthime, which is a bit pesky.

 

 

Will leave it for 48 hours to see if error messages resume.

Thanks for the sounding board.

<winky smile>

 

Link to comment

Update...

Single RAM stick = several days test = 0 errors. (Server WAS headless, no graphics card, but change in memory forced temp use of graphics card.)

Replaced 2nd RAM stick now memory is good again, BIOS recignised it, but refused to boot.

Long story short, new SATA PCIEx1 adapter, but now it refuses to boot without the graphics card. Slightly annoying, needs looking into, must be BIOS setting, but it can wait.

 

Anyway, 12 hours after booting with both RAM sticks, still OK...no new GPF errors yet.

If it re-errors, it confirms original diagnosis & mempry can go back for warranty, if not, end of job.

 

Update to follow.

Link to comment

Update...

4 days in & no General Protection Faults anymore...

So I will now close this as "maybe solved" by just re-seating the RAM sticks <?!> 

 

I'll also stop looking at syslog on a daily basis...

It's still running, so if there's  a crash, I will look through & see if there's a clue...

fingers crossed!

YMMV

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...