salvdordalisdad Posted December 28, 2023 Share Posted December 28, 2023 (edited) Hi Guys, Big fan of Unraid, I have 3 servers (so far) but my main primary server is starting to flake. No doubt it's not Unraid, it's likely a piece of hardware, but need some help identifying the culprit. Still on 6.10.2 because my hall lights need to use the NC function in a script to send raw packets to the controllers & it's not in the later versions of the nerdpack. Soon I will be replacing them with newer (Shelly) devices which will eliminate this obstacle. Symptoms: Emby Docker starts misbehaving a bit, then the Unraid GUI becomes unresponsive, then SSH "reboot" commands (and all the variations I have tried) stop working, so it has to be a power cycle, with the inevitable parity check afterwards. 1st happened week1 December, and again today. I'm attaching the diagnostics after a fresh reboot, maybe there's s clue in there which will enable someone to point a finger (please not the motherboard!!!). If there's any other info needed - please ask. Thanks in advance, all the best sdd n1-diagnostics-20231228-1253.zip Edited December 28, 2023 by salvdordalisdad added timings Quote Link to comment
trurl Posted December 28, 2023 Share Posted December 28, 2023 setup syslog server Quote Link to comment
salvdordalisdad Posted December 28, 2023 Author Share Posted December 28, 2023 Good idea. Thanks. I checked & it's already setup, syslogging locally to 1 file in a directory, but there's nothing there. Tried an alternative location - still nothing being put in there. Also tried a remote syslog-ng docker on another server. will wait & see if anything shows up. ...watch this space... Quote Link to comment
JorgeB Posted December 28, 2023 Share Posted December 28, 2023 46 minutes ago, salvdordalisdad said: Tried an alternative location - still nothing being put in there. Make sure the remote server IP is filled, with your server IP, alternatively enable the mirror to flash drive option. Quote Link to comment
salvdordalisdad Posted December 29, 2023 Author Share Posted December 29, 2023 Hiya OK, what a palaver... there are two convincing-looking syslog dockers in the unraid repositories. Neither seems to work, which takes some time to prove. So I reverted back to a known-good option. 3cdaemon. It runs on windows as a tftp/tfp/syslog server, from the days wen 3Com were a switch manufacturer (yeah, the stone age). But - importantly - it works. So now waiting for some event logs to start populating it. Not sure what events will generate a syslog entry, just to test that it's working. So I used the 2nd server & stopped / started the array, BOY does it send messages?! This process also stopped & started a few minor things so I got server-1 messages too. NB - the "local" file method of syslog is NOT working. It doesn't save any log files in the stated location, at all. Maybe someone could look at that one day? (or just take the option away if it doesn't work?) Now we wait & see I guess. Don't suppose anyone saw anything in the debug logs ? I presume they're requested for a reason... TIA Quote Link to comment
JorgeB Posted December 29, 2023 Share Posted December 29, 2023 42 minutes ago, salvdordalisdad said: there are two convincing-looking syslog dockers You don't need a docker, it's included with Unraid: Settings - Syslog Server Quote Link to comment
salvdordalisdad Posted December 30, 2023 Author Share Posted December 30, 2023 OK, 24 hours in & the syslog server is filled with these types of messages. All from this server, all "kernel" sourced. Dec 30 10:58:44 n1 kernel: RSP: 0018:ffffc9000131fdb8 EFLAGS: 00010216 Dec 30 10:58:44 n1 kernel: RAX: 0000000000000000 RBX: ffff8881e54f3cc0 RCX: 0000000000100073 Dec 30 10:58:44 n1 kernel: RDX: 0000000000000000 RSI: ffff8881e54f3cc0 RDI: ffff88810658c960 Dec 30 10:58:44 n1 kernel: RBP: ffff8881d0ea6d18 R08: 000000000000d000 R09: 000014e7111f1000 Dec 30 10:58:44 n1 kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffff88810658c960 Dec 30 10:58:44 n1 kernel: R13: ffff88814c55d0c0 R14: ffff88810658c988 R15: ffff88810658c960 Dec 30 10:58:44 n1 kernel: FS: 0000150eb9581740(0000) GS:ffff8887fe8c0000(0000) knlGS:0000000000000000 Dec 30 10:58:44 n1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 30 10:58:44 n1 kernel: CR2: 00000000004aa000 CR3: 00000001d52ac000 CR4: 0000000000350ee0 Dec 30 10:58:49 n1 kernel: general protection fault, probably for non-canonical address 0xd16719a3d1666fb3: 0000 [#5162] SMP NOPTI Dec 30 10:58:49 n1 kernel: CPU: 1 PID: 12418 Comm: lsof Tainted: G D W 5.15.46-Unraid #1 Dec 30 10:58:49 n1 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C95/B550M PRO-VDH (MS-7C95), BIOS 2.80 06/22/2021 Dec 30 10:58:49 n1 kernel: RIP: 0010:show_map_vma+0x3c/0x134 Dec 30 10:58:49 n1 kernel: Code: 00 00 00 48 89 f3 4c 8b 6e 40 48 8b 4e 50 48 85 ed 74 1d 48 8b 45 20 4c 8b 86 98 00 00 00 48 8b 50 28 49 c1 e0 0c 48 8b 40 38 <44> 8b 4a 10 eb 08 45 31 c9 45 31 c0 31 c0 48 8b 53 08 50 4c 89 e7 Dec 30 10:58:49 n1 kernel: RSP: 0018:ffffc90001cf7db8 EFLAGS: 00010216 Dec 30 10:58:49 n1 kernel: RAX: b6b13a8300002709 RBX: ffff8881e54f3cc0 RCX: 0000000000100073 Dec 30 10:58:49 n1 kernel: RDX: d16719a3d1666fa3 RSI: ffff8881e54f3cc0 RDI: ffff888104f26348 Dec 30 10:58:49 n1 kernel: RBP: ffff88815da59748 R08: 000000000000d000 R09: 000014e7111f1000 Dec 30 10:58:49 n1 kernel: R10: 0000000000000002 R11: 0000000000000001 R12: ffff888104f26348 Dec 30 10:58:49 n1 kernel: R13: ffff88814c55d0c0 R14: ffff888104f26370 R15: ffff888104f26348 Dec 30 10:58:49 n1 kernel: FS: 000014a73d519740(0000) GS:ffff8887fe840000(0000) knlGS:0000000000000000 Dec 30 10:58:49 n1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 30 10:58:49 n1 kernel: CR2: 0000151f26fbe070 CR3: 00000001f6ae6000 CR4: 0000000000350ee0 Dec 30 10:58:49 n1 kernel: Call Trace: Dec 30 10:58:49 n1 kernel: <TASK> Dec 30 10:58:49 n1 kernel: show_map+0xa/0xd Dec 30 10:58:49 n1 kernel: seq_read_iter+0x258/0x347 Dec 30 10:58:49 n1 kernel: seq_read+0xfc/0x11f Dec 30 10:58:49 n1 kernel: vfs_read+0xa8/0x108 Dec 30 10:58:49 n1 kernel: ksys_read+0x76/0xbe Dec 30 10:58:49 n1 kernel: do_syscall_64+0x83/0xa5 Dec 30 10:58:49 n1 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Dec 30 10:58:49 n1 kernel: RIP: 0033:0x14a73d7cf3fe Dec 30 10:58:49 n1 kernel: Code: c0 e9 e6 fe ff ff 50 48 8d 3d 4e 53 0a 00 e8 59 ea 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 Dec 30 10:58:49 n1 kernel: RSP: 002b:00007ffc803dd0f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Dec 30 10:58:49 n1 kernel: RAX: ffffffffffffffda RBX: 000000000042b2c0 RCX: 000014a73d7cf3fe Dec 30 10:58:49 n1 kernel: RDX: 0000000000001000 RSI: 0000000000489250 RDI: 0000000000000004 Dec 30 10:58:49 n1 kernel: RBP: 000014a73d8a4520 R08: 0000000000000004 R09: 0000000000000000 Dec 30 10:58:49 n1 kernel: R10: 000014a73d854ac0 R11: 0000000000000246 R12: 000000000042b2c0 Dec 30 10:58:49 n1 kernel: R13: 0000000000000d68 R14: 000014a73d8a3920 R15: 0000000000000d68 Dec 30 10:58:49 n1 kernel: </TASK> Server itself is fine, fully operational as far as I can tell, despite the "general protection fault" message in there... If it fails, then I will upload the last messages etc. Meanwhile I'll let it be. I will google that message (& probably end up down another rabbit hole...) TIA Quote Link to comment
salvdordalisdad Posted December 30, 2023 Author Share Posted December 30, 2023 This rabbit hole begins to point towards a RAM problem... There's a RAM test on the boot menu, so I'll have to add a graphics card to run that, maybe just re-seat the RAM to start with. a 48 hour soak test would be a painfully long time to be without my prime server. Any votes on this- yay or nay ? The original RAM is still under warranty, but it needs to show up a failure... Thanks for the sounding board! Quote Link to comment
itimpi Posted December 30, 2023 Share Posted December 30, 2023 11 minutes ago, salvdordalisdad said: This rabbit hole begins to point towards a RAM problem... There's a RAM test on the boot menu, so I'll have to add a graphics card to run that, maybe just re-seat the RAM to start with. a 48 hour soak test would be a painfully long time to be without my prime server. Any votes on this- yay or nay ? The original RAM is still under warranty, but it needs to show up a failure... Thanks for the sounding board! If you have more than one ram stick you can try running with less for a while to see if the symptoms change or only occur with a specific stick. This tends to cater both for ram sticks going bad, and also the case where too many ram sticks are overloading the memory controller when under load. Quote Link to comment
salvdordalisdad Posted December 30, 2023 Author Share Posted December 30, 2023 Ooh, nice idea...thanks, I will do that this evening after (everyone else's) bedtime. (Assuming I remember) Quote Link to comment
salvdordalisdad Posted December 31, 2023 Author Share Posted December 31, 2023 OK, well that was unexpected, but not unwelcome... Removed one of the DIMM modules, and rebooted ( had to add a graphics card cos of the BIOS complaint, hurumpf) 18 hours later & very few such error messages in the syslog server (which I will now keep as it's good practice anyway!). The parity check took exactly the same 11 hours, so that's a good sign, too. In fact the memory stats page looks quite healthy with only a single 16GB DIMM module, so I am tempted to not put it back. Of course I will put it back for completeness' sake & if it's still faulty, then it will need to be replaced - assuming I can get it through the Corsair Warranty System, which appears to be designed to avoid warranty claims! Will need some more memory in the meanthime, which is a bit pesky. Will leave it for 48 hours to see if error messages resume. Thanks for the sounding board. <winky smile> Quote Link to comment
salvdordalisdad Posted January 7 Author Share Posted January 7 Update... Single RAM stick = several days test = 0 errors. (Server WAS headless, no graphics card, but change in memory forced temp use of graphics card.) Replaced 2nd RAM stick now memory is good again, BIOS recignised it, but refused to boot. Long story short, new SATA PCIEx1 adapter, but now it refuses to boot without the graphics card. Slightly annoying, needs looking into, must be BIOS setting, but it can wait. Anyway, 12 hours after booting with both RAM sticks, still OK...no new GPF errors yet. If it re-errors, it confirms original diagnosis & mempry can go back for warranty, if not, end of job. Update to follow. Quote Link to comment
salvdordalisdad Posted January 11 Author Share Posted January 11 Update... 4 days in & no General Protection Faults anymore... So I will now close this as "maybe solved" by just re-seating the RAM sticks <?!> I'll also stop looking at syslog on a daily basis... It's still running, so if there's a crash, I will look through & see if there's a clue... fingers crossed! YMMV Quote Link to comment
salvdordalisdad Posted January 15 Author Share Posted January 15 Update...fell over again this morning, NOTHING in the syslog. maybe the GPF happened at a lower level than syslog was capable of? Locally connected screen says "kernel panic" Have removed offending (probably) memory module & rebooted., oh joy. Give it a week & then send the memory off for warranty. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.