Jump to content

Parity Disk on new tower fails


Recommended Posts

2 hours ago, JorgeB said:

Diags are after a reboot so we cannot see what happened, disk looks healthy, replace/swap cables and try again, if there are more errors post new diags before rebooting.

I replaced all sata cables and power cables and it still gives me a error. i can run all disk just fine and healthy in an array, but if i create a new config where i have one of the disks in parity it will run a parity check for about 10 minutes then turn off the array and say hardware error, so not fully sure if there is a setting i should change

Link to comment

ok i'll try again now but in the meantime i found the error that was filling up my sys log, seems like my gpu is the fault with the vm perhaps?

Feb 24 13:04:31 BigBoy kernel: vfio-pci 0000:07:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
Feb 24 13:04:31 BigBoy kernel: vfio-pci 0000:07:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
Feb 24 13:04:31 BigBoy kernel: vfio-pci 0000:07:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
Feb 24 13:04:31 BigBoy kernel: vfio-pci 0000:07:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
Feb 24 13:04:31 BigBoy kernel: vfio-pci 0000:07:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
Feb 24 13:04:31 BigBoy kernel: vfio-pci 0000:07:00.0: BAR 1: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]

 

Link to comment
5 minutes ago, AceBoyden said:

*sorry, it reboots after around 10 minutes of being on, not stopping array alone. not sure why it keeps rebooting after being on for around 10 minutes, i fear it might be something to do with the cache drive and how i set it up in appdata but im no genius 

If you are getting the system unexpectedly rebooting than I would think the most likely culprits are either a thermal problem or a PSU issue.

Link to comment
Just now, itimpi said:

If you are getting the system unexpectedly rebooting than I would think the most likely culprits are either a thermal problem or a PSU issue.

it seems to only reboot after the rust server on pterodactly starts, i moved that first to start (rebooted immidetly), moved it last (rebooted after some time) and now i deleted that one and it has not rebooted yet, started array in maintance now to see if that works for parity sync, but i still keep getting "machine check events - Your server has detected hardware errors" but not sure where to find whats wrong on where :/ 

Link to comment
On 2/24/2023 at 5:29 PM, JorgeB said:

Those suggest a hardware problem, at first I understood that parity was getting disabled, not that the server was crashing, I see you are overclocking the RAM, so start here:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

 

I disabled overclocking on the ram and set it to "auto" in bios, same as set the "Power Supply Idle Control" to "typical current idle" in bios as well. i changed the ram to check (removed one by one and did a new set aswell) yet still have the same issues, it appears it dosn't happen if array is started in maintanance mode (this only lasted 11 hours as that was the time the parity sync took, then i startet array normally). it went fine for a few hours then saw the server was down again (not fully down but array had been stopped and auto start was set to "no" again). it appears what happends is that the OS itself restarts but not the system, i was in my server room while the crash happened and saw on my monitor in there that the os was restarting but not the system itself. i tried checking the drives and unplugging 1 by 1 there aswell and tested everyone alone, still same problem occoured. i can have no vm's or containers running and it will still happend, none of my friends or collegues can find any errors physical or in the diognostic logs i sent them before and after the crashes. any other suggestions? (also changed cpu to the same type just a different one i had spare and problem is still there) could it be my usb stick i have unraid on? everything should be fine there aswell

Link to comment

not sure if relevant but the "crashed" has happened 19 times in a span of 1 hour 

 

 

edit: disabled c sate in bios now so i'll see how it goes tomorrow if that helped any way

edit edit: after checking sys logs after disabling c state im not getting an error im not fully sure what means, if it is the cpu or the ram..

 

Quote

Feb 25 23:37:34 BigBoy kernel: mdcmd (36): check correct Feb 25 23:37:34 BigBoy kernel: md: recovery thread: check P ... Feb 25 23:37:37 BigBoy kernel: nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DVI-D-0 Feb 25 23:37:39 BigBoy kernel: traps: xdg-desktop-por[20026] trap int3 ip:14afc24a1ca7 sp:7ffda76f2590 error:0 in libglib-2.0.so.0.6600.8[14afc2465000+88000] Feb 25 23:37:39 BigBoy kernel: traps: xdg-desktop-por[19950] trap int3 ip:14fb1992bca7 sp:7fffd2027c80 error:0 in libglib-2.0.so.0.6600.8[14fb198ef000+88000] Feb 25 23:37:51 BigBoy kernel: nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DVI-D-0 Feb 25 23:39:04 BigBoy kernel: mce: Uncorrected hardware memory error in user-access at 135690e640 Feb 25 23:39:04 BigBoy kernel: kernel BUG at arch/x86/kernel/cpu/mce/core.c:1582! Feb 25 23:39:04 BigBoy kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI Feb 25 23:39:04 BigBoy kernel: CPU: 17 PID: 19512 Comm: Job.Worker 8 Tainted: P M O 5.19.17-Unraid #2 Feb 25 23:39:04 BigBoy kernel: Call Trace: Feb 25 23:39:04 BigBoy kernel: mce: [Hardware Error]: Machine check events logged Feb 25 23:39:04 BigBoy kernel: [Hardware Error]: Uncorrected, software restartable error. Feb 25 23:39:04 BigBoy kernel: [Hardware Error]: CPU:17 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135 Feb 25 23:39:04 BigBoy kernel: [Hardware Error]: Error Addr: 0x000000135690e640 Feb 25 23:39:04 BigBoy kernel: [Hardware Error]: IPID: 0x001000b000000000 Feb 25 23:39:04 BigBoy kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load. Feb 25 23:39:04 BigBoy kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD Feb 25 23:39:13 BigBoy root: Fix Common Problems: Error: Machine Check Events detected on your server Feb 25 23:39:13 BigBoy root: mcelog: ERROR: AMD Processor family 25: mcelog does not support this processor. Please use the edac_mce_amd module instead.

 

bigboy-diagnostics-20230225-2341.zip bigboy-syslog-20230225-2241.zip

Edited by AceBoyden
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...