Jump to content

\V/

Members
  • Posts

    12
  • Joined

  • Last visited

Posts posted by \V/

  1. Backstory:
    I have experienced multiple instances of two-disks failures over the summer due to the combo of heat wave and old drives. In one of those rebuilds, the recreated disks were totally corrupt due to bad parity. As some of the data were not backed up, this resulted in permanent data loss.

    How did this happen:
    Mover ran itself to schedule in emulated mode, even though there were two drives with errors.

    Feature request:
    Don't let mover run when array is in emulated mode OR when there are disk errors

  2. Hi all,

     

    I am wondering if it is safe to put this in my go file so that Unraid will auto reboot when suffering from a kernel panic:

     

    echo 20 >/proc/sys/kernel/panic

     

    My server has been experiencing some occasional kernel panics since 6.9.2 (I suspect it's related to: 1, 2). I work long hours and hate coming home to a furnace when it does. Will the above quick fix work?

     

    Thanks

  3. 12 hours ago, John_M said:

    From the results of that Google search, did you happen to read this thread? The thread, incidentally, pre-dates Ryzen. Check temperatures and voltages in the BIOS. Is the cooler seated properly? Each core has its own dedicated L2 cache. I find it difficult to believe that you have bad silicon that affects so many different cores. It's much more likely to be something like a faulty VRM on the motherboard. Can you try the CPU in another motherboard?

     

    I suggested unplugging all but the essentials to break the problem into more manageable pieces. That's how I tackle complex problems.

     

    Motherboard manufacturers publish Qualified Vendor Lists that show the memory modules they've tested with their products. The Pinnacle Ridge (AMD's codename for 2000-series of CPUs) QVL is here. All the ADATA 8 GB 2400 MHz parts on the list begin with AX4U2400... so yours isn't on the list. I'm curious to know why you bought 2400 MHz RAM to use with a Pinnacle Ridge processor. I use 3000 MHz RAM with mine and clock it at 2933 MHz, which is Pinnacle Ridge's specified speed. It's possible that ADATA has tested and verified the modules with your motherboard but from what you said about your previous system I'm thinking it's probably left over from that.

     

    When you reset the BIOS to default. Any XMP that you've set is lost and the RAM defaults to the fastest JEDEC profile of 2133 MHz (DDR, so the actual clock is half of that). It's a safe starting point but to run memory optimally with 2000-series Ryzen without overclocking either the RAM or the memory management unit you need 2933 speed (or faster) RAM and you need to select 2933 MHz manually - again, it's DDR so the actual clock is half of that. On 1000-series Ryzen the optimum speed is 2666. Some people use faster memory - 3200, for example but running the MMU at that speed is technically an overclock, which might be ok for gaming but not for a 24/7 server.

     

    The HX850 should be fine. Corsair is my choice of power supply, too. I use RM models in regular builds and AX models in servers. The HX fits between the two and I'd have no problem using one.

     

     

    I do have spare VGA card, ram, and PSU, and can temporarily unplug HBAs. Gonna try to troubleshoot them. Unfortunately I don't have a spare Ryzen CPU or MB available. It's gonna take some serious convincing to the service center guy as he's pretty used to dealing with fraudsters attempting to get free replacements. Last time I had to argue with the lady from a different service center that SMART error count means the SSD is bad even if it functions. Story of my pc building life. Anyhow thanks for your thoughtful replies.

  4. 4 hours ago, pwm said:

     

    It's enough that you run the following to produce lots of CPU load:

    
    cat /dev/urandom > /dev/null

     

     

     

     

    Just tried that. Didn't change anything, still got the errors.

     

     

    2 hours ago, John_M said:

     

    Amusingly, you've already voided the warranty on the WD80EMAZ drives you shucked from WD EasyStores! I'm not blaming you - I'd do it too if I could buy them here. But seriously, you don't want to be editing your diagnostics. Potentially personal stuff is already properly anonymised if you leave the box checked at Tools -> Diagnostics.

     

    But you're right. Your problem is not disk related. It's manifesting as a hardware error and it appears to be CPU related. However, that doesn't mean that it's definitely the CPU that's faulty. It could be the motherboard or the BIOS.

     

    Your diagnostics show that you have BIOS P1.50, which, as I said, is no longer the latest. There's a beta L1.52 that you can try.

    https://www.asrock.com/mb/AMD/X470 Taichi/index.asp#BIOS

     

    You might well have a faulty motherboard though. It's much more likely than a faulty CPU, unless you somehow managed to break a pin. Have you tried booting it with just a single DIMM installed and all PCIe slots empty (except for the graphics card, of course)? Your graphics card is in one of the X470 slots, rather than the usual CPU x16_1 slot, isn't it? You might try relocating it, as a test.

     

    Since you're running BIOS defaults, your RAM is currently running at JEDEC 2133 DDR, yes? Is it on the Pinnacle Ridge QVL? What is the exact part number for it?

     

    FWIW I have two Asus, two Gigabyte and one ASRock AM4 motherboards and the ASRock has been the most troublesome by some margin.

     

    What power supply are you using?

     

     

     

    Okay, I am on latest beta firmware now. The errors still came up.

     

    I've done some googling and it looks like I may have a bad L2 cache (per the line "uop cache tag parity error" in syslog).

     

    And I just now caught the error after array has started. Looks like I might have to RMA my 2700x.

     

    Granted, the system hasn't crashed even one. It's just nuisance to see those errors.

     

     

    Edit:

    To answer your questions

    - Could PCIE devices really be probable culprit? Because these were the same devices before I migrated from an Intel Coffee Lake. They worked fine before, no mce error.

    - I have no idea what those terms meant. Part number is AD4U240038G17-R

    - HX850 (2017 revised ver)

  5. 18 minutes ago, John_M said:

    One of my unRAID servers uses the same processor as yours, though it's in an Asus X470 motherboard. I do have some experience with ASRock AM4 BIOSes though and I'd like to try to help you but as you've tampered with your diagnostics I feel the information they contain is misleading.

     

    One thing I would suggest is updating to the latest BIOS (there was a new beta released less than a week ago that fixes a couple of issues with P1.50) and making sure, as a first step to diagnosing your problem, that all BIOS settings are at default.

     

    Hi,

     

    I took out the WD drive serials since 1) people from /r/datahoarder subreddit claim that publishing them on open internet can be risky for future warranty 2) the issue I have don't seem hard disk related. All other logs are untampered.

     

    I had just recently updated the BIOs and it is already set to default per flashing. I guess I'll wait for the next BIOs update and see.

     

     

    I really hope this is not faulty hardware related as it will be very painful to prove complex problem like this one to the guy at my local service center

     

  6. Hello all, my first post here.

     

    I have been getting a lot of Machine Check Event errors ever since I swapped to a new CPU, MB, and Rams.

     

    Specifically, this error happens nonstop after system start up and before array starts. If I start the array, the MCE error spam ceases.

     

    The Unraid version is 6.5.3. My rig currently has Ryzen 2700X, Asrock X470 Taichi, and 4 sticks of ADATA Premier 8GB 2400Mhz Ram.

     

    Aug 21 18:48:18 Lawpet avahi-daemon[9033]: Server startup complete. Host name is Lawpet.local. Local service cookie is 1169872464.
    Aug 21 18:48:19 Lawpet avahi-daemon[9033]: Service "Lawpet" (/services/ssh.service) successfully established.
    Aug 21 18:48:19 Lawpet avahi-daemon[9033]: Service "Lawpet" (/services/smb.service) successfully established.
    Aug 21 18:48:19 Lawpet avahi-daemon[9033]: Service "Lawpet" (/services/sftp-ssh.service) successfully established.
    Aug 21 18:53:07 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: CPU:1 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 18:53:07 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: CPU:11 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 18:53:07 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 18:53:23 Lawpet ntpd[2087]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
    Aug 21 19:04:03 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: CPU:1 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:04:03 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: CPU:11 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:04:03 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:04:22 Lawpet root: Fix Common Problems Version 2018.08.05
    Aug 21 19:04:25 Lawpet root: Fix Common Problems: Error: Array is not started
    Aug 21 19:04:38 Lawpet root: Fix Common Problems: Error: Machine Check Events detected on your server
    Aug 21 19:04:38 Lawpet root: mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
    Aug 21 19:04:38 Lawpet root: CPU is unsupported
    Aug 21 19:04:39 Lawpet root: Fix Common Problems: Warning: You have a Ryzen CPU, but Zenstates not installed. ** Ignored
    Aug 21 19:09:30 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: CPU:9 (17:8:2) MC3_STATUS[-|CE|MiscV|-|-|-|-|SyndV|-]: 0x9820000000000150
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:09:30 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: CPU:1 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000502
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: CPU:5 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: CPU:15 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: CPU:11 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:09:30 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:14:58 Lawpet kernel: mce_notify_irq: 3 callbacks suppressed
    Aug 21 19:14:58 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: CPU:5 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:14:58 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: CPU:13 (17:8:2) MC3_STATUS[-|CE|MiscV|-|-|-|-|SyndV|-]: 0x9820000000000150
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:14:58 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:20:26 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:20:26 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:20:26 Lawpet kernel: [Hardware Error]: CPU:15 (17:8:2) MC3_STATUS[-|CE|MiscV|-|-|-|-|SyndV|-]: 0x9820000000000150
    Aug 21 19:20:26 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:20:26 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:20:26 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:20:26 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:25:53 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: CPU:5 (17:8:2) MC3_STATUS[-|CE|MiscV|-|-|-|-|SyndV|-]: 0x9820000000000150
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:25:53 Lawpet kernel: mce: [Hardware Error]: Machine check events logged
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: CPU:11 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: CPU:13 (17:8:2) MC3_STATUS[-|CE|MiscV|-|-|-|-|SyndV|-]: 0x9820000000000150
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Corrected error, no action required.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: CPU:15 (17:8:2) MC3_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd820000000000150
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
    Aug 21 19:25:53 Lawpet kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD

     

    I don't know how to read all of this. Your help will be greatly appreciated. Thank you!

     

     

     

    Edit:

     

    Also worth noting is this

    Aug 21 19:58:22 Lawpet root: mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.

     

     

    Solution update:

    Turns out I had a faulty CPU. Exchanged it and everything is fine now. So faulty Ryzens do exist -- am not sure if it is related to the segfault defect.

     

×
×
  • Create New...