HP Proliant / Workstation & unRaid Information Thread


1812

Recommended Posts

Bios is set to legacy; I vaguely recall reading somewhere that's what I should have it set to for this hardware and unRAID, but I can try switching it.

 

I edited that config line so many times I can't believe I mistyped "iommo=on" and didn't catch it; but if it doesn't matter anyway I can remove it.

 

I will try your steps as soon as I can; I appreciate the help as this is all new to me.

  • Like 1
Link to comment
6 hours ago, ddot said:

Bios is set to legacy; I vaguely recall reading somewhere that's what I should have it set to for this hardware and unRAID, but I can try switching it.

Legacy should be fine, but I've had some machines act more normally using one vs the other with no real reasoning why.

 

6 hours ago, ddot said:

I edited that config line so many times I can't believe I mistyped "iommo=on" and didn't catch it; but if it doesn't matter anyway I can remove it.

yeah, not a biggie and not needed in my past HP: proliant/workstation/server experience. 

 

6 hours ago, ddot said:

I will try your steps as soon as I can; I appreciate the help as this is all new to me.

glad to try and help. this was all new to me once!

Link to comment
14 hours ago, ddot said:

I edited the config; HVM and IOMMU are both enabled. The USB card works in the VM. Shutdown/restart of the VM locks up the server. Requested files attached. About to hit submit and realized I haven't tried switching the bios; I'll try that next when I can, only so much time after work on weekdays :/

unraidserver-diagnostics-20191004-0332.zip 75.03 kB · 0 downloads syslog.txt 629.51 kB · 0 downloads VMlog.txt 5.07 kB · 0 downloads

get rid of iommu=pt from your syslinux . while it shouldn't cause any problems, we want to rule things out. do that, reboot, test for lockup.

 

if it still happens, change acs override to: "pcie_acs_override=downstream,mutlifunction" reboot, retest.

 

as it currently stands the usb card is still not in an isolated group. If doing the steps above don't isolate it, then leave configuration on the last one listed above and try another slot if one is available.

 

Link to comment
22 hours ago, 1812 said:

get rid of iommu=pt from your syslinux . while it shouldn't cause any problems, we want to rule things out. do that, reboot, test for lockup.

 

if it still happens, change acs override to: "pcie_acs_override=downstream,mutlifunction" reboot, retest.

 

as it currently stands the usb card is still not in an isolated group. If doing the steps above don't isolate it, then leave configuration on the last one listed above and try another slot if one is available.

 

Without iommu=pt, the disks are missing in the array.

Link to comment
19 hours ago, ddot said:

Without iommu=pt, the disks are missing in the array.

well that's a new one for me...

 

Doing some research, it appears there are issues with the Marvel disk controller in the box. The original fix a few years ago was disabling virtualization all together which is not what you want.

 

try changing iommu=pt to amd_iommu=pt, reboot. If that does not "fix" the reset/lockup issue, you'll have to do research on how to use the Marvell 88SE9230 in Unraid. Simplest might be get a host bus adapter for the disks and disable the onboard controller, or someone might have figured out another workaround. 

 

more info here: 

 

 

 

Link to comment
6 hours ago, 1812 said:

well that's a new one for me...

 

Doing some research, it appears there are issues with the Marvel disk controller in the box. The original fix a few years ago was disabling virtualization all together which is not what you want.

 

try changing iommu=pt to amd_iommu=pt, reboot. If that does not "fix" the reset/lockup issue, you'll have to do research on how to use the Marvell 88SE9230 in Unraid. Simplest might be get a host bus adapter for the disks and disable the onboard controller, or someone might have figured out another workaround. 

 

more info here: 

 

 

 

amd_iommu=pt results in missing disks; they only show with iommu=pt.

 

I was looking at the hardware compatibility list in the unRAID wiki, and assume I should avoid anything Marvell... do you think something like an LSI SAS 9201-8i HBA would work? Or is there a better one for use with the microserver Gen10?

Link to comment
1 hour ago, ddot said:

amd_iommu=pt results in missing disks; they only show with iommu=pt.

 

I was looking at the hardware compatibility list in the unRAID wiki, and assume I should avoid anything Marvell... do you think something like an LSI SAS 9201-8i HBA would work? Or is there a better one for use with the microserver Gen10?

any listed compatible HBA should work. I've never used that one but I believe a few folks on here do. I use h220 because it was what came with my first HP server years ago, and they are pretty cheap and keep it in the family.  Avoid Marvell, and make sure the card length will fit in your server before ordering.

Link to comment
19 hours ago, 1812 said:

any listed compatible HBA should work. I've never used that one but I believe a few folks on here do. I use h220 because it was what came with my first HP server years ago, and they are pretty cheap and keep it in the family.  Avoid Marvell, and make sure the card length will fit in your server before ordering.

"Keep it in the family" makes sense. Not sure how I'm going to proceed. Thank you for all your help and insight!

 

  • Like 1
Link to comment
  • 2 weeks later...

Hi! So I've been getting recurrent system crashes on a DL360e Gen8.  

274	PCI Bus		10/20/2019 05:00	10/20/2019 05:00	1	Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 0, Function 0, Error status 0x00100000)
273	System Error	10/20/2019 05:00	10/20/2019 05:00	1	Unrecoverable System Error (NMI) has occurred.  System Firmware will log additional details in a separate IML entry if possible

I'm on the most recent firmware (2019.05.24) and I've been motoring through the HPE forums and found this discussion which links the issue to a Matrox graphics driver in Windows.  Since I'm clearly not running windows, I'm uncertain which direction to go in.  

 

For me, I'm also unable to remove the pcie riser as I run my cache through an NVME drive on a pcie adapter card.  

 

Any ideas where to look?  I've also attached my diagnostics in case that helps.

avasarala-diagnostics-20191021-1353.zip

Thanks!

Link to comment
2 hours ago, noja said:

Hi! So I've been getting recurrent system crashes on a DL360e Gen8.  


274	PCI Bus		10/20/2019 05:00	10/20/2019 05:00	1	Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 0, Function 0, Error status 0x00100000)
273	System Error	10/20/2019 05:00	10/20/2019 05:00	1	Unrecoverable System Error (NMI) has occurred.  System Firmware will log additional details in a separate IML entry if possible

I'm on the most recent firmware (2019.05.24) and I've been motoring through the HPE forums and found this discussion which links the issue to a Matrox graphics driver in Windows.  Since I'm clearly not running windows, I'm uncertain which direction to go in.  

 

For me, I'm also unable to remove the pcie riser as I run my cache through an NVME drive on a pcie adapter card.  

 

Any ideas where to look?  I've also attached my diagnostics in case that helps.

avasarala-diagnostics-20191021-1353.zip 98.35 kB · 0 downloads

Thanks!

shot in the dark here, but if Unraid accessing the video controller is causing the issue (as connected to the bios problems) then possibly try isolating it from Unraid via vfio to start (the same method you'd use for isolating for passthrough) 

Link to comment
2 hours ago, 1812 said:

shot in the dark here, but if Unraid accessing the video controller is causing the issue (as connected to the bios problems) then possibly try isolating it from Unraid via vfio to start (the same method you'd use for isolating for passthrough) 

Thanks! Since I haven't isolated something on the host before, I just want to confirm:

 

System Devices says this: 

[102b:0533] 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH

 

Syslinux currently looks like this:

image.png.f3b34105c97792955c052c68508ef2b0.png

 

New Version will look like this?

image.png.7236240d3fa695f1f8d3302b0235eef9.png

 

Thanks again

Link to comment
2 minutes ago, noja said:

Thanks! Since I haven't isolated something on the host before, I just want to confirm:

 

System Devices says this: 

[102b:0533] 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH

 

Syslinux currently looks like this:

image.png.f3b34105c97792955c052c68508ef2b0.png

 

New Version will look like this?

image.png.7236240d3fa695f1f8d3302b0235eef9.png

 

Thanks again

close, the changes go in-between the append     initrd=/bzroot and not on a new append line. If that doesn't work, you might also consider adding the other iLo component 103c:3381 which would go after the device id you have with a coma separating them and now space. Unraid doesn't really have any interaction with iLo, but we're trying anything at the moment!

  • Like 1
Link to comment
1 minute ago, 1812 said:

close, the changes go in-between the append     initrd=/bzroot and not on a new append line. If that doesn't work, you might also consider adding the other iLo component 103c:3381 which would go after the device id you have with a coma separating them and now space. Unraid doesn't really have any interaction with iLo, but we're trying anything at the moment!

Fair enough! I might as well add them both now.  It'll probably be a long while before I notice whether or not the solution works as I usually go about 2 weeks between crashes.  I'll give myself a reminder to update this post in like a month if I don't have any new crashes.

 

Thanks again!

  • Like 1
Link to comment
  • 2 weeks later...
On 10/21/2019 at 4:07 PM, 1812 said:

If that doesn't work, you might also consider adding the other iLo component 103c:3381 which would go after the device id you have with a coma separating them and now space. Unraid doesn't really have any interaction with iLo, but we're trying anything at the moment!

Well, unfortunately, it has not worked.  Of course when I was looking at the BIOS health in iLO, I didn't clue in that the issues happen every Sunday at 5:00pm est.  The first week, I only isolated the Matrox component and I had a crash that day.  The second week I did the other iLO components liek so:

 

image.png.a3e35678660f43908b46e0a9ba39384d.png

 

Is there any normal events in Unraid that would kick in every 7 days that I should look out for?  I looked at my scheduler and the only thing I have that might be close is that SSD Trim is set to go on the same day as the issues, but it happens at 1AM while the server crashes every time that day (Sunday) at 5:00pm.  The reason I mention that is that I added that NVMe SSD through the PCIe adapter - it may be around the same time as the crashes started happening.

 

I also still have an old sata ssd attached as an unassigned device.  Would that factor in? I think for now, I'll try disabling TRIM to see what happens.

Link to comment
3 hours ago, noja said:

Well, unfortunately, it has not worked.  Of course when I was looking at the BIOS health in iLO, I didn't clue in that the issues happen every Sunday at 5:00pm est.  The first week, I only isolated the Matrox component and I had a crash that day.  The second week I did the other iLO components liek so:

 

image.png.a3e35678660f43908b46e0a9ba39384d.png

 

Is there any normal events in Unraid that would kick in every 7 days that I should look out for?  I looked at my scheduler and the only thing I have that might be close is that SSD Trim is set to go on the same day as the issues, but it happens at 1AM while the server crashes every time that day (Sunday) at 5:00pm.  The reason I mention that is that I added that NVMe SSD through the PCIe adapter - it may be around the same time as the crashes started happening.

 

I also still have an old sata ssd attached as an unassigned device.  Would that factor in? I think for now, I'll try disabling TRIM to see what happens.

remove the space after the coma for the last device id and reboot.

 

you would have to check your scheduler and the logs to see what is happening at that time.... could be trim, could be anything.

  • Like 1
Link to comment
On 11/4/2019 at 5:16 PM, 1812 said:

remove the space after the coma for the last device id and reboot.

 

you would have to check your scheduler and the logs to see what is happening at that time.... could be trim, could be anything.

Fixed! Either disabling TRIM or fixing my error on the Syslinux file did the trick.  The system finally had no issues yesterday.  I may try setting TRIM to run later today to see if that was culprit.  Thanks again @1812!

 

**EDIT** - Yep - its TRIM.  Time to figure out what the hell then.  

Edited by noja
Link to comment
  • 4 weeks later...
On 11/11/2019 at 8:54 AM, noja said:

Fixed! Either disabling TRIM or fixing my error on the Syslinux file did the trick.  The system finally had no issues yesterday.  I may try setting TRIM to run later today to see if that was culprit.  Thanks again @1812!

 

**EDIT** - Yep - its TRIM.  Time to figure out what the hell then.  

 

 

It's TRIM with the HP controller.  I have a similar problem, running it will completely fuck up my server,  trying to get a process list via ps, or top will just hang, and eventually the server will just stop responding to just about everything.  it won't reboot itself, and requires a manual reset via the BMC or by pushing the power button.

 

I managed to catch a kernel panic  about "btrfs unable to handle kernel NULL pointer" the last time I ran it, but I was an idiot and forgot to save the full syslog output before I rebooted it. 

 

As you can imagine, I'm not real keen on running it again :)

 

Link to comment
On 12/6/2019 at 10:09 AM, DaiTengu said:

It's TRIM with the HP controller. 

That's interesting.  So this entire time, I've had an NVMe SSD attached as my cache drive with a PCIe adapter which I don't think would be attached to the on board HP controller.  I also have a SATA SSD sitting in the box as an unassigned drive but going through the controller. 

 

So, for the sake of "science," I'll try pulling the SATA drive and leaving the NVMe drive and then running TRIM again.  Given that I don't know $h*t about how TRIM works, I have no idea if that will cause a crash again.  But I'm willing to give it a go! You know - for "science"!

Edited by noja
grammar good.
Link to comment
  • 1 month later...

Hi

My logs are full with these alerts :

Quote

Jan 27 15:54:48 Tower kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180810/exfield-390)
Jan 27 15:54:48 Tower kernel: ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180810/psparse-514)
Jan 27 15:54:48 Tower kernel: ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180810/power_meter-338)

I already add rmmod acpi_power_meter in my go file but no changes.

Is there another solution ?


Thanks

Link to comment
  • 3 weeks later...
On 1/27/2020 at 9:55 AM, deadnote said:

Hi

My logs are full with these alerts :

I already add rmmod acpi_power_meter in my go file but no changes.

Is there another solution ?


Thanks

run the command from a console.  that seems to stop it. yes I have to do it after every reboot.

Link to comment
  • 4 weeks later...

Has anyone ever seen these type of errors? I am getting scores of them upon writing to the array. Many when writing max to cache, but error indicates different ata devices (ata1, ata2). Don't get many when write throughput is lower.

 

Mar 23 04:05:07 Ovi kernel: ata2.00: exception Emask 0x50 SAct 0x20198 SErr 0x4890800 action 0xe frozen
Mar 23 04:05:07 Ovi kernel: ata2.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Mar 23 04:05:07 Ovi kernel: ata2: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Mar 23 04:05:07 Ovi kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 23 04:05:07 Ovi kernel: ata2.00: cmd 60/40:18:f0:8d:72/05:00:b9:00:00/40 tag 3 ncq dma 688128 in
Mar 23 04:05:07 Ovi kernel: res 40/00:00:f0:8c:72/00:00:b9:00:00/40 Emask 0x50 (ATA bus error)
Mar 23 04:05:07 Ovi kernel: ata2.00: status: { DRDY }
Mar 23 04:05:07 Ovi kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 23 04:05:07 Ovi kernel: ata2.00: cmd 60/40:20:30:93:72/00:00:b9:00:00/40 tag 4 ncq dma 32768 in
Mar 23 04:05:07 Ovi kernel: res 40/00:00:f0:8c:72/00:00:b9:00:00/40 Emask 0x50 (ATA bus error)
Mar 23 04:05:07 Ovi kernel: ata2.00: status: { DRDY }
Mar 23 04:05:07 Ovi kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 23 04:05:07 Ovi kernel: ata2.00: cmd 60/48:38:68:96:72/02:00:b9:00:00/40 tag 7 ncq dma 299008 in
Mar 23 04:05:07 Ovi kernel: res 40/00:00:f0:8c:72/00:00:b9:00:00/40 Emask 0x50 (ATA bus error)
Mar 23 04:05:07 Ovi kernel: ata2.00: status: { DRDY }
Mar 23 04:05:07 Ovi kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 23 04:05:07 Ovi kernel: ata2.00: cmd 61/00:40:f0:8c:72/01:00:b9:00:00/40 tag 8 ncq dma 131072 out
Mar 23 04:05:07 Ovi kernel: res 40/00:00:f0:8c:72/00:00:b9:00:00/40 Emask 0x50 (ATA bus error)
Mar 23 04:05:07 Ovi kernel: ata2.00: status: { DRDY }
Mar 23 04:05:07 Ovi kernel: ata2.00: failed command: READ FPDMA QUEUED
Mar 23 04:05:07 Ovi kernel: ata2.00: cmd 60/f8:88:70:93:72/02:00:b9:00:00/40 tag 17 ncq dma 389120 in
Mar 23 04:05:07 Ovi kernel: res 40/00:00:f0:8c:72/00:00:b9:00:00/40 Emask 0x50 (ATA bus error)
Mar 23 04:05:07 Ovi kernel: ata2.00: status: { DRDY }
Mar 23 04:05:07 Ovi kernel: ata2: hard resetting link
Mar 23 04:05:17 Ovi kernel: ata2: softreset failed (device not ready)
Mar 23 04:05:17 Ovi kernel: ata2: hard resetting link
Mar 23 04:05:18 Ovi kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 23 04:05:20 Ovi kernel: ata2.00: configured for UDMA/33
Mar 23 04:05:20 Ovi kernel: ata2: EH complete

I've googled and read almost everything I can.

Errors happen on different devices (ata1, ata2, etc).

Some pointed to power supply, so I bought a (reconditioned) unit but it did not change the errors. :(

I can't find a different integrated breakout ATA cable so I can't change the cable.

Two of the drives are brand new and two are relatively new and non have any SMART errors.

My cache SSD _does_ have one SMART warning of a remapped sector but I am convinced that is the not problem here.

I have changed Interrupts in the BIOS (but some are linked individual ones).

I have reset all cables multiple times.

I have changed from AHCI to Legacy and get slightly different error reporting but similar erros -ATA reset.

 

I recently purchased this gen8 microserver used and am starting to think I purchased someone else's problem.

Not sure where else to post or ask.

Any ideas of anything else I can try?

 

Thanks

wes

ovi-diagnostics-20200323-1345.zip

Edited by crazyhorse90210
Added diagnostics zipfile
Link to comment

I looks like the RAID controller or cables might be the issue.  I never thought cables could be so fickle till SFF cables cables came about.

 

Maybe reset the BIOS to factory and start over.

 

Do the F8 thing when booting and go into your RAID controller and make sure it sees all the drives properly.

 

This is one of those moments that you need to be careful or you might lose some data.  I know in my Gen7 system, I switched to using unRAID's parity software RAID over hardware.  So each drive in the system is set as a RAID 0.  I had too much trouble with my controller and telneting to the controller.  Since this system is an hour away from me, software RAID was easier to work with.

 

I know my solution isn't much, but just cause you got someone else's problem, doesn't mean they knew what they were doing.  Take your hardware back to factory and make changes.  If you still can't get it, maybe look for a new RAID controller.  I had that issue years back with a gen5, got it from eBay seller that rebuilds.  Told them the SFF cable was bad and same with controller.  Replaced it and no issues.

 

One last note, if your RAID controller has a piggyback memory module, try removing it and see what happens.  It's only a hardware cache and the controller will still work without it.  In fact, I recommend not using it.

Edited by ReidS
Link to comment
  • 1 month later...

Wow - this thread is a monster, and I have just gone through hours and hours of getting a simple Nvidia GT710 passed through to a Windows 10 VM.  I'm pretty experienced with building VMs with GPUs, so this came as a surprise.  I certanily won't buy another HP for using with Unraid!  In the end, here is what I recommend which got me working.

  • For clarity, I have a Proliant DL360e Gen 8 running Unraid 6.8.3
  • Before you do anything, follow the instructions from 1812 to obtain the HP Proliant build of unraid.  Seriously - don't waste time on RMRR or Operation Not Permitted (what I got).  Try the HP Proliant build first as it sorted everything out for me straight away at the Unraid level.  Doing a manual patch when you want to upgrade in the future is nothing compared to the pain if you don't use this IMHO.

https://github.com/AnnabellaRenee87/Unraid-HP-Proliant-Edition

  • Set your boot options in Syslinux configuration (replacing the PCI devices where necessary) : pcie_acs_override=id:10de:128b,10de:0e0f vfio_iommu_type1.allow_unsafe_interrupts=1 (possibly you don't need the ACS override - I haven't dared take it out as I have a working solution)
  • The main problem I experieced was the code 43 error in Windows.  No matter what I tried, I could not find a solution, until I read on the forums about GOPupd.  Firstly I didn't have much luck until I did the following...
    • Installed the GT710 in my Windows 10 desktop
    • Ran GPUZ to export the BIOS (my model was not available at techpowerup).
    • Used GOPupd to inject the UEFI code into the bios file (https://www.win-raid.com/t892f16-AMD-and-Nvidia-GOP-update-No-requests-DIY.html)
    • Used HxD and the instructions from SpaceInvaderOne to remove the extra lines at the top of the rom file.
    • Copied this rom file to my Unraid box.
    • Moved the GT710 back into the proliant
    • Added the GPU & the Sound Card portion into the VM using the new rom file

Once I had this ROM file, it was all good.  I can operate the VM as much as I do in my other servers with GPU passthrough.  No restrictions or issues found yet.

 

Thanks to all contributors on this forum and elsewhere, or I would have been lost.

Link to comment
  • 6 months later...

ACPI Error Message Issues

 

I was getting the same ACPI error messages detailed towards the start of this thread on my Gen8 Microserver which was already running the latest J06 BIOS dated 04/04/2019.  I followed the advice given there namely ;-

 

"This is essentially a misreading and does not affect anything but spamming your system log. To remove, disable "acpi_power_meter" module from the kernel by adding this line to the /boot/config/go file:

 

rmmod acpi_power_meter"

 

After a reboot I am now seeing this message in the system logs :-

Nov 25 12:21:25 Borg kernel: ACPI: Early table checksum verification disabled

Nov 25 12:21:25 Borg kernel: ACPI: RSDP 0x00000000000F4F00 000024 (v02 HP )

Nov 25 12:21:25 Borg kernel: ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Pm1aControlBlock: 16/32 (20180810/tbfadt-564)

Nov 25 12:21:25 Borg kernel: ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Pm2ControlBlock: 8/32 (20180810/tbfadt-564)

Nov 25 12:21:25 Borg kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20180810/tbfadt-669)

Nov 25 12:21:25 Borg kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm2ControlBlock: 32, using default 8 (20180810/tbfadt-669)

 

I have attached a copy of the post-reboot system log for your reference.

 

Is there anything to worry about here or are we all good to go now?

borg-syslog-20201125-0124.zip

Edited by peppingc
Add system log
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.