HP Proliant / Workstation & unRaid Information Thread


1812

Recommended Posts

2 hours ago, ReidS said:

Sorry for my delay is response to this.  I had to prioritize some other things over this project last week.

 

So I added the line above to allow unsafe interrupts, reselected the audio config for the Win10 system and I'm getting the same exact error above still.  I went into the BIOS and tried to manually reassign the audio card to another IRQ and memory address, but the system keeps moving the devices on the same interrupts along with it, like the RAID controller.  I read earlier that the RAID controller doesn't play well and am I possible a victim of circumstances here?  I have 6 disks in this system running off the RAID controller, but not using the hardware RAID.  So using the onboard SATA won't work with only 2 ports.  I also have an PCIe external SATA for an external array, but the array isn't plugged in at the moment and it's just a PNP, no BIOS card.  I'm just tossing out some thoughts.  I'm going to be trying to get this box working 100% by the end of week So I can start the next leg of projects related to this system.  If I need to order another audio card, I'm open to suggestions.

 

Thank you and everybody in advance for any help!

post your diagnostics zip file (tools>diagnostics)

 

 

Link to comment

Syslinux...

 

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS Test Settings
  menu default
  kernel /bzimage
  append vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot
label unRAID OS
  kernel /bzimage
  append initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest
 

diagnostics-20180406-1252.zip

Link to comment
46 minutes ago, ReidS said:

Syslinux...

 

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS Test Settings
  menu default
  kernel /bzimage
  append vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot
label unRAID OS
  kernel /bzimage
  append initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest
 

diagnostics-20180406-1252.zip

it's not booting allowing unsafe interrupts (per the log.) delete all that you have in the syslinux and replace with the following:

 

 

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS
  menu default
  kernel /bzimage
append vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append pcie_acs_override=downstream initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest

 

 

once done verify the sound card is in it's iommu group by itself and try again with the vm. if it fails, post your new diagnostics immediately after.

 

Link to comment

Changed the syslinux 100% as requested.  The audio card is in an iommo group (22) of it's own now.  I'm still getting the same error thought, aside from the date part of the code.  Attached is the immediate after diag file.  For some reason I thought I remember seeing the audio card have two occurrences in the System Devices and now there's only one.

 

I'm doing this remotely and thankfully it hasn't locked up yet.  Not sure what I did the one time, but that wasn't fun.  I'll be on-site tomorrow (Tuesday, EST 9am-5pm) if there's any BIOS info needed.  Again I really appreciate the help.  I've looked at the diag files myself, but I'm not experienced enough to know what all I'm looking for.  If there's somewhere I can do some reading on better understanding the subject other then searches, I'd feel more useful.  Currently I'm just 13th Warrior'ing this thing till it starts becoming common sense.

 

Thank you

diagnostics-20180416-2130.zip

Link to comment
10 hours ago, ReidS said:

Changed the syslinux 100% as requested.  The audio card is in an iommo group (22) of it's own now.  I'm still getting the same error thought, aside from the date part of the code.  Attached is the immediate after diag file.  For some reason I thought I remember seeing the audio card have two occurrences in the System Devices and now there's only one.

 

I'm doing this remotely and thankfully it hasn't locked up yet.  Not sure what I did the one time, but that wasn't fun.  I'll be on-site tomorrow (Tuesday, EST 9am-5pm) if there's any BIOS info needed.  Again I really appreciate the help.  I've looked at the diag files myself, but I'm not experienced enough to know what all I'm looking for.  If there's somewhere I can do some reading on better understanding the subject other then searches, I'd feel more useful.  Currently I'm just 13th Warrior'ing this thing till it starts becoming common sense.

 

Thank you

diagnostics-20180416-2130.zip

 

unsafe interrupts is now working, which allows us to unmask your real problem, that being:

 

vfio-pci 0000:09:00.0: Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.

 

This thread has some information regarding it. Some in the first post which may not be applicable on a G7 (using older bios vs updated-works on G6 servers...) On the second page, another user posted their steps using a custom kernel they made to bypass the RMRR issue with a patch via proxmox users. Currently these are the only 2 options that I know of to work around the issue. I have not attempted to use the custom compiled kernel. 

Link to comment

Dear 1812,

 

Thank you for your efforts.  Your info has shed much info upon my situation that actually accidentally fixed my issue in a roundabout manner.  The audio card issue has not been resolved, but a workaround was created in these efforts.  In my thwarted attempt to backdate the BIOS, I instead decided to update to a more recent version in an effort of not being able to do one, why not try the other.  Sure unRAID doesn't see a USB audio as a valid sound card, but now I can flag USB audio device and now it works through the remote desktop, which is just as good for me.  This didn't work before my BIOS update and the allowance of unsafe interrupts, but now it does.  The system before would see the USB audio device, but wouldn't play through the remote desktop for reasons beyond my knowing.  Now I can free up that slot for a USB 3.1 card and improve the dynamics of this box.

 

As a FYI, to restore the systems BIOS back to prior 2011 BIOS, you either have to have the original ROMpaq CD/DVD that came with the system (or have access to one) or have an active service agreement open with HP.  I'm not putting this box in another service agreement.  Especially not with the flexibility that unRAID gives me.  This box dies, remove unRAID USB, all drives, and hello new equipment.

 

So the issue has not been resolve, but has.  I recommend to anybody having audio issues, disable the sound card and use an USB audio device. It's not the fix we'd like to see, but it's cheap and works.  So update BIOS, update unRAID, allow unsafe interrupts, plug in a $10 USB audio device, and select that device as a USB device in a VM.  Does one device work with multiple VM's?  No, but if needed, purchase multiple USB audio devices by different builders.  It's still a cheap fix and also keeps your system security up-to-date.

 

Thank you 1812.  Maybe that helps someone else?

Link to comment
  • 3 weeks later...

The only issue is managing the drive.  You can download the tools and run them, but I haven't fully tested it.  I currently have two 120GB SSDs running in RAID 1.  I did this for our database to increase response time.  Runs fine.  If I lose a drive, it'll reduce response back to normal.  The only downfall I have is the system will not notify me if I lose a drive cause unRaid isn't managing it.  It sees it as a single drive still.  If that's a concern, run the drives as RAID 0, and just make sure you have a parity drive large enough.  Should gave you the same results, but you'll know when one of the drives fail.  Just be aware, that any maintenance to the RAID array may require you to take the system down.

 

I didn't cover RAID 5 cause even though you gain a good bit of performance with 4+ drives and also have redundancy.  You don't have the integrated manageability with unRaid.  It would run like a RAID 0, but with some perks.

 

If you really want speed and have a spare PCI slot, I'd go that route.

Link to comment

My video editing server uses the onboard raid controller to manage 3 different higher performing raid groups without issue. But as stated, if something fails there is no notification.

 

In 6.2.4 using  the onboard controller and an hba caused the machine to hang or not boot. I have not retested in subsequent versions  (didn’t have the need).

 

 

there has been (for about a year now) in the feature request section for unRaid to have multiple cache pools. Maybe someday we will get it...

Link to comment

hmm you are correct about the notification.. As these are 'dummy' boxes that I run VMs and Dockers (inside the VM for CUDA) I backup everything so Im not sure if its a problem.

On the other hand my 2 SSDs are really small to run them as 2 parts in the unraid array. 240gb...

 

I did ask because since I moved to unraid and VM on this specific box the response (I believe for the read/write) is slow. Ubuntu runs quite slow for this system...

So this HP has a 240GB SSB SATA from HP and my new system with the same unraid/VM has a intel 960PRO. Maybe Im spoiled? HP seems to hung..

Link to comment

I had the same issues with Ubuntu on my HP.  Runs slow and/or laggy.  I found this not to be a drive speed issue, but more of a configuration and driver issue.  I switched from the default settings and went with the SeaBIOS and the i440fx chipset.

 

If I used the defaults, the system would hang or not boot right.  This could have been a fault of my own, but I tried many methods but the default virtual hardware would cause it to fail on my DL360 G7.  I'm still experimenting on the older G5's I have.

Link to comment
8 hours ago, ReidS said:

I had the same issues with Ubuntu on my HP.  Runs slow and/or laggy.  I found this not to be a drive speed issue, but more of a configuration and driver issue.  I switched from the default settings and went with the SeaBIOS and the i440fx chipset.

 

If I used the defaults, the system would hang or not boot right.  This could have been a fault of my own, but I tried many methods but the default virtual hardware would cause it to fail on my DL360 G7.  I'm still experimenting on the older G5's I have.

 

Aha! Ill test it. it seemed weird as for almost 2 years I had bare ubuntu on that machine and it was super... and now it was very laggy.

cheers

Link to comment

Dear @1812, I have a problem with my Proliant. I would appreciate if you or any other Proliant user here can shed a light.

 

I have a ML110 Gen9 which is working fine with unRAID and 3x LFF HDDs and 2x SSDs. SSDs are in 2.5"-to-3.5" caddies.

 

I needed a second unRAID server. 3 weeks ago, I bought a ML150 Gen9. Installed unRAID (on USB) with 2x LFF HDDs and 1x SSD (with 2.5" to 3.5" converter). All was fine. I needed to install a second SDD to make a cache pool. As soon as I install the second SSD (the 4th in the cage), something happened to server. From this moment on, the server wouldn't detect any LFF HDD. There is no problem with SSD though. After long conversation with HPE tech and many resets (inc SW6), it was not possible to correct. Then, the HPE tech replaced the backplane. All was fine again (seemingly). As soon as I install 2x LFF HDDs and 2x SSDs, the problem repeated again. HPE had to issue a DOA report.

 

What should I do? I am afraid the problem will repeat if I get another ML150 Gen9. Or should I go for ML110 Gen9 as it is tested. By the way, all HDDs and SSDs are non-HP brands.

 

Thanks for your help.

 

Link to comment
2 hours ago, sse450 said:

Dear @1812, I have a problem with my Proliant. I would appreciate if you or any other Proliant user here can shed a light.

 

I have a ML110 Gen9 which is working fine with unRAID and 3x LFF HDDs and 2x SSDs. SSDs are in 2.5"-to-3.5" caddies.

 

I needed a second unRAID server. 3 weeks ago, I bought a ML150 Gen9. Installed unRAID (on USB) with 2x LFF HDDs and 1x SSD (with 2.5" to 3.5" converter). All was fine. I needed to install a second SDD to make a cache pool. As soon as I install the second SSD (the 4th in the cage), something happened to server. From this moment on, the server wouldn't detect any LFF HDD. There is no problem with SSD though. After long conversation with HPE tech and many resets (inc SW6), it was not possible to correct. Then, the HPE tech replaced the backplane. All was fine again (seemingly). As soon as I install 2x LFF HDDs and 2x SSDs, the problem repeated again. HPE had to issue a DOA report.

 

What should I do? I am afraid the problem will repeat if I get another ML150 Gen9. Or should I go for ML110 Gen9 as it is tested. By the way, all HDDs and SSDs are non-HP brands.

 

Thanks for your help.

 

 

what hba are you using and is the onboard raid controller still enabled? Or are you using onboard raid, and if so, what model?

Link to comment
3 hours ago, 1812 said:

 

what hba are you using and is the onboard raid controller still enabled? Or are you using onboard raid, and if so, what model?

 

The onboard controller is B140i. But I disabled it and selected AHCI in the BIOS . So, no RAID at all. But just for testing, I installed LSI 9207-8i on PCIe slot. The result was the same, ie. After the problem happens, it showed only the SSds not any of LFF HDDs. It seems to me that something is wrong with hot plug backplane of the HDD cage. But, the replaced backplane showed the same behaviour  as well.

 

Thanks for helping.

Link to comment

anyone getting this flooding error?

 

[83718.488013] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550)
[83718.488020] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338)
[83728.528274] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170728/exfield-427)
[83728.528279] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550)
[83728.528286] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338)

Link to comment
12 hours ago, 1812 said:

Ok. Replicate the problem again making the drives disappear. Then immediately after they are gone, grab the diagnostics zip(tools>diagnostics) and post that complete zip file. That might help us know what is going on, and maybe why.

 

The problem is not about disappearing after unRAID starts. It is a detection problem by the BIOS. BIOS of the Proliant doesn't see the 3.5" HDDs. unRAID has nothing todo if the BIOS cannot detect the HDDs.

 

Anyway, a couple of weeks ago, I asked this problem on this forum:

https://lime-technology.com/forums/topic/71583-my-server-lost-all-35-hdds-but-keep-25-ssds/?tab=comments#comment-657485 

There is a diagnostics file attached in my post.

 

It happens in this sequence:

1. Perfectly working order with 2x LFF HDDs + 1x SSD

2. Poweroff the server

3. Insert a 2nd SSD (totalling 2x LFF HDDs + 2x SSD)

4. Power on the server

5. BIOS detects only 2x SDDs. Nothing about LFF HDDs. From this point on, the server is broken somehow.

6. Poweroff server

7. Insert just 1x LFF HDD and nothing else (for testing)

8. The server still cannot detect the HDD. It means the damage is done and irreversible. No way to get it detect any LFF HDD.

9. Poweroff server.

10.Replace the backplane with a new one

11.Insert 2x LFF HDDs + 1x SSD

12.Go to 1st step above. And so on.


I have never seen such a problem before. Stumped.

 

Thank you for taking time to help me.

 

Link to comment
3 hours ago, daemonix said:

anyone getting this flooding error?

 

[83718.488013] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550)
[83718.488020] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338)
[83728.528274] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170728/exfield-427)
[83728.528279] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550)
[83728.528286] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338)

fix for this log spam is on the first page.

  • Like 1
Link to comment

Curious... Once the LFF drives are no longer being detected, can you remove them and put them in another system and see them?  The thought here is if this is a storage controller issue or drive issue.  Even if you can't see the drives in another system, it could still be a storage controller problem (card, cable, or back plan).  I don't think this is a SSD issue.

 

A thought I have is SSD drives are always ready.  Add power, go!  HDD drives take time to spin up, especially if they're 15k> rpm SAS style.  When the controller has power and says hello to the first drives, which would be the SSD, it may only allow for so much time for the other drives to say hello.  If those drives aren't ready, the controller starts and says those drives aren't available.  I've seen this with older storage controllers that are pre-SSD and some post.  Usually you newer controllers can take a firmware upgrade and will increase your boot time for this allowance.

 

I had this like your having with an old Dell server and Intel RAID controller about 5 years back.  We had four SSD's and two HDD, but we kept losing the HDD's when configuring the server and constantly had to rebuild the mirroring between the two.  After a long discussion with Intel support, we changed the RAID controller to a another controller and had to switch it to s specific firmware version.  This was not a user error and there's nothing to configure, which adds to the frustration.  It was all based on the ready state of the drives and the factory configured allowance in the controller for ready time from the first drives to the last.

 

So once your drives fail, see if they can be seen in another system.  If so, you might be having a similar issue.  If so, then you'll have to add another controller to separate the drives or maybe see if there's a firmware for the controller to switch to to relieve this issue.

Link to comment
2 hours ago, sse450 said:

 

The problem is not about disappearing after unRAID starts. It is a detection problem by the BIOS. BIOS of the Proliant doesn't see the 3.5" HDDs. unRAID has nothing todo if the BIOS cannot detect the HDDs.

 

Anyway, a couple of weeks ago, I asked this problem on this forum:

https://lime-technology.com/forums/topic/71583-my-server-lost-all-35-hdds-but-keep-25-ssds/?tab=comments#comment-657485 

There is a diagnostics file attached in my post.

 

It happens in this sequence:

1. Perfectly working order with 2x LFF HDDs + 1x SSD

2. Poweroff the server

3. Insert a 2nd SSD (totalling 2x LFF HDDs + 2x SSD)

4. Power on the server

5. BIOS detects only 2x SDDs. Nothing about LFF HDDs. From this point on, the server is broken somehow.

6. Poweroff server

7. Insert just 1x LFF HDD and nothing else (for testing)

8. The server still cannot detect the HDD. It means the damage is done and irreversible. No way to get it detect any LFF HDD.

9. Poweroff server.

10.Replace the backplane with a new one

11.Insert 2x LFF HDDs + 1x SSD

12.Go to 1st step above. And so on.


I have never seen such a problem before. Stumped.

 

Thank you for taking time to help me.

 

 

Very perplexing. I was just reading about some issues with sata disk recognition on this gen/model and some were resolved by a firmware update... have you looked into that?

 

I'll keep digging, but it's very weird. Once the "damage is done" can you switch back on the onboard raid controller and see if it can see all the disks in the array creation menu? (obviously don't make one, just see if they show up.)

Link to comment
3 hours ago, ReidS said:

Curious... Once the LFF drives are no longer being detected, can you remove them and put them in another system and see them? 

Yes, of course. Even after replacing the hot swap cage with a new one, the very same LFF drives are detected by the bios.

 

3 hours ago, ReidS said:

Usually you newer controllers can take a firmware upgrade and will increase your boot time for this allowance.

Perhaps this is possible. But, the firmware is the latest. Even, I tried the spare ROM with no solution.

 

4 hours ago, ReidS said:

After a long discussion with Intel support, we changed the RAID controller to a another controller and had to switch it to s specific firmware version. 

I tried with LSI 9207-8i PCIe card. Same problem.

 

To me, the most important clue is replacing the hot swap LFF cage solves the problem, albeit until I insert 2x LFF HDDs and 2x SSDs. As soon as I insert them, I revert back to the square one with a damaged (or somehow corrupted) LFF cage backplane. 

 

I am in the eve of a decision. If we believe that my ML150 Gen9 is really DOA, not a design fault, then I will request for another ML150 Gen9. 

Link to comment
3 hours ago, 1812 said:

 

Very perplexing. I was just reading about some issues with sata disk recognition on this gen/model and some were resolved by a firmware update... have you looked into that?

 

I'll keep digging, but it's very weird. Once the "damage is done" can you switch back on the onboard raid controller and see if it can see all the disks in the array creation menu? (obviously don't make one, just see if they show up.)

 

The firmware is the latest.

 

No. Once the "damage is done", even the B140i doesn't see any LFF HDDs. No problems with SSDs.

 

Frankly speaking, I asked this question on many HPE related forums. Nobody had seen such a problem.This strengthens the idea that it was a DOA. My courage is getting higher to order a new ML150 Gen9.

Link to comment
11 minutes ago, pwm said:

Any component burned by supplying power to four disks?

No as far as I see on the backplane. 

 

Question: I know that HPE pushes us to use the HPE branded dard disks. My disks are not HPE branded. I use 2x 10T HGST Helium and 2x 512GB Samsung Pro 850.

 

Could the server locks itself up because of the consumer disks? I am aware that non-HPE disks may cause high fan noise. Could my problem be something like that? But, even so, there must be some other similar complaints on the internet. But I have never seen any problem like that.

Edited by sse450
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.