1812 Posted April 16, 2018 Author Share Posted April 16, 2018 2 hours ago, ReidS said: Sorry for my delay is response to this. I had to prioritize some other things over this project last week. So I added the line above to allow unsafe interrupts, reselected the audio config for the Win10 system and I'm getting the same exact error above still. I went into the BIOS and tried to manually reassign the audio card to another IRQ and memory address, but the system keeps moving the devices on the same interrupts along with it, like the RAID controller. I read earlier that the RAID controller doesn't play well and am I possible a victim of circumstances here? I have 6 disks in this system running off the RAID controller, but not using the hardware RAID. So using the onboard SATA won't work with only 2 ports. I also have an PCIe external SATA for an external array, but the array isn't plugged in at the moment and it's just a PNP, no BIOS card. I'm just tossing out some thoughts. I'm going to be trying to get this box working 100% by the end of week So I can start the next leg of projects related to this system. If I need to order another audio card, I'm open to suggestions. Thank you and everybody in advance for any help! post your diagnostics zip file (tools>diagnostics) Quote Link to comment
ReidS Posted April 16, 2018 Share Posted April 16, 2018 Syslinux... default menu.c32 menu title Lime Technology, Inc. prompt 0 timeout 50 label unRAID OS Test Settings menu default kernel /bzimage append vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot label unRAID OS kernel /bzimage append initrd=/bzroot label unRAID OS GUI Mode kernel /bzimage append initrd=/bzroot,/bzroot-gui label unRAID OS Safe Mode (no plugins, no GUI) kernel /bzimage append initrd=/bzroot unraidsafemode label unRAID OS GUI Safe Mode (no plugins) kernel /bzimage append initrd=/bzroot,/bzroot-gui unraidsafemode label Memtest86+ kernel /memtest diagnostics-20180406-1252.zip Quote Link to comment
1812 Posted April 16, 2018 Author Share Posted April 16, 2018 46 minutes ago, ReidS said: Syslinux... default menu.c32 menu title Lime Technology, Inc. prompt 0 timeout 50 label unRAID OS Test Settings menu default kernel /bzimage append vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot label unRAID OS kernel /bzimage append initrd=/bzroot label unRAID OS GUI Mode kernel /bzimage append initrd=/bzroot,/bzroot-gui label unRAID OS Safe Mode (no plugins, no GUI) kernel /bzimage append initrd=/bzroot unraidsafemode label unRAID OS GUI Safe Mode (no plugins) kernel /bzimage append initrd=/bzroot,/bzroot-gui unraidsafemode label Memtest86+ kernel /memtest diagnostics-20180406-1252.zip it's not booting allowing unsafe interrupts (per the log.) delete all that you have in the syslinux and replace with the following: default menu.c32 menu title Lime Technology, Inc. prompt 0 timeout 50 label unRAID OS menu default kernel /bzimage append vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot label unRAID OS GUI Mode kernel /bzimage append pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui label unRAID OS Safe Mode (no plugins, no GUI) kernel /bzimage append pcie_acs_override=downstream initrd=/bzroot unraidsafemode label unRAID OS GUI Safe Mode (no plugins) kernel /bzimage append pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui unraidsafemode label Memtest86+ kernel /memtest once done verify the sound card is in it's iommu group by itself and try again with the vm. if it fails, post your new diagnostics immediately after. Quote Link to comment
ReidS Posted April 17, 2018 Share Posted April 17, 2018 Changed the syslinux 100% as requested. The audio card is in an iommo group (22) of it's own now. I'm still getting the same error thought, aside from the date part of the code. Attached is the immediate after diag file. For some reason I thought I remember seeing the audio card have two occurrences in the System Devices and now there's only one. I'm doing this remotely and thankfully it hasn't locked up yet. Not sure what I did the one time, but that wasn't fun. I'll be on-site tomorrow (Tuesday, EST 9am-5pm) if there's any BIOS info needed. Again I really appreciate the help. I've looked at the diag files myself, but I'm not experienced enough to know what all I'm looking for. If there's somewhere I can do some reading on better understanding the subject other then searches, I'd feel more useful. Currently I'm just 13th Warrior'ing this thing till it starts becoming common sense. Thank you diagnostics-20180416-2130.zip Quote Link to comment
1812 Posted April 17, 2018 Author Share Posted April 17, 2018 10 hours ago, ReidS said: Changed the syslinux 100% as requested. The audio card is in an iommo group (22) of it's own now. I'm still getting the same error thought, aside from the date part of the code. Attached is the immediate after diag file. For some reason I thought I remember seeing the audio card have two occurrences in the System Devices and now there's only one. I'm doing this remotely and thankfully it hasn't locked up yet. Not sure what I did the one time, but that wasn't fun. I'll be on-site tomorrow (Tuesday, EST 9am-5pm) if there's any BIOS info needed. Again I really appreciate the help. I've looked at the diag files myself, but I'm not experienced enough to know what all I'm looking for. If there's somewhere I can do some reading on better understanding the subject other then searches, I'd feel more useful. Currently I'm just 13th Warrior'ing this thing till it starts becoming common sense. Thank you diagnostics-20180416-2130.zip unsafe interrupts is now working, which allows us to unmask your real problem, that being: vfio-pci 0000:09:00.0: Device is ineligible for IOMMU domain attach due to platform RMRR requirement. Contact your platform vendor. This thread has some information regarding it. Some in the first post which may not be applicable on a G7 (using older bios vs updated-works on G6 servers...) On the second page, another user posted their steps using a custom kernel they made to bypass the RMRR issue with a patch via proxmox users. Currently these are the only 2 options that I know of to work around the issue. I have not attempted to use the custom compiled kernel. Quote Link to comment
ReidS Posted April 17, 2018 Share Posted April 17, 2018 Dear 1812, Thank you for your efforts. Your info has shed much info upon my situation that actually accidentally fixed my issue in a roundabout manner. The audio card issue has not been resolved, but a workaround was created in these efforts. In my thwarted attempt to backdate the BIOS, I instead decided to update to a more recent version in an effort of not being able to do one, why not try the other. Sure unRAID doesn't see a USB audio as a valid sound card, but now I can flag USB audio device and now it works through the remote desktop, which is just as good for me. This didn't work before my BIOS update and the allowance of unsafe interrupts, but now it does. The system before would see the USB audio device, but wouldn't play through the remote desktop for reasons beyond my knowing. Now I can free up that slot for a USB 3.1 card and improve the dynamics of this box. As a FYI, to restore the systems BIOS back to prior 2011 BIOS, you either have to have the original ROMpaq CD/DVD that came with the system (or have access to one) or have an active service agreement open with HP. I'm not putting this box in another service agreement. Especially not with the flexibility that unRAID gives me. This box dies, remove unRAID USB, all drives, and hello new equipment. So the issue has not been resolve, but has. I recommend to anybody having audio issues, disable the sound card and use an USB audio device. It's not the fix we'd like to see, but it's cheap and works. So update BIOS, update unRAID, allow unsafe interrupts, plug in a $10 USB audio device, and select that device as a USB device in a VM. Does one device work with multiple VM's? No, but if needed, purchase multiple USB audio devices by different builders. It's still a cheap fix and also keeps your system security up-to-date. Thank you 1812. Maybe that helps someone else? Quote Link to comment
daemonix Posted May 14, 2018 Share Posted May 14, 2018 Hi all, any known issues with using hps hardware raid system to merge two ssds into one? (As opposed to leaving them as two things for the unraid array to manage?) performance mainly. thanks Quote Link to comment
ReidS Posted May 14, 2018 Share Posted May 14, 2018 The only issue is managing the drive. You can download the tools and run them, but I haven't fully tested it. I currently have two 120GB SSDs running in RAID 1. I did this for our database to increase response time. Runs fine. If I lose a drive, it'll reduce response back to normal. The only downfall I have is the system will not notify me if I lose a drive cause unRaid isn't managing it. It sees it as a single drive still. If that's a concern, run the drives as RAID 0, and just make sure you have a parity drive large enough. Should gave you the same results, but you'll know when one of the drives fail. Just be aware, that any maintenance to the RAID array may require you to take the system down. I didn't cover RAID 5 cause even though you gain a good bit of performance with 4+ drives and also have redundancy. You don't have the integrated manageability with unRaid. It would run like a RAID 0, but with some perks. If you really want speed and have a spare PCI slot, I'd go that route. Quote Link to comment
1812 Posted May 14, 2018 Author Share Posted May 14, 2018 My video editing server uses the onboard raid controller to manage 3 different higher performing raid groups without issue. But as stated, if something fails there is no notification. In 6.2.4 using the onboard controller and an hba caused the machine to hang or not boot. I have not retested in subsequent versions (didn’t have the need). there has been (for about a year now) in the feature request section for unRaid to have multiple cache pools. Maybe someday we will get it... Quote Link to comment
daemonix Posted May 14, 2018 Share Posted May 14, 2018 hmm you are correct about the notification.. As these are 'dummy' boxes that I run VMs and Dockers (inside the VM for CUDA) I backup everything so Im not sure if its a problem. On the other hand my 2 SSDs are really small to run them as 2 parts in the unraid array. 240gb... I did ask because since I moved to unraid and VM on this specific box the response (I believe for the read/write) is slow. Ubuntu runs quite slow for this system... So this HP has a 240GB SSB SATA from HP and my new system with the same unraid/VM has a intel 960PRO. Maybe Im spoiled? HP seems to hung.. Quote Link to comment
ReidS Posted May 14, 2018 Share Posted May 14, 2018 I had the same issues with Ubuntu on my HP. Runs slow and/or laggy. I found this not to be a drive speed issue, but more of a configuration and driver issue. I switched from the default settings and went with the SeaBIOS and the i440fx chipset. If I used the defaults, the system would hang or not boot right. This could have been a fault of my own, but I tried many methods but the default virtual hardware would cause it to fail on my DL360 G7. I'm still experimenting on the older G5's I have. Quote Link to comment
daemonix Posted May 15, 2018 Share Posted May 15, 2018 8 hours ago, ReidS said: I had the same issues with Ubuntu on my HP. Runs slow and/or laggy. I found this not to be a drive speed issue, but more of a configuration and driver issue. I switched from the default settings and went with the SeaBIOS and the i440fx chipset. If I used the defaults, the system would hang or not boot right. This could have been a fault of my own, but I tried many methods but the default virtual hardware would cause it to fail on my DL360 G7. I'm still experimenting on the older G5's I have. Aha! Ill test it. it seemed weird as for almost 2 years I had bare ubuntu on that machine and it was super... and now it was very laggy. cheers Quote Link to comment
sse450 Posted May 20, 2018 Share Posted May 20, 2018 Dear @1812, I have a problem with my Proliant. I would appreciate if you or any other Proliant user here can shed a light. I have a ML110 Gen9 which is working fine with unRAID and 3x LFF HDDs and 2x SSDs. SSDs are in 2.5"-to-3.5" caddies. I needed a second unRAID server. 3 weeks ago, I bought a ML150 Gen9. Installed unRAID (on USB) with 2x LFF HDDs and 1x SSD (with 2.5" to 3.5" converter). All was fine. I needed to install a second SDD to make a cache pool. As soon as I install the second SSD (the 4th in the cage), something happened to server. From this moment on, the server wouldn't detect any LFF HDD. There is no problem with SSD though. After long conversation with HPE tech and many resets (inc SW6), it was not possible to correct. Then, the HPE tech replaced the backplane. All was fine again (seemingly). As soon as I install 2x LFF HDDs and 2x SSDs, the problem repeated again. HPE had to issue a DOA report. What should I do? I am afraid the problem will repeat if I get another ML150 Gen9. Or should I go for ML110 Gen9 as it is tested. By the way, all HDDs and SSDs are non-HP brands. Thanks for your help. Quote Link to comment
1812 Posted May 20, 2018 Author Share Posted May 20, 2018 2 hours ago, sse450 said: Dear @1812, I have a problem with my Proliant. I would appreciate if you or any other Proliant user here can shed a light. I have a ML110 Gen9 which is working fine with unRAID and 3x LFF HDDs and 2x SSDs. SSDs are in 2.5"-to-3.5" caddies. I needed a second unRAID server. 3 weeks ago, I bought a ML150 Gen9. Installed unRAID (on USB) with 2x LFF HDDs and 1x SSD (with 2.5" to 3.5" converter). All was fine. I needed to install a second SDD to make a cache pool. As soon as I install the second SSD (the 4th in the cage), something happened to server. From this moment on, the server wouldn't detect any LFF HDD. There is no problem with SSD though. After long conversation with HPE tech and many resets (inc SW6), it was not possible to correct. Then, the HPE tech replaced the backplane. All was fine again (seemingly). As soon as I install 2x LFF HDDs and 2x SSDs, the problem repeated again. HPE had to issue a DOA report. What should I do? I am afraid the problem will repeat if I get another ML150 Gen9. Or should I go for ML110 Gen9 as it is tested. By the way, all HDDs and SSDs are non-HP brands. Thanks for your help. what hba are you using and is the onboard raid controller still enabled? Or are you using onboard raid, and if so, what model? Quote Link to comment
sse450 Posted May 20, 2018 Share Posted May 20, 2018 3 hours ago, 1812 said: what hba are you using and is the onboard raid controller still enabled? Or are you using onboard raid, and if so, what model? The onboard controller is B140i. But I disabled it and selected AHCI in the BIOS . So, no RAID at all. But just for testing, I installed LSI 9207-8i on PCIe slot. The result was the same, ie. After the problem happens, it showed only the SSds not any of LFF HDDs. It seems to me that something is wrong with hot plug backplane of the HDD cage. But, the replaced backplane showed the same behaviour as well. Thanks for helping. Quote Link to comment
1812 Posted May 20, 2018 Author Share Posted May 20, 2018 Ok. Replicate the problem again making the drives disappear. Then immediately after they are gone, grab the diagnostics zip(tools>diagnostics) and post that complete zip file. That might help us know what is going on, and maybe why. Quote Link to comment
daemonix Posted May 21, 2018 Share Posted May 21, 2018 anyone getting this flooding error? [83718.488013] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550) [83718.488020] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338) [83728.528274] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170728/exfield-427) [83728.528279] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550) [83728.528286] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338) Quote Link to comment
sse450 Posted May 21, 2018 Share Posted May 21, 2018 12 hours ago, 1812 said: Ok. Replicate the problem again making the drives disappear. Then immediately after they are gone, grab the diagnostics zip(tools>diagnostics) and post that complete zip file. That might help us know what is going on, and maybe why. The problem is not about disappearing after unRAID starts. It is a detection problem by the BIOS. BIOS of the Proliant doesn't see the 3.5" HDDs. unRAID has nothing todo if the BIOS cannot detect the HDDs. Anyway, a couple of weeks ago, I asked this problem on this forum: https://lime-technology.com/forums/topic/71583-my-server-lost-all-35-hdds-but-keep-25-ssds/?tab=comments#comment-657485 There is a diagnostics file attached in my post. It happens in this sequence: 1. Perfectly working order with 2x LFF HDDs + 1x SSD 2. Poweroff the server 3. Insert a 2nd SSD (totalling 2x LFF HDDs + 2x SSD) 4. Power on the server 5. BIOS detects only 2x SDDs. Nothing about LFF HDDs. From this point on, the server is broken somehow. 6. Poweroff server 7. Insert just 1x LFF HDD and nothing else (for testing) 8. The server still cannot detect the HDD. It means the damage is done and irreversible. No way to get it detect any LFF HDD. 9. Poweroff server. 10.Replace the backplane with a new one 11.Insert 2x LFF HDDs + 1x SSD 12.Go to 1st step above. And so on. I have never seen such a problem before. Stumped. Thank you for taking time to help me. Quote Link to comment
1812 Posted May 21, 2018 Author Share Posted May 21, 2018 3 hours ago, daemonix said: anyone getting this flooding error? [83718.488013] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550) [83718.488020] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338) [83728.528274] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170728/exfield-427) [83728.528279] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170728/psparse-550) [83728.528286] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170728/power_meter-338) fix for this log spam is on the first page. 1 Quote Link to comment
ReidS Posted May 21, 2018 Share Posted May 21, 2018 Curious... Once the LFF drives are no longer being detected, can you remove them and put them in another system and see them? The thought here is if this is a storage controller issue or drive issue. Even if you can't see the drives in another system, it could still be a storage controller problem (card, cable, or back plan). I don't think this is a SSD issue. A thought I have is SSD drives are always ready. Add power, go! HDD drives take time to spin up, especially if they're 15k> rpm SAS style. When the controller has power and says hello to the first drives, which would be the SSD, it may only allow for so much time for the other drives to say hello. If those drives aren't ready, the controller starts and says those drives aren't available. I've seen this with older storage controllers that are pre-SSD and some post. Usually you newer controllers can take a firmware upgrade and will increase your boot time for this allowance. I had this like your having with an old Dell server and Intel RAID controller about 5 years back. We had four SSD's and two HDD, but we kept losing the HDD's when configuring the server and constantly had to rebuild the mirroring between the two. After a long discussion with Intel support, we changed the RAID controller to a another controller and had to switch it to s specific firmware version. This was not a user error and there's nothing to configure, which adds to the frustration. It was all based on the ready state of the drives and the factory configured allowance in the controller for ready time from the first drives to the last. So once your drives fail, see if they can be seen in another system. If so, you might be having a similar issue. If so, then you'll have to add another controller to separate the drives or maybe see if there's a firmware for the controller to switch to to relieve this issue. Quote Link to comment
1812 Posted May 21, 2018 Author Share Posted May 21, 2018 2 hours ago, sse450 said: The problem is not about disappearing after unRAID starts. It is a detection problem by the BIOS. BIOS of the Proliant doesn't see the 3.5" HDDs. unRAID has nothing todo if the BIOS cannot detect the HDDs. Anyway, a couple of weeks ago, I asked this problem on this forum: https://lime-technology.com/forums/topic/71583-my-server-lost-all-35-hdds-but-keep-25-ssds/?tab=comments#comment-657485 There is a diagnostics file attached in my post. It happens in this sequence: 1. Perfectly working order with 2x LFF HDDs + 1x SSD 2. Poweroff the server 3. Insert a 2nd SSD (totalling 2x LFF HDDs + 2x SSD) 4. Power on the server 5. BIOS detects only 2x SDDs. Nothing about LFF HDDs. From this point on, the server is broken somehow. 6. Poweroff server 7. Insert just 1x LFF HDD and nothing else (for testing) 8. The server still cannot detect the HDD. It means the damage is done and irreversible. No way to get it detect any LFF HDD. 9. Poweroff server. 10.Replace the backplane with a new one 11.Insert 2x LFF HDDs + 1x SSD 12.Go to 1st step above. And so on. I have never seen such a problem before. Stumped. Thank you for taking time to help me. Very perplexing. I was just reading about some issues with sata disk recognition on this gen/model and some were resolved by a firmware update... have you looked into that? I'll keep digging, but it's very weird. Once the "damage is done" can you switch back on the onboard raid controller and see if it can see all the disks in the array creation menu? (obviously don't make one, just see if they show up.) Quote Link to comment
sse450 Posted May 21, 2018 Share Posted May 21, 2018 3 hours ago, ReidS said: Curious... Once the LFF drives are no longer being detected, can you remove them and put them in another system and see them? Yes, of course. Even after replacing the hot swap cage with a new one, the very same LFF drives are detected by the bios. 3 hours ago, ReidS said: Usually you newer controllers can take a firmware upgrade and will increase your boot time for this allowance. Perhaps this is possible. But, the firmware is the latest. Even, I tried the spare ROM with no solution. 4 hours ago, ReidS said: After a long discussion with Intel support, we changed the RAID controller to a another controller and had to switch it to s specific firmware version. I tried with LSI 9207-8i PCIe card. Same problem. To me, the most important clue is replacing the hot swap LFF cage solves the problem, albeit until I insert 2x LFF HDDs and 2x SSDs. As soon as I insert them, I revert back to the square one with a damaged (or somehow corrupted) LFF cage backplane. I am in the eve of a decision. If we believe that my ML150 Gen9 is really DOA, not a design fault, then I will request for another ML150 Gen9. Quote Link to comment
sse450 Posted May 21, 2018 Share Posted May 21, 2018 3 hours ago, 1812 said: Very perplexing. I was just reading about some issues with sata disk recognition on this gen/model and some were resolved by a firmware update... have you looked into that? I'll keep digging, but it's very weird. Once the "damage is done" can you switch back on the onboard raid controller and see if it can see all the disks in the array creation menu? (obviously don't make one, just see if they show up.) The firmware is the latest. No. Once the "damage is done", even the B140i doesn't see any LFF HDDs. No problems with SSDs. Frankly speaking, I asked this question on many HPE related forums. Nobody had seen such a problem.This strengthens the idea that it was a DOA. My courage is getting higher to order a new ML150 Gen9. Quote Link to comment
pwm Posted May 21, 2018 Share Posted May 21, 2018 Any component burned by supplying power to four disks? Quote Link to comment
sse450 Posted May 21, 2018 Share Posted May 21, 2018 (edited) 11 minutes ago, pwm said: Any component burned by supplying power to four disks? No as far as I see on the backplane. Question: I know that HPE pushes us to use the HPE branded dard disks. My disks are not HPE branded. I use 2x 10T HGST Helium and 2x 512GB Samsung Pro 850. Could the server locks itself up because of the consumer disks? I am aware that non-HPE disks may cause high fan noise. Could my problem be something like that? But, even so, there must be some other similar complaints on the internet. But I have never seen any problem like that. Edited May 21, 2018 by sse450 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.