** Possible SERIOUS UnRAID Problem with 6.7.x **


Recommended Posts

Hi All

 

I am writing this on the fly as I am in the middle of a major dilemma and the only common denominator is UnRAID...

 

So... I have a couple of UnRAID servers which have been working fine for years until now. It started when my HP Proliant DL380p G8, which is in a datacenter, flagged up an error in the ILO - There were 2 critical flags... "Unrecoverable System Error (NMI) has occurred." and "Uncorrectable PCI Express Error". I assumed that this was a hardware failure on the machine and whilst it limped on with this fault I prepared for the worst and ordered a new machine to replace it.

 

Whilst I was waiting to get access to the datacenter to swap the machine over, my HP Proliant Microserver G8 flagged a similar critical error - and I thought what bad luck to have two failed machines at the same time. The microserver wouldn't boot from the USB stick while fitted to the internal USB port and I also had no keyboard or mouse working when connected to it over ILO. But as this machine is at home I was able to fit the USB stick in one of the external USB ports and it booted OK.

 

Now read on... this is where I run out of believing these are all coincidences...

 

Today I fitted my replacement DL380p G8 in the datacenter and powered it up - everything looked good when I left the datacenter and headed home. Then when I got home I couldn't get the keyboard / mouse working when connected to the ILO and eventually out of desperation I gave the machine a cold boot to see if things would sort themselves out... then I saw the same critical error on the ILO as the machine I have just replaced it with. The only common hardware is the SAS disks, everything else has been changed!!! Also this machine now won't boot with the USB key fitted to the internal USB socket... it says that it is a non-system disk and reboots. Exactly the same as my Microserver. Unfortunately as I have to book to get into the datacenter it will be a few days before I can do anything - which causes me a real problem!!

 

To try and prove some of my thoughts I took the UnRAID USB key out of my Microserver and fitted a normal bootable Linux USB key and it booted fine from the internal port.

 

So... between all 3 of these machines, the only common factor is that they are all running UnRAID and I usually update them as soon as a new version comes out. They are all running 6.7.1 or 6.7.2 - so as I am able to physically get to my Microserver I am going to downgrade to 6.6.7 as these problems only seem to have started from 6.7.0 (although I can't be 100% sure it was after this release that the problem started). One thing I have just thought of is that this might have started in an earlier build as it only came to light in the last few weeks as I decided to power my Microserver off when not needed to save power as this is at home. So until about 6-8 weeks ago this machine was powered on permanently!! But as 6.6.7 is the next real step back I will start here!

 

I will post how things go - but would be keen to know if anyone else has had similar problems and am happy to work with someone to provide logs to identify the problem. The only issue I really have is that the DC is a 160 mile round trip and as UnRAID is locked to the UID, I can't just send the DC a new USB key to fit without having to buy another licence ☹️

 

I might be, wrongly, blaming UnRAID - but 3 identical errors on different machines seems too much to all be hardware issues. I suspect that as most people with UnRAID on ProLiant servers leave it powered on all the time, they may not know that they have this issue - but if there is an HP bug in 6.7.0 - it would be a shame to keep building and developing new releases with an inherent HP bug!

 

 I am guessing this might be an HP / UnRAID issue but would be grateful for any URGENT assistance.

 

Thanks

 

Tony

P.S. I just noticed this perhaps needs to be in the "bug report" forum - perhaps a mod could move it?

Edited by SliMat
additional info
Link to comment
9 minutes ago, Squid said:

 

Wow - looks like thats tonight over with as I have some reading to do... interestingly in one of the first posts it mentions "The 2017.04.0 SPP is the last production SPP to contain components for the G7 and Gen8 server platforms"... I'm sure I have SPP-2018-03 installed.

 

Thanks for the link ;-)

Link to comment
51 minutes ago, SliMat said:

Wow - looks like thats tonight over with as I have some reading to do... interestingly in one of the first posts it mentions "The 2017.04.0 SPP is the last production SPP to contain components for the G7 and Gen8 server platforms"... I'm sure I have SPP-2018-03 installed.

 

Thanks for the link ;-)

 

They may have added new SPP for some mitigations and other exploits. Since I no longer have G6/7 hardware, I stopped following updates for those machines. I've updated that line to reflect that.

 

 

2 hours ago, SliMat said:

Also this machine now won't boot with the USB key fitted to the internal USB socket... it says that it is a non-system disk and reboots.

most likely a bios setting since your previous machine did it without issue.

 

2 hours ago, SliMat said:

There were 2 critical flags... "Unrecoverable System Error (NMI) has occurred." and "Uncorrectable PCI Express Error"

 

My assumption is that you're using an onboard raid controller with the disks, which is causing the errors (this guess is made without seeing any info from your machines.) I would have similar problems early on using the onboard or even using a standalone hba card with the onboard raid controller still enabled.

 

 

Link to comment

Thanks for all the input... the original DL380p G8 was identical to the replacement - the disks are setup in RAID 0 on the P420i / 2Gb onboard controller - presenting 8 arrays to UnRAID which then sees the individual disks. The internal USB port was working as I booted it this morning when I was in the DC and everything seemed to be working OK. When I got home I realised I had no keyboard/mouse in ILO - but could get on the UnRAID GUI OK and the VMs were working OK too. I rebooted it from the UnRAID GUI to try and clear some errors and it was only then that it wouldn't boot up and is saying non system boot disk. So, I don't think its a BIOS setting.

 

I have played arround with my Microserver and there was an Adaptec RAID card fitted as I had a couple of extra disks installed for a while, which are now out... so I removed this and also downgraded to UnRAID 6.5.0 (as this is the latest pre 6.7.x image I have) and touch wood it all seems OK now.

 

Obviously the DL380p is harder to sort out as I can't physically go and remove the USB key... so I will pay for hands-on tomorrow.

 

Thanks for all the info - I am hoping that somewhere in the posts from 1812 he may recommend a RAID Card to use in the Gen8 DL servers -  as if this is possibly the cause, I need to remedy it as it is a production machine.

 

Cheers

Link to comment
  • 2 months later...

Just bumping this with an update... Having been using 6.5.0 since these problems in July and everything has been sweet - but yesterday I upgraded my Microserver, at the house, to 6.7.2 again as I was trying to set up a reverse proxy and it seems that 6.5.0 has a bug that means when you create a new docker network it isn't seen in the drop down box on the docker!

 

@j0nnymoe suggest upgrading to the latest version (which was right as this fixed that problem).  After the upgrade the network appeared and so I set up the reverse proxy OK. 

 

BUT... tonight I needed to reboot and again the server says that there is no bootable device and keeps cycling through boot saying no device found. In the HP ILO I have no keyboard or mouse working. So, I removed the USB boot disk from the internal USB port on the Microsever and tried in the rear ports - same issue. Inserted it in one of the front USB ports and it was recognised. It took an age to boot and when it did it sat on "starting services" for about 5 mins, but has now booted.

 

So, first thing tomorrow I will be downgrading, again, to 6.5.0 - I am just so grateful I didn't risk trying an upgrade on my DL380 in the datacenter! 

 

I'm at a loss with this, but it has to be more than coincidence - I'm not planning to explore this any more as I have no idea where to even begin!

 

Thanks

Link to comment
13 minutes ago, SliMat said:

BUT... tonight I needed to reboot and again the server says that there is no bootable device and keeps cycling through boot saying no device found.

Wild guess but are you using UEFI? I have an hp workstation that is finicky with booting uefi and exhibits the same issue of not finding the unRaid usb boot device unless I go to the boot manager/selection (or whatever it’s called) and manually select it. Then it operates as expected.

Link to comment
18 minutes ago, 1812 said:

Wild guess but are you using UEFI? I have an hp workstation that is finicky with booting uefi and exhibits the same issue of not finding the unRaid usb boot device unless I go to the boot manager/selection (or whatever it’s called) and manually select it. Then it operates as expected.

Hi 1812 - to be honest I don't know... I have just powered the Microserver down in order to downgrade the USB key to 6.5.0 and it stalled for about 10 minutes while shutting down... I ended up forcing a power off! The only way I can access the BIOS (perhaps) is to plug in a physical USB keyboard/mouse... so I am just off to find one. Its interesting that it stalled during shutdown as it did the same thing earlier when this problem started. So its a 2 out of 2 for failing to power off. I took a screen shot of where it hung in case its relevant. I will try and get into BIOS and let you know what I find. 

Screenshot 2019-10-02 at 23.45.13.png

Link to comment
13 minutes ago, cybrnook said:

You can check in Main -> Flash -> Scroll down... (If you have a window where you can get into the GUI)

image.thumb.png.e04c23d7a9aa0b008ad5958cf0e61a0e.png

Thanks

 

@1812 I managed to get into BIOS by connecting a physical keyboard/mouse/screen. Initially it wouldn't accept any input - so I did a complete power off (forced) and removed the power cable. Pressed power button to force a complete discharge and then powered on again and this time I could use keyboard... so I looked in BIOS and there is no mention of UEFI boot - I remember reading somewhere that UEFI support is only available in G9+.

 

So as the machine has started again (with USB key plugged into mainboard) I looked in syslinux and it is set to legacy boot, but the "allow UEFI boot" box is ticked... so I will untick this... screenshot below.

 

I don't think this will be the final fix... but will reboot and see whether I get ILO keyboard/mouse back and indeed if it will reboot without going into a boot failure cycle.

 

Update in a bit...

 

 

Screenshot 2019-10-03 at 00.24.57.png

Link to comment

OK, it rebooted OK with the USB key in the internal motherboard socket - but for some reason I have now lost the ILO HTML5 console so can't see anything other than the UnRAID GUI... so will have a play some more tomorrow.

 

I can't believe that nobody else has had this problem (with 6.7.2 ?!?!) as it seems fairly constant with my DL380p G8 and my Microserver G8.

 

I'll play some more tomorrow and update with anything new - any thoughts appreciated as at the moment ESXi is looking appealing... which is something I never thought I'd hear myself say as I have been using UnRAID for about 13-14 years!

Link to comment
9 hours ago, SliMat said:

OK, it rebooted OK with the USB key in the internal motherboard socket - but for some reason I have now lost the ILO HTML5 console so can't see anything other than the UnRAID GUI... so will have a play some more tomorrow.

 

I can't believe that nobody else has had this problem (with 6.7.2 ?!?!) as it seems fairly constant with my DL380p G8 and my Microserver G8.

 

I'll play some more tomorrow and update with anything new - any thoughts appreciated as at the moment ESXi is looking appealing... which is something I never thought I'd hear myself say as I have been using UnRAID for about 13-14 years!

 

If the system can't find the USB device to boot from, it doesn't have anything to do with unraid. Then it's something with your system. The bios is looking for a device that has the correct boot files/flags, but can't find any.

Looking at the screenshot of your UEFI setting in unraid you have it set to boot in UEFI, but as you say, your system doesn't support UEFI. Even though it might seem like it's a problem with unraid, the reason it doesn't boot then is that your system doesn't support UEFI.

Link to comment
  • 2 months later...

I Think I meet the same problem , IF you choose force reboot , the reboot will take a long time and sucked , so I just press the shut down button , then you cant boot again , I am not sure if it‘s change the cnofig to uefi mode , last time I meet this problem , I Just make my USB bootable again , and evetything goes OK , today I meet the same problem

Link to comment
1 hour ago, niccoming said:

I Think I meet the same problem , IF you choose force reboot , the reboot will take a long time and sucked , so I just press the shut down button , then you cant boot again , I am not sure if it‘s change the cnofig to uefi mode , last time I meet this problem , I Just make my USB bootable again , and evetything goes OK , today I meet the same problem

This thread is a few months old now. It's unclear whether you have the same problem or not. Sounds like you might have corrupted Flash. Do you have a backup of Flash? 

Link to comment
  • 3 weeks later...
On 12/30/2019 at 12:07 PM, trurl said:

This thread is a few months old now. It's unclear whether you have the same problem or not. Sounds like you might have corrupted Flash. Do you have a backup of Flash? 

Hi - sorry, I havent checked here recently - but there has been a development today - new details here...

 

 

Link to comment
  • 2 months later...

I also have a Microserver Gen8 - recently upgraded to 6.7.2 and guess waht happened to my NAND / ILO4 ??

 

exactly the samething as you.

 

I had been using openmediavault and even used ESXi without issues....

then after i moved to a new house - i decided to give UNRAID a go.

 

Guess who has a HP Gen8 server with no Intelligent Provisioning and a dead nand?
I am now trying to downgrade ILO back to 2.55 from 2.73 and have forced a NAND format 4x already and I am now trying to make it work...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.