HP DL380e Gen8 - fans rev at 100% at ALL times.


Recommended Posts

Hey guys,

 

so when fully booted into unRAID the fans will spin to what sounds like 100% load.

 

This is in absolute idle mode, nothing going on, vanilla install. The fans are so loud it's actually a burden and considerable time pressure for any bug hunting. I have attached the syslog, it's two days old, but I don't think I changed anything during that time, wouldn't know what. Hardly booted into unRAID at all.

 

I have a separate issue concerning the P420 card in HBA mode, if anyone feels like helping me troubleshoot that as well, I'd be forever thankful. :)

 

Link to thread.

 

Cheers

tower-syslog-20190426-1810.zip

Link to comment

You should probably post a proper diagnostics file rather than just the syslog.  Looking at the syslog, something is your system seems to be causing a crash and may be related, I'll leave that to others to analyze.  As far as the fans, one possible reason is, as you may know, these servers are designed to be used with the vendor's (in this case HP) certified storage devices.  If you are using any normal off-the-shelf drives then it is possible that the system cannot read the drive temperature from the device and will run the fans at high speed as a result.  If this is the case you may be seeing some kind of complaint during POST.

Link to comment
35 minutes ago, WizADSL said:

You should probably post a proper diagnostics file rather than just the syslog.  Looking at the syslog, something is your system seems to be causing a crash and may be related, I'll leave that to others to analyze.  As far as the fans, one possible reason is, as you may know, these servers are designed to be used with the vendor's (in this case HP) certified storage devices.  If you are using any normal off-the-shelf drives then it is possible that the system cannot read the drive temperature from the device and will run the fans at high speed as a result.  If this is the case you may be seeing some kind of complaint during POST.

I'll pull out the WD and see if that fixes it, after that the system would be all-HP. That being said, wouldn't this being the cause also effect any other area? Like booting into Smart Array Administrator, Intelligent Provisioning or such?

 

The system is running pretty okay as long as I don't load up unRAID.

 

I'll boot into unRAID and quickly pull the diagnostics with the WD drive out tomorrow morning. :)

Link to comment

do you have any pcie cards in it? that tended to trigger a 60% minimum spin on my g6 and g7 proliants.

 

also you have a call trace: 

 

Apr 26 10:37:34 Tower kernel: Call Trace:
Apr 26 10:37:34 Tower kernel: con_scroll+0xb7/0xea
Apr 26 10:37:34 Tower kernel: lf+0x28/0x5e
Apr 26 10:37:34 Tower kernel: vt_console_print+0x1cc/0x308
Apr 26 10:37:34 Tower kernel: console_unlock+0x2df/0x3cd
Apr 26 10:37:34 Tower kernel: vprintk_emit+0x166/0x17d
Apr 26 10:37:34 Tower kernel: printk+0x53/0x6a
Apr 26 10:37:34 Tower kernel: sas_bsg_initialize+0x45/0xf5 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: sas_rphy_add+0x81/0x141 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: hpsa_scan_start+0x29a2/0x2aec [hpsa]
Apr 26 10:37:34 Tower kernel: ? do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: ? hpsa_scsi_queue_command+0x1e6/0x1e6 [hpsa]
Apr 26 10:37:34 Tower kernel: do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: scsi_scan_host+0x19f/0x1b4
Apr 26 10:37:34 Tower kernel: hpsa_init_one+0x176e/0x1985 [hpsa]
Apr 26 10:37:34 Tower kernel: local_pci_probe+0x39/0x7a
Apr 26 10:37:34 Tower kernel: work_for_cpu_fn+0x11/0x17
Apr 26 10:37:34 Tower kernel: process_one_work+0x16e/0x24f
Apr 26 10:37:34 Tower kernel: ? cancel_delayed_work_sync+0xa/0xa
Apr 26 10:37:34 Tower kernel: process_scheduled_works+0x22/0x27
Apr 26 10:37:34 Tower kernel: worker_thread+0x1f9/0x2ac
Apr 26 10:37:34 Tower kernel: kthread+0x10b/0x113
Apr 26 10:37:34 Tower kernel: ? kthread_flush_work_fn+0x9/0x9
Apr 26 10:37:34 Tower kernel: ret_from_fork+0x35/0x40
Apr 26 10:37:34 Tower kernel: NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
Apr 26 10:37:34 Tower kernel: CPU: 0 PID: 1496 Comm: kworker/0:5 Not tainted 4.18.20-unRAID #1
Apr 26 10:37:34 Tower kernel: Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/21/2018
Apr 26 10:37:34 Tower kernel: Workqueue: events work_for_cpu_fn
Apr 26 10:37:34 Tower kernel: RIP: 0010:vgacon_scroll+0xb1/0x23f
Apr 26 10:37:34 Tower kernel: Code: 00 00 89 ea 48 01 d0 48 3b 05 c7 5e b7 00 0f 82 90 00 00 00 48 8b 05 c2 5e b7 00 49 8d 34 14 8b 8b a8 01 00 00 48 89 c7 29 e9 <f3> a4 48 89 83 78 01 00 00 eb 76 83 78 0c 00 74 c0 0f b7 93 60 01 
Apr 26 10:37:34 Tower kernel: RSP: 0018:ffffc900079dba70 EFLAGS: 00010006
Apr 26 10:37:34 Tower kernel: RAX: ffff8800000b8000 RBX: ffff88081f018c00 RCX: 0000000000000225
Apr 26 10:37:34 Tower kernel: RDX: 00000000000000a0 RSI: ffff8800000bfd5b RDI: ffff8800000b8cdb
Apr 26 10:37:34 Tower kernel: RBP: 00000000000000a0 R08: 00000000ffffffff R09: ffff8800000bf080
Apr 26 10:37:34 Tower kernel: R10: 000000000000052f R11: 000000000001ba6c R12: ffff8800000befe0
Apr 26 10:37:34 Tower kernel: R13: ffff88081f018c00 R14: 0000000000000000 R15: 0000000000000000
Apr 26 10:37:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff88081f600000(0000) knlGS:0000000000000000
Apr 26 10:37:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 10:37:34 Tower kernel: CR2: 0000000000733000 CR3: 0000000001e0a003 CR4: 00000000000606f0
Apr 26 10:37:34 Tower kernel: Call Trace:
Apr 26 10:37:34 Tower kernel: con_scroll+0xb7/0xea
Apr 26 10:37:34 Tower kernel: lf+0x28/0x5e
Apr 26 10:37:34 Tower kernel: vt_console_print+0x1cc/0x308
Apr 26 10:37:34 Tower kernel: console_unlock+0x2df/0x3cd
Apr 26 10:37:34 Tower kernel: vprintk_emit+0x166/0x17d
Apr 26 10:37:34 Tower kernel: printk+0x53/0x6a
Apr 26 10:37:34 Tower kernel: sas_bsg_initialize+0x45/0xf5 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: sas_rphy_add+0x81/0x141 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: hpsa_scan_start+0x29a2/0x2aec [hpsa]
Apr 26 10:37:34 Tower kernel: ? do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: ? hpsa_scsi_queue_command+0x1e6/0x1e6 [hpsa]
Apr 26 10:37:34 Tower kernel: do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: scsi_scan_host+0x19f/0x1b4
Apr 26 10:37:34 Tower kernel: hpsa_init_one+0x176e/0x1985 [hpsa]
Apr 26 10:37:34 Tower kernel: local_pci_probe+0x39/0x7a
Apr 26 10:37:34 Tower kernel: work_for_cpu_fn+0x11/0x17
Apr 26 10:37:34 Tower kernel: process_one_work+0x16e/0x24f
Apr 26 10:37:34 Tower kernel: ? cancel_delayed_work_sync+0xa/0xa
Apr 26 10:37:34 Tower kernel: process_scheduled_works+0x22/0x27
Apr 26 10:37:34 Tower kernel: worker_thread+0x1f9/0x2ac
Apr 26 10:37:34 Tower kernel: kthread+0x10b/0x113
Apr 26 10:37:34 Tower kernel: ? kthread_flush_work_fn+0x9/0x9
Apr 26 10:37:34 Tower kernel: ret_from_fork+0x35/0x40
Apr 26 10:37:34 Tower kernel: NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
Apr 26 10:37:34 Tower kernel: CPU: 0 PID: 1496 Comm: kworker/0:5 Not tainted 4.18.20-unRAID #1
Apr 26 10:37:34 Tower kernel: Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/21/2018
Apr 26 10:37:34 Tower kernel: Workqueue: events work_for_cpu_fn
Apr 26 10:37:34 Tower kernel: RIP: 0010:vgacon_scroll+0xb1/0x23f
Apr 26 10:37:34 Tower kernel: Code: 00 00 89 ea 48 01 d0 48 3b 05 c7 5e b7 00 0f 82 90 00 00 00 48 8b 05 c2 5e b7 00 49 8d 34 14 8b 8b a8 01 00 00 48 89 c7 29 e9 <f3> a4 48 89 83 78 01 00 00 eb 76 83 78 0c 00 74 c0 0f b7 93 60 01 
Apr 26 10:37:34 Tower kernel: RSP: 0018:ffffc900079dba70 EFLAGS: 00010006
Apr 26 10:37:34 Tower kernel: RAX: ffff8800000b8000 RBX: ffff88081f018c00 RCX: 0000000000000225
Apr 26 10:37:34 Tower kernel: RDX: 00000000000000a0 RSI: ffff8800000bfd5b RDI: ffff8800000b8cdb
Apr 26 10:37:34 Tower kernel: RBP: 00000000000000a0 R08: 00000000ffffffff R09: ffff8800000bf080
Apr 26 10:37:34 Tower kernel: R10: 000000000000052f R11: 000000000001ba6c R12: ffff8800000befe0
Apr 26 10:37:34 Tower kernel: R13: ffff88081f018c00 R14: 0000000000000000 R15: 0000000000000000
Apr 26 10:37:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff88081f600000(0000) knlGS:0000000000000000
Apr 26 10:37:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 10:37:34 Tower kernel: CR2: 0000000000733000 CR3: 0000000001e0a003 CR4: 00000000000606f0
Apr 26 10:37:34 Tower kernel: Call Trace:

 

 

I'm not great with these, but I would also get them on my g6/g7 when using the onboard raid controller. They just don't always play well with unraid.

 

and even more fun showing the controller lockup:

 

Apr 26 08:38:03 Tower kernel: ------------[ cut here ]------------
Apr 26 08:38:03 Tower kernel: hpsa 0000:0a:00.0: disabling already-disabled device
Apr 26 08:38:03 Tower kernel: WARNING: CPU: 0 PID: 1498 at drivers/pci/pci.c:1697 pci_disable_device+0x5d/0x80
Apr 26 08:38:03 Tower kernel: Modules linked in: md_mod bonding sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 ipmi_ssif crypto_simd cryptd glue_helper igb hpsa intel_cstate intel_uncore ahci intel_rapl_perf libahci i2c_algo_bit i2c_core scsi_transport_sas thermal button acpi_power_meter ipmi_si
Apr 26 08:38:03 Tower kernel: CPU: 0 PID: 1498 Comm: kworker/0:6 Not tainted 4.18.20-unRAID #1
Apr 26 08:38:03 Tower kernel: Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/21/2018
Apr 26 08:38:03 Tower kernel: Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
Apr 26 08:38:03 Tower kernel: RIP: 0010:pci_disable_device+0x5d/0x80
Apr 26 08:38:03 Tower kernel: Code: 48 85 ed 75 07 48 8b ab b0 00 00 00 48 8d bb a0 00 00 00 e8 b4 89 0d 00 48 89 ea 48 c7 c7 b1 50 d5 81 48 89 c6 e8 a6 46 d3 ff <0f> 0b 83 c8 ff f0 0f c1 83 c8 07 00 00 ff c8 75 0f 48 89 df e8 1e 
Apr 26 08:38:03 Tower kernel: RSP: 0018:ffffc900079ebe48 EFLAGS: 00010286
Apr 26 08:38:03 Tower kernel: RAX: 0000000000000000 RBX: ffff88081ad78000 RCX: 0000000000000007
Apr 26 08:38:03 Tower kernel: RDX: 0000000000000000 RSI: ffff88081f616470 RDI: ffff88081f616470
Apr 26 08:38:03 Tower kernel: RBP: ffff880c2b9bfcc0 R08: 0000000000000003 R09: 0000000000020400
Apr 26 08:38:03 Tower kernel: R10: 0000000000000600 R11: 00000076877f0e68 R12: 0000000000000000
Apr 26 08:38:03 Tower kernel: R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
Apr 26 08:38:03 Tower kernel: FS:  0000000000000000(0000) GS:ffff88081f600000(0000) knlGS:0000000000000000
Apr 26 08:38:03 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 08:38:03 Tower kernel: CR2: 00001548c4cc7fa0 CR3: 0000000001e0a003 CR4: 00000000000606f0
Apr 26 08:38:03 Tower kernel: Call Trace:
Apr 26 08:38:03 Tower kernel: detect_controller_lockup+0x144/0x21f [hpsa]

 

 

you can try going into iLo and seeing what the fan speeds are at (optimal vs whatever) and make sure it's one the lowest, but regardless if you have any pcie devices in slots, they usually spin up fans.

 

There's a few tips/tricks for proliants using unraid linked in my signature.

Link to comment

No PCIe cards, just the P420.

 

Onboard RAID controller is disabled in BIOS since the day before yesterday at least, because I read something like that myself I think.

 

I'll check out iLO as well, do you mean like defining the fan speed bias?

 

I'll go check out your links, already dabbled in them before.

 

Thank you so much already!

Link to comment
1 hour ago, Glassed Silver said:

I'll check out iLO as well, do you mean like defining the fan speed bias?

Yeah, you should find a setting for fan speed with a couple options. If ILO doesn’t define them for you, then look them up because they can be a little misleading at times.

Link to comment
16 hours ago, 1812 said:

Yeah, you should find a setting for fan speed with a couple options. If ILO doesn’t define them for you, then look them up because they can be a little misleading at times.

Wouldn't know where to look for SETTING my fans or anything alike in iLO. I can certainly see their spin and all 6 of them run at 100%.

 

I could look in the BIOS, if that's what you mean?

 

Regardless, I have attached the full diagnostics.

 

I have seen something fishy in iLO's health log already.

 

See this bit and the screenshot:

BIOS/Hardware Health section reveals this:

Quote

 

1108 <CRITICAL> PCI Bus 04/29/2019 20:35 <INIT UPDATE: 04/29/2019 20:35> <COUNT: 1> Uncorrectable PCI Express Error (Slot 2, Bus 0, Device 3, Function 0, Error status 0x00000000)

1107 <CRITICAL> System Error 04/29/2019 20:35 <INIT UPDATE: 04/29/2019 20:35> <COUNT: 1> Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

1106 <CAUTION> POST Message 04/29/2019 20:34 <INIT UPDATE: 04/29/2019 20:34> <COUNT: 1> POST Error: 1805-Slot X Drive Array - Cache Module Super-Cap is not installed; IMPORTANT: Unsupported Configuration: Cache Module cache functionality is limited. Action: Install the Super-Cap to remove these limitations.

Storage section:

773176044_Anmerkung2019-04-29205741.thumb.png.bc7cc42bf1b885e1c49f59b6df9e60c0.png

tower-diagnostics-20190429-2048.zip

Link to comment
2 hours ago, Glassed Silver said:

Wouldn't know where to look for SETTING my fans or anything alike in iLO. I can certainly see their spin and all 6 of them run at 100%.

 

I could look in the BIOS, if that's what you mean?

 

Regardless, I have attached the full diagnostics.

 

I have seen something fishy in iLO's health log already.

 

See this bit and the screenshot:

BIOS/Hardware Health section reveals this:

Storage section:

773176044_Anmerkung2019-04-29205741.thumb.png.bc7cc42bf1b885e1c49f59b6df9e60c0.png

tower-diagnostics-20190429-2048.zip 48.5 kB · 0 downloads

You’re cache is missing a battery or is faulty. 

 

also for iLO: http://hamedsalami.com/how-to-configure-ilo-in-hp-servers/

 

you're looking for cooling settings or similar

Link to comment
1 hour ago, 1812 said:

You’re cache is missing a battery or is faulty. 

 

also for iLO: http://hamedsalami.com/how-to-configure-ilo-in-hp-servers/

 

you're looking for cooling settings or similar

Thanks for the link, will check it out when I get home.

 

So far I thought I could ignore the caching, since in HBA mode it doesn't use it anyways if I understand the concept of HBAs correctly.

 

I'll remove the cache module then and see if that fixes it. Maybe I can run it with the P420 after all, at least until I get the HBA so I can already do some software configuration and get some things swinging. :)

Link to comment
2 hours ago, Glassed Silver said:

Thanks for the link, will check it out when I get home.

 

So far I thought I could ignore the caching, since in HBA mode it doesn't use it anyways if I understand the concept of HBAs correctly.

 

I'll remove the cache module then and see if that fixes it. Maybe I can run it with the P420 after all, at least until I get the HBA so I can already do some software configuration and get some things swinging. :)

I’ve run raid controllers without the battery backing, no big deal, just lose the ability to retain any data in the cache if your server loses power. 

Link to comment
4 hours ago, 1812 said:

I’ve run raid controllers without the battery backing, no big deal, just lose the ability to retain any data in the cache if your server loses power. 

I have good and bad news.

 

Bad news: Can't run the P420 without cache module apparently. (Source)

(Smart Storage Administrator in iLO returns a fatal error stating P420 is disabled due to the unattached cache module.)

 

Good news: With unRAID fully booted the server is MUCH less noisy without the cache module. (sweet sweet 32-54% fan spin!!! Wouldn't mind less during practically no load at all other than an idling unRAID with the P420 disabled even, but I'll take it over 100% any day of the week!) So since the P420 is disabled right now at the very least, I now know that the card needs addressing and there isn't really anything else getting in the way.

 

Unless I'm missing something here and there's a trick to get the P420 to behave and have it ignore that it's designed for caching use. :P

Link to comment

Update time...

 

Good news: The super cap had been in there all this time - just not connected. So I plugged it into the P420 cache module and lo and behold it now remains activated, still not working though, because...

 

Bad news: the fans spin at 100% again, just like before.

 

I'm also still getting this error in unRAID:

Quote

Tower emhttpd: device /dev/sdb problem getting id

 

As for cooling settings, it's set to Optimal which is the smallest setting. Found it in BIOS, forgot that I had long set it to that.

Link to comment
  • 9 months later...

Welcome
I had very similar problems.
It was solved by establishing a tunnel with an additional fan (3d printing) on the controller.
fans: 35-36% ,
Controller temperature 80-83C

I haven't checked it yet but it's possible that our problems are caused by bad thermal contact between the heat sink and the chip.

 

https://www.thingiverse.com/thing:4070343 

 

Good luck :)

  • Like 1
Link to comment
58 minutes ago, Grzery said:

Welcome
I had very similar problems.
It was solved by establishing a tunnel with an additional fan (3d printing) on the controller.
fans: 35-36% ,
Controller temperature 80-83C

I haven't checked it yet but it's possible that our problems are caused by bad thermal contact between the heat sink and the chip.

 

https://www.thingiverse.com/thing:4070343 

 

Good luck :)

Thanks for your contribution, but it's already long solved. (see the post above you)

 

However it's never bad to make contributions even after an issue is solved, since people who google this problem might find the further tips helpful! :)

Link to comment

Thanks for the suggestions :)
Unfortunately, I do not have (yet) an HBA.
There are 420 working on the server (gen8).
The installed operating system is Windows Server 2016 (Hyper-V) and must remain so.
There are no problems in ILO4. Only after a failure the system signals an error identical to that in the post 1812, April 30, 2019.
Three SSDs 512 GB (RAID 5) and 8 x 4 TB are installed. RAID 6,
4TB drives are from different manufacturers (Seagete, WD). This is probably the cause of my problems.
I recommend the article
http://dascomputerconsultants.com/HPCompaqServerDrives.htm
Its author has collected some interesting and very useful information about problems with "foreign" disks on HP servers.

Link to comment

The last time I had turbo fan problems (on an HP DL380 G6) it was because I had a HDD that wasn't reporting SMART temp data correctly, and would randomly cause the system to think it was overheating.  Replacing it and its mirror in the system with HP branded disks solved the problem.

Interestingly the non-HP branded 4-port NIC card (albet, it is an Intel card that's parts identical to the HP version, just doesn't have an HP sticker) and the generic 4-port USB 3.0 card don't cause the box to freak out.  The latter someone else did some extensive testing on what generic USB cards would work in HP servers without freaking them out.

 

I'd start looking at disks and interface cards to see if they're the cause of the problem.

Link to comment
  • 1 month later...

The system has been stable for a month.
I did not replace hard drives.
The controller (420) temperature is 52C (was 80-83C) at 17C ambient temperature.
I achieved the above results thanks to:
1. Installed a tunnel with 2 fans (similar to those used in graphics cards).
2. Replace the paste under the controller heat sink. The original did not give good thermal coupling with the chip.

 

 

Greetings to all, I wish you good health (your systems).

Edited by Grzery
Link to comment
  • 7 months later...

I can give you some tip for your situation why fan speed too overcharged it caused by HP server have some compatibility with variable parts. In example, if pci-express GPU card it doesn’t support option card sensor data(OCSD) that made by the protocol between HP and some hardware vendors, temperature sensor abnormally rising up that not exactly celsius and then the fan speed also rising up related the abnormally temperature.

So why you can’t control the fan speed. Also you can check when you booting up, you can run the ‘intelligent provisioning’ with press F10 key. After intelligent provisioning started, not so long time, the fan speed going back to normal condition.

I doubts IP(Intelligent Provisioning) have some command to control fan speed back to normal condition.

So I think make setup the connect to the SSH or serial to read the server realtime data dumping with run IP then, find which command is slow down the fan speed as normally condition.

Unfortunately, I didn’t familiar with doing this so if you as well known doing this command capturing or you know interested in the person who can easily solve it, please solves the problem instead of me.

There have so many people got a pain of this HP server fan noise issues.

 

 

Sent from my iPhone using Tapatalk

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.