Jump to content
Glassed Silver

HP DL380e Gen8 - fans rev at 100% at ALL times.

13 posts in this topic Last Reply

Recommended Posts

Hey guys,

 

so when fully booted into unRAID the fans will spin to what sounds like 100% load.

 

This is in absolute idle mode, nothing going on, vanilla install. The fans are so loud it's actually a burden and considerable time pressure for any bug hunting. I have attached the syslog, it's two days old, but I don't think I changed anything during that time, wouldn't know what. Hardly booted into unRAID at all.

 

I have a separate issue concerning the P420 card in HBA mode, if anyone feels like helping me troubleshoot that as well, I'd be forever thankful. :)

 

Link to thread.

 

Cheers

tower-syslog-20190426-1810.zip

Share this post


Link to post

You should probably post a proper diagnostics file rather than just the syslog.  Looking at the syslog, something is your system seems to be causing a crash and may be related, I'll leave that to others to analyze.  As far as the fans, one possible reason is, as you may know, these servers are designed to be used with the vendor's (in this case HP) certified storage devices.  If you are using any normal off-the-shelf drives then it is possible that the system cannot read the drive temperature from the device and will run the fans at high speed as a result.  If this is the case you may be seeing some kind of complaint during POST.

Share this post


Link to post
35 minutes ago, WizADSL said:

You should probably post a proper diagnostics file rather than just the syslog.  Looking at the syslog, something is your system seems to be causing a crash and may be related, I'll leave that to others to analyze.  As far as the fans, one possible reason is, as you may know, these servers are designed to be used with the vendor's (in this case HP) certified storage devices.  If you are using any normal off-the-shelf drives then it is possible that the system cannot read the drive temperature from the device and will run the fans at high speed as a result.  If this is the case you may be seeing some kind of complaint during POST.

I'll pull out the WD and see if that fixes it, after that the system would be all-HP. That being said, wouldn't this being the cause also effect any other area? Like booting into Smart Array Administrator, Intelligent Provisioning or such?

 

The system is running pretty okay as long as I don't load up unRAID.

 

I'll boot into unRAID and quickly pull the diagnostics with the WD drive out tomorrow morning. :)

Share this post


Link to post

do you have any pcie cards in it? that tended to trigger a 60% minimum spin on my g6 and g7 proliants.

 

also you have a call trace: 

 

Apr 26 10:37:34 Tower kernel: Call Trace:
Apr 26 10:37:34 Tower kernel: con_scroll+0xb7/0xea
Apr 26 10:37:34 Tower kernel: lf+0x28/0x5e
Apr 26 10:37:34 Tower kernel: vt_console_print+0x1cc/0x308
Apr 26 10:37:34 Tower kernel: console_unlock+0x2df/0x3cd
Apr 26 10:37:34 Tower kernel: vprintk_emit+0x166/0x17d
Apr 26 10:37:34 Tower kernel: printk+0x53/0x6a
Apr 26 10:37:34 Tower kernel: sas_bsg_initialize+0x45/0xf5 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: sas_rphy_add+0x81/0x141 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: hpsa_scan_start+0x29a2/0x2aec [hpsa]
Apr 26 10:37:34 Tower kernel: ? do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: ? hpsa_scsi_queue_command+0x1e6/0x1e6 [hpsa]
Apr 26 10:37:34 Tower kernel: do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: scsi_scan_host+0x19f/0x1b4
Apr 26 10:37:34 Tower kernel: hpsa_init_one+0x176e/0x1985 [hpsa]
Apr 26 10:37:34 Tower kernel: local_pci_probe+0x39/0x7a
Apr 26 10:37:34 Tower kernel: work_for_cpu_fn+0x11/0x17
Apr 26 10:37:34 Tower kernel: process_one_work+0x16e/0x24f
Apr 26 10:37:34 Tower kernel: ? cancel_delayed_work_sync+0xa/0xa
Apr 26 10:37:34 Tower kernel: process_scheduled_works+0x22/0x27
Apr 26 10:37:34 Tower kernel: worker_thread+0x1f9/0x2ac
Apr 26 10:37:34 Tower kernel: kthread+0x10b/0x113
Apr 26 10:37:34 Tower kernel: ? kthread_flush_work_fn+0x9/0x9
Apr 26 10:37:34 Tower kernel: ret_from_fork+0x35/0x40
Apr 26 10:37:34 Tower kernel: NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
Apr 26 10:37:34 Tower kernel: CPU: 0 PID: 1496 Comm: kworker/0:5 Not tainted 4.18.20-unRAID #1
Apr 26 10:37:34 Tower kernel: Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/21/2018
Apr 26 10:37:34 Tower kernel: Workqueue: events work_for_cpu_fn
Apr 26 10:37:34 Tower kernel: RIP: 0010:vgacon_scroll+0xb1/0x23f
Apr 26 10:37:34 Tower kernel: Code: 00 00 89 ea 48 01 d0 48 3b 05 c7 5e b7 00 0f 82 90 00 00 00 48 8b 05 c2 5e b7 00 49 8d 34 14 8b 8b a8 01 00 00 48 89 c7 29 e9 <f3> a4 48 89 83 78 01 00 00 eb 76 83 78 0c 00 74 c0 0f b7 93 60 01 
Apr 26 10:37:34 Tower kernel: RSP: 0018:ffffc900079dba70 EFLAGS: 00010006
Apr 26 10:37:34 Tower kernel: RAX: ffff8800000b8000 RBX: ffff88081f018c00 RCX: 0000000000000225
Apr 26 10:37:34 Tower kernel: RDX: 00000000000000a0 RSI: ffff8800000bfd5b RDI: ffff8800000b8cdb
Apr 26 10:37:34 Tower kernel: RBP: 00000000000000a0 R08: 00000000ffffffff R09: ffff8800000bf080
Apr 26 10:37:34 Tower kernel: R10: 000000000000052f R11: 000000000001ba6c R12: ffff8800000befe0
Apr 26 10:37:34 Tower kernel: R13: ffff88081f018c00 R14: 0000000000000000 R15: 0000000000000000
Apr 26 10:37:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff88081f600000(0000) knlGS:0000000000000000
Apr 26 10:37:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 10:37:34 Tower kernel: CR2: 0000000000733000 CR3: 0000000001e0a003 CR4: 00000000000606f0
Apr 26 10:37:34 Tower kernel: Call Trace:
Apr 26 10:37:34 Tower kernel: con_scroll+0xb7/0xea
Apr 26 10:37:34 Tower kernel: lf+0x28/0x5e
Apr 26 10:37:34 Tower kernel: vt_console_print+0x1cc/0x308
Apr 26 10:37:34 Tower kernel: console_unlock+0x2df/0x3cd
Apr 26 10:37:34 Tower kernel: vprintk_emit+0x166/0x17d
Apr 26 10:37:34 Tower kernel: printk+0x53/0x6a
Apr 26 10:37:34 Tower kernel: sas_bsg_initialize+0x45/0xf5 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: sas_rphy_add+0x81/0x141 [scsi_transport_sas]
Apr 26 10:37:34 Tower kernel: hpsa_scan_start+0x29a2/0x2aec [hpsa]
Apr 26 10:37:34 Tower kernel: ? do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: ? hpsa_scsi_queue_command+0x1e6/0x1e6 [hpsa]
Apr 26 10:37:34 Tower kernel: do_scsi_scan_host+0x2e/0x72
Apr 26 10:37:34 Tower kernel: scsi_scan_host+0x19f/0x1b4
Apr 26 10:37:34 Tower kernel: hpsa_init_one+0x176e/0x1985 [hpsa]
Apr 26 10:37:34 Tower kernel: local_pci_probe+0x39/0x7a
Apr 26 10:37:34 Tower kernel: work_for_cpu_fn+0x11/0x17
Apr 26 10:37:34 Tower kernel: process_one_work+0x16e/0x24f
Apr 26 10:37:34 Tower kernel: ? cancel_delayed_work_sync+0xa/0xa
Apr 26 10:37:34 Tower kernel: process_scheduled_works+0x22/0x27
Apr 26 10:37:34 Tower kernel: worker_thread+0x1f9/0x2ac
Apr 26 10:37:34 Tower kernel: kthread+0x10b/0x113
Apr 26 10:37:34 Tower kernel: ? kthread_flush_work_fn+0x9/0x9
Apr 26 10:37:34 Tower kernel: ret_from_fork+0x35/0x40
Apr 26 10:37:34 Tower kernel: NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
Apr 26 10:37:34 Tower kernel: CPU: 0 PID: 1496 Comm: kworker/0:5 Not tainted 4.18.20-unRAID #1
Apr 26 10:37:34 Tower kernel: Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/21/2018
Apr 26 10:37:34 Tower kernel: Workqueue: events work_for_cpu_fn
Apr 26 10:37:34 Tower kernel: RIP: 0010:vgacon_scroll+0xb1/0x23f
Apr 26 10:37:34 Tower kernel: Code: 00 00 89 ea 48 01 d0 48 3b 05 c7 5e b7 00 0f 82 90 00 00 00 48 8b 05 c2 5e b7 00 49 8d 34 14 8b 8b a8 01 00 00 48 89 c7 29 e9 <f3> a4 48 89 83 78 01 00 00 eb 76 83 78 0c 00 74 c0 0f b7 93 60 01 
Apr 26 10:37:34 Tower kernel: RSP: 0018:ffffc900079dba70 EFLAGS: 00010006
Apr 26 10:37:34 Tower kernel: RAX: ffff8800000b8000 RBX: ffff88081f018c00 RCX: 0000000000000225
Apr 26 10:37:34 Tower kernel: RDX: 00000000000000a0 RSI: ffff8800000bfd5b RDI: ffff8800000b8cdb
Apr 26 10:37:34 Tower kernel: RBP: 00000000000000a0 R08: 00000000ffffffff R09: ffff8800000bf080
Apr 26 10:37:34 Tower kernel: R10: 000000000000052f R11: 000000000001ba6c R12: ffff8800000befe0
Apr 26 10:37:34 Tower kernel: R13: ffff88081f018c00 R14: 0000000000000000 R15: 0000000000000000
Apr 26 10:37:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff88081f600000(0000) knlGS:0000000000000000
Apr 26 10:37:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 10:37:34 Tower kernel: CR2: 0000000000733000 CR3: 0000000001e0a003 CR4: 00000000000606f0
Apr 26 10:37:34 Tower kernel: Call Trace:

 

 

I'm not great with these, but I would also get them on my g6/g7 when using the onboard raid controller. They just don't always play well with unraid.

 

and even more fun showing the controller lockup:

 

Apr 26 08:38:03 Tower kernel: ------------[ cut here ]------------
Apr 26 08:38:03 Tower kernel: hpsa 0000:0a:00.0: disabling already-disabled device
Apr 26 08:38:03 Tower kernel: WARNING: CPU: 0 PID: 1498 at drivers/pci/pci.c:1697 pci_disable_device+0x5d/0x80
Apr 26 08:38:03 Tower kernel: Modules linked in: md_mod bonding sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 ipmi_ssif crypto_simd cryptd glue_helper igb hpsa intel_cstate intel_uncore ahci intel_rapl_perf libahci i2c_algo_bit i2c_core scsi_transport_sas thermal button acpi_power_meter ipmi_si
Apr 26 08:38:03 Tower kernel: CPU: 0 PID: 1498 Comm: kworker/0:6 Not tainted 4.18.20-unRAID #1
Apr 26 08:38:03 Tower kernel: Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/21/2018
Apr 26 08:38:03 Tower kernel: Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
Apr 26 08:38:03 Tower kernel: RIP: 0010:pci_disable_device+0x5d/0x80
Apr 26 08:38:03 Tower kernel: Code: 48 85 ed 75 07 48 8b ab b0 00 00 00 48 8d bb a0 00 00 00 e8 b4 89 0d 00 48 89 ea 48 c7 c7 b1 50 d5 81 48 89 c6 e8 a6 46 d3 ff <0f> 0b 83 c8 ff f0 0f c1 83 c8 07 00 00 ff c8 75 0f 48 89 df e8 1e 
Apr 26 08:38:03 Tower kernel: RSP: 0018:ffffc900079ebe48 EFLAGS: 00010286
Apr 26 08:38:03 Tower kernel: RAX: 0000000000000000 RBX: ffff88081ad78000 RCX: 0000000000000007
Apr 26 08:38:03 Tower kernel: RDX: 0000000000000000 RSI: ffff88081f616470 RDI: ffff88081f616470
Apr 26 08:38:03 Tower kernel: RBP: ffff880c2b9bfcc0 R08: 0000000000000003 R09: 0000000000020400
Apr 26 08:38:03 Tower kernel: R10: 0000000000000600 R11: 00000076877f0e68 R12: 0000000000000000
Apr 26 08:38:03 Tower kernel: R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
Apr 26 08:38:03 Tower kernel: FS:  0000000000000000(0000) GS:ffff88081f600000(0000) knlGS:0000000000000000
Apr 26 08:38:03 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 08:38:03 Tower kernel: CR2: 00001548c4cc7fa0 CR3: 0000000001e0a003 CR4: 00000000000606f0
Apr 26 08:38:03 Tower kernel: Call Trace:
Apr 26 08:38:03 Tower kernel: detect_controller_lockup+0x144/0x21f [hpsa]

 

 

you can try going into iLo and seeing what the fan speeds are at (optimal vs whatever) and make sure it's one the lowest, but regardless if you have any pcie devices in slots, they usually spin up fans.

 

There's a few tips/tricks for proliants using unraid linked in my signature.

Share this post


Link to post

No PCIe cards, just the P420.

 

Onboard RAID controller is disabled in BIOS since the day before yesterday at least, because I read something like that myself I think.

 

I'll check out iLO as well, do you mean like defining the fan speed bias?

 

I'll go check out your links, already dabbled in them before.

 

Thank you so much already!

Share this post


Link to post
1 hour ago, Glassed Silver said:

I'll check out iLO as well, do you mean like defining the fan speed bias?

Yeah, you should find a setting for fan speed with a couple options. If ILO doesn’t define them for you, then look them up because they can be a little misleading at times.

Share this post


Link to post
16 hours ago, 1812 said:

Yeah, you should find a setting for fan speed with a couple options. If ILO doesn’t define them for you, then look them up because they can be a little misleading at times.

Wouldn't know where to look for SETTING my fans or anything alike in iLO. I can certainly see their spin and all 6 of them run at 100%.

 

I could look in the BIOS, if that's what you mean?

 

Regardless, I have attached the full diagnostics.

 

I have seen something fishy in iLO's health log already.

 

See this bit and the screenshot:

BIOS/Hardware Health section reveals this:

Quote

 

1108 <CRITICAL> PCI Bus 04/29/2019 20:35 <INIT UPDATE: 04/29/2019 20:35> <COUNT: 1> Uncorrectable PCI Express Error (Slot 2, Bus 0, Device 3, Function 0, Error status 0x00000000)

1107 <CRITICAL> System Error 04/29/2019 20:35 <INIT UPDATE: 04/29/2019 20:35> <COUNT: 1> Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

1106 <CAUTION> POST Message 04/29/2019 20:34 <INIT UPDATE: 04/29/2019 20:34> <COUNT: 1> POST Error: 1805-Slot X Drive Array - Cache Module Super-Cap is not installed; IMPORTANT: Unsupported Configuration: Cache Module cache functionality is limited. Action: Install the Super-Cap to remove these limitations.

Storage section:

773176044_Anmerkung2019-04-29205741.thumb.png.bc7cc42bf1b885e1c49f59b6df9e60c0.png

tower-diagnostics-20190429-2048.zip

Share this post


Link to post
2 hours ago, Glassed Silver said:

Wouldn't know where to look for SETTING my fans or anything alike in iLO. I can certainly see their spin and all 6 of them run at 100%.

 

I could look in the BIOS, if that's what you mean?

 

Regardless, I have attached the full diagnostics.

 

I have seen something fishy in iLO's health log already.

 

See this bit and the screenshot:

BIOS/Hardware Health section reveals this:

Storage section:

773176044_Anmerkung2019-04-29205741.thumb.png.bc7cc42bf1b885e1c49f59b6df9e60c0.png

tower-diagnostics-20190429-2048.zip 48.5 kB · 0 downloads

You’re cache is missing a battery or is faulty. 

 

also for iLO: http://hamedsalami.com/how-to-configure-ilo-in-hp-servers/

 

you're looking for cooling settings or similar

Share this post


Link to post
1 hour ago, 1812 said:

You’re cache is missing a battery or is faulty. 

 

also for iLO: http://hamedsalami.com/how-to-configure-ilo-in-hp-servers/

 

you're looking for cooling settings or similar

Thanks for the link, will check it out when I get home.

 

So far I thought I could ignore the caching, since in HBA mode it doesn't use it anyways if I understand the concept of HBAs correctly.

 

I'll remove the cache module then and see if that fixes it. Maybe I can run it with the P420 after all, at least until I get the HBA so I can already do some software configuration and get some things swinging. :)

Share this post


Link to post
2 hours ago, Glassed Silver said:

Thanks for the link, will check it out when I get home.

 

So far I thought I could ignore the caching, since in HBA mode it doesn't use it anyways if I understand the concept of HBAs correctly.

 

I'll remove the cache module then and see if that fixes it. Maybe I can run it with the P420 after all, at least until I get the HBA so I can already do some software configuration and get some things swinging. :)

I’ve run raid controllers without the battery backing, no big deal, just lose the ability to retain any data in the cache if your server loses power. 

Share this post


Link to post
4 hours ago, 1812 said:

I’ve run raid controllers without the battery backing, no big deal, just lose the ability to retain any data in the cache if your server loses power. 

I have good and bad news.

 

Bad news: Can't run the P420 without cache module apparently. (Source)

(Smart Storage Administrator in iLO returns a fatal error stating P420 is disabled due to the unattached cache module.)

 

Good news: With unRAID fully booted the server is MUCH less noisy without the cache module. (sweet sweet 32-54% fan spin!!! Wouldn't mind less during practically no load at all other than an idling unRAID with the P420 disabled even, but I'll take it over 100% any day of the week!) So since the P420 is disabled right now at the very least, I now know that the card needs addressing and there isn't really anything else getting in the way.

 

Unless I'm missing something here and there's a trick to get the P420 to behave and have it ignore that it's designed for caching use. :P

Share this post


Link to post

Update time...

 

Good news: The super cap had been in there all this time - just not connected. So I plugged it into the P420 cache module and lo and behold it now remains activated, still not working though, because...

 

Bad news: the fans spin at 100% again, just like before.

 

I'm also still getting this error in unRAID:

Quote

Tower emhttpd: device /dev/sdb problem getting id

 

As for cooling settings, it's set to Optimal which is the smallest setting. Found it in BIOS, forgot that I had long set it to that.

Share this post


Link to post

HBA card H220 solved my issues all around. It's a shame you can't put a nice card like the P420 to good use, but that's what it is.

 

Happy I got it working and am now in well supported territory rather than sketchy hacking around to be frank. :)

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.