Unable to run a Parity check


Recommended Posts

Good Afternoon,

 

I am having an issue when trying to manually run or schedule a parity check. Every time I try to kick one off the entire system hangs. More specifically, when I am on the console and kick it off, I see all the drive lights light up like its doing a parity check however, the web instance becomes unresponsive. Furthermore, when monitoring from a remote computer, I see that the network (ping test) drops altogether. 

 

I have verified this activity both on my older Supermicro server and on the newly provisioned (this past weekend) HP DL365 G7. Please note, I have tried safe mode, guide mode without plugins, etc. All with the same results.

 

Current setup:

UNRAID Version 6.7.0 2019-05-08

HP DL365 G7

192G RAM

AMD Opteron™ 6282 SE @ 2600 MHz

Qty. (2) H200 flashed to IT Mode - Not connected to any drives

Qty. (1) H200E flashed to IT Mode - connected to external EMC SAS shelves

Qty. (30) 4TB Enterprise SAS Flash Drives (2 Parity, 28 Data)

Qty. (4) 4TB Enterprise SAS Flash Drives (BTRFS Cache Pool)

All plugins removed

Only modification: /boot/config/go => rmmod acpi_power_meter

 

I am attaching the diagnostics file for your review. 

 

All that said, I suspect if I just let it run, that it would complete in a day or so however, none of the VM's and Containers are available nor can UNRAID be managed. This has been going on for quite some time and the only way to recover is to perform a hard shut down and power on. When it comes up I have to be fairly quick to cancel the parity check.

 

Any help is appreciated.

 

Thanks again!

 

-MW

unraid-1-diagnostics-20190528-1430.zip

Edited by mfwade
Link to comment
41 minutes ago, mfwade said:

Qty. (2) H200 flashed to IT Mode - Not connected to any drives

Qty. (1) H200E flashed to IT Mode - connected to external EMC SAS shelves

Qty. (30) 4TB Enterprise SAS Flash Drives (2 Parity, 28 Data)

Qty. (4) 4TB Enterprise SAS Flash Drives (BTRFS Cache Pool)

Can the power supplies handle the massive current draw of all drives spinning up at the same time?

Link to comment

Server power supplies are rated @ 750watts x 2. Power draw at start up is 392watts. At idle 242watts.

 

The SAS trays each have 1150watt x 2 power supplies. At idle with no drives they are running @ 40watts total.

 

these are enterprise rated disk trays and are rated for 25 hdd’s. The ssd’s I am using consume much less than a spinner.

Edited by mfwade
Link to comment

To clarify a bit further...

 

Server contains no drives at this time.

 

H200E Connection 1 => SAS DAE tray 1 contains 14 data drives, 2 parity drives, 4 cache drives - (all 2.5")

 

H200E Connection 2 => SAS DAE tray 2 contains 14 data drives - (all 2.5")

DAE tray 2 SAS expansion => SAS DAE tray 3 contains 0 drives - will be used for Unassigned Devices (at some point)

 

The server and all SAS DAE's are all connected to the same rackmount APC Smart UPS 1500 of which is running around 65% load.

 

-MW

Link to comment

The reason why I was asking was because most true server equipment has the ability to undersize the power supplies because the controller cards support staggered spinup to significantly lower the power draw *  (which unRaid cannot take advantage of), and if you only have problems when a parity check starts, then this would be my first suspect.

 

* assumes the drives are spinning 24/7

 

3 hours ago, mfwade said:

At idle with no drives they are running @ 40watts total.

Presumably, you're measuring this at the wall.

3 hours ago, mfwade said:

1150watt x 2 power supplies.

One of them is presumably redundant.  But, have you checked that one supply can manage the amperage for the drive's startup current according to it's specs on what ever rail the drives are utilizing?

Link to comment

I would tend to agree with you on some server type gear and the staggered spinning up of the drives however, in the storage world (Pure, NetApp, EMC, etc.), all drives are on all of the time. In this case they are all flash drives so the overall power draw is significantly lower than that of a tray with spinners.

 

Power draw:

 

All DAE Trays contain either dual 850watt power supplies or 1150watt power supplies.

The server contains dual 750watt power supplies currently configured for high efficiency - balanced mode.

 

Total combined usage detailed below.

 

Server (0 drives): In use: 268 watts

DAE Tray 1 (20 drives): Idle - 32watts, In use - 161watts

DAE Tray 2 (14 drives): idle - 32watts, In use - 118watts

DAE Tray 3 (0 drives): idle - 32 watts

 

I don't believe this to be a power supply issue. As stated earlier, the parity check does kick off and it does appear to run however, when it is running I lose all connectivity to the UNRAID server and the web interface when logged in to the console becomes unresponsive.

 

-MW

Edited by mfwade
Link to comment

I am considering starting over per se, or rather formatting my USB stick with a fresh UNRAID image. Thoughts? This is really only an exercise to rule out any 'weirdness' that may be going on with the currently installed image. Ultimately, my real concern is a disk failure and as it currently stands, the inability to run a parity check and or verify if In fact parity is correct. There are 'green' health indicators however, erring on the side of caution I need to ensure the data is intact and can be rebuilt.

 

If I do this, I would really like to preserve the following:

  • NO plugins will be installed - initially
  • Docker container / configuration
  • Virtual Machine configuration and layout
  • Disk layout
  • Shares
  • User names / passwords
  • Cache pool (4 disk BTRFS pool)
    • Appdata resides on cache
    • Domains reside on cache
    • Other shares reside here as well

I don't mind reconfiguring the hostname, network, etc., it's simple enough to redo as I have 4 connections set up in failover mode.

 

Is there a document that details the config files that I need to preserve and move over to the newly formatted USB?

 

Anything else I am missing? Should I even go down this path? Other than the inability to run or verify parity, UNRAID appears to be running phenomenally! And, it goes without saying... The new web GUI looks great!

 

-MW

 

 

 

Link to comment

Had some time this morning to work on the UNRAID server. I made one change to the Disk Settings section - Tunable (md_write_method): => Reconstruct Write. I wouldn't expect this to make a difference unless there is an issue with the current parity hash or checksums. So far, so good. The network connectivity to the array has not dropped however, the web interface is still very very sluggish. I am pleased that the parity check is currently running even if its a staggering 44-47MB/s per drive x qty. 30. I would expect significantly faster speeds to each of the SSD drives..

 

I know there are a few of you ( @johnnie.black is one? ) with all flash arrays. I suspect that most are all SATA however (mine is all SAS), if you wouldn't mind posting your Disk Settings (or any other relevant) so that I may try to incorporate them to assist in speeding up my array, I would be sincerely grateful.

 

Thanks again,

 

-MW

Link to comment
26 minutes ago, mfwade said:

I know there are a few of you ( @johnnie.black is one? ) with all flash arrays.

Not currently, but never needed any specif settings when I did, I'm not seeing any errors in the logs, you could try increasing the tunables, but these settings are more performance related:

 

Settings -> Disk Settings

Tunable (md_num_stripes): 4096
Tunable (md_sync_window): 2048
Tunable (md_sync_thresh): 2000

 

Note that in some cases a lower sync_thresh is needed to help with stalls, but if that's the case they are usually visible in the log.

Link to comment

Thanks Johnnie.Black,

 

I did try those settings with the same result. The server web interface was hung and the network interfaces dropped. However, the parity check appears to keep running though. Seems the only setting that has made any difference was the Tunable (md_write_method): => Reconstruct Write.

 

Nonetheless, it is running now with albeit a perceived slower than what I would expect SSD disk speeds of which should complete in under 24 hours. In the storage world, this same operation could take days/weeks to complete (background verify tasks) so I guess I should be grateful for what I have :)

 

If @limetech is reading these, I would be more than happy to perform any type of testing you may need on this or another array. I can stand up another one relatively quickly with all SAS drives and a few scattered SATA SSD and HDD if you need these as well.

 

Thanks again!

 

-MW

 

 

Link to comment

Well, new information. Not sure if it pertains to the finally running parity check though.. I am seeing the following in the logs.

 

Jun 1 10:55:12 unRAID-1 kernel: CPU: 24 PID: 5156 Comm: unraidd Tainted: G O 4.19.41-Unraid #1
Jun 1 10:55:12 unRAID-1 kernel: Call Trace:

 

Could this be a new issue or a systemic from the now running parity check?

 

Latest diagnostics attached.

 

-MW 

unraid-1-diagnostics-20190601-1459.zip

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.