Help - New Build, having some problems.


Go to solution Solved by JorgeB,

Recommended Posts

I am in the process of upgrading my build.

 

New Hardware, 

Asus PRIME Z790-V WIFI

Intel i9 12900k

32GB GSkill Ripjaws DDR5

2x  LSI 9201-8i 6Gbps SAS PCI-E HBA P20 IT Mode ZFS

 

Once I got everything hooked up, I couldnt boot to my flash drive, so fter some digging I saw that I needed to chagne my Flash drive from EFI- to EFI.

 

Saved, and I finally got to boot.  During the boot I noticed alot of "Power-On or Device Reset Occured" errors in the startup logs.  And sure enough, once the array came up, I noticed that a disk was missing.  I had to disks unassiged on standby, so I decided to just run a rebuild, and I would sort out the missing disk another day (Its late, I want my array up, and I am tired).

 

Well after about 5 minutes of waiting on "Array Starting, Mounting in Progress..." it came up, but now one of my Parity disks is also in disable state.  both disks are on the same HBA, so, I think there is an issue with the HBA.

 

But I am also seeing incredibly slow parity.  around 1 MB/s.  I know the HBAs I have arent top of the line, but based on my math and research, I thought they would be faster than that.

 

What options do I have?  I know I need to check the HBA, but now with potentially a missing parity disk and a missing data disk, I am afraid to do anything.  I also dont like the idea of parity build running for 30 days!

 

I would love some guidance - Diagnostics attached.  

 

Please let me know if there are any questions or more info needed.

nas-diagnostics-20240309-1846.zip

Link to comment

Looks more like a HBA problem, it keeps resetting:

 

Mar  9 18:40:08 NAS kernel: mpt2sas_cm0 fault info from func: _base_make_ioc_ready

 

Make sure the HBA is well seated and sufficiently cooled, you can also try a different PCIe slot, then post new diags after array start.

Link to comment

Update again - P2 Seek error rates have disappeared.  If I check the drive details, it is no longer in a failing state, and seek error raw value is at 0 (it had climbed to like 31457.  I don't know what that means.

Rebuild speed climbed up to 150MB/s.  Which seems about normal.  I went to sleep, but looks like my parity tuning paused overnight. I resumed this morning, and it is running at 90 MB/s, to complete in 4 hours.

My plan is to let the rebuild finish.  Shut down the array, reseat the HBA and check the cables, start back up and rebuild P1.  I'm going to buy a few new drives to replace the P2 that is acting strangely.

Link to comment
3 minutes ago, JorgeB said:

Looks more like a HBA problem, it keeps resetting:

 

Mar  9 18:40:08 NAS kernel: mpt2sas_cm0 fault info from func: _base_make_ioc_ready

 

Make sure the HBA is well seated and sufficiently cooled, you can also try a different PCIe slot, then post new diags after array start.

 

Thanks, I will do that!

Link to comment

Rebuild finished.

 

After messing around with cables and the seating of the hba, I was able to get everything in a usable state, all drives being recognized.

 

However. Tailing the log I am still seeing the power on error repeatedly for drive sdm. This drive is currently part of unassigned devices.

 

I am ready to rebuild the dsbled parity disk. Should I shut down and pull sdm before I start the parity build?

Latest diagnostics attached. 

 

 

nas-diagnostics-20240310-1347.zip

Link to comment
6 hours ago, JorgeB said:

Looks more like a HBA problem, it keeps resetting:

 

Mar  9 18:40:08 NAS kernel: mpt2sas_cm0 fault info from func: _base_make_ioc_ready

 

Make sure the HBA is well seated and sufficiently cooled, you can also try a different PCIe slot, then post new diags after array start.

 

I initiated the Parity build of my P1.  Still seeing alot of odd stuff in the logs, Not sure what it means.  Latest Diagnostics here.

 

nas-diagnostics-20240310-1424.zip

Link to comment
3 hours ago, JorgeB said:

Looks more like power/connection issues, can also be a weak PSU.

Things went from bad to worse.

 

I switched to brand new cables, pulled the disks that were throwing errors.  Restarted, tried to rebuild using a new disk from unassigned.

 

Now both my parity are in a bad state, and I am still seeing very slow rebuild times.  I do not know how to recover from here at this point...Both my parity disks were plugged directly into the motherboard.

 

image.thumb.png.7f4dd1078078f1018f8b37834e3545e3.png

 

image.thumb.png.8fa3fb58cefe3a539eb1c6878c7cf646.png

nas-diagnostics-20240311-0921.zip

Link to comment
  • Solution
Mar 11 09:22:21 NAS kernel: sd 9:0:0:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:6:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:1:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:2:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:5:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:3:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:4:0: Power-on or device reset occurred

 

Still looks like a power/connection issue.

 

2 minutes ago, dmoney517 said:

I do not know how to recover from here at this point

Unassign both parity drives, start array, stop array, reassign both parity drive, start array, ideally after trying to fix the issue, or it will likely happen again.

  • Upvote 1
Link to comment
1 minute ago, JorgeB said:
Mar 11 09:22:21 NAS kernel: sd 9:0:0:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:6:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:1:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:2:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:5:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:3:0: Power-on or device reset occurred
Mar 11 09:22:21 NAS kernel: sd 9:0:4:0: Power-on or device reset occurred

 

Still looks like a power/connection issue.

 

Unassign both parity drives, start array, stop array, reassign both parity drive, start array, ideally after trying to fix the issue, or it will likely happen again.

 

At this point I am not sure of the issue...I have reseated the HBAs and switched to all new cables.  If the HBAs are bad, I wont get new ones until wednesday. I dont have anything near me I can pick them up from.

 

I am going to get a new PSU as well.  It wont be here until tomorrow.  I am currently on a 500w.  I assumed it would be enough as it was powering my previous system with no issues.  I am getting a 750w.

 

The board is brand new.  If it is a bad board, I can get it replaced on Thursday, after I test the new HBAs and PSUs.

 

Also going to get new drives in case they are a problem. My simple, straight forward upgrade has become very expensive and a huge headache.  Tempted to put my old board back and return everything.

 

Link to comment
Posted (edited)

Disks both crashed during parity...

 

1 of them was plugged directly into a SATA port on the motherboard.  The other is plugged into an HBA.

 

I feel like the disks are bad?  Could that be the issue?  Or is it more likely power supply / cable related?

 

Im afraid to do anything now.  All my dockers are off and I am not writing anything to the array.  If I power down, and a drive comes up missing, Im no longer protected by Parity correct?  What is the safest way forward?

 

image.thumb.png.b8a79da5f0ab49d0a170b4592947df43.png

Edited by dmoney517
Link to comment
13 minutes ago, JorgeB said:

Seems unlikely to me that the problem are the disks, but reboot or power cycle to see if they come back online, then post new diags, so that we can check SMART for both.

 

Restarted server.  I did not touch any cables or cards.  Did not start array.  Diagnostics attached.

 

I really appreciate your help!  I am in a bit of a panic right now.

 

 

nas-diagnostics-20240311-1107.zip

Link to comment

500 watts PSU does seam a bit low with that many drives and HBA cards.
older 4TB and lower Spin Drives do take some power when I ran 6 4TB WD reds. I made sure my PSU was 650.
kind of surprised your dual Xeon box didn't have power issues. unless the CPU's just constantly ran in low power mode.  Which can happen if your Bios is not set to turn off C-states and a few other things all depending on what the BIOS lets you do.  glad to hear it sounds like your up and running now. I would double check your HBA Firmware to make sure its P20. Sometimes I've gotten a few of those old 9201 cards and I've had to reflash them because who ever sold them didn't do it right or just claimed they did. But your log shows that's good I don't know how to read everything in the logs yet still learning.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.