Several Simultaneous Failures

srfnmnk · September 3, 2020

Ok, I hear you. I will look around at the hardware and perhaps upgrade the firmware (but it's been working for years)...

How does one go about recovering the items in lost+found?

trurl · September 4, 2020

27 minutes ago, srfnmnk said:

How does one go about recovering the items in lost+found?

From backups may be the simplest. The problem is those files will have lost anything that tells their name or folder.

How much is there?

JorgeB · September 4, 2020

7 hours ago, srfnmnk said:

but it's been working for years

That's how it normally is before anything breaks, any hardware can go wrong at any time, though in that case LSI firmware update is unlikely to help, but it also won't hurt.

srfnmnk · September 4, 2020

12 hours ago, trurl said:

How much is there?

There are a LOT of folders but disk 8 is still disabled...so not sure if a new config and rebuild would help it or what. But to Johnnie's point...there may be something going bad hardware-wise. I keep getting different read errors on different drives. As you can see from the image below today it's disks 5 and 6 but I've seen them elsewhere. These read errors...does that mean corrupted data or does it mean it failed to read (maybe not a data error)? If the latter, then I'm guessing that this may happen during rebuild causing the issues and may be intermittent hardware issues as Johnnie is eluding to. man...I have 3 LSI cards in there with dual sas ports each haha...hrmm...

Below are some example logs from the errors.

Sep 4 04:40:01 pumbaa kernel: md: disk5 read error, sector=10804441552
Sep 4 04:40:01 pumbaa kernel: md: disk6 read error, sector=10804441552

Sep 4 04:40:01 pumbaa kernel: XFS (dm-4): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x21c009460 len 8 error 5
Sep 4 04:40:01 pumbaa kernel: XFS (dm-4): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x21c009460 len 8 error 5

Michael_P · September 4, 2020

FWIW- I had random drives throwing errors when I had too many on one string to the PSU, different ones each time, so could very well be flaky power

JorgeB · September 4, 2020

Best bet for disk8 is to check if the actual disk is mounting with UD, same as you did before with another disk, if yes just do a new config and re-sync parity (ideally after finding out whatever is causing the read errors, or at least try doing it with a different PSU, controller or cables).

srfnmnk · September 4, 2020

Gotcha, thanks Johnnie. Michael, i hear you on that. I am using two rails from PSU and it's been fine for many years. That said, the easiest thing to check/replace is the PSU (probably)...I will take a look and do the good ole jiggle everything to ensure everything is seated well.

For the errors on disk 5 / 6 (where i had the read errors), I did a check disk -nv from UI and it noted that there were valuable metadata changes. So In maintenance mode (because it's grayed out when array is on) I ran the xfs_repair "check" and it tells me it's not mounted. It is mountable but in maintenance mode it seems as though I cannot do a repair. Should I enable the array and run the xfs_repair -v /dev/mapper/md5 from cmd?

JorgeB · September 4, 2020

42 minutes ago, srfnmnk said:

For the errors on disk 5 / 6 (where i had the read errors), I did a check disk -nv

Don't do that, just reboot and they should go back to normal.

srfnmnk · September 4, 2020

that's true and it's fine for a while but then in a day or two i get errors somewhere else. I was wondering if I should try and do a parity check (and repair) and/or xfs_repair on all the disks to see if there's some array error.

trurl · September 4, 2020

What is the current state of your array? Is anything still disabled or unmountable?

JorgeB · September 4, 2020

1 hour ago, srfnmnk said:

I was wondering if I should try and do a parity check (and repair)

I would avoid correcting checks/rebuilds until you find the problem, non correcting checks and/or read checks are fine.

1 hour ago, srfnmnk said:

xfs_repair on all the disks to see if there's some array error.

You can still do that but if there are multiple disk read errors do it after rebooting, so that the filesystem errors are cleared, and possibly xfs_repair is not even needed anymore.

srfnmnk · September 6, 2020

On 9/4/2020 at 12:38 PM, trurl said:

What is the current state of your array?

@trurl still struggling. Below is before a reboot.

After a reboot and array start

Disk 8 is mountable via UD RO

and in fact, all of the files that are in lost+found on the emulated driver are here and accessible. Disk 8 also passes SMART self extended.

trurl · September 6, 2020

On 8/31/2020 at 5:19 AM, JorgeB said:

You are having multiple disk errors, you need to fix that first or it will make any rebuild difficult, looks more like a power/connection problem, if you have it try another PSU and/or another controller, also check/replace all cables.

This sounds like the avenues to pursue

srfnmnk · September 7, 2020

Ok, will work on hardware troubleshooting today.

One more question. I have a suspicion that my parity is inaccurate and we keep trusting it. What would be an approach to invalidate the parity and trust the disks. All the disks have passed SMART test and seem to have the proper files but the emulated disks seem to have the wrong information. Is there a way to invalidate the parity and rebuild from the disks and their data?

Since I have a disabled disk -- I'm thinking I can create a new config and just have the parity rebuild but wanted to confirm.

Thanks.

JorgeB · September 7, 2020

15 minutes ago, srfnmnk said:

I'm thinking I can create a new config and just have the parity rebuild but wanted to confirm.

You can.

On 9/4/2020 at 2:25 PM, JorgeB said:

Best bet for disk8 is to check if the actual disk is mounting with UD, same as you did before with another disk, if yes just do a new config and re-sync parity (ideally after finding out whatever is causing the read errors, or at least try doing it with a different PSU, controller or cables).

srfnmnk · September 7, 2020

Back in June (shortly before I started having these issues), I did add a new component. I ran out of pcie slots on my gigabyte mobo and thus powered my SAS expander with a molex instead of from the pcie slot. I did a full power review and here's what I have running off a Seasonic Prime PD-1000 Platinum. Seems like I have plenty of headroom but I'm no expert on calculating watts / rail or anything. If anyone has any insights, I'd love to hear if you think there's a better way way to rearrange the power. Meanwhile, a new config has been launched, everything is mounted, the array seems healthy and the the parity is rebuilding.

I'm not entirely sure how to go about debugging where the power issue may be but I figured revert back the newest changes and start from there...perhaps adding the SAS expander to molex and/or adding the LSI 9201-16e to the pcie caused unstable power...so testing that now...after that, I'm not sure. I did the best I could to use the PSU calculators and it seems as though I have sufficient power but would love input if someone else has experience.

Seasonic Prime PD-1000 Platinum
GIGABYTE X570 AORUS Master
Ryzen 3900
H5-25379-00 SAS 9201-16e - powered via pcie4 (temporarily removed)
Intel RAID (SAS) Expander Card (RES2SV240) -- powered via pcie4 (temorarily moved -- was on molex as of June 2020)
HighPoint RocketRAID 2720SGL 8-Port SAS -- powered via pcie4
GTX - 1660 -- installed on pcie4 powered via 6 pin from PSU
1 Sabrent 1TB Rocket NVMe PCIe M.2 2280
3 samsung 850 pro 500GB
20 7200 HDD

Just had a thought as I was putting this list together. This PSU has power output for SATA/IDE/MOLEX which is what is powering my HDDs and SSDs. To accommodate the new LSI 9201, I moved the SAS expander card to molex (from pcie power). I was powering it from IDE/SATA/Molex NOT the CPU/PCI-E rails...wondering if I should have been powering the SAS expander from the PCIE/CPU rails instaed?? Thoughts?

Michael_P · September 7, 2020

How many drives are hanging of of each molex connector?

3 hours ago, srfnmnk said:

I'm not entirely sure how to go about debugging where the power issue may be but I figured revert back the newest changes and start from there...perhaps adding the SAS expander to molex and/or adding the LSI 9201-16e to the pcie caused unstable power...so testing that now...after that, I'm not sure. I did the best I could to use the PSU calculators and it seems as though I have sufficient power but would love input if someone else has experience.

From my experience:

I have a Norco 24 bay case, 4 drives per backplane and 6 backplanes total. My Power supply had 4 molex connectors, so I split 2 off to get the needed power. Everything worked fine with 14 drives, but if I added a 15th outside of the array and assigned to my WHS VM with my WHS VM running, every parity check would fail random drives for read errors, and then shortly after the drives would show pending sectors. If I stopped the WHS VM before starting the parity check, it would finish no problem.

I was certain it was the Toshiba drives I was using. Every drive was stable as stable could be, as long as I didn't do a parity check with the VM running.

I went probably few months or so, then I forgot about the parity check so my VM was still running when it started and I get the email a drive has dropped. At this point, I'm just ready to stop doing parity checks altogether, and went a couple months without one - until I stumbled on a thread about shitty splitters and a light bulb went off in my head. It's power. The drives aren't getting enough during load even tho the power supply is more than capable of delivering, with more than 4 drives on a connector it was sagging just enough to reset.

I ordered some molex punchdown connectors on ebay, took one of the extra SATA lines that came with the power supply and created another molex string. Haven't had an issue since (knock on wood).

srfnmnk · September 9, 2020

And just like that the array is healthy and parity sync is complete. Wow. I will continue to poke around to see what is causing the power fluctuations. My guess is the molex power to the SAS expander, I will try to dedicate a rail or use the pcie power out from the PSU.

@Michael_P thank you so much for chiming in here! That's great to know and it lines up exactly with my situation. I too have the Norco 20 bay and I believe I am using 3 rails to power the 5 layers but I will double check. It seems that using the PSU SATA power out is causing power sags in the SAS expander linked above. I will check back in when I have more details. I will also post diagram of PSU setup after I get time to get back into the server.

Not entirely sure how to test this other than create new hardware config and then run full parity check to see if there are issues. This is going to be quite a fun ride but hopefully with the new power out from pcie it will just keep working. Anyone else have more ideas on how to test for power sags other than parity check?

@JorgeB -- great insights on the power/hardware issue. I hate that you seem to be right but props to you friend.

srfnmnk · September 9, 2020

Another update

removed the SAS expander from PCIe power and put it back on MOLEX power from the SATA slots on PSU.
I have not added the new LSI 9201-16e back in yet.
I had 2 SSDs running on one SATA power out from PSU and 1 SSD on another. I have strung those together on a single wire all now going into a single out on PSU. This was to open a slot from PSU to try another output
I removed the MOLEX cable I was using for the SAS expander card (it was a 4x molex power out string) and have replaced it with a 2 output moled xtring where the first outupt in the serial is connected directly to PSU.
The new molex has been plugged into a different output port on the PSU in case there was an issue with the PSU out.
All 20 HDDs are on 2 PSU outs. The norco 20 has a backplace with 5 rows each powered by a single molex connector. One PSU output powers 3 and the other powers 2.
Running Parity check now to see if I run into any issues. If no issues, I may try to force a rebuild to see if that causes issues...if not I'll have to surmise that one of the changes resolved the issue.

Additionally, I have pulled the data sheet for the PSU and the SAS expander to determine if there are any strange power requirements / considerations when running from molex, I found nothing unexpected. max power draw is 14.6W 12v -- no jumpers or anything to switch. The PSU supplies up to 996W on 12v and with 3 SSDs, 20 HDDs, 1 gtx-1060, Ryzen 3900, and the SAS expander card all running at full tilt "should" not exceed 592W as per seasonic's calculator.

SAS Expander Data Sheet

PSU Data Sheet (1000W)

I realize this is not the right forum to troubleshoot power delivery but I figured I'd keep the main thread here for those that are curious in the future.

Michael_P · September 9, 2020

The total power that the supply can provide isn't as important as what the limit of the current the connector/wire can sustain. That's what was tripping me up, too.

srfnmnk · September 9, 2020

4 minutes ago, Michael_P said:

current the connector/wire can sustain

Right, but a 14.6 watt max TDP pcie card pulling from a molex on a dedicated connection...the only thing I can think of is bad wire, bad PSU slot, bad molex connector on card, or just a bad power rail on the card itself. Do you have any other thoughts/ideas on what could cause it? If you look at the SAS Expander datasheet it shows max TDP is 12v / 14.6w

Thanks again @Michael_P

image.png.9c207d60b5b60d16b8940af641d732f5.png

Michael_P · September 9, 2020

If it's the only thing powered on that string, it wouldn't be an issue - but if you add up whatever amps it's pulling (~1.2A) with whatever else is connected on that string back to the PSU (drives @ ~1.0 to 1.8A each) and it starts to add up

Decto · September 9, 2020

Reading with interest.

If I understand correctly you have two molex cables from the PSU, one powers 3x4 drives and one power 2x4 drives.

As each drive can take at least 1A possibly more, you have 12A+ on the 3x4 molex which has 1 active connector.

PCI-E cables are only spec'd at 12.5A (150W) with 3 active connetors.

I would run a third molex as a precaution so you are powering 2,2,1.

There is a possibility the issue with the molex power expander is sensitivity to minor differential voltages due to it presenting a load and the end of a power cable while you have drives at the end of another heavily loaded power cable.

Good luck

Edited September 9, 2020 by Decto

srfnmnk · September 9, 2020

1 hour ago, Decto said:

I would run a third molex as a precaution so you are powering 2,2,1.

I like this Idea -- will do it.

Perhaps a few corrections.

I currently have 4 MOLEX/SATA strings from PSU (modular).

1 3X4 HDD (MOLEX --> Backplane)

1 2X4 HDD (MOLEX --> Backplane)

1 3-drive SSD String (SATA Power --> SSD direct)

1 string to expander card directly from modular output on PSU.

Screenshot below shows where all the outputs are coming. I have 1 (bottom left) still open, I will probably run another molex to backplane so that I am powering as suggested 2,2,1 on the backplane.

Parity check is still running, of course, but is at max speed and progressing.

Edited September 9, 2020 by srfnmnk

srfnmnk · September 10, 2020

Absolutely nuts...well, the parity sync finished without an issue given the new power config. I have now booted out a 4TB disk and am rebuilding it. If this succeeds, I'll re-install the LSI 9201-16e and do one more rebuild...

I will say, there's one more possibility. The SAS Expander card, when I had the issue, I had the card installed differently (not in the PCIe slot) since it was being powered by the molex and didn't need the PCIe slot anymore. The way I had it installed in the case made it possible for the pcie pins to make contact with the metal, so...I wrapped the pcie headers of the card with electrical tape to make sure nothing shorted the pins. I'm wondering if somehow the tape I had was allowing static to cause strange behaviors with the card...not sure...just an alternative theory to the power issues.

Several Simultaneous Failures

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

srfnmnk

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation