Multiple disk read error failure


Recommended Posts

I'm pulling my hair out at this point. Diagnostics.zip attatched
So had a disk start throwing smart errors. No big deal, followed the wiki and swapped it. Everything seemed happy until about half way through parity rebuild when suddenly my parity drive starts reporting read errors.. so I followed the wiki on parity swap procedure and this time added in 2 parity drives. Ran the rebuild again, considering the original failed drive an acceptable loss and was prepared to move on. It "finished" but then dropped disk 1 and disk 5 from the array and said they had a bunch of read errors. So now im in panic mode, I bring the array offline and boot it in maintenance mode. Ran xfs_repair -v /dev/sdx on both disks, both disks reported no secondary super block could be found after several hours of just dots scrolling through terminal. So I started the array minus the 2 trouble disks, and attempted to mount disk1 partition 1 manually. I was able to browse files on the disk, and started copying them to disk 2. It got about 3/4 of the way through before MC reported an input/output error and now I can't manually browse to disk 2
```
root@TPC-Abraham:/mnt/disk2# ls
/bin/ls: cannot open directory '.': Input/output error
```
and the webgui reports 2018 read errors on disk.. im running out of disks to move stuff too here..
currently working on a solution to back what data I have left to the cloud.
 

 

tpc-abraham-diagnostics-20191007-1318_1.zip

Edited by Starlord
Link to comment
Quote

### [PREVIOUS LINE REPEATED 11 TIMES] ###
Oct  7 12:28:01 TPC-Abraham sSMTP[11964]: Creating SSL connection to host
Oct  7 12:28:01 TPC-Abraham sSMTP[11964]: SSL connection using TLS_AES_256_GCM_SHA384
Oct  7 12:28:02 TPC-Abraham kernel: r750:[03:00 M0] PM Request failed. Error information 0x80000002
Oct  7 12:28:02 TPC-Abraham kernel: r750:[        ] H2D FIS: 02e48f27 0f000000 00000000 00000000
Oct  7 12:28:02 TPC-Abraham kernel: r750:[03:00 M0] PM Request of state 1 for 0/0 failed
Oct  7 12:28:02 TPC-Abraham kernel: r750:[03:00 00] disk removed (0).
Oct  7 12:28:02 TPC-Abraham kernel: r750:[03:00 01] disk removed (0).
Oct  7 12:28:02 TPC-Abraham kernel: r750:[03:00 02] disk removed (0).
Oct  7 12:28:02 TPC-Abraham kernel: r750:[03:00 03] disk removed (0).
Oct  7 12:28:02 TPC-Abraham kernel: sd 11:0:0:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Oct  7 12:28:02 TPC-Abraham kernel: sd 11:0:0:0: [sdd] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 47 bc 49 50 00 00 00 80 00 00
Oct  7 12:28:02 TPC-Abraham kernel: print_req_error: I/O error, dev sdd, sector 1203521872
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521808
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521816
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521824
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521832
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521840
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521848
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521856
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521864
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521872
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521880
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521888
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521896
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521904
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521912
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521920
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203521928
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203522064
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203522072
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203522080
Oct  7 12:28:02 TPC-Abraham kernel: md: disk2 write error, sector=1203522088

Dont tell me my Rocket 750's biting the dust...

Link to comment

That's terrible!  Things that crop to mind to check - swap sata cables,  is there enough power supply wattage / is it failing, update BIOS / reset BIOS (because you never know), reseat that (very nice, haven't seen one of those before) rocket card in a different slot, test that nice 64GB RAM, make sure you haven't seriously bumped / banged the computer while running recently, disconnect all the drives bar one and try hammer it with reads / writes.

 

Hopefully theres something more telling in the logs though.

 

BTW your sig says you don't have a parity drive...

Link to comment

Oh my bad, the system in question is different from sig. It's a 45 drive storinator, dual xeon, real fancy. It's a clients machine not "mine"
 

Quote

Model: 45Drives Storinator
M/B: Supermicro - X10DRL-i
CPU: Intel® Xeon® CPU E5-2620 v4 @ 2.10GHz
HVM: Enabled
IOMMU: Enabled
Cache: 512 kB, 2048 kB, 20480 kB
Memory: 64 GB Multi-bit ECC (max. installable capacity 512 GB)
Kernel: Linux 4.18.20-unRAID x86_64
OpenSSL: 1.1.1a

Need to edit personal rig in sig too it's out of date. Will do so now

Edited by Starlord
Link to comment
13 minutes ago, Marshalleq said:

That's terrible!  Things that crop to mind to check - swap sata cables,  is there enough power supply wattage / is it failing, update BIOS / reset BIOS (because you never know), reseat that (very nice, haven't seen one of those before) rocket card in a different slot, test that nice 64GB RAM, make sure you haven't seriously bumped / banged the computer while running recently, disconnect all the drives bar one and try hammer it with reads / writes.

 

Hopefully theres something more telling in the logs though.

 

BTW your sig says you don't have a parity drive...

BIOS is up to date as of last night, I'll try re-seating the SAS cables into the rocket but the drives themselves are attached to a back plane and I've tried moving slots already. I'll run a ram test too, and since this machine is rack mounted in its own climate controlled closet I dont think bumping/moving is an issue. Issue is, this is a production server that handles a site to site VPN so I gotta be selective when bringing it down. Might be a few hours before I report back

Link to comment

Yeah, I’ve elected to keep all routing and vpn off my unraid box for these kinds of reasons. Are you running 6.7x or 6.6x? Also How long has the machine been running well in this config? With no evidence except my own experience, I’m suspicious that read errors could occur on 6.7x under load due to the bug currently slated for fix in 6.8. I’ve had quite a few myself, though absolutely nothing to the extent you’ve experienced. That is likely something different, but you could try a downgrade to 6.6.7 to attempt to rule that out. 

Link to comment
8 minutes ago, Marshalleq said:

Seem like hardware then, or corrupted software potentially.  The trouble is replicating it safely and isolating.  I'd say I'd be combing the logs first.

Second post is the only relevant bits I was able to spot in the syslog not seeing anything else out of the ordinary.

Link to comment

Yeah, it does call into question that card of yours.  Also assuming it's behind the UPS and didn't desperately need a reboot after running for two years, Power Supply potentially.  You can test them, but it's not easy.  If it's had a hard reboot and still doing it, isolating the cause might be more expensive than buying a new set and transferring over the drives etc.

Link to comment
5 hours ago, Starlord said:

I'm pulling my hair out at this point. Diagnostics.zip attatched
So had a disk start throwing smart errors. No big deal, followed the wiki and swapped it. Everything seemed happy until about half way through parity rebuild when suddenly my parity drive starts reporting read errors.. so I followed the wiki on parity swap procedure and this time added in 2 parity drives. Ran the rebuild again, considering the original failed drive an acceptable loss and was prepared to move on. It "finished" but then dropped disk 1 and disk 5 from the array and said they had a bunch of read errors. So now im in panic mode, I bring the array offline and boot it in maintenance mode. Ran xfs_repair -v /dev/sdx on both disks, both disks reported no secondary super block could be found after several hours of just dots scrolling through terminal. So I started the array minus the 2 trouble disks, and attempted to mount disk1 partition 1 manually. I was able to browse files on the disk, and started copying them to disk 2. It got about 3/4 of the way through before MC reported an input/output error and now I can't manually browse to disk 2
```
root@TPC-Abraham:/mnt/disk2# ls
/bin/ls: cannot open directory '.': Input/output error
```
and the webgui reports 2018 read errors on disk.. im running out of disks to move stuff too here..
currently working on a solution to back what data I have left to the cloud.
 

 

tpc-abraham-diagnostics-20191007-1318_1.zip 121.94 kB · 0 downloads

Most error pointing to Rocket 750 controller P0, P1 ? 45Drives Storinator seems use many 1-to5 port multipler backplace, you need well identify how they connect.

Link to comment

Hi there,

 

So we have a couple of 45Drives servers in our labs that we do testing on from time to time.  I can confirm in a recent test that both 6.7.2 and our 6.8-internal series of builds have a faulty R750 driver.  We have already contacted highpoint about the problem but their response was less than thrilling.  In short, they cannot give a timeframe yet on when they will be able to update their driver to fix the issue.  This is a huge frustration for us because they intentionally do not include the driver in the kernel tree, which means that as we update the kernel, if they don't keep their driver code in pace, problems like this can occur.

 

All of that said, we did test version 6.6.7 and it worked just fine with the R750 controller.  I also don't believe there are any issues with the Storinator Workstation (AV15) product.  Those use a different LSI controller I believe.

  • Like 1
Link to comment
2 minutes ago, jonp said:

Hi there,

 

So we have a couple of 45Drives servers in our labs that we do testing on from time to time.  I can confirm in a recent test that both 6.7.2 and our 6.8-internal series of builds have a faulty R750 driver.  We have already contacted highpoint about the problem but their response was less than thrilling.  In short, they cannot give a timeframe yet on when they will be able to update their driver to fix the issue.  This is a huge frustration for us because they intentionally do not include the driver in the kernel tree, which means that as we update the kernel, if they don't keep their driver code in pace, problems like this can occur.

 

All of that said, we did test version 6.6.7 and it worked just fine with the R750 controller.  I also don't believe there are any issues with the Storinator Workstation (AV15) product.  Those use a different LSI controller I believe.

Oof that's what I was afraid of. The out of tree driver has made using the rocket 750 outside of unraid nearly impossible and I was worried that may be the issue here. I emailed them about it with no response ages ago.. good to know you guys havent had any luck either. I'll be avoiding RocketRaid products like the plague from now on. IMO that's unacceptable. Thankfully I ordered 2 LSI controllers similar to the ones in the new storinators that came in today. I'll be replacing them this afternoon. Will report back with results.

Link to comment

From looking at your logs though, it looks like you definitely are having controller problems though that aren't the same as the driver compatibility issues I mentioned from the 6.7+ series of Unraid OS releases.  E.g.:

 

Quote

Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] Request failed. Error information 0x1000000
Oct  7 10:04:12 TPC-Abraham kernel: r750:[        ] H2D FIS: 00708227 40000000 00000000 00000000
Oct  7 10:04:12 TPC-Abraham kernel: r750:[        ] D2H FIS: 04514234 40000000 00000000 00000000 (Status 51 Error 4)
Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] Device state a failed (21) retried 0
Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] Request failed. Error information 0x1000000
Oct  7 10:04:12 TPC-Abraham kernel: r750:[        ] H2D FIS: 00708227 40000000 00000000 00000000
Oct  7 10:04:12 TPC-Abraham kernel: r750:[        ] D2H FIS: 04514234 40000000 00000000 00000000 (Status 51 Error 4)
Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] Device state a failed (21) retried 1
Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] Request failed. Error information 0x1000000
Oct  7 10:04:12 TPC-Abraham kernel: r750:[        ] H2D FIS: 00708227 40000000 00000000 00000000
Oct  7 10:04:12 TPC-Abraham kernel: r750:[        ] D2H FIS: 04514234 40000000 00000000 00000000 (Status 51 Error 4)
Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] Device state a failed (21) retried 2
Oct  7 10:04:12 TPC-Abraham kernel: r750:[82:00 02] disk failed.

 

I would attempt reseating the card or trying to get a warranty fulfillment.  It's one thing if just a single drive is having a failure (disks do fail), but the type of issues you're indicating look like potential false positives because of a controller problem.

Link to comment
  • 2 weeks later...

Update for me on 6.8 - so far, the HDD based read errors have gone in 6.8.  Which would be amazing, and preferable to whatever else could have been going on with my disks / cables or controllers.  Normally by now I would have had one or two of them.  Also, the unsafe shutdown count which was appearing on my brand new enterprise SSD with power loss data protection under 6.7 seems to have stopped counting.  I'm quite sure now that 6.7 was somehow killing my SSD's.  Again, time will tell if that problem has gone.  So all for the positive on 6.8 so far.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.