"Array has 9 disks with read errors"


Recommended Posts

In the past 24-48hrs, I've all of a sudden went from 0 read errors ever to "Array has 9 disks with read errors", and I can't for the life of me figure out why.

What I've changed in the past few days:
 

  1. Created my first VM - Windows Server 2016 with all storage/isos/etc located on an SSD mounted via Unassigned Devices (/mnt/disks/vms - aka not in my regular NAS shares). This has been a wild ride, as the VM under performs when I throw 16 of my 32 cores at it - I then read up on CPU pinning and it's going better-ish with 8 cores, but that's a discussion for another time.
  2. I enabled disk spin-down, in an attempt to conserve power when disks are idle (I've never used this in the past 2 years of running UnRAID)
  3. Few system reboots for various things, all of which were 100% clean with no issues whatsoever.
  4. Ran a parity check, which found 256 errors, but finished successfully
     

Operationally, I haven't noticed anything wrong with the disks. No failures in Plex, no issues accessing files, etc. SMART all seem to be okay, just the usual "your disks are old as dirt" errors - they are old, and I 100% expect them to die in time, but I'd expect them to not all die at once. Some of the disks that are throwing errors are 5yrs+ power-on-hours, some are ~2 or less. I have two parity disks just in case things go south, but 9 disks failing will likely break things 😬

 

One thing that seems good is I haven't seen any disks with Offline Uncorrectable or Pending Sectors in SMART, so the disks aren't *really* dying (in my layman's understanding of SMART).

 

Diagnostics are attached. Can anyone tell me if I should be overly concerned? Is this some weird transient error? What can I do?

nas01-diagnostics-20180910-1358.zip

Link to comment

 

26 minutes ago, ryanborstelmann said:

Operationally, I haven't noticed anything wrong with the disks. No failures in Plex, no issues accessing files, etc. SMART all seem to be okay, just the usual "your disks are old as dirt" errors - they are old, and I 100% expect them to die in time, but I'd expect them to not all die at once. Some of the disks that are throwing errors are 5yrs+ power-on-hours, some are ~2 or less. I have two parity disks just in case things go south, but 9 disks failing will likely break things 😬

 

You have a lot more disks than I am inclined to wade through SMART at this point. Maybe you could tell us which 9.

 

What exactly do you mean by  "your disks are old as dirt" ERRORS?

 

You do know that reliable reconstruction of a disk requires reliably reading every bit of every disk ( - # of parity)? 

 

Are all these problem disks on the same controller? How is your power supply?

Link to comment
Quote

You have a lot more disks than I am inclined to wade through SMART at this point. Maybe you could tell us which 9. 

Disks with errors:

  • Disk10/sdac | 256 errors
  • Disk4/sdu | 6 errors
  • Disk11/sdab | 256 errors
  • Disk13/sdy | 128 errors
  • Disk15/sdaa | 256 errors
  • Disk14/sdt | 6 errors
  • Disk16/sds | 6 errors
  • Disk22/sdad | 256 errors
  • Disk23/sdae | 128 errors
Quote

What exactly do you mean by  "your disks are old as dirt" ERRORS? 

Meaning most of my SMART reports are in "Pre-Fail" and "Old Age" results in the "Type" column. I presume it's cause many of the disks are 5yrs+ old - though no sector failures that I can find.

 

Quote

Are all these problem disks on the same controller? How is your power supply? 

There is one Dell M1015 flashed to IT Mode (aka LSI SAS2008) handling all these disks across two SAS2 backplanes: SuperMicro BPN-SAS2-846EL & BPN-SAS2-826EL. All of the affected disks are on the 846EL, which currently has 16 disks attached to it. Power supply is 2x SuperMicro power supplies (I think 1200W each, but i'd have to double check) each hooked up to their own APC BR1500G UPS for surge & UPS protection.

 

Edited by ryanborstelmann
Link to comment

There are no SATA cables, as it's backplane-based. Just a SAS cable or two to each backplane. Also I can try re-seating the controller, but this server has been in the rack for ~2 years now, and I somehow doubt it's a physical seating/cabling issue that just magically started on Saturday. But I'll pull the server from the rack tonight and double check it all to be sure. I can also order a new M1015 from eBay to be 1000% sure on that front as well.

 

If I reset the stat counters on the dashboard, they do come back in time as well

 

If I was having true read errors, would those show in a SMART report? Those are coming back clean.

 

One final thing that's interesting to me is that the syslog shows that the same exact sector of several disks has the read error. Example:
 

Sep 11 05:51:06 NAS01 kernel: md: disk10 read error, sector=1133863832
Sep 11 05:51:06 NAS01 kernel: md: disk22 read error, sector=1133863832
Sep 11 05:51:06 NAS01 kernel: md: disk23 read error, sector=1133863832
Sep 11 05:51:06 NAS01 kernel: md: disk10 read error, sector=1133863840
Sep 11 05:51:06 NAS01 kernel: md: disk22 read error, sector=1133863840
Sep 11 05:51:06 NAS01 kernel: md: disk23 read error, sector=1133863840
Sep 11 05:51:06 NAS01 kernel: md: disk10 read error, sector=1133863848
Sep 11 05:51:06 NAS01 kernel: md: disk22 read error, sector=1133863848
Sep 11 05:51:06 NAS01 kernel: md: disk23 read error, sector=1133863848
Sep 11 05:51:06 NAS01 kernel: md: disk10 read error, sector=1133863856
Sep 11 05:51:06 NAS01 kernel: md: disk22 read error, sector=1133863856
Sep 11 05:51:06 NAS01 kernel: md: disk23 read error, sector=1133863856

Notice how the sector is the same across three drives. That doesn't seem likely/normal. Thoughts on that?

 

Here is my entire syslog output, grepped for the read errors: https://hastebin.com/yusipoyula.pl

Link to comment

One other thing I see in syslog right as the read errors are noted is this:
 

Sep 11 05:50:42 NAS01 kernel: sd 8:0:25:0: timing out command, waited 180s
Sep 11 05:50:42 NAS01 kernel: sd 8:0:25:0: [sdac] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 11 05:50:42 NAS01 kernel: sd 8:0:25:0: [sdac] tag#23 Sense Key : 0x2 [current]
Sep 11 05:50:42 NAS01 kernel: sd 8:0:25:0: [sdac] tag#23 ASC=0x4 ASCQ=0x2
Sep 11 05:50:42 NAS01 kernel: sd 8:0:25:0: [sdac] tag#23 CDB: opcode=0x28 28 00 43 95 5f f8 00 04 00 00
Sep 11 05:50:42 NAS01 kernel: print_req_error: I/O error, dev sdac, sector 1133862904
Sep 11 05:50:53 NAS01 kernel: sd 8:0:27:0: timing out command, waited 180s
Sep 11 05:50:53 NAS01 kernel: sd 8:0:27:0: [sdad] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 11 05:50:53 NAS01 kernel: sd 8:0:27:0: [sdad] tag#15 Sense Key : 0x2 [current]
Sep 11 05:50:53 NAS01 kernel: sd 8:0:27:0: [sdad] tag#15 ASC=0x4 ASCQ=0x2
Sep 11 05:50:53 NAS01 kernel: sd 8:0:27:0: [sdad] tag#15 CDB: opcode=0x28 28 00 43 95 5f f8 00 04 00 00
Sep 11 05:50:53 NAS01 kernel: print_req_error: I/O error, dev sdad, sector 1133862904
Sep 11 05:51:06 NAS01 kernel: sd 8:0:28:0: timing out command, waited 180s
Sep 11 05:51:06 NAS01 kernel: sd 8:0:28:0: [sdae] tag#57 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 11 05:51:06 NAS01 kernel: sd 8:0:28:0: [sdae] tag#57 Sense Key : 0x2 [current]
Sep 11 05:51:06 NAS01 kernel: sd 8:0:28:0: [sdae] tag#57 ASC=0x4 ASCQ=0x2
Sep 11 05:51:06 NAS01 kernel: sd 8:0:28:0: [sdae] tag#57 CDB: opcode=0x88 88 00 00 00 00 00 43 95 5f f8 00 00 04 00 00 00
Sep 11 05:51:06 NAS01 kernel: print_req_error: I/O error, dev sdae, sector 1133862904

will try physically re-seating everything tonight and see if the issue persists.

Link to comment
1 hour ago, ryanborstelmann said:

One final thing that's interesting to me is that the syslog shows that the same exact sector of several disks has the read error.

That's why it's not a disk problem, no chance several disks have the same bad sector, and why it points to a controller/connections issue.

 

It can also be a failing backplane, failing PSU, etc.

Link to comment
Just now, johnnie.black said:

Forget to mention, sometimes these things can happen one time and only one time, like a fluke, so keep using the server normally and if it happens again in the near future you know there's a problem, and start troubleshooting, there's a Portuguese saying, roughly translates to "one time is no times". 

Good point - that's why I reset the stats (and it does after reboots too). I noticed it incremented again this AM (logs above) after resetting last night. The server had almost zero activity at that time (2AM and 6AM or so), which is strange too. So it's definitely continuing to increment.

 

I'll focus on re-seating everything, and might snag a new LSI HBA too. If it's not an issue with the HBA, I can at least have a spare for the day the HBA does fail.

 

Let's hope it's not the backplane, as SAS2 backplanes are relatively expensive ($300 on eBay at quick glance). I only paid $600 for the server as it is 😀

Link to comment
5 hours ago, ryanborstelmann said:

I somehow doubt it's a physical seating/cabling issue that just magically started on Saturday.

Keep in mind the connections are subject to vibration, heat cycling, and corrosion. Sometimes things degrade slowly but don't show signs of failure until, well, they fail. Reseating the connections is a good troubleshooting step.

Link to comment

Again overnight, at 4:33AM, 6 of the drives all reported the same bad sectors. Only thing that happened was CA Backup ran at 3:00 and finished at 3:30, and the drives all spun down at 3:58AM.
 

Full log: https://ghostbin.com/paste/ja7mh

 

Relevant snippet:

Sep 12 04:33:33 NAS01 kernel: sd 7:0:20:0: timing out command, waited 180s
Sep 12 04:33:33 NAS01 kernel: sd 7:0:20:0: [sdw] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 12 04:33:33 NAS01 kernel: sd 7:0:20:0: [sdw] tag#18 Sense Key : 0x2 [current]
Sep 12 04:33:33 NAS01 kernel: sd 7:0:20:0: [sdw] tag#18 ASC=0x4 ASCQ=0x2
Sep 12 04:33:33 NAS01 kernel: sd 7:0:20:0: [sdw] tag#18 CDB: opcode=0x28 28 00 43 95 86 78 00 04 00 00
Sep 12 04:33:33 NAS01 kernel: print_req_error: I/O error, dev sdw, sector 1133872760
Sep 12 04:33:33 NAS01 kernel: sd 7:0:23:0: timing out command, waited 180s
Sep 12 04:33:33 NAS01 kernel: sd 7:0:23:0: [sdz] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 12 04:33:33 NAS01 kernel: sd 7:0:23:0: [sdz] tag#21 Sense Key : 0x2 [current]
Sep 12 04:33:33 NAS01 kernel: sd 7:0:23:0: [sdz] tag#21 ASC=0x4 ASCQ=0x2
Sep 12 04:33:33 NAS01 kernel: sd 7:0:23:0: [sdz] tag#21 CDB: opcode=0x88 88 00 00 00 00 00 43 95 86 78 00 00 04 00 00 00
Sep 12 04:33:33 NAS01 kernel: print_req_error: I/O error, dev sdz, sector 1133872760
Sep 12 04:33:33 NAS01 kernel: sd 7:0:24:0: timing out command, waited 180s
Sep 12 04:33:33 NAS01 kernel: sd 7:0:24:0: [sdaa] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 12 04:33:33 NAS01 kernel: sd 7:0:24:0: [sdaa] tag#22 Sense Key : 0x2 [current]
Sep 12 04:33:33 NAS01 kernel: sd 7:0:24:0: [sdaa] tag#22 ASC=0x4 ASCQ=0x2
Sep 12 04:33:33 NAS01 kernel: sd 7:0:24:0: [sdaa] tag#22 CDB: opcode=0x28 28 00 43 95 86 78 00 04 00 00
Sep 12 04:33:33 NAS01 kernel: print_req_error: I/O error, dev sdaa, sector 1133872760
Sep 12 04:33:33 NAS01 kernel: sd 7:0:25:0: timing out command, waited 180s
Sep 12 04:33:33 NAS01 kernel: sd 7:0:25:0: [sdab] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 12 04:33:33 NAS01 kernel: sd 7:0:25:0: [sdab] tag#23 Sense Key : 0x2 [current]
Sep 12 04:33:33 NAS01 kernel: sd 7:0:25:0: [sdab] tag#23 ASC=0x4 ASCQ=0x2
Sep 12 04:33:33 NAS01 kernel: sd 7:0:25:0: [sdab] tag#23 CDB: opcode=0x28 28 00 43 95 86 78 00 04 00 00
Sep 12 04:33:33 NAS01 kernel: print_req_error: I/O error, dev sdab, sector 1133872760
Sep 12 04:33:33 NAS01 kernel: sd 7:0:27:0: timing out command, waited 180s
Sep 12 04:33:33 NAS01 kernel: sd 7:0:27:0: [sdac] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 12 04:33:33 NAS01 kernel: sd 7:0:27:0: [sdac] tag#24 Sense Key : 0x2 [current]
Sep 12 04:33:33 NAS01 kernel: sd 7:0:27:0: [sdac] tag#24 ASC=0x4 ASCQ=0x2
Sep 12 04:33:33 NAS01 kernel: sd 7:0:27:0: [sdac] tag#24 CDB: opcode=0x28 28 00 43 95 86 78 00 04 00 00
Sep 12 04:33:33 NAS01 kernel: print_req_error: I/O error, dev sdac, sector 1133872760
Sep 12 04:33:33 NAS01 kernel: sd 7:0:29:0: timing out command, waited 180s
Sep 12 04:33:33 NAS01 kernel: sd 7:0:29:0: [sdae] tag#25 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 12 04:33:33 NAS01 kernel: sd 7:0:29:0: [sdae] tag#25 Sense Key : 0x2 [current]
Sep 12 04:33:33 NAS01 kernel: sd 7:0:29:0: [sdae] tag#25 ASC=0x4 ASCQ=0x2
Sep 12 04:33:33 NAS01 kernel: sd 7:0:29:0: [sdae] tag#25 CDB: opcode=0x88 88 00 00 00 00 00 43 95 86 78 00 00 04 00 00 00
Sep 12 04:33:33 NAS01 kernel: print_req_error: I/O error, dev sdae, sector 1133872760
Sep 12 04:34:33 NAS01 kernel: md: disk10 read error, sector=1133872696
Sep 12 04:34:33 NAS01 kernel: md: disk11 read error, sector=1133872696
Sep 12 04:34:33 NAS01 kernel: md: disk15 read error, sector=1133872696
Sep 12 04:34:33 NAS01 kernel: md: disk21 read error, sector=1133872696
Sep 12 04:34:33 NAS01 kernel: md: disk22 read error, sector=1133872696
Sep 12 04:34:33 NAS01 kernel: md: disk23 read error, sector=1133872696
Sep 12 04:34:33 NAS01 kernel: md: disk10 read error, sector=1133872704
Sep 12 04:34:33 NAS01 kernel: md: disk11 read error, sector=1133872704
Sep 12 04:34:33 NAS01 kernel: md: disk15 read error, sector=1133872704
Sep 12 04:34:33 NAS01 kernel: md: disk21 read error, sector=1133872704
Sep 12 04:34:33 NAS01 kernel: md: disk22 read error, sector=1133872704
Sep 12 04:34:33 NAS01 kernel: md: disk23 read error, sector=1133872704
Sep 12 04:34:33 NAS01 kernel: md: disk10 read error, sector=1133872712
Sep 12 04:34:33 NAS01 kernel: md: disk11 read error, sector=1133872712
Sep 12 04:34:33 NAS01 kernel: md: disk15 read error, sector=1133872712
Sep 12 04:34:33 NAS01 kernel: md: disk21 read error, sector=1133872712
Sep 12 04:34:33 NAS01 kernel: md: disk22 read error, sector=1133872712
Sep 12 04:34:33 NAS01 kernel: md: disk23 read error, sector=1133872712
Sep 12 04:34:33 NAS01 kernel: md: disk10 read error, sector=1133872720
Sep 12 04:34:33 NAS01 kernel: md: disk11 read error, sector=1133872720
Sep 12 04:34:33 NAS01 kernel: md: disk15 read error, sector=1133872720
Sep 12 04:34:33 NAS01 kernel: md: disk21 read error, sector=1133872720
Sep 12 04:34:33 NAS01 kernel: md: disk22 read error, sector=1133872720
Sep 12 04:34:33 NAS01 kernel: md: disk23 read error, sector=1133872720
Sep 12 04:34:33 NAS01 kernel: md: disk10 read error, sector=1133872728
Sep 12 04:34:33 NAS01 kernel: md: disk11 read error, sector=1133872728
Sep 12 04:34:33 NAS01 kernel: md: disk15 read error, sector=1133872728
Sep 12 04:34:33 NAS01 kernel: md: disk21 read error, sector=1133872728
Sep 12 04:34:33 NAS01 kernel: md: disk22 read error, sector=1133872728
Sep 12 04:34:33 NAS01 kernel: md: disk23 read error, sector=1133872728

Thoughts? My next course of action is to roll back some of the changes I made this past weekend (disable drive spin-down, etc) just in case something isn't happy.

 

Link to comment

They're hardware errors but they make no sense, so I'm looking to roll back the only things that changed just before I started seeing the errors.

 

There's no way the same exact sector on 6 disks failed at once, and there's even less of a chance that it happened 100 times in one second, all on the same sectors. Even a bad cable/seating/HBA failure, etc would see different sectors going bad, or the entire drive going bad. Even if 6 or 9 drives go bad at once, there is about 0% chance it's all the same sector, and even less of a chance it happens many times.

 

How is UnRAID even looking at the same exact sector on 6 drives at the same time to determine there's a read error? Is there a possibility that this is a kernel bug and it's not actually reporting correctly? How is SMART not seeing any read errors, but the UnRAID kernel is?

 

Sorry for the questions, but this is all so odd and I'd like to wrap my head around it before getting new hardware.

Link to comment
39 minutes ago, ryanborstelmann said:

How is UnRAID even looking at the same exact sector on 6 drives at the same time to determine there's a read error? Is there a possibility that this is a kernel bug and it's not actually reporting correctly? How is SMART not seeing any read errors, but the UnRAID kernel is?

When there's a read error on one disk unRAID reads all the other disks plus parity to calculate the correct data and write it back to the the first one, that's why it's reading the same sector on several disks, and now that you know there's definitely a problem you'll need to start trying swapping things around to detect the cause, different PSU, different backplane (or temporarily connect the disks directly), etc.

Link to comment
  • 2 weeks later...

Update: Spent ~10 days with disk spin-down disabled, and saw zero read errors. Did a full parity check in there too with zero errors output.

 

Decided to turn disk spin-down back on last night, and within 13 minutes, saw the read errors hit: https://hastebin.com/ikejufadok.nginx

 

Disabled it, cleared the stats, and have seen zero errors since. This is clearly tied to disk spin-down, and I can't figure out why it'd be occurring. Can anyone shed some light on this for me?

Link to comment
  • 10 months later...
On 9/14/2018 at 9:30 AM, ryanborstelmann said:

Update: Since disabling disk spin-down, I have had 0 read errors. Honestly have no idea what the issue would be - maybe my HBA freaked out or something.

 

Will let it go for a week or two to ensure 100% that it's stable without errors, then might do some poking around to determine what could be the issue. Thanks for the guidance, folks!

 

I just had the same error show up - I think you are correct ... it's a spin up / down error. Under certain conditions, unRAID will read a drive before it is ready and thus cause this error. 


According to the SCSI error code guide I found on Oracle's website, the following code in the logs:

ASC=0x4 ASCQ=0x2

Logical Unit Not Ready, Becoming Ready

Indicates that the drive is OK— it is in the process of spinning up, but it is not yet ready.

 

https://docs.oracle.com/cd/E19105-01/storedge.6320/817-5918-10/817-5918-10.pdf

 

I'm seeing the same error on my SuperMicro 847 chassis. 

Edited by coolspot
Link to comment
  • 10 months later...

 

On 9/25/2018 at 8:15 PM, ryanborstelmann said:

Will look into that in the coming week or two, and will report back findings. Thanks!

 

On 8/7/2019 at 4:06 PM, coolspot said:

I'm seeing the same error on my SuperMicro 847 chassis. 

 

I don't want to necropost but I think I might have the same issue. How did you guys solve this?

In the next few days I'm gonna update fw on my lsi 9305 from p14 to p16. Hope this helps.

I would like to keep disk spin-down function to save some power.

 

 

wisdom_of_the_ancients.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.