Jump to content

Parity Scan was running and now a disk is disabled and multiple disk errors! Help!


Go to solution Solved by trurl,

Recommended Posts

SETUP ==
This is a new a server build and everything has been running great for the last few weeks that its been up and running.
-DRIVES: 15x ST14000NM001G
-RAID CARDS: 2x ADAPTEC ASR-71605 SFF8643 16 PORT
-LOGIC BOARD: Gigabyte Technology Co., Ltd. Z690 GAMING X DDR4

-POWER: Corsair HX Series, HX1000

-CABLES: Brand new SAS cables 

-UNRAID VERSION: 6.12.6

 

CURRENT STATE ==

-SMART CHECK: Good  (I don't suspect any drive issues)
-DISABLED DISK: One of my drives is disabled due to not being able to write to the disk

 

Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877512
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877520
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877528
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877536
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877544
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:6:0: [sdk] tag#871 access beyond end of device
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877552
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877560
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877568
Jan  1 09:21:32 OrigamiNET kernel: md: disk8 write error, sector=7907877576

 

-ERRORS: See above with the the disabled disk write error and I'm seeing a bunch of RAID card errors for one of my RAID cards exclusively

 

Jan  1 09:19:41 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:19:41 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,8,0):
Jan  1 09:19:41 OrigamiNET kernel: sd 2:1:8:0: [sdm] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Jan  1 09:19:41 OrigamiNET kernel: sd 2:1:8:0: [sdm] 4096-byte physical blocks
Jan  1 09:19:41 OrigamiNET kernel: sdm: sdm1
Jan  1 09:19:49 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:19:49 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,11,0):
Jan  1 09:19:49 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:19:49 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,9,0):
Jan  1 09:19:49 OrigamiNET kernel: sd 2:1:9:0: [sdn] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Jan  1 09:19:49 OrigamiNET kernel: sd 2:1:9:0: [sdn] 4096-byte physical blocks
Jan  1 09:19:49 OrigamiNET kernel: sdn: sdn1
Jan  1 09:19:53 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:19:53 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,4,0):
Jan  1 09:19:55 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:19:55 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,6,0):
Jan  1 09:19:55 OrigamiNET kernel: sd 2:1:6:0: [sdk] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Jan  1 09:19:55 OrigamiNET kernel: sd 2:1:6:0: [sdk] 4096-byte physical blocks
Jan  1 09:19:55 OrigamiNET kernel: sdk: sdk1
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,2,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,5,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,2,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,1,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,4,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,10,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,0,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,0,0):
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:03 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,11,0):
Jan  1 09:20:07 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:07 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,7,0):
Jan  1 09:20:10 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:10 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,5,0):
Jan  1 09:20:10 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:10 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,5,0):
Jan  1 09:20:11 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:11 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,8,0):
Jan  1 09:20:16 OrigamiNET crond[1576]: exit status 255 from user root /usr/local/emhttp/plugins/dynamix.day.night/scripts/dynamix.day.night &> /dev/null
Jan  1 09:20:19 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:19 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,9,0):
Jan  1 09:20:20 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:20 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,0,0):
Jan  1 09:20:24 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:24 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,3,0):
Jan  1 09:20:25 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:25 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,6,0):
Jan  1 09:20:26 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:26 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,1,0):
Jan  1 09:20:28 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:28 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,2,0):
Jan  1 09:20:34 OrigamiNET kernel: aacraid: Host adapter abort request.
Jan  1 09:20:34 OrigamiNET kernel: aacraid: Outstanding commands on (2,1,5,0):
Jan  1 09:20:34 OrigamiNET kernel: aacraid: Host bus reset request. SCSI hang ?
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: outstanding cmd: midlevel-0
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: outstanding cmd: lowlevel-0
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: outstanding cmd: error handler-7
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: outstanding cmd: firmware-5
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: outstanding cmd: kernel-0
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: Controller reset type is 3
Jan  1 09:20:34 OrigamiNET kernel: aacraid 0000:04:00.0: Issuing IOP reset
Jan  1 09:21:12 OrigamiNET kernel: aacraid 0000:04:00.0: IOP reset succeeded
Jan  1 09:21:12 OrigamiNET kernel: aacraid: Comm Interface type2 enabled
Jan  1 09:21:16 OrigamiNET crond[1576]: exit status 255 from user root /usr/local/emhttp/plugins/dynamix.day.night/scripts/dynamix.day.night &> /dev/null
Jan  1 09:21:21 OrigamiNET kernel: aacraid 0000:04:00.0: Scheduling bus rescan
Jan  1 09:21:32 OrigamiNET kernel: sdk: detected capacity change from 27344764928 to 0
Jan  1 09:21:32 OrigamiNET kernel: sdn: detected capacity change from 27344764928 to 0
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:11:0: [sdp] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:11:0: [sdp] 4096-byte physical blocks
Jan  1 09:21:32 OrigamiNET kernel: sdp: sdp1
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:4:0: [sdi] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:4:0: [sdi] 4096-byte physical blocks
Jan  1 09:21:32 OrigamiNET kernel: sdi: sdi1
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:3:0: [sdh] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:3:0: [sdh] 4096-byte physical blocks
Jan  1 09:21:32 OrigamiNET kernel: sd 2:1:9:0: [sdn] tag#225 access beyond end of device
Jan  1 09:21:32 OrigamiNET kernel: I/O error, dev sdn, sector 7907873264 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 2
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873200
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873208
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873216
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873224
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873232
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873240
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873248
Jan  1 09:21:32 OrigamiNET kernel: md: disk12 read error, sector=7907873256

 

 

NEXT STEPS ==
I'm not sure what to do next outside of powering down the server and checking the cables and RAID card. 
Is there a way to restore the disabled disk if I feel that the disk is truly OK? I imagine it would just be a rebuild of the disk again, but I'll address that road once I get the RAID card errors sorted.

 

The system has paused my Parity Check at 28% - I'm guessing due to the disabled disk? 

I appreciate any assistance you can provide and I imagine everyone is feeling the Y2K24 spirit at the moment from last night :)

 

 

origaminet-diagnostics-20240101-1048.zip

Edited by DaveHavok
Link to comment

I think I may have found the issue - 

https://docs.unraid.net/unraid-os/release-notes/6.12.6/#known-issues

 

Adaptec 7 Series HBA not compatible

If you have an Adaptec 7 Series HBA that uses the aacraid driver, we'd recommend staying on 6.12.4 for now as there is a regression in the driver in the latest kernels. For more information see this bug report in the Linux kernel

 

I imagine the fix is rolling back to 6.12.4, but my concern now is that the Parity scan got 28% finished and "corrected" 34 errors

 

QUESTIONS ==

Is there any file damage now due to error correction writes?
Is there a way for me to see what files were corrected so I can review them for issues?

NEXT STEPS ==
1.) Roll UNRAID back to 6.12.4
2.) Rebuild disabled drive to restore it to normal status (?? unless there is another way)

3.) Run parity check again (?? is this needed after rebuilding a drive)

 

 

 

Link to comment
Posted (edited)
41 minutes ago, trurl said:

Download the zip from the Download page and replace all bz* files in the top folder of the flash drive.

 

Thanks @trurl!  I just found that in the UNRAID docs and have successfully rolled back to 6.12.4

Now to sort to figure out the 2 remaining issues:
1.) Restoring the disabled disk (I suspect this was disabled due to the Linux kernel bug as I mentioned above and the disk itself is fine)
2.) Is the 34 "error correction" writes to the Parity drive an issue

 

 

I'm starting the drive rebuild process now to restore the disabled drive onto itself.

Now to research into point 2 above while the rebuild runs.

Edited by DaveHavok
Link to comment
9 hours ago, JorgeB said:

You can do a new config and run a correcting check or just rebuild the disk (if the emulated disk looks good)


Happy New Years @JorgeB!

I just finished rebuilding the previously disabled disk and the array is now back to normal operation.
However, I do have a few questions on my next steps / best practice from here that I hope you could help me with?

1.) Now that the array is back to "normal" status, should I run a parity check now?
2.) By "correcting check" - I'm assuming having the checkbox checked for "write corrections to parity"? 
3.) Should I be concerned with the previous 34 errors listed under the Parity Check History when the "Status" shows as cancelled? (I'm guessing these errors were not corrected since the operation was canceled when I rebooted the server to rollback to 6.12.4 from 6.12.6?)


2123109754_Screenshot2024-01-02at10_58_50.thumb.png.53ccf34aa2253698e9eb328583d2e42e.png

 

4.) Now that I'm eyeballing this a bit more - that looks like a Read-Check and not a Parity-Check - I assume there is a difference and that nothing was written to the parity disk given the difference in actions?

 

I'm kicking myself for missing that warning about the Adaptec Series 7 RAID cards issue in the Change Notes! 

Thank you!

Link to comment
24 minutes ago, trurl said:

Post new diagnostics


Happy New Years @trurl!

Attached as requested.

Looking at the syslogs file - I see a ton of lines flooding me about the Day & Night plugin 

exit status 255 from user root /usr/local/emhttp/plugins/dynamix.day.night/scripts/dynamix.day.night &> /dev/null

 

I'll dig into the issue above a bit later, but I wanted to make sure I'm on the right corrective path with next steps with understanding what I should do about kicking off a parity check post downgrading UNRAID from 6.12.6 to 6.12.4 due to my RAID cards being impacted by the Linux kernel bug.

 

I suspect I have 2 options:
1.) Run the parity check with corrections checked
2.) Run the parity check with corrections unchecked

I'm trying to get better understanding of those two options and their impact on my scenario. 

Thank you for your help and guidance!

origaminet-diagnostics-20240102-1646.zip

Link to comment

Fixed the plugin spamming - easy enough - uninstalled the plugin and then reinstalled it via the Apps tab.
I had my address info entered into the original installation of the plugin, but disabled the plugin - the plugin didn't seem to acknowledge that and just kept spamming.

All fixed now. :)

That just leaves the parity topic to sort out.

Link to comment
16 minutes ago, trurl said:

I would run a non-correcting check just to see if everything is working well. If it doesn't have a lot of sync errors or any other problems then a correcting check.

 

Why do you have 200G docker.img?

 

That's a good idea. I was just hoping to avoid multiple checks and putting unneeded load on the disks with repeated parity scans - especially since it takes about a day to finish a parity check.

As for the docker.img being 200G - great question! I could probably scale it back down to 40GB. I was planning on adding additional docker apps in the near future as I get this new box up and running. 

 

 

Link to comment

Looks like everything has recovered nicely!

Thanks for the help everyone - I'm marking this as solved.

SOLUTION:
1.) Rolled back to unRAID 6.12.4 due to RAID card Linux kernel bug (Adaptec Series 7)
2.) Rebuilt disabled drive on top of itself after verifying it was good
3.) Ran non-corrective parity-check to verify if previously reported errors were false due to the Linux kernel bug
4.) Reinstalled Day & Night plugin to stop syslog spamming (unrelated issue)

 

image.thumb.png.8ca6bea63f4af809421bf247652ab0cc.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...