5.0.6 - Disk Rebuild Completed, Disk Redballed in ParChk (no obvious webGUI error message)


Recommended Posts

Subject says it all; precleared an external 8TB on 3 cycles, shucked it, and the replace/rebuild process finished late last night. I immediately started the parity check and went to bed. Now the GUI is orange, the disc is redballed with 768 errors, and I can't make heads nor tails of the error log.

 

NOTE: This was a non-correcting parity check and I still have the original 4TB drive that was in the slot for Disk 10 to replace and restart everything... if the issue's just a disk that precleared 3x and rebuilt fine but promptly died immediately after the rebuild.

 

Apologies for not being up to date on this box's version, but it's my "least featured" box and needing to upgrade my SAS cards to work with newer versions of Unraid on other boxes kept me from upgrading this one as fast as I did the others. Really hoping that doesn't come back to bite me with a lost disk now...

 

I'm going to leave it powered up and where it was for now in case that helps with resolving this.  Thanks in advance for any help anyone can provide!

 

EDIT: Other weirdness; GUI gives me the option to stop the array (which I feel like wouldn't be a bad idea to do?) and Spin Up/Spin Down/Clear Statistics, but no other info.

EDIT 2: Tried running SMART on Disk 10 from the command line;  results say user capacity is 600 petabytes, which seems a bit high. Logical block size 774843950 bytes, physical block size 3099375800 bytes. Lowest aligned LBA: 14896.

On the scsiModePageOffset line(s), it says "response length too short, resp_len=47 offset=50 bd_len=46. "Terminate command early due to bad response to IEC mode page."

 

EDIT 3: It looks like this string of the log is where things went wrong right before the massive stretch of errors on Disk 10:

 

Jul 29 19:30:06 Tower2 kernel: md: sync done. time=113721sec

Jul 29 19:30:06 Tower2 kernel: md: recovery thread sync completion status: 0

Jul 29 19:34:42 Tower2 kernel: mdcmd (78): check NOCORRECT

Jul 29 19:34:42 Tower2 kernel: md: recovery thread woken up ...

Jul 29 19:34:42 Tower2 kernel: md: recovery thread checking parity...

Jul 29 19:34:42 Tower2 kernel: md: using 1536k window, over a total of 7814026532 blocks.

Jul 29 20:11:26 Tower2 kernel: sd 6:0:2:0: [sdp] command f72d9900 timed out

Jul 29 20:11:27 Tower2 kernel: sd 6:0:2:0: [sdp] command eeacbe40 timed out

Jul 29 20:11:27 Tower2 kernel: sd 6:0:2:0: [sdp] command eeacb3c0 timed out

Jul 29 20:11:27 Tower2 kernel: sd 6:0:2:0: [sdp] command eeacb480 timed out

Jul 29 20:11:27 Tower2 kernel: sd 6:0:2:0: [sdp] command eeacb840 timed out

Jul 29 20:11:27 Tower2 kernel: sd 6:0:2:0: [sdp] command f73059c0 timed out

Jul 29 20:11:27 Tower2 kernel: sas: Enter sas_scsi_recover_host busy: 6 failed: 6

Jul 29 20:11:27 Tower2 kernel: sas: trying to find task 0xf703f900

syslog-5.0.6-postrebuild.txt

Edited by wheel
Link to comment

From reading other threads, I'm thinking I should reboot so I can run a SMART check on Disk 10, but the fact that this is v5 not v6 has me wondering if something is wrong other than the standard SASLP/Marvell issues that seem to happen a lot on v6 boxes.  I'm concerned that once I restart to get the SMART, I may lose the opportunity to make changes or get data that may be recommended in here.

 

I'm going to run a few errands and leave the array on / box running, and if nobody's jumped into this thread to say "wait don't turn it off yet" by the time I'm back tonight, I'll go ahead with what seems like the standard post-redball operating procedure for some of the error messages I searched from my syslogs.

Link to comment

OK, stopped the array, shut the system down without issue.  Restarted after a few minutes, and the GUI says there's no device at all on Disk 10. I'm thinking I might need to disconnect and reconnect the drive from the hotswap cage for Unraid to detect it again, but I'm going to wait on doing that for a little while in case anyone thinks I should do anything else first.

 

EDIT: Actually may have made a mistake. When I booted back up, the 8tb drive that had just been rebuilt was showing in the italics below the dropdown box, but no drive was available to select in the dropdown box. I started the array in maintenance mode to see if that would allow recognition of the drive that threw up all the errors (Disk 10) so I could run a SMART test. It didn't, so I exited maintenance mode and powered down. I've started the system up again, and the GUI shows "not installed" under the redballed Disk 10 entry (vs. the previous ST800DM... text). From what I'm reading, I don't think I screwed anything up by going into maintenance mode, but I'm going to wait to take further steps until someone from the community lets me know what safe steps I should be taking next.

 

Again, I can go back to the original 4TB drive that the 8TB replaced if necessary.  I'm thinking about putting the 8TB into an enclosure attached to another box and running SMART there.

Edited by wheel
Link to comment

So I booted back up after reseating the disk in the drive cage and nothing changed; zero recognition of the disk being present at all.

 

Removed the disk, put it in an esata caddy attached to another unraid box I occasionally use for preclearing, and turned it on. Drive had a very labored spin up sound and loudly beeped twice almost immediately upon starting up. Disk is detectable in UnRaid GUI, but when I try to manually run a SMART report using the disk's unraid ID (sdo, in this case), a "scsi error device not ready" prompt kills the attempt. SMART reports running fine on everything else.

 

Amazingly, this 8tb drive may have survived 3 full cycles of preclear, a full rebuild cycle, and then promptly died about 40 minutes into the parity check for that preclear cycle based on the timing in my logs.

 

I went out and got another 8tb and am in the process of preclearing that one - despite not successfully finshing the parity check after the successful rebuild, can I just drop the new 8tb where the old 8tb was and run the rebuild process again, or do I need to reuse the old 4tb that was there to return to the "old parity" before attempting to expand that slot again with the 8tb?

Link to comment

Definitely was a non-correcting parity check as usual, which is what I figured was going to make it easy to pick up where I left off.  Looking at the log I'd initially posted, no obvious errors during the rebuild.

 

Damn glad to hear it - thanks for the help/confirmation!

Link to comment

Well, hell.  Nothing like staying up late to complete a rebuild and starting the parity check immediately after and doing it half-cocked because I'm half-asleep.

 

Hopefully the slew of errors during the correcting check didn't screw me on a rebuild, but if I lost a few things, c'est la data.

Link to comment
1 hour ago, wheel said:

Hopefully the slew of errors during the correcting check didn't screw me on a rebuild

Parity wasn't updated during the check, so no harm done, but in some cases it's known to wrongly update parity, hence why it should always be non correct.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.