2 drives Unmountable - AGAIN [SOLVED]


Homerr

Recommended Posts

[Solved]  TLDR - A SATA cable went bad on one of the parity drives.

 

 

I had a previous issue I *just* fixed and now it's back.

 

I had 3 drives drop out of the array per the thread below.  I have 2 parity drives so I was able to rebuild two from parity and the third had data loss.

 

What  I did:

  • I had a new precleared WD 10tb that I put in as Disk 8 (WD10...48LDZ).
  • I then precleared the replaced former Disk 8 (WD10...W43D) as Disk 14, that's running fine now.
  • Disk 11 was ultimately reformatted and returned to the array after some checks per the other thread.

 

Disks 8 and 14 rebuilding about a week ago and the array all good for about 24 hours.  Then a regularly scheduled parity check started Wednesday morning.  I came home to an unresponsive server.  I rebooted and now Disks 8 and 13 are fubared.  Disk 13 had not previously had an issue.  I turned the server off for the last couple of days as I was out of town for the holiday.

 

I can obviously rebuild from array but the reliability is majorly in questions.

Any ideas on troubleshooting this?

 

I do have another motherboard, CPU, and PSU I could try if it is recommended.  I'm running a pair of Supermicro AOC-SASLP-MV8 that I've used a couple of years and had no issues with previously.  I replaced the breakout cables per the previous thread.  All running in an early-gen Norco RPC-2040 4U case.

 

array1.JPG

array2.JPG

unraid-diagnostics-20191130-1651.zip

Edited by Homerr
Link to comment

Diags are after rebooting, but the most likely culprit would be the SASLP controller, they are known to drop disks without a reason, and past reliability doesn't mean much since they can start giving problems at any time, usually after an Unraid release upgrade or any other hardware change.

 

Since disks8 and 13 should be on the same miniSAS cable it might also be that if you want to try that first, but IMHO you should replace them with LSI HBAs ASAP.

 

P.S: parity2 needs a new SATA cable.

Link to comment

Ok, thanks for the direction.  I had previously ordered an LSI 9201 16i from ebay seller jiawen2018 in the midst of this and tried it but it was wonky from the start and only running at paltry kb/s speeds.  I went back to the SASLP controllers since they had previously worked. 

 

But now I just ordered another LSI 9201 16i from Art of Server on ebay and his pics looks slightly different than jaiwen2018's cards, so I wonder if the latter was genuine.  I did put in new cables in August when I first attempted to address this.  So I'm concerned that one of my backplanes is bad.  I'm going to deconstruct which drives are on which backplanes as well.  If one is bad then it would seem the ol' Norco might need to be replaced as well as I don't seem to be able to find replacements online.

Link to comment

I've figured out which backplane each drive is on.  Disks 13 and 8 were on the same backplane and same breakout cable.  I replaced the breakout cable with a different one and also moved the drives to my one unused slot to no avail.

 

Once drives are marked 'Unmountable: No file system' are they then always until a reformat and/or rebuild?  Or should they show up 'healthy' if there are no other issues?

Link to comment
Quote

Once drives are marked 'Unmountable: No file system' are they then always until a reformat and/or rebuild?  Or should they show up 'healthy' if there are no other issues?

The vast majority of the time this means that there is File System corruption on the drive that can be fixed by Clicking on the drive on the Main tab and using the option on the resulting dialog to check/repair the file system.

Link to comment

I got the new LSI card from Art of Server and installed it.  I started in Maintenance mode, clicked on each disks 8 and 13, and ran the xfs repair without -n.

 

It then said it needed to mount the drives to get information.  I stopped Maintenance mode and started the array in normal mode.  I don't know if this is really what it was asking for.  Stopped the array again.

 

Went in to Maintenance mode and re-ran xfs repair.  It didn't seem to be definitive about what happened, just 'done'.  Stopped Maintenance mode and started array as normal.  Disks 8 and 13 still having issues.

 

Stopped, back to Maintenance mode.  Ran xfs repair with -L to rebuild the log file on each drive.  Stop Maintenance mode, start array as normal.  No joy.

 

What do I do next?

 

In the jumble of all this my hot spare is now smaller than a drive to replace.  Can I force an array rebuild on these two drives without reassigning a different drive?

Link to comment

 

 

 

Disk 8:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 4
        - agno = 6
        - agno = 3
        - agno = 5
        - agno = 1
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 2
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (4:2) is ahead of log (1:2).
Format log to cycle 7.
done

 

 

And Disk 13:


Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 524288, ag 0, rval -1

fatal error -- Input/output error

 

Link to comment

There's a problem with parity2, and since 2 disks are disabled both can't be emulated correctly:

Dec  7 16:34:55 unRAID kernel: ata6.00: status: { DRDY SENSE ERR }
Dec  7 16:34:55 unRAID kernel: ata6.00: error: { ICRC ABRT }
Dec  7 16:34:55 unRAID kernel: ata6: hard resetting link
Dec  7 16:34:56 unRAID kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec  7 16:34:56 unRAID kernel: ata6.00: configured for UDMA/100
Dec  7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Dec  7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 Sense Key : 0xb [current]
Dec  7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 ASC=0x47 ASCQ=0x0
Dec  7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 CDB: opcode=0x8a 8a 00 00 00 00 00 47 f9 57 d0 00 00 05 40 00 00
Dec  7 16:34:56 unRAID kernel: print_req_error: I/O error, dev sdt, sector 1207523280
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523216
Dec  7 16:34:56 unRAID kernel: ata6: EH complete
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523224
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523232
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523240
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523248
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523256
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523264
Dec  7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523272

Looks more like a connection problem, replace both cables on that disk  and post new diags after array start.

Link to comment

Been sick this week, just getting back to this.  I bought new SATA cables for all 4 drives running off the mobo including the 2 parity drives.  I started the array in Maintenance mode, ran xfs_repair with no modifiers.  It said to start the array and then rerun, use -L if that didn't work.

 

Here is the diagnostics file while the array is back up.  I have not gone back in to Maintenance mode yet and reran anything.

unraid-diagnostics-20191214-2113.zip

Edited by Homerr
Link to comment

I let it finish a check of the array while the array was started, it ran all day yesterday and finished without errors.  I'm attaching the diagnostics file after that completed.

 

Restarted in Maint. mode and here are the xfs_repair outputs.  I reran with no modifier, but did not yet run the -L option.

 

Disk 8:


Phase 1 - find and verify superblock...
sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128
resetting superblock root inode pointer to 128
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129
resetting superblock realtime bitmap ino pointer to 129
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130
resetting superblock realtime summary ino pointer to 130
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

Disk 13:


Phase 1 - find and verify superblock...
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

unraid-diagnostics-20191216-1428.zip

Edited by Homerr
Link to comment
3 minutes ago, Homerr said:

I started the array and the disk now show up with Used and Free space (instead of Unmountable) but are still emulated.

That's expected since once a disk is disabled it needs to be rebuilt, you can rebuild on top of the old disks, but before doing that make sure the data on the emulated disks looks correct, and there are no lost+found folder with data.

Link to comment

Both disks have lost+found folders.  Disk 8 just has 2 folders with files like 132, 133, 134, etc. with no file extensions.  The files are large multi GB, probably movies...?

 

Disk 13 has more recognizable folders and files.

 

Dumb question is - how do I recover each of them from the lost+found trees?

Link to comment

I would suggesting rebuilding to new disks and keeping the old ones, they should mount fine with UD (after changing the UUID) and then compare data on the rebuilt disks, and copy anything missing.

 

Alternatively do a new config and resync parity, but there's more risk involved since array will be unprotected if one of disks is really bad.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.