Chasing down disabled drive / CRC errors


Go to solution Solved by Arcaeus,

Recommended Posts

Hello everyone,

 

I'm working on chasing down some CRC errors that may/may not be contributing to some weird disk behavior recently. A couple years ago, I picked up an LSI 9211-8i to increase the amount of disks I could connect. From the beginning I was getting CRC errors, but looking through the forums it seemed to not be a huge issue and essentially left it alone. Money was tight so couldn't afford to replace the card and it didn't seem like a huge issue, so I left it. I would consistently get CRC errors, but everything seemed to be working so I didn't change anything. 

 

Last week I had a disk 4 in my array get disabled after a bunch of read errors. I followed the steps from Unraid to stop the array, remove the disk, start/stop the array, then re-add the disk and let the system rebuild. All seemed ok.

 

Shortly after that, disk 7 went down. I tried the same process but eventually it wouldn't allow me to re-add it back into the array. Starting to worry now, I ordered new 8087 to sata cables and a new 4TB disk. I've been planning a new NAS build so got 2x WD Gold 16TB disks as they were on a good sale too. The plan so far is to copy all 32TB spread out across the array to these two disks as a local backup. 

 

Once everything arrived, I pulled out the non-working drive, replaced the cables making sure they weren't kinked or jammed up next to each other, and installed the 16TB disks (connected via SATA cables directly to the mobo). Ran a preclear on the 3 new disks with the 4TB passing no problem (16's are still going). I stopped the array, added the new disk in, and spun it back up again.

 

When I did, disk 4 started to throw a bunch of read errors again and get disabled. So I stopped the array, did the remove/re-add dance like above, and spun it back up. Disk 7 (new disk) went through a very fast parity sync / data rebuild at 3.2GB/s (very odd for a 7200rpm HGST drive) and then showed green in the array. Meanwhile disk 4 is moving through the data rebuild process seemingly fine, yet fix common problems is showing "unable to write to disks 4 & 5 - drive mounted read only or completely full" with disk 5 having 341 million + read errors and climbing. 4 & 5 are at 92% and 90% capacity respectively. It's also showing /var/log is 100% full probably due to all the errors.

 

To me this points to a faulty HBA card that would just need to be replaced, but is there something else that it could be before I buy one? Diags posted below.

 

 

Build:

i7-960 @ 3.2

12GB DDR3 RAM

Gigabyte G1 Guerrilla mobo

LSI 9211-8i in PCIe x16 (maybe Gen 1?) slot

single parity drive

Antec 1000w PSU with 9 drives connected via SATA power cables (that came with the PSU) and 5 drives connected via molex to 2x SATA connections (total 6 connections with 5 disks connected). All power cables are original except for the molex to sata adapters.

mediavault-diagnostics-20220501-1236.zip

Edited by Arcaeus
Link to comment
  • Replies 81
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

5 hours ago, Arcaeus said:

LSI 9211-8i

I don't know quite what answer you are going to get about your question but before you order a new card, read this thread carefully:  

 

       https://forums.unraid.net/topic/102010-recommended-controllers-for-unraid\

 

Realize the LSI is not longer a separate company but is now a part of Broadcom.  It is my understanding that the LSI 9211 series of cards has not been manufactured by them for several years as they have designed a new chip set.  However, Broadcom is still selling the earlier chip sets.  The original LSI boards have been reversed engineered (including the artwork and logos) and there are one or more companies who are selling these knock-offs as an LSI product.    So buying 'new' LSI 9211 boards is a bit of a crap shoot...

Link to comment
On 5/2/2022 at 5:16 AM, JorgeB said:
LSISAS2308: FWVersion(20.00.00.00)

 

Known issue with that firmware, update to 20.00.07.00

 

Ok did this. Thank you so much for your detailed walkthrough, it was super helpful. I liked and thanked the post, not sure if that does anything for you.

 

I started the array, and disk 5 is showing that it can't be entered into the aray (see attached). Do I need to let Unraid do a parity sync / data rebuild before bringing that drive back in? I will start the array and see if there are more CRC errors.

disk 5 won't add into the array.png

Link to comment
6 minutes ago, JorgeB said:

Please post current diags, disk5 was enable in the previous ones.

New diags attached. Array is currently stopped, does it need to be started?

 

6 minutes ago, trurl said:

That screenshot is Unassigned Devices saying the disk is an array disk. You have to assign it as disk5 in the array

The disk was previously in the array but Fix common problems was showing "unable to write to disks 4 & 5 - drive mounted read only or completely full". When I stop the array and attempt to select that disk, I can't choose it from the drop down. Main - array devices (screenshot attached).

 

Main - array devices.png

 

mediavault-diagnostics-20220504-1106.zip

Main - array devices - no disk5 option.png

Edited by Arcaeus
Link to comment
1 minute ago, trurl said:

SY8T (sdi) isn't giving SMART report. Looks like it has disconnected.

So this is interesting as I believe it was connected and working since I pulled out the other drive. Only after stopping the array and assigning the new disk to the array, then starting it up again did it start throwing a bunch of errors.

 

I can shut everything down and confirm that all the cables are connected properly if you think that may be an issue?

 

So far no CRC errors yet, but I don't think anything is really hitting the disks so I'm not sure.

Link to comment

Emulated disk5 is mounted and seems to have a lot of data, but syslog also indicates filesystem corruption.

 

Since you mentioned FCP report, I wonder if you don't have multiple problems, possibly corrupting emulation of disk5. And

5 minutes ago, trurl said:

SY8T (sdi) isn't giving SMART report. Looks like it has disconnected.

Shutdown, check all connections both ends, all disks, SATA and power, including splitters.

 

Then reboot, start the array, and post new diagnostics.

Link to comment
4 minutes ago, trurl said:

Emulated disk5 is mounted and seems to have a lot of data, but syslog also indicates filesystem corruption.

 

Since you mentioned FCP report, I wonder if you don't have multiple problems, possibly corrupting emulation of disk5. And

Shutdown, check all connections both ends, all disks, SATA and power, including splitters.

 

Then reboot, start the array, and post new diagnostics.

Got it. Doing that now.

Link to comment
38 minutes ago, trurl said:

Emulated disk5 is mounted and seems to have a lot of data, but syslog also indicates filesystem corruption.

 

Since you mentioned FCP report, I wonder if you don't have multiple problems, possibly corrupting emulation of disk5. And

Shutdown, check all connections both ends, all disks, SATA and power, including splitters.

 

Then reboot, start the array, and post new diagnostics.

 

Ok, all connections have been checked. SY8T was connected to an HBA card SATA connection, so I swapped that for a direct SATA cable that went to the motherboard SATA ports. Now SY8T is in a SATA connection to the mobo, QWW3 (currently disk #8 ) is connected with the HBA card SATA connection. 

 

When I logged in, SY8T showed 1 CRC error on startup, but have not seen another one yet. Unassigned Devices shows it as mountable (see attached).

 

New diags are attached as well.

SY8T mountable.png

mediavault-diagnostics-20220504-1156.zip

Edited by Arcaeus
Link to comment
5 minutes ago, JorgeB said:

Sorry, actually it's both, I saw fs corruption in the log and assumed it was the emulated disk, but both disks 4 and emulated disk5 are showing the same.

Ran the filesystem check on disk 4, the output was pretty long so it's attached in a .txt file. 

 

Attempted to run the same process on the emulated disk 5, but clicking the "Check" button does nothing. 

disk 4 (FAYG) filesystem check .txt

Edited by Arcaeus
Link to comment
3 minutes ago, Arcaeus said:

Ran the filesystem check on disk 4, the output was pretty long so it's attached in a .txt file. 

You only option is to fix it, run it again without -n

 

4 minutes ago, Arcaeus said:

Attempted to run the same process on the emulated disk 5, but clicking the "Check" button does nothing. 

That's strange, don't remember seeing that issue before, but you can always run it manually:

 

xfs_repair -v /dev/md5

 

Link to comment
10 minutes ago, JorgeB said:

You only option is to fix it, run it again without -n

 

That's strange, don't remember seeing that issue before, but you can always run it manually:

 

xfs_repair -v /dev/md5

 

Ok, both have been run. Disk 4 is sitting on this as the last message:

 

resetting inode 2153387389 nlinks from 4 to 3
resetting inode 2153448744 nlinks from 3 to 2
resetting inode 2153448745 nlinks from 3 to 2
resetting inode 2153448746 nlinks from 3 to 2
resetting inode 2192650716 nlinks from 15 to 12
resetting inode 2219092096 nlinks from 11 to 9
resetting inode 633253422 nlinks from 27 to 20
done

 

Disk 5 showed this:

resetting inode 2801176 nlinks from 14 to 4
resetting inode 2801177 nlinks from 11 to 4
resetting inode 2801202 nlinks from 18 to 5
resetting inode 2801877 nlinks from 11 to 4
resetting inode 2802586 nlinks from 16 to 5
resetting inode 2802591 nlinks from 14 to 5
resetting inode 2810724 nlinks from 14 to 5
resetting inode 16435645 nlinks from 9 to 3
resetting inode 16435686 nlinks from 17 to 4
resetting inode 16458651 nlinks from 13 to 4
resetting inode 16458684 nlinks from 8 to 3
resetting inode 16458687 nlinks from 16 to 5
resetting inode 16460176 nlinks from 8 to 3
Maximum metadata LSN (1056220067:-1244195062) is ahead of log (1:1128485).
Format log to cycle 1056220070.
cache_purge: shake on cache 0x528de0 left 11 nodes!?
cache_purge: shake on cache 0x528de0 left 11 nodes!?
cache_zero_check: refcount is 1, not zero (node=0x150cb80b8e60)
cache_zero_check: refcount is 1, not zero (node=0x150cc8004810)
cache_zero_check: refcount is 1, not zero (node=0x150cc00cdfa0)
cache_zero_check: refcount is 1, not zero (node=0x150cb80b5410)
cache_zero_check: refcount is 1, not zero (node=0x150cb80a6e60)
cache_zero_check: refcount is 1, not zero (node=0x150cb409ef80)
cache_zero_check: refcount is 1, not zero (node=0x150cc8005c10)
cache_zero_check: refcount is 1, not zero (node=0x150cb40832e0)
cache_zero_check: refcount is 1, not zero (node=0x150cb8000ea0)
cache_zero_check: refcount is 1, not zero (node=0x150cb80ae1d0)
cache_zero_check: refcount is 1, not zero (node=0x150cc8002450)

        XFS_REPAIR Summary    Wed May  4 12:32:08 2022

Phase           Start           End             Duration
Phase 1:        05/04 12:29:38  05/04 12:29:39  1 second
Phase 2:        05/04 12:29:39  05/04 12:29:40  1 second
Phase 3:        05/04 12:29:40  05/04 12:29:46  6 seconds
Phase 4:        05/04 12:29:46  05/04 12:29:46
Phase 5:        05/04 12:29:46  05/04 12:29:47  1 second
Phase 6:        05/04 12:29:47  05/04 12:29:47
Phase 7:        05/04 12:29:47  05/04 12:29:47

Total run time: 9 seconds
done
root@MediaVault:~# xfs_repair -v /dev/md5
Phase 1 - find and verify superblock...
        - block cache size set to 542384 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 0 tail block 0
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...

        XFS_REPAIR Summary    Wed May  4 12:33:14 2022

Phase           Start           End             Duration
Phase 1:        05/04 12:33:12  05/04 12:33:13  1 second
Phase 2:        05/04 12:33:13  05/04 12:33:13
Phase 3:        05/04 12:33:13  05/04 12:33:14  1 second
Phase 4:        05/04 12:33:14  05/04 12:33:14
Phase 5:        05/04 12:33:14  05/04 12:33:14
Phase 6:        05/04 12:33:14  05/04 12:33:14
Phase 7:        05/04 12:33:14  05/04 12:33:14

Total run time: 2 seconds
done

 

Does that look right?

Link to comment
7 minutes ago, JorgeB said:

Yep, check for a lost+found folder on those disks, if there are a lot o lost files on the emulated disk5 don't rebuild on top of the old disk.

 

And to do that I'm assuming that I need to stop the array and restart it normally so the disks mount, correct? or should I be looking somewhere else?

 

What is a lot? is there some certain number I should be looking for?

Edited by Arcaeus
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.