Please advice - First time dealing with "disk disabled"

Squnkraid · June 18, 2021

Hi,

Seeking advice on what to do next. I ran mover today. Cache disk was filled ~92%. It looks like this has taken it's toll on Disk 3. Apparently Mover finished, but after that I got an error saying "Disk 3 is disabled, contents are emulated". I have attached the Syslog and Diagnostics for you to look at.

Running:

13x 2,5" Seagate 5TB array (dual parity)

01x 2,5" Samsung 500GB cache

2x HBA M1015 cards

What do you recommend I do next?

From what I've read online I have only one option: Reboot & Rebuild (by adding disk 3 again).

I have ran a SMART short self-test on Disk 3 that passed without any errors, I don't see why I should not add disk 3 again. I feel it's not the drive itself. My best guesses are:

1. Cable is bad

2. HBA card is bad

3. Too much I/O at once due to mover (and this being 2,5" SMR drives?) which resulted in a temp glitch. (Gut feeling)

Your help and advice is much appreciated.

server-ur-syslog-20210618-1650.zip server-ur-diagnostics-20210618-1903.zip

Edited June 18, 2021 by Squnkraid

JorgeB · June 18, 2021

Disk looks OK, syslog rotated so can't see the LSI firmware version, make sure they are on 20.00.07.00, then if the emulated disk is mounting correctly rebuild on top, good idea to replace/swap cables/slot to rule that out.

P.S: lots of ATA errors on disk1, likely a power/connection issue.

Squnkraid · June 18, 2021

14 minutes ago, JorgeB said:

Disk looks OK, syslog rotated so can't see the LSI firmware version, make sure they are on 20.00.07.00, then if the emulated disk is mounting correctly rebuild on top, good idea to replace/swap cables/slot to rule that out.

P.S: lots of ATA errors on disk1, likely a power/connection issue.

Really appreciate the quick reply!

So next steps are: Shutdown, inspect & replace/reseat cable*, boot, rebuild.

* Thanks for noticing Disk 1. I will double check the power cables and see which disks are connected to which SAS cable. Perhaps Disk 1 & 3 are running on the same cable. I have a spare SAS cable, so if that's the case I'll be sure to replace that. If not, what would you suggest is best? Just connect them with a SATA cable to the Motherboard or switch connecters with other drives and thus keep using the current SAS cable.

BTW Is there a method to lookup the firmware version in Unraid? I started using the HBA cards a couple of months ago and flashed them myself. I used the links in https://forums.serverbuilds.net/t/guide-updating-your-lsi-sas-controller-with-a-uefi-motherboard/131 As far as I can tell they link to 20.00.07.00. I'll check on reboot to double check, but I'm pretty sure they are running the latest firmware version. I remember that being one of my concerns when flashing

JorgeB · June 18, 2021

Just now, Squnkraid said:

Perhaps Disk 1 & 3

No, disk 1 is using the onboard SATA controller, disk 3 is on an LSI.

Just now, Squnkraid said:

If not, what would you suggest is best?

Replace SATA cable for disk1, also replace power cables, if not available swap power cable with another disk and see where the problem follow, for disk3 you can just swap cables with a different disk on the LSI.

2 minutes ago, Squnkraid said:

BTW Is there a method to lookup the firmware version in Unraid?

After booting search the syslog for mpt2sas, then click next until you find the line where it mentions the firmware.

JorgeB · June 18, 2021

40 minutes ago, Squnkraid said:

and this being 2,5" SMR drives?

Forgot to mention, this is not a reason for a disk getting disable, I've been using several SMR disks in some of my servers, including 8 disks like yours (but the 4TB model) and never had any issues, other than the sometimes expected performance penalty.

Squnkraid · June 18, 2021

6 minutes ago, JorgeB said:

No, disk 1 is using the onboard SATA controller, disk 3 is on an LSI.

Replace SATA cable for disk1, also replace power cables, if not available swap power cable with another disk and see where the problem follow, for disk3 you can just swap cables with a different disk on the LSI.

After booting search the syslog for mpt2sas, then click next until you find the line where it mentions the firmware.

OK, thank you so much. I know what to do now. And I'll be sure to write down any changes I make. I'll update this thread when I know more.

And I totally forgot. About a 1,5 month ago Disk 1 started getting udma crc error count errors. Re-seating the cable didn't help so I went ahead and used a SATA cable directly to the motherboard. That stopped all errors, but apparently I didn't look hard enough and this cable is bad as well... Just my luck haha

6 minutes ago, JorgeB said:

Forgot to mention, this is not a reason for a disk getting disable, I've been using several SMR disks in some of my servers, including 8 disks like yours (but the 4TB model) and never had any issues, other than the sometimes expected performance penalty.

Ah okay, good to know. And yeah, performance isn't great, especially when running mover. But overall they are great little drives, perfect for long term media storage + Unraid.

Squnkraid · June 18, 2021

1 hour ago, JorgeB said:

After booting search the syslog for mpt2sas, then click next until you find the line where it mentions the firmware.

Changed the cables and booted the server again. Tried to stop the array gave me a "retry unmounting disk share(s)" loop.

Syslog showed a never ending list with the same messages:

Jun 18 21:45:11 Server-UR emhttpd: shcmd (331): exit status: 32
Jun 18 21:45:11 Server-UR emhttpd: Retry unmounting disk share(s)...
Jun 18 21:45:16 Server-UR emhttpd: Unmounting disks...
Jun 18 21:45:16 Server-UR emhttpd: shcmd (332): umount /mnt/cache
Jun 18 21:45:16 Server-UR root: umount: /mnt/cache: target is busy.
Jun 18 21:45:16 Server-UR emhttpd: shcmd (332): exit status: 32

Some googling later; I finally disabled Docker and rebooted. Now I can stop/start the array again.

But how do I enable Disk 3 Again? It's still red and I don't see an option to Rebuild? Do I just remove Disk 3 from the array, start/stop array again and add Disk 3 back?

Just making sure I'm doing this correctly.

itimpi · June 18, 2021

1 hour ago, Squnkraid said:

But how do I enable Disk 3 Again? It's still red and I don't see an option to Rebuild? Do I just remove Disk 3 from the array, start/stop array again and add Disk 3 back?

This is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the UnRaid GUI.

Squnkraid · June 19, 2021

On 6/18/2021 at 11:48 PM, itimpi said:

This is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the UnRaid GUI.

Thank you. Always forget there is an entire wiki... Sorry about that.

Well, success after ~14 hours! But with a small hick-up.

I had swapped the SAS SATA cable connectors between Disk 3 & 4. After that I started the procedure of rebuilding. I noticed that the speeds were pretty low to begin with. Around ~80 MB/s. Next day I woke up to an error. It aborted after 2,5 hours at 15%. Disk 3 had again 1024 errors. But SMART was still fine.

I re-ordered the power cables (just for the hell of it) and unplugged the SAS SATA cable from Disk 3 and connected that disk with a normal (new) SATA cable directly to the motherboard. Booted the server and started the procedure of rebuilding again. The difference was huge. The starting speed was ~135 MB/s and this steadily declined during the rebuild. And after ~14 hours it completed with 0 errors with an average speed of 101 MB/s.

Conclusions:

Because I switched the SAS SATA cable connectors between Disk 3 & 4 I guess I can rule out a faulty cable.

I don't see power being an issue, because it's all well within specs (and has worked for a long time without problems)

So, that leaves the HBA card? Any idea on how to find out if it's bad? It's like it couldn't handle the traffic to Disk 3? Maybe something to do with SMR/cache?

Final question:

Can I resume normal operations again or should I first run another parity check (with/without write corrections)?

JorgeB · June 20, 2021

9 hours ago, Squnkraid said:

So, that leaves the HBA card?

Possibly, but it would likely affect more disks if that was the case, did you check the firmware? But only way to be sure would be to test with another one.

9 hours ago, Squnkraid said:

Can I resume normal operations again

Should be fine, you can run a non correcting check also.

Squnkraid · June 20, 2021

1 hour ago, JorgeB said:

Possibly, but it would likely affect more disks if that was the case, did you check the firmware? But only way to be sure would be to test with another one.

Should be fine, you can run a non correcting check also.

Well, started a non correcting parity check. With immediate results: Disk 4 emulated, 1024 errors. Great... haha.

Disk 4 is on the connector that used to be on Disk 3. So probably the cable is the one at fault?

I have a spare SAS SATA cable. Should I just replace the SAS SATA it all together or first just disconnect Disk 4 and use a normal SATA cable and connect it to the motherboard?

My 2 cents:

First use a (new) SATA cable to the motherboard

Rebuild disk 4

Parity check (non correcting)

Replace the SAS SATA cables, but leave out Disk 4

Parity check (non correcting)

If pass, connect Disk 3 too

Parity check (non correcting)

If pass: cable was bad, HBA is good, all is well again

JorgeB · June 20, 2021

8 minutes ago, Squnkraid said:

Disk 4 is on the connector that used to be on Disk 3. So probably the cable is the one at fault?

Could be.

9 minutes ago, Squnkraid said:

Should I just replace the SAS SATA it all together or first just disconnect Disk 4 and use a normal SATA cable and connect it to the motherboard?

Either.

Squnkraid · June 20, 2021

Used a SATA cable on Disk 4. Changing out 1 thing at the time, just to be sure. Sync started, so I'll report back tomorrow if all goes well.

Squnkraid · June 21, 2021

After 14 hours a successful rebuild. Next problem haha

Rebooted the server and checked the log to be sure, glad I did. Came across this repeating error:

kernel: XFS (dm-2): Metadata corruption detected at xfs_dinode_verify+0xa5/0x52e [xfs], inode 0x84 dinode
kernel: XFS (dm-2): Unmount and run xfs_repair
kernel: XFS (dm-2): First 128 bytes of corrupted metadata buffer:
kernel: 000000002169b579: 49 4e 41 ff 03 02 00 00 00 00 00 63 00 00 00 64  INA........c...d
kernel: 0000000019da1c05: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
kernel: 0000000072ff9089: 5e c5 9d 52 1f ea ff 25 5e c5 9e b0 07 59 9d 1c  ^..R...%^....Y..
kernel: 00000000be2b61f7: 85 68 fc fa 01 2f 5d 03 70 20 68 52 bc ed c8 04  .h.../].p hR....
kernel: 0000000045f635c4: 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01  ................
kernel: 0000000076bc70c3: 00 00 24 01 00 00 00 00 00 00 00 00 8f ea 0e ac  ..$.............
kernel: 000000003302a279: 69 f3 65 cb 49 98 8f 0f b1 c5 1a 58 1b a5 87 06  i.e.I......X....
kernel: 0000000024cd4b84: 2b 4d 62 2a 25 47 71 7f 30 ad a3 88 d0 ad e3 60  +Mb*%Gq.0......`

DM-2 is disk 3 according to the log. Checking WIKI results in an XFS repair: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

So went ahead with Check -nV as suggested. The outcome is this:

Phase 1 - find and verify superblock...
        - block cache size set to 1499720 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 221146 tail block 221146
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
bad CRC for inode 132
Bad flags2 set in inode 132
bad CRC for inode 132, would rewrite
Bad flags2 set in inode 132
would fix bad flags2.
directory inode 132 has bad size 8079572436068976644
would have cleared inode 132
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 0
        - agno = 2
bad CRC for inode 132, would rewrite
Bad flags2 set in inode 132
would fix bad flags2.
Would clear next_unlinked in inode 132
directory inode 132 has bad size 8079572436068976644
entry "Movie name" at block 0 offset 128 in directory inode 2147483776 references free inode 132
	would clear inode number in entry at offset 128...
would have cleared inode 132
        - agno = 3
        - agno = 4
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
entry "Movie name" in directory inode 2147483776 points to free inode 132, would junk entry
bad hash table for directory inode 2147483776 (no data entry): would rebuild
        - agno = 2
        - agno = 3
        - agno = 4
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 133, would move to lost+found
disconnected inode 134, would move to lost+found
disconnected inode 135, would move to lost+found
disconnected inode 136, would move to lost+found
disconnected inode 137, would move to lost+found
disconnected inode 138, would move to lost+found
disconnected inode 139, would move to lost+found
disconnected inode 140, would move to lost+found
disconnected inode 141, would move to lost+found
disconnected inode 142, would move to lost+found
disconnected inode 143, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 2147483776 nlinks from 284 to 283
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Mon Jun 21 03:02:29 2021

Phase		Start		End		Duration
Phase 1:	06/21 03:02:23	06/21 03:02:23
Phase 2:	06/21 03:02:23	06/21 03:02:24	1 second
Phase 3:	06/21 03:02:24	06/21 03:02:27	3 seconds
Phase 4:	06/21 03:02:27	06/21 03:02:27
Phase 5:	Skipped
Phase 6:	06/21 03:02:27	06/21 03:02:29	2 seconds
Phase 7:	06/21 03:02:29	06/21 03:02:29

Total run time: 6 seconds

Now I'm not sure what to do. The WIKI says this:

Quote

If however issues were found, the display of results will indicate the recommended action to take. Typically, that will involve repeating the command with a specific option, clearly stated, which you will type into the options box (including any hyphens, usually 2 leading hyphens).

Then (if not BTRFS) click the Check button again to run the repair.

But I'm not sure how to interpret the outcome. I'm not seeing a clear "do this" message, or I'm just blink/tired. I tried googling this issue and found an answer saying "run Check -L", but I don't really understand what that does, so I thought I'd clear it first with you guys.

Run "Check -L" or something else?

Edited June 21, 2021 by Squnkraid

JorgeB · June 21, 2021

8 hours ago, Squnkraid said:

So went ahead with Check -nV as suggested

Run it again without -n or nothing will be done, use -L only if it asks for it.

Squnkraid · June 21, 2021

36 minutes ago, JorgeB said:

Run it again without -n or nothing will be done, use -L only if it asks for it.

Ugh, yeah, too tired it was haha. Was expecting "-nv" to give me some sort of answer of what to do next.

Ran Check

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
bad CRC for inode 132
Bad flags2 set in inode 132
bad CRC for inode 132, will rewrite
Bad flags2 set in inode 132
fixing bad flags2.
directory inode 132 has bad size 8079572436068976644
cleared inode 132
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 3
        - agno = 2
        - agno = 1
        - agno = 0
entry "Movie name" at block 0 offset 128 in directory inode 2147483776 references free inode 132
	clearing inode number in entry at offset 128...
        - agno = 4
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
bad hash table for directory inode 2147483776 (no data entry): rebuilding
rebuilding directory inode 2147483776
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 133, moving to lost+found
disconnected inode 134, moving to lost+found
disconnected inode 135, moving to lost+found
disconnected inode 136, moving to lost+found
disconnected inode 137, moving to lost+found
disconnected inode 138, moving to lost+found
disconnected inode 139, moving to lost+found
disconnected inode 140, moving to lost+found
disconnected inode 141, moving to lost+found
disconnected inode 142, moving to lost+found
disconnected inode 143, moving to lost+found
Phase 7 - verify and correct link counts...
resetting inode 2147483776 nlinks from 284 to 283
done

Nothing left to do? Ran -nv again to see what that would give

Phase 1 - find and verify superblock...
        - block cache size set to 1499720 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 221146 tail block 221146
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 4
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Mon Jun 21 12:00:41 2021

Phase		Start		End		Duration
Phase 1:	06/21 12:00:35	06/21 12:00:35
Phase 2:	06/21 12:00:35	06/21 12:00:35
Phase 3:	06/21 12:00:35	06/21 12:00:38	3 seconds
Phase 4:	06/21 12:00:38	06/21 12:00:38
Phase 5:	Skipped
Phase 6:	06/21 12:00:38	06/21 12:00:41	3 seconds
Phase 7:	06/21 12:00:41	06/21 12:00:41

Total run time: 6 seconds

So everything should be fine now I guess and I can boot up the array again?

JorgeB · June 21, 2021

20 minutes ago, Squnkraid said:

So everything should be fine now I guess and I can boot up the array again?

Yep.

Squnkraid · June 21, 2021

27 minutes ago, JorgeB said:

Yep.

It looked like everything was normal again. Nothing abnormal in syslog. So, went ahead and changed out the SAS SATA cable (only 1 drive, Disk 2 was still conneted to the old SAS SATA cable).

Aaaaaand next problem haha.

As soon as I started a non correcting Parity Check i could hear a drive making abnormal sounds. Speeds of the check were terrible. As soon as I looked in Syslog:

Jun 21 13:33:53 Server-UR kernel: mdcmd (84): check nocorrect
Jun 21 13:33:53 Server-UR kernel: md: recovery thread: check P Q ...
Jun 21 13:34:35 Server-UR kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:35 Server-UR kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:35 Server-UR kernel: sd 8:0:6:0: Power-on or device reset occurred
Jun 21 13:34:36 Server-UR kernel: sd 8:0:6:0: Power-on or device reset occurred
Jun 21 13:34:36 Server-UR rc.diskinfo[11766]: SIGHUP received, forcing refresh of disks info.
Jun 21 13:34:40 Server-UR kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:40 Server-UR kernel: sd 2:0:2:0: Power-on or device reset occurred
Jun 21 13:34:41 Server-UR kernel: sd 2:0:2:0: Power-on or device reset occurred
Jun 21 13:34:41 Server-UR rc.diskinfo[11766]: SIGHUP received, forcing refresh of disks info.
Jun 21 13:34:45 Server-UR kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:45 Server-UR kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:45 Server-UR kernel: sd 8:0:6:0: Power-on or device reset occurred
Jun 21 13:34:45 Server-UR rc.diskinfo[11766]: SIGHUP received, forcing refresh of disks info.
Jun 21 13:34:45 Server-UR kernel: sd 8:0:6:0: Power-on or device reset occurred
Jun 21 13:34:49 Server-UR kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:49 Server-UR kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 13:34:50 Server-UR kernel: sd 2:0:2:0: Power-on or device reset occurred
Jun 21 13:34:50 Server-UR rc.diskinfo[11766]: SIGHUP received, forcing refresh of disks info.
Jun 21 13:34:50 Server-UR kernel: sd 2:0:2:0: Power-on or device reset occurred
Jun 21 13:34:51 Server-UR kernel: ata2.00: exception Emask 0x50 SAct 0x300000 SErr 0x4890800 action 0xe frozen
Jun 21 13:34:51 Server-UR kernel: ata2.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Jun 21 13:34:51 Server-UR kernel: ata2: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jun 21 13:34:51 Server-UR kernel: ata2.00: failed command: READ FPDMA QUEUED
Jun 21 13:34:51 Server-UR kernel: ata2.00: cmd 60/00:a0:40:f8:a2/04:00:00:00:00/40 tag 20 ncq dma 524288 in
Jun 21 13:34:51 Server-UR kernel:         res 40/00:a0:40:f8:a2/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jun 21 13:34:51 Server-UR kernel: ata2.00: status: { DRDY }
Jun 21 13:34:51 Server-UR kernel: ata2.00: failed command: READ FPDMA QUEUED
Jun 21 13:34:51 Server-UR kernel: ata2.00: cmd 60/08:a8:40:fc:a2/00:00:00:00:00/40 tag 21 ncq dma 4096 in
Jun 21 13:34:51 Server-UR kernel:         res 40/00:a0:40:f8:a2/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jun 21 13:34:51 Server-UR kernel: ata2.00: status: { DRDY }
Jun 21 13:34:51 Server-UR kernel: ata2: hard resetting link
Jun 21 13:34:54 Server-UR kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 21 13:34:54 Server-UR kernel: ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT0.PRT1._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
Jun 21 13:34:54 Server-UR kernel: ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT1._GTF, AE_NOT_FOUND (20180810/psparse-514)
Jun 21 13:34:54 Server-UR kernel: ata2.00: supports DRM functions and may not be fully accessible
Jun 21 13:34:54 Server-UR kernel: ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT0.PRT1._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
Jun 21 13:34:54 Server-UR kernel: ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT1._GTF, AE_NOT_FOUND (20180810/psparse-514)
Jun 21 13:34:54 Server-UR kernel: ata2.00: supports DRM functions and may not be fully accessible
Jun 21 13:34:54 Server-UR kernel: ata2.00: configured for UDMA/133
Jun 21 13:34:54 Server-UR kernel: ata2: EH complete

Reconnected the SAS SATA cable again, at both ends, but same results when I start a Parity check.

Just pulled the SAS SATA cable all together and connected Disk 2 with a SATA cable to the MB.

Jun 21 15:28:14 Server-UR kernel: mdcmd (83): check 
Jun 21 15:28:14 Server-UR kernel: md: recovery thread: check P Q ...
Jun 21 15:28:15 Server-UR kernel: ata3.00: exception Emask 0x50 SAct 0x40 SErr 0x4890800 action 0xe frozen
Jun 21 15:28:15 Server-UR kernel: ata3.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Jun 21 15:28:15 Server-UR kernel: ata3: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jun 21 15:28:15 Server-UR kernel: ata3.00: failed command: READ FPDMA QUEUED
Jun 21 15:28:15 Server-UR kernel: ata3.00: cmd 60/f8:30:48:0c:01/03:00:00:00:00/40 tag 6 ncq dma 520192 in
Jun 21 15:28:15 Server-UR kernel:         res 40/00:30:48:0c:01/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jun 21 15:28:15 Server-UR kernel: ata3.00: status: { DRDY }
Jun 21 15:28:15 Server-UR kernel: ata3: hard resetting link
Jun 21 15:28:18 Server-UR kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 21 15:28:18 Server-UR kernel: ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT0.PRT2._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
Jun 21 15:28:18 Server-UR kernel: ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT2._GTF, AE_NOT_FOUND (20180810/psparse-514)
Jun 21 15:28:18 Server-UR kernel: ata3.00: supports DRM functions and may not be fully accessible
Jun 21 15:28:18 Server-UR kernel: ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT0.PRT2._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
Jun 21 15:28:18 Server-UR kernel: ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT2._GTF, AE_NOT_FOUND (20180810/psparse-514)
Jun 21 15:28:18 Server-UR kernel: ata3.00: supports DRM functions and may not be fully accessible
Jun 21 15:28:18 Server-UR kernel: ata3.00: configured for UDMA/133
Jun 21 15:28:18 Server-UR kernel: ata3: EH complete
Jun 21 15:28:22 Server-UR kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jun 21 15:28:22 Server-UR kernel: sd 8:0:6:0: Power-on or device reset occurred
Jun 21 15:28:23 Server-UR kernel: sd 8:0:6:0: Power-on or device reset occurred
Jun 21 15:28:23 Server-UR rc.diskinfo[11753]: SIGHUP received, forcing refresh of disks info.
Jun 21 15:28:23 Server-UR kernel: ata3.00: exception Emask 0x50 SAct 0x2 SErr 0x4890800 action 0xe frozen
Jun 21 15:28:23 Server-UR kernel: ata3.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Jun 21 15:28:23 Server-UR kernel: ata3: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jun 21 15:28:23 Server-UR kernel: ata3.00: failed command: READ FPDMA QUEUED
Jun 21 15:28:23 Server-UR kernel: ata3.00: cmd 60/00:08:c8:6e:03/04:00:00:00:00/40 tag 1 ncq dma 524288 in
Jun 21 15:28:23 Server-UR kernel:         res 40/00:08:c8:6e:03/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jun 21 15:28:23 Server-UR kernel: ata3.00: status: { DRDY }
Jun 21 15:28:23 Server-UR kernel: ata3: hard resetting link
Jun 21 15:28:27 Server-UR kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 21 15:28:27 Server-UR kernel: ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT0.PRT2._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
Jun 21 15:28:27 Server-UR kernel: ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT2._GTF, AE_NOT_FOUND (20180810/psparse-514)
Jun 21 15:28:27 Server-UR kernel: ata3.00: supports DRM functions and may not be fully accessible
Jun 21 15:28:27 Server-UR kernel: ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT0.PRT2._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
Jun 21 15:28:27 Server-UR kernel: ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT2._GTF, AE_NOT_FOUND (20180810/psparse-514)
Jun 21 15:28:27 Server-UR kernel: ata3.00: supports DRM functions and may not be fully accessible
Jun 21 15:28:27 Server-UR kernel: ata3.00: configured for UDMA/133
Jun 21 15:28:27 Server-UR kernel: ata3: EH complete

And this just repeats. Speeds are terrible. First it was ata2.00 and now ata3.00.

Array starts without problems.

Unraid doesn't report any issues.

Disk 2, 3 and 4 all complete a SMART short self-test without errors.

Only when I start a parity check problems emerge...

And I can't figure out what I should change out next. Any ideas?

JorgeB · June 21, 2021

Still problems, likely cable or power related.

Squnkraid · June 21, 2021

1 hour ago, JorgeB said:

Still problems, likely cable or power related.

Think you were correct, although I still can't explain it all. But it seems to have been a power issue after all.

I have a Be Quiet Pue Power 400CM. Across 2 Drive cables I had connected 13x 2,5" 5TB and 2x SSD

After I expanded a few months ago it seemed I (unknowingly) connected a total of 11 drives on one Drive cable.

All of them being Seagate 5TB drives. Which should be more than fine when it comes to total amount of power.

But possibly the length of the cable was too much? I had 2x cables with each 5x SATA Power connectors linked together.

It never caused any issue, so not sure why it did now, but rearranging power (and SATA cables, now all connected again to the HBA) seems to have done the job.

No errors on boot, no errors in syslog and Parity Check started at full speed _and_ not filling syslog with the errors I had before.

I'm going to get a third Drive Power cable for the PSU, so the drives are even more evenly spread out. Hopefully that will be the end of it.

Still strange that parity check worked before.

Keep you posted

trurl · June 21, 2021

5 hours ago, Squnkraid said:

disconnected inode 133, moving to lost+found disconnected inode 134, moving to lost+found disconnected inode 135, moving to lost+found disconnected inode 136, moving to lost+found disconnected inode 137, moving to lost+found disconnected inode 138, moving to lost+found disconnected inode 139, moving to lost+found disconnected inode 140, moving to lost+found disconnected inode 141, moving to lost+found disconnected inode 142, moving to lost+found disconnected inode 143, moving to lost+found

5 hours ago, Squnkraid said:

So everything should be fine now I guess and I can boot up the array again?

Did you check your lost+found share?

Squnkraid · June 21, 2021

2 hours ago, trurl said:

Did you check your lost+found share?

I did not actually! Thank you for mentioning this!

I didn't read the wiki entirely. My bad. Just opened the share and I see this:

139	11.7 GB	2014-09-19 13:37	Disk 3 = Not yet downloaded due to Parity running, but guessing it's .mkv = "movie name"
134	500 KB	2018-09-01 22:38	Disk 3 =.png = movieposter
136	500 KB	2018-09-01 22:38	Disk 3 =.png = movieposter (duplicate of 134)
137	500 KB	2018-09-01 22:38	Disk 3 =.png = movieposter (duplicate of 134)
133	373 KB	2018-09-01 22:38	Disk 3 =.png = background poster
138	163 KB	2017-09-05 07:20	Disk 3 =.srt = subtitle "movie name"
141	126 KB	2017-09-05 07:20	Disk 3 =.srt = subtitle "movie name" 
140	8.08 KB	2021-06-08 03:20	Disk 3 =.nfo = "movie name" info & cast <movie>
142	4.42 KB	2017-09-05 06:57	Disk 3 =.nfo = releasegroup info etc.
135	888 B	2018-09-15 09:38	Disk 3 =.nfo = "movie name" info <movie>
143	609 B	2018-09-15 09:38	Disk 3 =.nfo = "movie name" info <video>

Repair only mentioned one folder called "movie name". That folder is not showing up when I browse Disk 3. So my guess is that all content in that folder was somehow corrupted. So the movie file, subtitles and thumbnails. Looking at the file sizes that seems logical to me.

EDIT:

After downloading all the files to my PC, except the big one, I can confirm that my hunch was correct. After adding a file extension to all the files I was able to see what they were. All the files seem to be undamaged? So what should I do with them? Just re-download and be done with it and delete Lost & Found folder?

Also, for my understanding, can you say what happened here exactly? Or a best guess of what caused the file corruption in this situation? Because Rebuild was successful, without errors. And the folder "movie name" is a really old folder that hasn't been written to in ages. Bit scary to see the limits of Parity after the array needs rebuilding.

Straight up wake-up call to check all my backups!

Edited June 21, 2021 by Squnkraid

Squnkraid · June 22, 2021

Parity check finished with normal results; ~14 hours, ~100MB/s speed. Just like before.

Server is back online, docker started. Radarr scanned all files and indeed, "Movie name" was no longer there. Changed the file extension on the large file in "lost & found" to .mkv and played it. It plays just fine... So all the moved files seem to be okay and not corrupted? That's something I don't understand. Corruption means loss of data and thus being unable to open/play a file, right? Figured it be best to just re-download it and delete the files in the "lost & found" share.

In the end it probably all was due to a lack of power. Too many drives linked together, too long of a cable and thus a drop in power. Resulting in the sound I heard (and have heard on a couple of occasions), the bad speeds and ultimately read errors. So far all seems well again.

JorgeB · June 22, 2021

17 minutes ago, Squnkraid said:

So all the moved files seem to be okay and not corrupted? That's something I don't understand. Corruption means loss of data and thus being unable to open/play a file, right? Figured it be best to just re-download it and delete the files in the "lost & found" share.

Sometimes just the filesystem structure is corrupt, i.e., the moved files are OK, but if you can get them again it would be better.

Please advice - First time dealing with "disk disabled"

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation