After 3 years Unraid is being wonky, dockers are not working, drives keep failing


Recommended Posts

Ok, checked disk 7.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 1
        - agno = 3
        - agno = 0
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

Ran again without the -n flag and attached the diagnostics.

 

 

orthanc-diagnostics-20220808-1844.zip

Link to comment
20 hours ago, trurl said:

lots of syslog entries about Unassigned Device sdh

Still filling syslog, and I almost forgot why since I look at a lot of threads and a lot of diagnostics. Currently using 25% of log space, it will eventually fill your log space unless you reboot before then

10 hours ago, trurl said:

You should remove it if for no other reason than it is cluttering your syslog and we will wonder about it next time.

 

Link to comment
36 minutes ago, trurl said:

Did you capture the output so you could post it?

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
Metadata CRC error detected at 0x439496, xfs_agf block 0x1ffffffe1/0x200
agf has bad CRC for ag 4
block (2,184178199-184178280) multiply claimed by bno space tree, state - 1
block (2,226384586-226384833) multiply claimed by cnt space tree, state - 2
block (2,123204880-123205051) multiply claimed by cnt space tree, state - 2
block (2,40225617-40225622) multiply claimed by cnt space tree, state - 2
block (2,40225773-40225859) multiply claimed by cnt space tree, state - 2
block (2,102377537-102377552) multiply claimed by cnt space tree, state - 2
block (2,36413631-36413735) multiply claimed by cnt space tree, state - 2
block (2,61136756-61136865) multiply claimed by cnt space tree, state - 2
block (2,174373505-174373514) multiply claimed by cnt space tree, state - 2
block (2,9748180-9748193) multiply claimed by cnt space tree, state - 2
block (2,9748274-9748287) multiply claimed by cnt space tree, state - 2
block (2,104799337-104799352) multiply claimed by cnt space tree, state - 2
block (2,123180332-123180475) multiply claimed by cnt space tree, state - 2
block (2,162984111-162984288) multiply claimed by cnt space tree, state - 2
block (2,60006119-60006216) multiply claimed by cnt space tree, state - 2
block (2,102524589-102524613) multiply claimed by cnt space tree, state - 2
block (2,40224317-40224326) multiply claimed by cnt space tree, state - 2
block (2,40224419-40224432) multiply claimed by cnt space tree, state - 2
block (2,40224497-40224575) multiply claimed by cnt space tree, state - 2
block (2,107092002-107092006) multiply claimed by cnt space tree, state - 2
block (2,236311683-236311689) multiply claimed by cnt space tree, state - 2
block (2,107087496-107087499) multiply claimed by cnt space tree, state - 2
block (2,48692963-48693032) multiply claimed by cnt space tree, state - 2
block (2,106727218-106727508) multiply claimed by cnt space tree, state - 2
block (2,104837401-104837680) multiply claimed by cnt space tree, state - 2
block (2,9514183-9514243) multiply claimed by cnt space tree, state - 2
block (2,48042435-48042473) multiply claimed by cnt space tree, state - 2
block (2,40238304-40238370) multiply claimed by cnt space tree, state - 2
block (2,162968554-162968627) multiply claimed by cnt space tree, state - 2
block (2,47965621-47965668) multiply claimed by cnt space tree, state - 2
block (2,178636310-178636383) multiply claimed by cnt space tree, state - 2
block (2,102523728-102523738) multiply claimed by cnt space tree, state - 2
block (2,19627766-19627977) multiply claimed by cnt space tree, state - 2
block (2,95752502-95752512) multiply claimed by cnt space tree, state - 2
block (2,127294648-127294666) multiply claimed by cnt space tree, state - 2
agf_freeblks 2038004, counted 2048802 in ag 2
agf_freeblks 2116416, counted 2106811 in ag 4
agi unlinked bucket 24 is 344682520 in ag 1 (inode=2492166168)
sb_fdblocks 793886, counted 11056513
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
data fork in ino 1841992 claims free block 722810940
data fork in ino 1841992 claims free block 722818433
data fork in ino 1841992 claims free block 722766205
data fork in ino 1841992 claims free block 722745607
        - agno = 1
data fork in ino 2358538759 claims free block 721347041
data fork in ino 2358538759 claims free block 721149027
        - agno = 2
data fork in ino 4296520813 claims free block 556498438
data fork in ino 4296520813 claims free block 578099252
data fork in ino 4296520813 claims free block 773182227
data fork in ino 4296520813 claims free block 773182467
data fork in ino 4296520817 claims free block 573284307
data fork in ino 4296520817 claims free block 639394219
data fork in ino 4296520817 claims free block 639395162
data fork in ino 4296733237 claims free block 616345471
data fork in ino 4296733237 claims free block 639394507
data fork in ino 4296733237 claims free block 639395362
data fork in ino 4296739755 claims free block 641669917
data fork in ino 4296739755 claims free block 699854846
data fork in ino 4296739755 claims free block 699839130
data fork in ino 4296739755 claims free block 632622970
data fork in ino 4296739755 claims free block 632630486
data fork in ino 4296739755 claims free block 643598034
data fork in ino 4296739755 claims free block 632630142
data fork in ino 4296739755 claims free block 585572404
data fork in ino 4296739755 claims free block 585563566
data fork in ino 4296739755 claims free block 585563626
data fork in ino 4296744535 claims free block 711244085
data fork in ino 4296744537 claims free block 578099040
data fork in ino 4296744541 claims free block 596876767
data fork in ino 4296744541 claims free block 763252191
data fork in ino 4296744541 claims free block 763255439
data fork in ino 4296744541 claims free block 763255426
data fork in ino 4296744541 claims free block 763255442
data fork in ino 4296744541 claims free block 641708205
data fork in ino 4296744541 claims free block 584836169
data fork in ino 4296744541 claims free block 643965222
data fork in ino 4296744541 claims free block 643962849
data fork in ino 4296744541 claims free block 643962780
data fork in ino 4296744541 claims free block 643958036
data fork in ino 4296744541 claims free block 643962552
data fork in ino 4296744541 claims free block 639248129
data fork in ino 4296744541 claims free block 715506878
data fork in ino 4296744541 claims free block 660075640
data fork in ino 4296744541 claims free block 660051040
data fork in ino 4296744541 claims free block 584913266
data fork in ino 4296744541 claims free block 722709888
data fork in ino 4296744541 claims free block 722710684
data fork in ino 4296744541 claims free block 584912986
data fork in ino 4296744544 claims free block 546384767
data fork in ino 4296744546 claims free block 546619004
data fork in ino 4296744546 claims free block 546619106
data fork in ino 4296744546 claims free block 546619200
data fork in ino 4296744547 claims free block 577095121
data fork in ino 4296744547 claims free block 577095239
data fork in ino 4296744547 claims free block 577095345
data fork in ino 4296744547 claims free block 577096441
data fork in ino 4296744547 claims free block 577096535
data fork in ino 4296744547 claims free block 577108880
data fork in ino 4296744547 claims free block 585468837
data fork in ino 4296744553 claims free block 546619278
data fork in ino 4296744553 claims free block 598007436
        - agno = 3
data fork in ino 6446982893 claims free block 721391881
        - agno = 4
data fork in ino 8590740692 claims free block 664165116
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 3
        - agno = 1
        - agno = 5
        - agno = 4
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 2492166168, moving to lost+found
Phase 7 - verify and correct link counts...
done

 

Link to comment
5 minutes ago, trurl said:

That attribute isn't usually monitored. Should it be added to custom attributes for some models?

Don't have enough experience with Helium disks for now to see if it's worth monitoring, it won't hurt, just have a couple of WD using Helium and the attribute is different, in this case and since there's a SMART attribute "failing now" user gets a notification anyway, assuming they are enabled.

Link to comment

Shut it down and checked the cables on the drives, removed the disk that wasn't in use and had errors.

 

Restarted and things are looking better.

 

image.png.f08dd4bb1545bb79ff1d9cee8d7ac7b4.png

 

Should I be worried about it leaking helium? I'll probably start the warranty process on it, since I doubt it can be recovered if that's the problem.

orthanc-diagnostics-20220810-1636.zip

Edited by thebedivere
Link to comment
15 minutes ago, trurl said:

Still having connection problems on parity2.

 

I stopped the array and hot swapped the 2 parity drives into a different drive bay, so they should be running from the sata controller on the mobo now instead of the expansion card.

 

if they don't have problems and the other drives now do, I will need to replace the sata card.

orthanc-diagnostics-20220810-1708.zip

Link to comment
20 hours ago, thebedivere said:

For my own understanding, where do you see this in the diagnostics?

In logs/syslog lots of this

Aug 10 14:08:37 Orthanc kernel: ata10.00: failed command: READ FPDMA QUEUED
Aug 10 14:08:37 Orthanc kernel: ata10.00: cmd 60/40:38:b8:c8:d4/05:00:00:00:00/40 tag 7 ncq dma 688128 in
Aug 10 14:08:37 Orthanc kernel:         res 40/00:00:f8:ad:d4/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Aug 10 14:08:37 Orthanc kernel: ata10.00: status: { DRDY }
Aug 10 14:08:37 Orthanc kernel: ata10.00: failed command: READ FPDMA QUEUED
Aug 10 14:08:37 Orthanc kernel: ata10.00: cmd 60/d8:40:f8:cd:d4/04:00:00:00:00/40 tag 8 ncq dma 634880 in
Aug 10 14:08:37 Orthanc kernel:         res 40/00:00:f8:ad:d4/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Aug 10 14:08:37 Orthanc kernel: ata10.00: status: { DRDY }
Aug 10 14:08:37 Orthanc kernel: ata10: hard resetting link

In system/lsscsi.txt

[10:0:0:0]   disk    ATA      ST16000NM001G-2K SN04  /dev/sdi   /dev/sg8 
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/10:0:0:0  [/sys/devices/pci0000:00/0000:00:01.2/0000:02:00.2/0000:03:04.0/0000:06:00.0/ata10/host10/target10:0:0/10:0:0:0]

And smart folder shows which disk is sdi, you can also search for ata10 and sdi in syslog and figure it out

 

That entry in lsscsi also shows controller 06:00.0, which can be seen in system/lspci.txt

06:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller [1b4b:9215] (rev 11)
	Subsystem: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller [1b4b:9215]

 

 

Link to comment

ata10 and the ata12s on Marvell, ata2 on another, probably motherboard.

 

Replacing Marvell should be progress even if it doesn't fix everything.

 

Can't remember if it was already mentioned, what are you replacing it with?

 

I skimmed the thread and didn't notice the link we usually give for recommended controllers.

Link to comment
20 minutes ago, trurl said:

Can't remember if it was already mentioned, what are you replacing it with?

I had grabbed another one off amazon, but even though its a different brand and image it's the exact same controller...

 

So right now I don't have anything to replace it with. Any suggestions?

Link to comment
19 minutes ago, trurl said:

Another thing I don't think we have mentioned. Does your system have adequate cooling? Controllers have heatsinks for a reason.

 

I have just the case fans that came with it and a big box fan blowing over the whole server rack. There isn't anything pointing directly at the sata controller. I can try and get another fan in the case pointing at it.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.