SSD Error!?

knarf0007 · February 9, 2022

Hi All,

I have a single Cache SSD drive in my UnRaid setting where all my Docker Applications are running. In the last couple of month ramdomly sector relacations on this drive took place. Last night I tried to transfer all data to a "classical" data disc to save the data. The Mover transfered the data partially and then ended with some error messages:

Feb 9 08:40:09 UnRaid-Server kernel: sd 7:0:2:0: [sdj] tag#195 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Feb 9 08:40:09 UnRaid-Server kernel: sd 7:0:2:0: [sdj] tag#195 CDB: opcode=0x28 28 00 03 25 e7 a0 00 00 20 00
Feb 9 08:40:09 UnRaid-Server kernel: blk_update_request: I/O error, dev sdj, sector 52815776 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Feb 9 08:40:09 UnRaid-Server kernel: BTRFS error (device dm-11): bdev /dev/mapper/sdj1 errs: wr 37, rd 260090, flush 0, corrupt 0, gen 0

I'm having problem to interpret these errors. Can you help? Thank you very much!

Best regards

Frank

unraid-server-diagnostics-20220209-0854.zip

JorgeB · February 9, 2022

Could be a device problem, but the SASLP driver crashed, one of the reasons those controllers are not recommended, connect that SSD to an Intel SATA port, swap with another device if needed, and try again, post new diags if it fails again.

knarf0007 · February 10, 2022

Hi All,

it's getting worse. My attempt to copy some data from my cache_VM drive to a data-disc using the mover ended in a lot of write errors on the data-disc which is now disabled. Is it possible that my 8port SATA extension card is somehow damaged? As a precaution, I stopped the array. Im afraid that I could really loose data now.

Best regards

Frank

unraid-server-diagnostics-20220210-1947.zip

Edited February 10, 2022 by knarf0007

JorgeB · February 10, 2022

Diags are after rebooting so we can't see what happened, but those controllers are not recommended for a long time, one of the reasons is that they tend to drop devices without a reason, and unlike suggested the SSD is still connected there.

JorgeB · February 10, 2022

Forgot to mention, disabled disk looks healthy, so likely not a disk problem, most likely controller or a power/cable issue.

knarf0007 · February 10, 2022

I have 4 SSDs in my setup. I switched all of them to the onboard intel SATA Controller. Everything looks fine for the moment. Nevertheless I find these kind of messages in the log.

xfs_inode block 0xe91a8fb0 xfs_inode_buf_verify
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): Unmount and run xfs_repair
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): First 128 bytes of corrupted metadata buffer:
Feb 10 21:28:31 UnRaid-Server kernel: 00000000: dd ec fd 51 64 2e 70 e1 bd 28 2c 85 d6 fb eb 4c ...Qd.p..(,....L
Feb 10 21:28:31 UnRaid-Server kernel: 00000010: 2e 67 21 fe 25 1e 69 f8 ed d0 e2 7c fa 6f 55 ce .g!.%.i....|.oU.
Feb 10 21:28:31 UnRaid-Server kernel: 00000020: af ea 17 e8 de eb 9b f1 a1 e6 36 91 25 58 2f 7b ..........6.%X/{
Feb 10 21:28:31 UnRaid-Server kernel: 00000030: 02 5e 02 e4 f8 82 17 3d 2d 3c 6d d6 c5 0e 0c 31 .^.....=-<m....1
Feb 10 21:28:31 UnRaid-Server kernel: 00000040: db 77 59 bb 85 75 f3 81 fe 75 bd 9c fb 2f b8 55 .wY..u...u.../.U
Feb 10 21:28:31 UnRaid-Server kernel: 00000050: b9 07 a0 e4 32 7c 77 aa b4 a8 25 24 68 19 9c 6d ....2|w...%$h..m
Feb 10 21:28:31 UnRaid-Server kernel: 00000060: 55 79 86 07 a2 49 ff fd 6c d0 87 57 d1 6b 79 61 Uy...I..l..W.kya
Feb 10 21:28:31 UnRaid-Server kernel: 00000070: 1b f3 23 a3 b0 0d 1f 4b e7 d6 8f 9a be b2 a8 bd ..#....K........
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): Metadata corruption detected at xfs_buf_ioend+0x51/0x284 [xfs], xfs_inode block 0xe91a8fb0 xfs_inode_buf_verify
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): Unmount and run xfs_repair
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): First 128 bytes of corrupted metadata buffer:
Feb 10 21:28:31 UnRaid-Server kernel: 00000000: dd ec fd 51 64 2e 70 e1 bd 28 2c 85 d6 fb eb 4c ...Qd.p..(,....L
Feb 10 21:28:31 UnRaid-Server kernel: 00000010: 2e 67 21 fe 25 1e 69 f8 ed d0 e2 7c fa 6f 55 ce .g!.%.i....|.oU.
Feb 10 21:28:31 UnRaid-Server kernel: 00000020: af ea 17 e8 de eb 9b f1 a1 e6 36 91 25 58 2f 7b ..........6.%X/{
Feb 10 21:28:31 UnRaid-Server kernel: 00000030: 02 5e 02 e4 f8 82 17 3d 2d 3c 6d d6 c5 0e 0c 31 .^.....=-<m....1
Feb 10 21:28:31 UnRaid-Server kernel: 00000040: db 77 59 bb 85 75 f3 81 fe 75 bd 9c fb 2f b8 55 .wY..u...u.../.U
Feb 10 21:28:31 UnRaid-Server kernel: 00000050: b9 07 a0 e4 32 7c 77 aa b4 a8 25 24 68 19 9c 6d ....2|w...%$h..m
Feb 10 21:28:31 UnRaid-Server kernel: 00000060: 55 79 86 07 a2 49 ff fd 6c d0 87 57 d1 6b 79 61 Uy...I..l..W.kya
Feb 10 21:28:31 UnRaid-Server kernel: 00000070: 1b f3 23 a3 b0 0d 1f 4b e7 d6 8f 9a be b2 a8 bd ..#....K........
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): Metadata corruption detected at xfs_buf_ioend+0x51/0x284 [xfs], xfs_inode block 0xe91a8fb0 xfs_inode_buf_verify
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): Unmount and run xfs_repair
Feb 10 21:28:31 UnRaid-Server kernel: XFS (dm-6): First 128 bytes of corrupted metadata buffer:
Feb 10 21:28:31 UnRaid-Server kernel: 00000000: dd ec fd 51 64 2e 70 e1 bd 28 2c 85 d6 fb eb 4c ...Qd.p..(,....L
Feb 10 21:28:31 UnRaid-Server kernel: 00000010: 2e 67 21 fe 25 1e 69 f8 ed d0 e2 7c fa 6f 55 ce .g!.%.i....|.oU.
Feb 10 21:28:31 UnRaid-Server kernel: 00000020: af ea 17 e8 de eb 9b f1 a1 e6 36 91 25 58 2f 7b ..........6.%X/{
Feb 10 21:28:31 UnRaid-Server kernel: 00000030: 02 5e 02 e4 f8 82 17 3d 2d 3c 6d d6 c5 0e 0c 31 .^.....=-<m....1
Feb 10 21:28:31 UnRaid-Server kernel: 00000040: db 77 59 bb 85 75 f3 81 fe 75 bd 9c fb 2f b8 55 .wY..u...u.../.U
Feb 10 21:28:31 UnRaid-Server kernel: 00000050: b9 07 a0 e4 32 7c 77 aa b4 a8 25 24 68 19 9c 6d ....2|w...%$h..m
Feb 10 21:28:31 UnRaid-Server kernel: 00000060: 55 79 86 07 a2 49 ff fd 6c d0 87 57 d1 6b 79 61 Uy...I..l..W.kya
Feb 10 21:28:31 UnRaid-Server kernel: 00000070: 1b f3 23 a3 b0 0d 1f 4b e7 d6 8f 9a be b2 a8 bd ..#....K........

I can't see which disc has the problem. Should I stop the array and run a xfs_repair?

And what is the way to revive disc 9? Thanks a lot!!!

Frank.

unraid-server-diagnostics-20220210-2136.zip

knarf0007 · February 10, 2022

Sorry, ignore the question how to revive disc 9, I found the procedure. But is it wise to start a rebuild as long I'm not sure, that erverything with the conroller is o.k?

Edited February 10, 2022 by knarf0007

JorgeB · February 11, 2022

dm-6 is disk9, check filesystem.

10 hours ago, knarf0007 said:

But is it wise to start a rebuild as long I'm not sure, that erverything with the conroller is o.k?

After checking file system you can attempt to rebuild, but with that controller I would recommend using a spare disk, so you still have the old one if it goes bad.

knarf0007 · February 11, 2022

Hi, the rebuild on an other spare disc is now running. So far erverthing works fine. During the drive swap, I checked all cable connections, maybe there was a small error there. Additionally - to be on the safe side - I ordered one of the recommanded LSI SAS Contoller. However, I doubt that the old controller type has a fundamental problem, because it has been working absolutely flawlessly for years, with hard drives and SSDs. But anyway, if I can further reduce the risk of failure of my "beloved" UnRaid server with a controller change, then I'm happy to do so. 😉

Best Regards

Frank

knarf0007 · February 13, 2022

Hi All,

sorry I'm back. Something is still wrong. Everytime I try to copy all the data from my docker ssd I get error messages and the copy process stops. I changed the controller for the SSDs (now all on the internal intel controller). But still I get these strange error messages (11:30am). Why I want to copy all data from the Docker SSD? Because I have two single SSDs, one for all Dockers and one for all VMs. I want to create a Cache Pools instead (for redundency reasons) an run dockers an VMs from this Pool. Any Idea? Thanks!

Best regards

Frank

unraid-server-diagnostics-20220213-1145.zip

JorgeB · February 13, 2022

Cache device appears to be failing, sorry I forgot to mention that before, the reason I asked to swap controllers is that a failing device shouldn't drop and crash the controller, like it happens with the mvsas driver, you can run an extended SMART test to confirm, if it failing try to copy everything you can, either manually or using for example ddrescue.

SSD Error!?

Recommended Posts

knarf0007

Link to comment

JorgeB

Link to comment

knarf0007

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

knarf0007

Link to comment

knarf0007

Link to comment

JorgeB

Link to comment

knarf0007

Link to comment

knarf0007

Link to comment

JorgeB

Link to comment

Join the conversation