Jump to content

metadata corruption?


tucansam

Recommended Posts

Never seen errors like these before:

 

Jul 25 06:54:49 ffs2 kernel: ata11.00: cmd 60/08:00:90:15:7b/00:00:d2:01:00/40 tag 3 ncq dma 4096 in
Jul 25 06:54:49 ffs2 kernel:         res 01/04:10:e8:7d:49/00:00:2f:03:00/40 Emask 0x3 (HSM violation)
Jul 25 06:54:49 ffs2 kernel: ata11.00: status: { ERR }
Jul 25 06:54:49 ffs2 kernel: ata11.00: error: { ABRT }
Jul 25 06:54:49 ffs2 kernel: ata11: hard resetting link
Jul 25 06:54:49 ffs2 kernel: sas: ata12: end_device-1:3: dev error handler
Jul 25 06:54:49 ffs2 kernel: sas: ata13: end_device-1:4: dev error handler
Jul 25 06:54:49 ffs2 kernel: sas: ata14: end_device-1:5: dev error handler
Jul 25 06:54:49 ffs2 kernel: sas: ata15: end_device-1:6: dev error handler
Jul 25 06:54:49 ffs2 kernel: sas: sas_ata_task_done: SAS error 8a
Jul 25 06:54:49 ffs2 kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x11)
Jul 25 06:54:49 ffs2 kernel: ata11.00: revalidation failed (errno=-5)
Jul 25 06:54:50 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 06:54:54 ffs2 kernel: ata11: hard resetting link
Jul 25 06:54:56 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 06:55:00 ffs2 kernel: ata11.00: qc timeout (cmd 0xec)
Jul 25 06:55:00 ffs2 kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Jul 25 06:55:00 ffs2 kernel: ata11.00: revalidation failed (errno=-5)
Jul 25 06:55:00 ffs2 kernel: ata11: hard resetting link
Jul 25 06:55:00 ffs2 kernel: check SRS 0 00000001.
Jul 25 06:55:02 ffs2 kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0
Jul 25 06:55:02 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 06:55:08 ffs2 kernel: sas: sas_form_port: phy3 belongs to port2 already(1)!
Jul 25 06:55:08 ffs2 kernel: ata11.00: configured for UDMA/133
Jul 25 06:55:08 ffs2 kernel: ata11: EH complete
Jul 25 06:55:08 ffs2 kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Jul 25 07:03:48 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 07:03:54 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 07:04:04 ffs2 kernel: XFS (md1): Metadata corruption detected at xfs_dir3_data_reada_verify+0x73/0x76, xfs_dir3_data_reada block 0x32f497da8
Jul 25 07:04:04 ffs2 kernel: XFS (md1): Unmount and run xfs_repair
Jul 25 07:04:04 ffs2 kernel: XFS (md1): First 64 bytes of corrupted metadata buffer:
Jul 25 07:04:04 ffs2 kernel: ffff880297d15000: 49 4e 00 00 03 02 00 00 00 00 00 63 00 00 00 64  IN.........c...d
Jul 25 07:04:04 ffs2 kernel: ffff880297d15010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Jul 25 07:04:04 ffs2 kernel: ffff880297d15020: 58 86 fe 0e 0a 6b 02 7d 59 71 bf bd 24 0a d5 00  X....k.}Yq..$...
Jul 25 07:04:04 ffs2 kernel: ffff880297d15030: 59 71 bf bd 24 0a d5 00 00 00 00 00 00 00 00 06  Yq..$...........
Jul 25 07:04:06 ffs2 kernel: sas: sas_form_port: phy3 belongs to port2 already(1)!
Jul 25 07:04:06 ffs2 kernel: XFS (md1): Metadata CRC error detected at xfs_dir3_block_read_verify+0xa1/0xa9, xfs_dir3_block block 0x32f497da8
Jul 25 07:04:06 ffs2 kernel: XFS (md1): Unmount and run xfs_repair
Jul 25 07:04:06 ffs2 kernel: XFS (md1): First 64 bytes of corrupted metadata buffer:
Jul 25 07:04:06 ffs2 kernel: ffff880297d15000: 46 00 00 00 96 63 d8 8c 02 d9 a8 40 d1 cd 2c 35  F....c.....@..,5
Jul 25 07:04:06 ffs2 kernel: ffff880297d15010: 30 38 03 98 18 bf fe 06 a0 01 b8 8a 68 9b 00 1d  08..........h...
Jul 25 07:04:06 ffs2 kernel: ffff880297d15020: 18 3f 0f 80 06 f5 70 63 00 11 d0 0b 8d 18 bc 1c  .?....pc........
Jul 25 07:04:06 ffs2 kernel: ffff880297d15030: 3a 01 94 02 fb 66 62 88 98 06 e0 6b ff fb 82 ba  :....fb....k....
Jul 25 07:04:06 ffs2 kernel: XFS (md1): metadata I/O error: block 0x32f497da8 ("xfs_trans_read_buf_map") error 74 numblks 8
Jul 25 07:04:34 ffs2 kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Jul 25 07:04:34 ffs2 kernel: sas: trying to find task 0xffff88026a123400
Jul 25 07:04:34 ffs2 kernel: sas: sas_scsi_find_task: aborting task 0xffff88026a123400
Jul 25 07:04:34 ffs2 kernel: sas: sas_scsi_find_task: task 0xffff88026a123400 is aborted
Jul 25 07:04:34 ffs2 kernel: sas: sas_eh_handle_sas_errors: task 0xffff88026a123400 is aborted
Jul 25 07:04:34 ffs2 kernel: sas: ata15: end_device-1:6: cmd error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata9: end_device-1:0: dev error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata10: end_device-1:1: dev error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata11: end_device-1:2: dev error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata12: end_device-1:3: dev error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata13: end_device-1:4: dev error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata14: end_device-1:5: dev error handler
Jul 25 07:04:34 ffs2 kernel: sas: ata15: end_device-1:6: dev error handler
Jul 25 07:04:34 ffs2 kernel: ata15.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen
Jul 25 07:04:34 ffs2 kernel: ata15.00: failed command: READ FPDMA QUEUED
Jul 25 07:04:34 ffs2 kernel: ata15.00: cmd 60/00:00:20:cb:52/02:00:c3:01:00/40 tag 9 ncq dma 262144 in
Jul 25 07:04:34 ffs2 kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jul 25 07:04:34 ffs2 kernel: ata15.00: status: { DRDY }
Jul 25 07:04:34 ffs2 kernel: ata15: hard resetting link
Jul 25 07:04:35 ffs2 kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[6]:rc= 0
Jul 25 07:04:35 ffs2 kernel: sas: sas_ata_task_done: SAS error 8a
Jul 25 07:04:35 ffs2 kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x11)
Jul 25 07:04:35 ffs2 kernel: ata15.00: revalidation failed (errno=-5)
Jul 25 07:04:36 ffs2 kernel: mvsas 0000:01:00.0: Phy7 : No sig fis
Jul 25 07:04:40 ffs2 kernel: sas: sas_form_port: phy7 belongs to port6 already(1)!
Jul 25 07:04:41 ffs2 kernel: ata15: hard resetting link
Jul 25 07:04:41 ffs2 kernel: ata15.00: configured for UDMA/133
Jul 25 07:04:41 ffs2 kernel: ata15: EH complete
Jul 25 07:04:41 ffs2 kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jul 25 13:56:50 ffs2 kernel: drivers/scsi/mvsas/mv_sas.c 1870:Release slot [4] tag[4], task [ffff8802b8363700]:
Jul 25 13:56:50 ffs2 kernel: sas: sas_ata_task_done: SAS error 8a
Jul 25 13:56:51 ffs2 kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Jul 25 13:56:51 ffs2 kernel: sas: ata11: end_device-1:2: cmd error handler
Jul 25 13:56:51 ffs2 kernel: sas: ata9: end_device-1:0: dev error handler
Jul 25 13:56:51 ffs2 kernel: sas: ata10: end_device-1:1: dev error handler
Jul 25 13:56:51 ffs2 kernel: sas: ata11: end_device-1:2: dev error handler
Jul 25 13:56:51 ffs2 kernel: sas: ata12: end_device-1:3: dev error handler
Jul 25 13:56:51 ffs2 kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Jul 25 13:56:51 ffs2 kernel: sas: ata13: end_device-1:4: dev error handler
Jul 25 13:56:51 ffs2 kernel: ata11.00: failed command: SMART
Jul 25 13:56:51 ffs2 kernel: sas: ata14: end_device-1:5: dev error handler
Jul 25 13:56:51 ffs2 kernel: ata11.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 1 pio 512 in
Jul 25 13:56:51 ffs2 kernel:         res 01/04:81:82:00:00/00:00:00:00:00/40 Emask 0x12 (ATA bus error)
Jul 25 13:56:51 ffs2 kernel: sas: ata15: end_device-1:6: dev error handler
Jul 25 13:56:51 ffs2 kernel: ata11.00: status: { ERR }
Jul 25 13:56:51 ffs2 kernel: ata11.00: error: { ABRT }
Jul 25 13:56:51 ffs2 kernel: ata11: hard resetting link
Jul 25 13:56:51 ffs2 kernel: sas: sas_ata_task_done: SAS error 8a
Jul 25 13:56:51 ffs2 kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x11)
Jul 25 13:56:51 ffs2 kernel: ata11.00: revalidation failed (errno=-5)
Jul 25 13:56:52 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 13:56:56 ffs2 kernel: ata11: hard resetting link
Jul 25 13:56:59 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 13:57:02 ffs2 kernel: ata11.00: qc timeout (cmd 0xec)
Jul 25 13:57:02 ffs2 kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Jul 25 13:57:02 ffs2 kernel: ata11.00: revalidation failed (errno=-5)
Jul 25 13:57:02 ffs2 kernel: ata11: hard resetting link
Jul 25 13:57:04 ffs2 kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0
Jul 25 13:57:04 ffs2 kernel: mvsas 0000:01:00.0: Phy3 : No sig fis
Jul 25 13:57:09 ffs2 kernel: ata11.00: qc timeout (cmd 0x27)
Jul 25 13:57:09 ffs2 kernel: ata11.00: failed to read native max address (err_mask=0x4)
Jul 25 13:57:09 ffs2 kernel: ata11.00: HPA support seems broken, skipping HPA handling
Jul 25 13:57:09 ffs2 kernel: ata11.00: revalidation failed (errno=-5)
Jul 25 13:57:09 ffs2 kernel: ata11.00: disabled
Jul 25 13:57:09 ffs2 kernel: ata11: hard resetting link
Jul 25 13:57:11 ffs2 kernel: sas: sas_form_port: phy3 belongs to port2 already(1)!
Jul 25 13:57:12 ffs2 kernel: drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[2]:rc= 0
Jul 25 13:57:12 ffs2 kernel: ata11: EH complete
Jul 25 13:57:12 ffs2 kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Jul 25 13:57:12 ffs2 kernel: sd 1:0:2:0: [sdn] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jul 25 13:57:12 ffs2 kernel: sd 1:0:2:0: [sdn] tag#2 CDB: opcode=0x88 88 00 00 00 00 01 d2 7b 15 90 00 00 00 08 00 00
 

 

 

Lots of meta data errors, various hard resetting links.... Array was running tip-top until now, now I have a red X'd disk and all this crazy stuff in the syslog.

 

 

Link to comment

Thanks to you both.  I have an LSI on order.

 

I guess my biggest curiosity is the randomness of the errors.  My old server ran for over a year with that controller and never had any issues.  The MB was too old to support VT-D, it wasn't even an option in the BIOS.

 

On two other motherboards, VT-D was supported but disabled.  Both those system (one of which is my present one giving me trouble) had all kinds of issues.  Still, sometimes I would get two red balls in a month, and sometimes I would go six or more months with no issues at all.

 

Trying to wrap my head around the exact cause, and why it doesn't happen with any regularity.

Link to comment

Mine ran for a few years without issue until I updated my MB - although at the same time I updated the version of UnRaid I was running and I would presume that it is the version of UnRaid that is most likely the issue...I suppose the best part is that I have learned a lot more (thanks @Johnnie.black) and the new controller is significantly cheaper...

Link to comment

That's common, they ran fine on v5 and still run fine on v6 for some users, but any hardware or software change can trigger a issue, IMO they are like a ticking time bomb, though I'm still using two SAS2LP on a backup server without issues so far, had a problem once with a SASLP and retire them all.

Link to comment

I have enough ports on the MB and a spare (marvell.....) controller, I can run 100% of my unraid disks on non-Marvell ports, and a mostly unused scratch disk outside the array on a Marvell until my LSI arrives (Aug 2).  

 

Can I attempt to rebuild Disk 1?  Is Disk 2 totally toast?  From the command line I am getting i/o errors, tried to 'ls /mnt/disk2" to see what all data I fear I am going to lose.

 

How bad is this?

 

Stopped the array to power down and swap hardware, now Disk 2 is missing and Disk 1 is emulated.  I assume this means I am going to lose data.

 

unraid.jpg

unraid2.jpg

Link to comment

Thanks Johnnie.  Disk 2 came back up and Disk 1 is rebuilding.  All array devices are on my existing LSI and Asmedia/Intel ports on the MB.  My spare scratch disk is on an old PCI Promise Tech SATA1 controller, but speed is not necessary on that drive.  Array is rebuilding Disk 1, fingers are crossed.

 

As an aside, are there any four-port PCIe SATA controllers that DON'T use the Marvell chipset?  I have a spare server with 8 drives and the only other four-port controller I have is a Marvell, would very much like to stay away from this chipset, forever.  Need something to add ports, could go with another LSI I suppose but seems like a waste as I have five ports on the MB, and this is a backup server that will come on once a month.

Link to comment

There are some 4 port LSI controllers but they usually cost almost the same as the 8 port version, other than that there's the older Adaptec 1430SA that still works well, it does use a Marvell controller but it uses a specific driver and AFAIK there's never been any issues with it and unRAID.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...