(SOLVED) Disk Errors


Drogon

Recommended Posts

I'm in the process of setting up my server for the first time and just ran into 114,590 disk errors while trying to copy data over from an existing disk to a disk I've just added to the array. I literally just finished running a preclear on the drive I'm copying too and everything was OK but I have to assume that with this many errors the drive failed... I don't know how to confirm that. I've attached what I think is needed but let me know if there's something else. I'm hoping that maybe I just did something wrong.

truesource-diagnostics-20191113-0056.zip

Edited by Drogon
Link to comment

Something came up last night and when I got home from work today I had another notification saying that now both drives in my array have errors... I've attached the diagnostics that should have what we're looking for. Is someone able to tell me what the heck is going on? I'm really concerned that all of my more or less brand new drives are failing.

 

EDIT: I can't seem to get my array back online. I set the drives to be encrypted and don't have an option to unlock them. unRAID is indicating that both disks need to be reformatted.

 

EDIT2: The array has "turned good" and I've attached what I finally think is the correct diagnostics. I don't want this to happen again and really want to know what caused it.

truesource-syslog-20191113-2256.zip

truesource-syslog-20191113-2312.zip

Edited by Drogon
Link to comment

The news broke in this thread

and at first some people were either not affected or could work round the problem by disabling IOMMU. With each new Linux kernel the situation has become worse and we are now at the point where anyone using one of the controllers on the list risks having the disks connected to it drop offline randomly. There is one Marvell controller that isn't on the list that seems less problematic than the rest - it's the 9235, which is the non-RAID capable version of the 9230, which ironically seems to be one of the more problematic. I still use the one I mention in this thread

in one of my servers though I wouldn't want to appear to be encouraging anyone to use it because I'm fully expecting it to break one day with a new Linux kernel. For the moment it's fine but some people are worse affected anyway than others. In other servers I used to use the popular SAS2LP-MV8 controller, which was capable of controlling eight SATA disks straight out of the box, but I've given it up in favour of LSI-based controllers. That's the best choice for controlling eight SATA disks. For a simple 2-port SATA controller, the ones that use the ASMedia ASM1061 or ASM1062 chips are reliable.

 

Look at the syslog from the very first diagnostics you posted:

Nov 12 17:24:24 TrueSource kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 12 17:24:24 TrueSource kernel: ata11.00: failed command: WRITE DMA EXT
Nov 12 17:24:24 TrueSource kernel: ata11.00: cmd 35/00:40:d0:dd:24/00:05:05:00:00/e0 tag 9 dma 688128 out
Nov 12 17:24:24 TrueSource kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 12 17:24:24 TrueSource kernel: ata11.00: status: { DRDY }
Nov 12 17:24:24 TrueSource kernel: ata11: hard resetting link
Nov 12 17:24:25 TrueSource kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 12 17:24:30 TrueSource kernel: ata11.00: qc timeout (cmd 0xec)
Nov 12 17:24:31 TrueSource kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 12 17:24:31 TrueSource kernel: ata11.00: revalidation failed (errno=-5)
Nov 12 17:24:31 TrueSource kernel: ata11: hard resetting link
Nov 12 17:24:31 TrueSource kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 12 17:24:40 TrueSource kernel: ata12.00: exception Emask 0x0 SAct 0xfd80000 SErr 0x0 action 0x6 frozen
Nov 12 17:24:40 TrueSource kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 12 17:24:40 TrueSource kernel: ata12.00: cmd 60/00:98:38:e0:15/01:00:0d:00:00/40 tag 19 ncq dma 131072 in
Nov 12 17:24:40 TrueSource kernel:         res 40/00:00:00:b4:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 12 17:24:40 TrueSource kernel: ata12.00: status: { DRDY }

The SATA link between the controller and one disk begins to fail (ata11), and then the link to another disk also begins to fail (ata12). Eventually we see this

Nov 12 17:25:16 TrueSource kernel: sd 12:0:0:0: [sdi] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 12 17:25:16 TrueSource kernel: sd 12:0:0:0: [sdi] tag#10 CDB: opcode=0x8a 8a 00 00 00 00 00 05 24 dd d0 00 00 05 40 00 00
Nov 12 17:25:16 TrueSource kernel: print_req_error: I/O error, dev sdi, sector 86302160
Nov 12 17:25:16 TrueSource kernel: md: disk2 write error, sector=86302096
Nov 12 17:25:16 TrueSource kernel: md: disk2 write error, sector=86302104

which fills the whole of the rest of the syslog. If we look at the SMART report for Disk2 we see that it's empty, indicating that the disk had dropped off line. That can happen with any controller and it usually indicates a bad SATA cable, or a bad powers supply (the PSU itself, or the cabling, splitters, etc), or a bad controller, or bad drive electronics. It's usually a cable problem, unless the controller is a Marvell. Looking at your most recent diagnostics Disk2 is showing its SMART report, which shows it to be healthy and connected to the controller, but it's likely to drop again. If you do a search of the forums or just type "unraid marvell" into Google you'll find numerous examples of people having the same problem as you.

 

Edited by John_M
More detail
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.