(SOLVED) Disk Errors

Drogon · November 13, 2019

I'm in the process of setting up my server for the first time and just ran into 114,590 disk errors while trying to copy data over from an existing disk to a disk I've just added to the array. I literally just finished running a preclear on the drive I'm copying too and everything was OK but I have to assume that with this many errors the drive failed... I don't know how to confirm that. I've attached what I think is needed but let me know if there's something else. I'm hoping that maybe I just did something wrong.

truesource-diagnostics-20191113-0056.zip

Edited November 14, 2019 by Drogon

John_M · November 13, 2019

Disk 2 dropped offline so there's no SMART report. Shut down. Check/replace power and data cables to it. Then power up and post new diagnostics.

Drogon · November 13, 2019

Something came up last night and when I got home from work today I had another notification saying that now both drives in my array have errors... I've attached the diagnostics that should have what we're looking for. Is someone able to tell me what the heck is going on? I'm really concerned that all of my more or less brand new drives are failing.

EDIT: I can't seem to get my array back online. I set the drives to be encrypted and don't have an option to unlock them. unRAID is indicating that both disks need to be reformatted.

EDIT2: The array has "turned good" and I've attached what I finally think is the correct diagnostics. I don't want this to happen again and really want to know what caused it.

truesource-syslog-20191113-2256.zip

truesource-syslog-20191113-2312.zip

Edited November 13, 2019 by Drogon

John_M · November 14, 2019

Tools -> Diagnostics and post the resulting zip file.

Drogon · November 14, 2019

Sorry about that. Attached!

truesource-diagnostics-20191114-0009.zip

John_M · November 14, 2019

Disk2 is back online but it's attached to a Marvell 9230 controller, which is known to be problematic with Linux and drops disks randomly. I would recommend using a different SATA or SAS controller.

Drogon · November 14, 2019

Yea I had trouble getting the disks recognized initially... So it looks like that's probably the problem?

What did you look for? How did you know? I'd like to trouble shoot on my own. Thank you for your help!

John_M · November 14, 2019

The news broke in this thread

and at first some people were either not affected or could work round the problem by disabling IOMMU. With each new Linux kernel the situation has become worse and we are now at the point where anyone using one of the controllers on the list risks having the disks connected to it drop offline randomly. There is one Marvell controller that isn't on the list that seems less problematic than the rest - it's the 9235, which is the non-RAID capable version of the 9230, which ironically seems to be one of the more problematic. I still use the one I mention in this thread

in one of my servers though I wouldn't want to appear to be encouraging anyone to use it because I'm fully expecting it to break one day with a new Linux kernel. For the moment it's fine but some people are worse affected anyway than others. In other servers I used to use the popular SAS2LP-MV8 controller, which was capable of controlling eight SATA disks straight out of the box, but I've given it up in favour of LSI-based controllers. That's the best choice for controlling eight SATA disks. For a simple 2-port SATA controller, the ones that use the ASMedia ASM1061 or ASM1062 chips are reliable.

Look at the syslog from the very first diagnostics you posted:

Nov 12 17:24:24 TrueSource kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 12 17:24:24 TrueSource kernel: ata11.00: failed command: WRITE DMA EXT
Nov 12 17:24:24 TrueSource kernel: ata11.00: cmd 35/00:40:d0:dd:24/00:05:05:00:00/e0 tag 9 dma 688128 out
Nov 12 17:24:24 TrueSource kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 12 17:24:24 TrueSource kernel: ata11.00: status: { DRDY }
Nov 12 17:24:24 TrueSource kernel: ata11: hard resetting link
Nov 12 17:24:25 TrueSource kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 12 17:24:30 TrueSource kernel: ata11.00: qc timeout (cmd 0xec)
Nov 12 17:24:31 TrueSource kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 12 17:24:31 TrueSource kernel: ata11.00: revalidation failed (errno=-5)
Nov 12 17:24:31 TrueSource kernel: ata11: hard resetting link
Nov 12 17:24:31 TrueSource kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 12 17:24:40 TrueSource kernel: ata12.00: exception Emask 0x0 SAct 0xfd80000 SErr 0x0 action 0x6 frozen
Nov 12 17:24:40 TrueSource kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 12 17:24:40 TrueSource kernel: ata12.00: cmd 60/00:98:38:e0:15/01:00:0d:00:00/40 tag 19 ncq dma 131072 in
Nov 12 17:24:40 TrueSource kernel:         res 40/00:00:00:b4:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 12 17:24:40 TrueSource kernel: ata12.00: status: { DRDY }

The SATA link between the controller and one disk begins to fail (ata11), and then the link to another disk also begins to fail (ata12). Eventually we see this

Nov 12 17:25:16 TrueSource kernel: sd 12:0:0:0: [sdi] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 12 17:25:16 TrueSource kernel: sd 12:0:0:0: [sdi] tag#10 CDB: opcode=0x8a 8a 00 00 00 00 00 05 24 dd d0 00 00 05 40 00 00
Nov 12 17:25:16 TrueSource kernel: print_req_error: I/O error, dev sdi, sector 86302160
Nov 12 17:25:16 TrueSource kernel: md: disk2 write error, sector=86302096
Nov 12 17:25:16 TrueSource kernel: md: disk2 write error, sector=86302104

which fills the whole of the rest of the syslog. If we look at the SMART report for Disk2 we see that it's empty, indicating that the disk had dropped off line. That can happen with any controller and it usually indicates a bad SATA cable, or a bad powers supply (the PSU itself, or the cabling, splitters, etc), or a bad controller, or bad drive electronics. It's usually a cable problem, unless the controller is a Marvell. Looking at your most recent diagnostics Disk2 is showing its SMART report, which shows it to be healthy and connected to the controller, but it's likely to drop again. If you do a search of the forums or just type "unraid marvell" into Google you'll find numerous examples of people having the same problem as you.

Edited November 14, 2019 by John_M
More detail

John_M · November 14, 2019

More recent threads here:

and here:

Drogon · November 14, 2019

I really appreciate your help and walking me though what's happening. Thank you!

(SOLVED) Disk Errors

Recommended Posts

Drogon

Link to comment

John_M

Link to comment

Drogon

Link to comment

John_M

Link to comment

Drogon

Link to comment

John_M

Link to comment

Drogon

Link to comment

John_M

Link to comment

John_M

Link to comment

Drogon

Link to comment

Join the conversation