Drogon Posted November 13, 2019 Share Posted November 13, 2019 (edited) I'm in the process of setting up my server for the first time and just ran into 114,590 disk errors while trying to copy data over from an existing disk to a disk I've just added to the array. I literally just finished running a preclear on the drive I'm copying too and everything was OK but I have to assume that with this many errors the drive failed... I don't know how to confirm that. I've attached what I think is needed but let me know if there's something else. I'm hoping that maybe I just did something wrong. truesource-diagnostics-20191113-0056.zip Edited November 14, 2019 by Drogon Quote Link to comment
John_M Posted November 13, 2019 Share Posted November 13, 2019 Disk 2 dropped offline so there's no SMART report. Shut down. Check/replace power and data cables to it. Then power up and post new diagnostics. Quote Link to comment
Drogon Posted November 13, 2019 Author Share Posted November 13, 2019 (edited) Something came up last night and when I got home from work today I had another notification saying that now both drives in my array have errors... I've attached the diagnostics that should have what we're looking for. Is someone able to tell me what the heck is going on? I'm really concerned that all of my more or less brand new drives are failing. EDIT: I can't seem to get my array back online. I set the drives to be encrypted and don't have an option to unlock them. unRAID is indicating that both disks need to be reformatted. EDIT2: The array has "turned good" and I've attached what I finally think is the correct diagnostics. I don't want this to happen again and really want to know what caused it. truesource-syslog-20191113-2256.zip truesource-syslog-20191113-2312.zip Edited November 13, 2019 by Drogon Quote Link to comment
John_M Posted November 14, 2019 Share Posted November 14, 2019 Tools -> Diagnostics and post the resulting zip file. Quote Link to comment
Drogon Posted November 14, 2019 Author Share Posted November 14, 2019 Sorry about that. Attached! truesource-diagnostics-20191114-0009.zip Quote Link to comment
John_M Posted November 14, 2019 Share Posted November 14, 2019 Disk2 is back online but it's attached to a Marvell 9230 controller, which is known to be problematic with Linux and drops disks randomly. I would recommend using a different SATA or SAS controller. Quote Link to comment
Drogon Posted November 14, 2019 Author Share Posted November 14, 2019 Yea I had trouble getting the disks recognized initially... So it looks like that's probably the problem? What did you look for? How did you know? I'd like to trouble shoot on my own. Thank you for your help! Quote Link to comment
John_M Posted November 14, 2019 Share Posted November 14, 2019 (edited) The news broke in this thread and at first some people were either not affected or could work round the problem by disabling IOMMU. With each new Linux kernel the situation has become worse and we are now at the point where anyone using one of the controllers on the list risks having the disks connected to it drop offline randomly. There is one Marvell controller that isn't on the list that seems less problematic than the rest - it's the 9235, which is the non-RAID capable version of the 9230, which ironically seems to be one of the more problematic. I still use the one I mention in this thread in one of my servers though I wouldn't want to appear to be encouraging anyone to use it because I'm fully expecting it to break one day with a new Linux kernel. For the moment it's fine but some people are worse affected anyway than others. In other servers I used to use the popular SAS2LP-MV8 controller, which was capable of controlling eight SATA disks straight out of the box, but I've given it up in favour of LSI-based controllers. That's the best choice for controlling eight SATA disks. For a simple 2-port SATA controller, the ones that use the ASMedia ASM1061 or ASM1062 chips are reliable. Look at the syslog from the very first diagnostics you posted: Nov 12 17:24:24 TrueSource kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Nov 12 17:24:24 TrueSource kernel: ata11.00: failed command: WRITE DMA EXT Nov 12 17:24:24 TrueSource kernel: ata11.00: cmd 35/00:40:d0:dd:24/00:05:05:00:00/e0 tag 9 dma 688128 out Nov 12 17:24:24 TrueSource kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Nov 12 17:24:24 TrueSource kernel: ata11.00: status: { DRDY } Nov 12 17:24:24 TrueSource kernel: ata11: hard resetting link Nov 12 17:24:25 TrueSource kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Nov 12 17:24:30 TrueSource kernel: ata11.00: qc timeout (cmd 0xec) Nov 12 17:24:31 TrueSource kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x4) Nov 12 17:24:31 TrueSource kernel: ata11.00: revalidation failed (errno=-5) Nov 12 17:24:31 TrueSource kernel: ata11: hard resetting link Nov 12 17:24:31 TrueSource kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Nov 12 17:24:40 TrueSource kernel: ata12.00: exception Emask 0x0 SAct 0xfd80000 SErr 0x0 action 0x6 frozen Nov 12 17:24:40 TrueSource kernel: ata12.00: failed command: READ FPDMA QUEUED Nov 12 17:24:40 TrueSource kernel: ata12.00: cmd 60/00:98:38:e0:15/01:00:0d:00:00/40 tag 19 ncq dma 131072 in Nov 12 17:24:40 TrueSource kernel: res 40/00:00:00:b4:00/00:00:00:00:00/00 Emask 0x4 (timeout) Nov 12 17:24:40 TrueSource kernel: ata12.00: status: { DRDY } The SATA link between the controller and one disk begins to fail (ata11), and then the link to another disk also begins to fail (ata12). Eventually we see this Nov 12 17:25:16 TrueSource kernel: sd 12:0:0:0: [sdi] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Nov 12 17:25:16 TrueSource kernel: sd 12:0:0:0: [sdi] tag#10 CDB: opcode=0x8a 8a 00 00 00 00 00 05 24 dd d0 00 00 05 40 00 00 Nov 12 17:25:16 TrueSource kernel: print_req_error: I/O error, dev sdi, sector 86302160 Nov 12 17:25:16 TrueSource kernel: md: disk2 write error, sector=86302096 Nov 12 17:25:16 TrueSource kernel: md: disk2 write error, sector=86302104 which fills the whole of the rest of the syslog. If we look at the SMART report for Disk2 we see that it's empty, indicating that the disk had dropped off line. That can happen with any controller and it usually indicates a bad SATA cable, or a bad powers supply (the PSU itself, or the cabling, splitters, etc), or a bad controller, or bad drive electronics. It's usually a cable problem, unless the controller is a Marvell. Looking at your most recent diagnostics Disk2 is showing its SMART report, which shows it to be healthy and connected to the controller, but it's likely to drop again. If you do a search of the forums or just type "unraid marvell" into Google you'll find numerous examples of people having the same problem as you. Edited November 14, 2019 by John_M More detail Quote Link to comment
John_M Posted November 14, 2019 Share Posted November 14, 2019 More recent threads here: and here: Quote Link to comment
Drogon Posted November 14, 2019 Author Share Posted November 14, 2019 I really appreciate your help and walking me though what's happening. Thank you! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.