McWetty Posted January 18 Share Posted January 18 I have 4x 18Tb WD181KFGX drives in my Synology DS1019+ that have been doing fine for around 1 year. I opted to move them (and 3x 8Tb Red Pros) to my Unraid server to combine all drives into one location (I plan to sell the Synology as my storage demands have increased). The 8Tb drives moved over flawlessly and have been running for about a month with zero hiccups. I pulled the first 18Tb for transition and Unraid threw a SMART error flag. So I checked it in a number of SMART test environments and it came back clean. So I put it back in the Syno, rebuilt the array, and everything was fine. Then I pulled a different 18Tb from the Syno for transition and stuck it in the Unraid box... it too threw SMART errors. Wut? Fine... I then ran it through Unraids SMART tests (both short and extended) and it came back clean. Why is Unraid showing it as having errors if the SMART tests are showing it clean? And why would it only be happening to the 18Tb drives? Here's the log for the second drive from Unraid: ATA Error Count: 3 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 3 occurred at disk power-on lifetime: 7465 hours (311 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 20 18 e0 0b 7d 40 08 2d+23:21:11.619 WRITE FPDMA QUEUED 61 20 10 20 13 81 40 08 2d+23:21:11.617 WRITE FPDMA QUEUED 61 20 08 20 11 81 40 08 2d+23:21:11.617 WRITE FPDMA QUEUED 61 20 00 e0 0b 81 40 08 2d+23:21:11.617 WRITE FPDMA QUEUED 61 20 f0 c0 0a 81 40 08 2d+23:21:11.617 WRITE FPDMA QUEUED Error 2 occurred at disk power-on lifetime: 7465 hours (311 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 60 78 e0 10 7d 40 08 2d+23:19:43.782 WRITE FPDMA QUEUED 61 20 38 60 08 81 40 08 2d+23:19:43.778 WRITE FPDMA QUEUED 61 20 30 c0 06 81 40 08 2d+23:19:43.778 WRITE FPDMA QUEUED 61 20 28 c0 05 81 40 08 2d+23:19:43.778 WRITE FPDMA QUEUED 61 20 20 a0 00 81 40 08 2d+23:19:43.778 WRITE FPDMA QUEUED Error 1 occurred at disk power-on lifetime: 7417 hours (309 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 20 78 c0 cf 80 40 08 23:20:30.907 WRITE FPDMA QUEUED 61 20 b0 c0 dd 80 40 08 23:20:30.902 WRITE FPDMA QUEUED 61 20 a8 c0 dc 80 40 08 23:20:30.902 WRITE FPDMA QUEUED 61 20 a0 c0 db 80 40 08 23:20:30.902 WRITE FPDMA QUEUED 61 40 98 a0 d9 80 40 08 23:20:30.902 WRITE FPDMA QUEUED And here's the attribute page from the first drive (that's back in the Syno): Final question: what can I do to fix this? I am already RMA'ing one of the drives in hopes that the replacement is fresh and doesn't trigger any SMART warnings. But I don't want to have to RMA every drive unless absolutely necessary. Especially since this seems to be an Unraid-only issue. Suggestions? Quote Link to comment
Solution JorgeB Posted January 18 Solution Share Posted January 18 UDMA CRC errors are usually a SATA cable problem, that's a very lower number but if it keeps increasing replace the SATA cable, note that the attribute never resets/goes down. Quote Link to comment
ahab666 Posted January 18 Share Posted January 18 Another idea : do a full (read-zeroing-read) Preclear on that drive and check the logs .... might take some 2 to 4 days. Besides the cables, it might be a PSU/PSA problem and in the worst case the mobo's onboard controller and or/or an PCIe installed HBA failure. The cable swapping might help as well, of course. sorry JorgeB could not resist .... cheers Quote Link to comment
itimpi Posted January 18 Share Posted January 18 It is also worth noting that if you click on the orange icon for a drive on the Dashboard tab and select the Acknowledge option from the menu displayed Unraid will then only notify you again if the attribute changes. i suspect the reason you ‘think’ this is only an Unraid issue is that the other systems are not reporting CRC errors. Quote Link to comment
McWetty Posted January 18 Author Share Posted January 18 Thanks all. I have some other drives I can use to expand my Unraid server in the interim. I will be checking all the cables (both SATA and SAS connections) as suggested. The PSU is a brand new Corsair 800W Plat I got on BF so hopefully I can rule that out as the culprit. I'm pretty sure they've always had those write issues (a bad thunderstorm knocked my Syno offline repeatedly while we were out of town) and I've added a UPS to that server. Side note: does Unraid support UPS shutdown a la Synology? I'll report back if the replacement drive throws errors. Ultimately, if the replacements are ok, I'll just RMA all of them and move on. Thanks again. Quote Link to comment
Frank1940 Posted January 18 Share Posted January 18 6 hours ago, McWetty said: The PSU is a brand new Corsair 800W Plat I got on BF so hopefully I can rule that out as the culprit. A few years back, I had a brand new PS failed in the first two weeks after I installed it. As I remember, it was either spontaneously rebooting the server or shutting it down. Luckily, I was only doing an upgrade of the PS at the time. So it being the last thing I had changed, it became the prime suspect! (As a retired Engineer, I had learned in my career in that profession to make one change at a time and look for any problems that the change might introduce!) Thinking that a new component will always be good is not a justifiable assumption. Google infant mortality bathtub curve for insight into the failure modes of complex assemblies of components. Quote Link to comment
McWetty Posted January 19 Author Share Posted January 19 16 hours ago, Frank1940 said: A few years back, I had a brand new PS failed in the first two weeks after I installed it. As I remember, it was either spontaneously rebooting the server or shutting it down. Luckily, I was only doing an upgrade of the PS at the time. So it being the last thing I had changed, it became the prime suspect! (As a retired Engineer, I had learned in my career in that profession to make one change at a time and look for any problems that the change might introduce!) Thinking that a new component will always be good is not a justifiable assumption. Google infant mortality bathtub curve for insight into the failure modes of complex assemblies of components. As a physicist (hello fellow math nerd), I completely agree about changing one thing at a time. The PSU has been powering the 3x 8Tb drives (and assorted peripherals) very well for the past few months. The only thing I changed was the 18Tb install. Given that the CRC errors are reported, I will change that one variable first and see how it goes from there. If I see other CRC errors, I'll look at other suspects like cables. I'm also not going to sweat the CRC errors if they stay stagnant as suggested by the mods. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.