Jump to content

WD 18Tb Red Pro's showing SMART errors in Unraid but not Synology


Go to solution Solved by JorgeB,

Recommended Posts

I have 4x 18Tb WD181KFGX drives in my Synology DS1019+ that have been doing fine for around 1 year. I opted to move them (and 3x 8Tb Red Pros) to my Unraid server to combine all drives into one location (I plan to sell the Synology as my storage demands have increased). 

 

The 8Tb drives moved over flawlessly and have been running for about a month with zero hiccups. I pulled the first 18Tb for transition and Unraid threw a SMART error flag. So I checked it in a number of SMART test environments and it came back clean. So I put it back in the Syno, rebuilt the array, and everything was fine. Then I pulled a different 18Tb from the Syno for transition and stuck it in the Unraid box... it too threw SMART errors. Wut? 

 

Fine... I then ran it through Unraids SMART tests (both short and extended) and it came back clean. Why is Unraid showing it as having errors if the SMART tests are showing it clean? And why would it only be happening to the 18Tb drives?

 

Here's the log for the second drive from Unraid:

ATA Error Count: 3
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 7465 hours (311 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 20 18 e0 0b 7d 40 08   2d+23:21:11.619  WRITE FPDMA QUEUED
  61 20 10 20 13 81 40 08   2d+23:21:11.617  WRITE FPDMA QUEUED
  61 20 08 20 11 81 40 08   2d+23:21:11.617  WRITE FPDMA QUEUED
  61 20 00 e0 0b 81 40 08   2d+23:21:11.617  WRITE FPDMA QUEUED
  61 20 f0 c0 0a 81 40 08   2d+23:21:11.617  WRITE FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 7465 hours (311 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 60 78 e0 10 7d 40 08   2d+23:19:43.782  WRITE FPDMA QUEUED
  61 20 38 60 08 81 40 08   2d+23:19:43.778  WRITE FPDMA QUEUED
  61 20 30 c0 06 81 40 08   2d+23:19:43.778  WRITE FPDMA QUEUED
  61 20 28 c0 05 81 40 08   2d+23:19:43.778  WRITE FPDMA QUEUED
  61 20 20 a0 00 81 40 08   2d+23:19:43.778  WRITE FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 7417 hours (309 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 20 78 c0 cf 80 40 08      23:20:30.907  WRITE FPDMA QUEUED
  61 20 b0 c0 dd 80 40 08      23:20:30.902  WRITE FPDMA QUEUED
  61 20 a8 c0 dc 80 40 08      23:20:30.902  WRITE FPDMA QUEUED
  61 20 a0 c0 db 80 40 08      23:20:30.902  WRITE FPDMA QUEUED
  61 40 98 a0 d9 80 40 08      23:20:30.902  WRITE FPDMA QUEUED

 

And here's the attribute page from the first drive (that's back in the Syno):

1418986601_Screenshot2024-01-16at2_07_40PM.thumb.png.d3ab9a95b0ab1570a49c1f0fdca5cbb7.png

 

Final question: what can I do to fix this? I am already RMA'ing one of the drives in hopes that the replacement is fresh and doesn't trigger any SMART warnings. But I don't want to have to RMA every drive unless absolutely necessary. Especially since this seems to be an Unraid-only issue. Suggestions?

Link to comment

Another idea : do a full (read-zeroing-read) Preclear on that drive and check the logs .... might take some 2 to 4 days. Besides the cables, it might be a PSU/PSA problem and in the worst case the mobo's onboard controller and or/or an PCIe installed HBA failure. The cable swapping might help as well, of course.

 

sorry JorgeB could not resist ....

 

cheers

Link to comment

It is also worth noting that if you click on the orange icon for a drive on the Dashboard tab and select the Acknowledge option from the menu displayed Unraid will then only notify you again if the attribute changes.

 

i suspect the reason you ‘think’ this is only an Unraid issue is that the other systems are not reporting CRC errors.

Link to comment

Thanks all. I have some other drives I can use to expand my Unraid server in the interim. I will be checking all the cables (both SATA and SAS connections) as suggested. The PSU is a brand new Corsair 800W Plat I got on BF so hopefully I can rule that out as the culprit. I'm pretty sure they've always had those write issues (a bad thunderstorm knocked my Syno offline repeatedly while we were out of town) and I've added a UPS to that server.

 

Side note: does Unraid support UPS shutdown a la Synology?

 

I'll report back if the replacement drive throws errors. Ultimately, if the replacements are ok, I'll just RMA all of them and move on. Thanks again. 

Link to comment
6 hours ago, McWetty said:

The PSU is a brand new Corsair 800W Plat I got on BF so hopefully I can rule that out as the culprit.

 

A few years back, I had a brand new PS failed in the first two weeks after I installed it.  As I remember, it was either spontaneously rebooting the server or shutting it down.  Luckily, I was only doing an upgrade of the PS at the time.  So it being the last thing I had changed, it became the prime suspect!   (As a retired Engineer, I had learned in my career in that profession to make one change at a time and look for any problems that the change might introduce!)  Thinking that a new component will always be good is not a justifiable  assumption.      Google       infant mortality bathtub curve       for insight into the failure modes of complex assemblies of components.

Link to comment
16 hours ago, Frank1940 said:

 

A few years back, I had a brand new PS failed in the first two weeks after I installed it.  As I remember, it was either spontaneously rebooting the server or shutting it down.  Luckily, I was only doing an upgrade of the PS at the time.  So it being the last thing I had changed, it became the prime suspect!   (As a retired Engineer, I had learned in my career in that profession to make one change at a time and look for any problems that the change might introduce!)  Thinking that a new component will always be good is not a justifiable  assumption.      Google       infant mortality bathtub curve       for insight into the failure modes of complex assemblies of components.

As a physicist (hello fellow math nerd), I completely agree about changing one thing at a time. The PSU has been powering the 3x 8Tb drives (and assorted peripherals) very well for the past few months. The only thing I changed was the 18Tb install. Given that the CRC errors are reported, I will change that one variable first and see how it goes from there. If I see other CRC errors, I'll look at other suspects like cables. 

 

I'm also not going to sweat the CRC errors if they stay stagnant as suggested by the mods. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...