Disk2 and 3 read errors


Recommended Posts

Hi,

 

I have a lot of errors in my syslog

Jan  8 15:31:12 Serveur root: error: /webGui/include/ProcessStatus.php: wrong csrf_token
Jan  8 15:34:36 Serveur kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan  8 15:34:36 Serveur kernel: ata3.00: error: { UNC }
Jan  8 15:34:36 Serveur kernel: print_req_error: I/O error, dev sde, sector 20482352
Jan  8 15:34:36 Serveur kernel: md: disk2 read error, sector=20482288
...

and

Jan  8 15:41:00 Serveur root: Fix Common Problems: Error: disk2 (WDC_WD40EFRX-68WT0N0_WD-WCC4E0FCJZXN) has read errors
Jan  8 15:52:21 Serveur kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan  8 15:52:21 Serveur kernel: ata4.00: error: { UNC }
Jan  8 15:52:21 Serveur kernel: print_req_error: I/O error, dev sdf, sector 2078284184
Jan  8 15:52:21 Serveur kernel: md: disk3 read error, sector=2078284120

 

and in the SMART error log of my disk2

ATA Error Count: 21 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 90 c4 95 e0  Error: UNC at LBA = 0x0095c490 = 9815184

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 c4 95 e0 08  24d+11:46:58.603  READ DMA
  ca 00 f0 a0 c2 95 e0 08  24d+11:46:58.588  WRITE DMA
  c8 00 00 90 c3 95 e0 08  24d+11:46:56.557  READ DMA
  ef 10 02 00 00 00 a0 08  24d+11:46:56.557  SET FEATURES [Enable SATA feature]

Error 20 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 a0 c2 95 e0  Error: UNC at LBA = 0x0095c2a0 = 9814688

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 c2 95 e0 08  24d+11:46:52.581  READ DMA
  c8 00 00 90 c1 95 e0 08  24d+11:46:52.148  READ DMA
  c8 00 00 90 c0 95 e0 08  24d+11:46:50.993  READ DMA
  ca 00 48 48 bf 95 e0 08  24d+11:46:50.993  WRITE DMA
  c8 00 00 90 bf 95 e0 08  24d+11:46:50.218  READ DMA

Error 19 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 bf 95 e0  Error: UNC at LBA = 0x0095bf48 = 9813832

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 be 95 e0 08  24d+11:46:46.118  READ DMA
  c8 00 00 90 bd 95 e0 08  24d+11:46:44.709  READ DMA
  c8 00 00 90 bc 95 e0 08  24d+11:46:42.665  READ DMA
  c8 00 00 90 bb 95 e0 08  24d+11:46:41.865  READ DMA
  c8 00 00 90 ba 95 e0 08  24d+11:46:40.233  READ DMA

Error 18 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d8 14 93 e0  Error: UNC at LBA = 0x009314d8 = 9639128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 14 93 e0 08  24d+11:44:56.067  READ DMA
  c8 00 00 90 13 93 e0 08  24d+11:44:54.889  READ DMA
  c8 00 00 90 12 93 e0 08  24d+11:44:54.020  READ DMA
  c8 00 00 90 11 93 e0 08  24d+11:44:52.813  READ DMA
  c8 00 00 90 10 93 e0 08  24d+11:44:51.868  READ DMA

Error 17 occurred at disk power-on lifetime: 15294 hours (637 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 c5 95 e0  Error: UNC at LBA = 0x0095c580 = 9815424

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 60 c5 95 e0 08  22d+18:06:17.868  READ DMA
  ca 00 48 18 c3 95 e0 08  22d+18:06:17.868  WRITE DMA
  ef 10 02 00 00 00 a0 08  22d+18:06:17.851  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  22d+18:06:17.846  IDENTIFY DEVICE

 

and in the Fix commun problems (and the disk is not disabled)

If the disk has not been disabled, then unRaid has successfully rewritten the contents of the offending sectors back to the hard drive. 

 

an idea of what my problem is ?

 

Thank,

Link to comment

Both disks 2 and 3 appear to be failing, though these errors can be intermittent, similar to disk1 errors when it was almost new, and no more since, I would recommend running an extended SMART test, and run it on all disks since others are also showing some warning signs, most raw read errors, then post new diags

Link to comment
1 minute ago, Jobine said:

What would be the best way to replace my disk 2 and my disk 3 later?

What is the procedure for replacing a disk and rebuilding?

 

Thank,

Why do you say "later"? Certainly they have to be done one at a time but if by later you mean some time in the not so near future then you need to be aware that any disk not working well compromises any protection you have from parity.

Link to comment
3 minutes ago, Jobine said:

Should I move manually with Krusader my files on other disks see which ones don't work?

Don't move anything on the array. Your array is already compromised. If you want you can copy data to someplace off the array, such as your PC or an external drive with Unassigned Devices.

 

Do you have another copy of anything important and irreplaceable? That should be your first priority.

Link to comment

Just to be sure...

  1. Turn off the server
  2. Install the new disk
  3. Turn on the server
  4. PreClear the new disk
  5. Turn off the server
  6. Remove the disk2
  7. Install the new disk instead of the disk2
  8. Turn on the server
  9. Assign the new disk to disk2.
  10. Start the Parity sync
  11. Waiting and hoping 🤔
  12. Repeat 1-11 for the disk3.

I thought I was safe from this problem with Unraid... 

Do you suggest that I add a second parity disk?

Link to comment
17 minutes ago, Jobine said:

I recently added a big 10 and 40 gig files. Can this be my problem?

Unlikely

 

18 minutes ago, Jobine said:

How I proceed to change a disk ?

  1. shut down
  2. replace one disk with a new one of the same or larger size, but no larger than parity.
  3. boot up
  4. assign new disk to the slot for the missing disk
  5. start array to begin rebuild

If for some reason it offers to Format anything, DON'T.

 

After the rebuild completes run a non-correcting parity check. Keep the original disk for now in case it is needed as backup for file recovery.

 

Repeat for other disk.

 

If you have any doubts, questions, or problems, come back for further advice.

 

  • Like 1
Link to comment
2 minutes ago, Jobine said:

Just to be sure...

  1. Turn off the server
  2. Install the new disk
  3. Turn on the server
  4. PreClear the new disk
  5. Turn off the server
  6. Remove the disk2
  7. Install the new disk instead of the disk2
  8. Turn on the server
  9. Assign the new disk to disk2.
  10. Start the Parity sync
  11. Waiting and hoping 🤔
  12. Repeat 1-11 for the disk3.

 

That's OK too. Technically you don't have to preclear but it is one way to give a new disk a good testing.

 

3 minutes ago, Jobine said:

I thought I was safe from this problem with Unraid... 

Do you suggest that I add a second parity disk?

How long have you been getting errors on multiple disks? You need to attend to each problem so you don't get into a situation where you have multiple problems.

 

Do you have Notifications setup to alert you by email or other agent when Unraid detects a problem?

 

With only 5 data disks it is debatable whether parity2 is justified. If you are diligent you should be able to deal with problems one disk at a time.

Link to comment

You might have to turn on the Array Status notification to get messages about simple I/O errors that don't result in SMART warnings or disabled disks, like what you have experienced. But the status notification only goes out on schedule instead of instantly.

 

Maybe monitoring additional SMART attributes besides the default ones would help. For my WD Red disks, I have added monitoring of attributes 1 and 200 @johnnie.black recommendation.

 

 

Link to comment

Your best bet would likely be to add a 2nd parity drive now, if there are just a few read errors on each disk they likely won't be on the same sectors, so with a little luck parity2 could build 100% successfully, though there's a small change that a disk could become disable, but they are much more likely to error on read then on writes, so IMO there's a good chance it would work.

Link to comment
2 hours ago, trurl said:

You might have to turn on the Array Status notification to get messages about simple I/O errors that don't result in SMART warnings or disabled disks, like what you have experienced. But the status notification only goes out on schedule instead of instantly.

 

Maybe monitoring additional SMART attributes besides the default ones would help. For my WD Red disks, I have added monitoring of attributes 1 and 200 @johnnie.black recommendation.

 

 

I just need to add 1,200 in SMART attribute notifications: in addition to the ones below?

Link to comment

Last receive email 

Quote

Event: Unraid Status
Subject: Notice [SERVEUR] - array health report [FAIL]
Description: Array has 7 disks (including parity & cache)
Importance: warning

Parity - WDC_WD40EFRX-68WT0N0_WD-WCC4E2TCRKH5 (sdc) - active 29 C [OK]
Disk 1 - WDC_WD40EFRX-68WT0N0_WD-WCC4E7JD0XVZ (sdd) - active 33 C [OK]
Disk 2 - WDC_WD40EFRX-68WT0N0_WD-WCC4E0FCJZXN (sde) - active 29 C (disk has read errors) [NOK]
Disk 3 - WDC_WD40EFRX-68N32N0_WD-WCC7K1VK78XA (sdf) - active 27 C (disk has read errors) [NOK]
Disk 4 - WDC_WD40EFRX-68WT0N0_WD-WCC4E5UFLDLZ (sdg) - active 29 C [OK]
Disk 5 - WDC_WD40EFRX-68WT0N0_WD-WCC4E6FT6N6E (sdh) - active 28 C [OK]
Cache - INTEL_SSDPEKKW512G7_BTPY63520CAY512F (nvme0n1) - active 38 C [OK]

Parity is valid
Last checked on Friday, 2019-01-04, 10:33 (6 days ago), finding 0 errors.
Duration: 10 hours, 33 minutes, 18 seconds. Average speed: 105.3 MB/s

 

Link to comment

Yes that is telling you about the read errors. Often those don't really indicate a bad disk though. Usually it is something else like the connection. But you could take that as an indication that you should investigate further and maybe run an extended SMART test. I guess sometimes the monitored SMART attributes aren't enough to know for sure if a disk is healthy.

Link to comment

Disk2 complete

Event: Unraid Disk 2 message
Subject: Notice [SERVEUR] - Disk 2 returned to normal operation
Description: WDC_WD40EFRX-68N32N0_WD-WCC7K6SRHYR0 (sdd)
Importance: normal

Event: Unraid Parity sync / Data rebuild
Subject: Notice [SERVEUR] - Parity sync / Data rebuild finished (254 errors)
Description: Duration: 11 hours, 11 minutes, 46 seconds. Average speed: 99.3 MB/s
Importance: warning

254 errors 🤔(Exactly what I have on the disk3)

Now the disk3

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.