Disk2 and 3 read errors

Jobine · January 8, 2019

Hi,

I have a lot of errors in my syslog

Jan  8 15:31:12 Serveur root: error: /webGui/include/ProcessStatus.php: wrong csrf_token
Jan  8 15:34:36 Serveur kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan  8 15:34:36 Serveur kernel: ata3.00: error: { UNC }
Jan  8 15:34:36 Serveur kernel: print_req_error: I/O error, dev sde, sector 20482352
Jan  8 15:34:36 Serveur kernel: md: disk2 read error, sector=20482288
...

and

Jan  8 15:41:00 Serveur root: Fix Common Problems: Error: disk2 (WDC_WD40EFRX-68WT0N0_WD-WCC4E0FCJZXN) has read errors
Jan  8 15:52:21 Serveur kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan  8 15:52:21 Serveur kernel: ata4.00: error: { UNC }
Jan  8 15:52:21 Serveur kernel: print_req_error: I/O error, dev sdf, sector 2078284184
Jan  8 15:52:21 Serveur kernel: md: disk3 read error, sector=2078284120

and in the SMART error log of my disk2

ATA Error Count: 21 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 90 c4 95 e0  Error: UNC at LBA = 0x0095c490 = 9815184

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 c4 95 e0 08  24d+11:46:58.603  READ DMA
  ca 00 f0 a0 c2 95 e0 08  24d+11:46:58.588  WRITE DMA
  c8 00 00 90 c3 95 e0 08  24d+11:46:56.557  READ DMA
  ef 10 02 00 00 00 a0 08  24d+11:46:56.557  SET FEATURES [Enable SATA feature]

Error 20 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 a0 c2 95 e0  Error: UNC at LBA = 0x0095c2a0 = 9814688

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 c2 95 e0 08  24d+11:46:52.581  READ DMA
  c8 00 00 90 c1 95 e0 08  24d+11:46:52.148  READ DMA
  c8 00 00 90 c0 95 e0 08  24d+11:46:50.993  READ DMA
  ca 00 48 48 bf 95 e0 08  24d+11:46:50.993  WRITE DMA
  c8 00 00 90 bf 95 e0 08  24d+11:46:50.218  READ DMA

Error 19 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 48 bf 95 e0  Error: UNC at LBA = 0x0095bf48 = 9813832

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 be 95 e0 08  24d+11:46:46.118  READ DMA
  c8 00 00 90 bd 95 e0 08  24d+11:46:44.709  READ DMA
  c8 00 00 90 bc 95 e0 08  24d+11:46:42.665  READ DMA
  c8 00 00 90 bb 95 e0 08  24d+11:46:41.865  READ DMA
  c8 00 00 90 ba 95 e0 08  24d+11:46:40.233  READ DMA

Error 18 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d8 14 93 e0  Error: UNC at LBA = 0x009314d8 = 9639128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 90 14 93 e0 08  24d+11:44:56.067  READ DMA
  c8 00 00 90 13 93 e0 08  24d+11:44:54.889  READ DMA
  c8 00 00 90 12 93 e0 08  24d+11:44:54.020  READ DMA
  c8 00 00 90 11 93 e0 08  24d+11:44:52.813  READ DMA
  c8 00 00 90 10 93 e0 08  24d+11:44:51.868  READ DMA

Error 17 occurred at disk power-on lifetime: 15294 hours (637 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 c5 95 e0  Error: UNC at LBA = 0x0095c580 = 9815424

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 60 c5 95 e0 08  22d+18:06:17.868  READ DMA
  ca 00 48 18 c3 95 e0 08  22d+18:06:17.868  WRITE DMA
  ef 10 02 00 00 00 a0 08  22d+18:06:17.851  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08  22d+18:06:17.846  IDENTIFY DEVICE

and in the Fix commun problems (and the disk is not disabled)

If the disk has not been disabled, then unRaid has successfully rewritten the contents of the offending sectors back to the hard drive.

an idea of what my problem is ?

Thank,

JorgeB · January 9, 2019

Please post the diagnostics: Tools -> Diagnostics

Jobine · January 9, 2019

Thank !

serveur-diagnostics-20190109-0617.zip

JorgeB · January 9, 2019

Both disks 2 and 3 appear to be failing, though these errors can be intermittent, similar to disk1 errors when it was almost new, and no more since, I would recommend running an extended SMART test, and run it on all disks since others are also showing some warning signs, most raw read errors, then post new diags

Jobine · January 9, 2019

Thank,

serveur-smart-20190109-1009.zip

JorgeB · January 9, 2019

# 1  Extended offline    Completed: read failure       90%     12704         112335120

# 1  Extended offline    Completed: read failure       90%     15542         9640664

Both failed the test and need to be replaced, since there's only one parity there will likely be some data loss.

Jobine · January 9, 2019

What would be the best way to replace my disk 2 and my disk 3 later?

What is the procedure for replacing a disk and rebuilding?

Thank,

Jobine · January 9, 2019

Should I move manually with Krusader my files on other disks see which ones don't work?

trurl · January 9, 2019

1 minute ago, Jobine said:

What would be the best way to replace my disk 2 and my disk 3 later?

What is the procedure for replacing a disk and rebuilding?

Thank,

Why do you say "later"? Certainly they have to be done one at a time but if by later you mean some time in the not so near future then you need to be aware that any disk not working well compromises any protection you have from parity.

Jobine · January 9, 2019

1 disk at time, I change the disk 2 and the disk 3 just after.

trurl · January 9, 2019

3 minutes ago, Jobine said:

Should I move manually with Krusader my files on other disks see which ones don't work?

Don't move anything on the array. Your array is already compromised. If you want you can copy data to someplace off the array, such as your PC or an external drive with Unassigned Devices.

Do you have another copy of anything important and irreplaceable? That should be your first priority.

Jobine · January 9, 2019

I recently added a big 10 and 40 gig files. Can this be my problem?

How I proceed to change a disk ? Hot swap ? Turn off, change disk, turn back on ?

Jobine · January 9, 2019

Just to be sure...

Turn off the server
Install the new disk
Turn on the server
PreClear the new disk
Turn off the server
Remove the disk2
Install the new disk instead of the disk2
Turn on the server
Assign the new disk to disk2.
Start the Parity sync
Waiting and hoping 🤔
Repeat 1-11 for the disk3.

I thought I was safe from this problem with Unraid...

Do you suggest that I add a second parity disk?

trurl · January 9, 2019

17 minutes ago, Jobine said:

I recently added a big 10 and 40 gig files. Can this be my problem?

Unlikely

18 minutes ago, Jobine said:

How I proceed to change a disk ?

shut down
replace one disk with a new one of the same or larger size, but no larger than parity.
boot up
assign new disk to the slot for the missing disk
start array to begin rebuild

If for some reason it offers to Format anything, DON'T.

After the rebuild completes run a non-correcting parity check. Keep the original disk for now in case it is needed as backup for file recovery.

Repeat for other disk.

If you have any doubts, questions, or problems, come back for further advice.

trurl · January 9, 2019

2 minutes ago, Jobine said:

Just to be sure...

Turn off the server

Install the new disk

Turn on the server

PreClear the new disk

Turn off the server

Remove the disk2

Install the new disk instead of the disk2

Turn on the server

Assign the new disk to disk2.

Start the Parity sync

Waiting and hoping 🤔

Repeat 1-11 for the disk3.

That's OK too. Technically you don't have to preclear but it is one way to give a new disk a good testing.

3 minutes ago, Jobine said:

I thought I was safe from this problem with Unraid...

Do you suggest that I add a second parity disk?

How long have you been getting errors on multiple disks? You need to attend to each problem so you don't get into a situation where you have multiple problems.

Do you have Notifications setup to alert you by email or other agent when Unraid detects a problem?

With only 5 data disks it is debatable whether parity2 is justified. If you are diligent you should be able to deal with problems one disk at a time.

Jobine · January 9, 2019

I like your method, more simple ...

The problems was started two weeks ago, with the addition of the 2 files.

Yes, all notifications are set. (I have never received a message for the disk3, just for disk2).

Thank,

trurl · January 9, 2019

You might have to turn on the Array Status notification to get messages about simple I/O errors that don't result in SMART warnings or disabled disks, like what you have experienced. But the status notification only goes out on schedule instead of instantly.

Maybe monitoring additional SMART attributes besides the default ones would help. For my WD Red disks, I have added monitoring of attributes 1 and 200 @johnnie.black recommendation.

JorgeB · January 9, 2019

Your best bet would likely be to add a 2nd parity drive now, if there are just a few read errors on each disk they likely won't be on the same sectors, so with a little luck parity2 could build 100% successfully, though there's a small change that a disk could become disable, but they are much more likely to error on read then on writes, so IMO there's a good chance it would work.

Jobine · January 10, 2019

2 hours ago, trurl said:

You might have to turn on the Array Status notification to get messages about simple I/O errors that don't result in SMART warnings or disabled disks, like what you have experienced. But the status notification only goes out on schedule instead of instantly.

Maybe monitoring additional SMART attributes besides the default ones would help. For my WD Red disks, I have added monitoring of attributes 1 and 200 @johnnie.black recommendation.

I just need to add 1,200 in SMART attribute notifications: in addition to the ones below?

Jobine · January 10, 2019

I don't think it's good, I get errors from all my disks...

Jobine · January 10, 2019

Last receive email

Quote

Event: Unraid Status
Subject: Notice [SERVEUR] - array health report [FAIL]
Description: Array has 7 disks (including parity & cache)
Importance: warning

Parity - WDC_WD40EFRX-68WT0N0_WD-WCC4E2TCRKH5 (sdc) - active 29 C [OK]
Disk 1 - WDC_WD40EFRX-68WT0N0_WD-WCC4E7JD0XVZ (sdd) - active 33 C [OK]
Disk 2 - WDC_WD40EFRX-68WT0N0_WD-WCC4E0FCJZXN (sde) - active 29 C (disk has read errors) [NOK]
Disk 3 - WDC_WD40EFRX-68N32N0_WD-WCC7K1VK78XA (sdf) - active 27 C (disk has read errors) [NOK]
Disk 4 - WDC_WD40EFRX-68WT0N0_WD-WCC4E5UFLDLZ (sdg) - active 29 C [OK]
Disk 5 - WDC_WD40EFRX-68WT0N0_WD-WCC4E6FT6N6E (sdh) - active 28 C [OK]
Cache - INTEL_SSDPEKKW512G7_BTPY63520CAY512F (nvme0n1) - active 38 C [OK]

Parity is valid
Last checked on Friday, 2019-01-04, 10:33 (6 days ago), finding 0 errors.
Duration: 10 hours, 33 minutes, 18 seconds. Average speed: 105.3 MB/s

trurl · January 10, 2019

Yes that is telling you about the read errors. Often those don't really indicate a bad disk though. Usually it is something else like the connection. But you could take that as an indication that you should investigate further and maybe run an extended SMART test. I guess sometimes the monitored SMART attributes aren't enough to know for sure if a disk is healthy.

Jobine · January 10, 2019

Disk 2 being rebuilt...

Jobine · January 11, 2019

Disk2 complete

Event: Unraid Disk 2 message
Subject: Notice [SERVEUR] - Disk 2 returned to normal operation
Description: WDC_WD40EFRX-68N32N0_WD-WCC7K6SRHYR0 (sdd)
Importance: normal

Event: Unraid Parity sync / Data rebuild
Subject: Notice [SERVEUR] - Parity sync / Data rebuild finished (254 errors)
Description: Duration: 11 hours, 11 minutes, 46 seconds. Average speed: 99.3 MB/s
Importance: warning

254 errors 🤔(Exactly what I have on the disk3)

Now the disk3

JorgeB · January 11, 2019

That means there will be some corruption on the rebuilt disk, that will in turn cause corruption when rebuilding disk3, still thing you should have tried to add parity2, you'd likely been able to replace both disks without errors.

Disk2 and 3 read errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation