Jobine Posted January 8, 2019 Share Posted January 8, 2019 Hi, I have a lot of errors in my syslog Jan 8 15:31:12 Serveur root: error: /webGui/include/ProcessStatus.php: wrong csrf_token Jan 8 15:34:36 Serveur kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jan 8 15:34:36 Serveur kernel: ata3.00: error: { UNC } Jan 8 15:34:36 Serveur kernel: print_req_error: I/O error, dev sde, sector 20482352 Jan 8 15:34:36 Serveur kernel: md: disk2 read error, sector=20482288 ... and Jan 8 15:41:00 Serveur root: Fix Common Problems: Error: disk2 (WDC_WD40EFRX-68WT0N0_WD-WCC4E0FCJZXN) has read errors Jan 8 15:52:21 Serveur kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jan 8 15:52:21 Serveur kernel: ata4.00: error: { UNC } Jan 8 15:52:21 Serveur kernel: print_req_error: I/O error, dev sdf, sector 2078284184 Jan 8 15:52:21 Serveur kernel: md: disk3 read error, sector=2078284120 and in the SMART error log of my disk2 ATA Error Count: 21 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 21 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 90 c4 95 e0 Error: UNC at LBA = 0x0095c490 = 9815184 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 90 c4 95 e0 08 24d+11:46:58.603 READ DMA ca 00 f0 a0 c2 95 e0 08 24d+11:46:58.588 WRITE DMA c8 00 00 90 c3 95 e0 08 24d+11:46:56.557 READ DMA ef 10 02 00 00 00 a0 08 24d+11:46:56.557 SET FEATURES [Enable SATA feature] Error 20 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 a0 c2 95 e0 Error: UNC at LBA = 0x0095c2a0 = 9814688 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 90 c2 95 e0 08 24d+11:46:52.581 READ DMA c8 00 00 90 c1 95 e0 08 24d+11:46:52.148 READ DMA c8 00 00 90 c0 95 e0 08 24d+11:46:50.993 READ DMA ca 00 48 48 bf 95 e0 08 24d+11:46:50.993 WRITE DMA c8 00 00 90 bf 95 e0 08 24d+11:46:50.218 READ DMA Error 19 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 48 bf 95 e0 Error: UNC at LBA = 0x0095bf48 = 9813832 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 90 be 95 e0 08 24d+11:46:46.118 READ DMA c8 00 00 90 bd 95 e0 08 24d+11:46:44.709 READ DMA c8 00 00 90 bc 95 e0 08 24d+11:46:42.665 READ DMA c8 00 00 90 bb 95 e0 08 24d+11:46:41.865 READ DMA c8 00 00 90 ba 95 e0 08 24d+11:46:40.233 READ DMA Error 18 occurred at disk power-on lifetime: 15336 hours (639 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d8 14 93 e0 Error: UNC at LBA = 0x009314d8 = 9639128 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 90 14 93 e0 08 24d+11:44:56.067 READ DMA c8 00 00 90 13 93 e0 08 24d+11:44:54.889 READ DMA c8 00 00 90 12 93 e0 08 24d+11:44:54.020 READ DMA c8 00 00 90 11 93 e0 08 24d+11:44:52.813 READ DMA c8 00 00 90 10 93 e0 08 24d+11:44:51.868 READ DMA Error 17 occurred at disk power-on lifetime: 15294 hours (637 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 80 c5 95 e0 Error: UNC at LBA = 0x0095c580 = 9815424 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 60 c5 95 e0 08 22d+18:06:17.868 READ DMA ca 00 48 18 c3 95 e0 08 22d+18:06:17.868 WRITE DMA ef 10 02 00 00 00 a0 08 22d+18:06:17.851 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 22d+18:06:17.846 IDENTIFY DEVICE and in the Fix commun problems (and the disk is not disabled) If the disk has not been disabled, then unRaid has successfully rewritten the contents of the offending sectors back to the hard drive. an idea of what my problem is ? Thank, Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 Please post the diagnostics: Tools -> Diagnostics Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 Thank ! serveur-diagnostics-20190109-0617.zip Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 Both disks 2 and 3 appear to be failing, though these errors can be intermittent, similar to disk1 errors when it was almost new, and no more since, I would recommend running an extended SMART test, and run it on all disks since others are also showing some warning signs, most raw read errors, then post new diags Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 Thank, serveur-smart-20190109-1009.zip Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 # 1 Extended offline Completed: read failure 90% 12704 112335120 # 1 Extended offline Completed: read failure 90% 15542 9640664 Both failed the test and need to be replaced, since there's only one parity there will likely be some data loss. Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 What would be the best way to replace my disk 2 and my disk 3 later? What is the procedure for replacing a disk and rebuilding? Thank, Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 Should I move manually with Krusader my files on other disks see which ones don't work? Quote Link to comment
trurl Posted January 9, 2019 Share Posted January 9, 2019 1 minute ago, Jobine said: What would be the best way to replace my disk 2 and my disk 3 later? What is the procedure for replacing a disk and rebuilding? Thank, Why do you say "later"? Certainly they have to be done one at a time but if by later you mean some time in the not so near future then you need to be aware that any disk not working well compromises any protection you have from parity. Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 1 disk at time, I change the disk 2 and the disk 3 just after. Quote Link to comment
trurl Posted January 9, 2019 Share Posted January 9, 2019 3 minutes ago, Jobine said: Should I move manually with Krusader my files on other disks see which ones don't work? Don't move anything on the array. Your array is already compromised. If you want you can copy data to someplace off the array, such as your PC or an external drive with Unassigned Devices. Do you have another copy of anything important and irreplaceable? That should be your first priority. Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 I recently added a big 10 and 40 gig files. Can this be my problem? How I proceed to change a disk ? Hot swap ? Turn off, change disk, turn back on ? Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 Just to be sure... Turn off the server Install the new disk Turn on the server PreClear the new disk Turn off the server Remove the disk2 Install the new disk instead of the disk2 Turn on the server Assign the new disk to disk2. Start the Parity sync Waiting and hoping 🤔 Repeat 1-11 for the disk3. I thought I was safe from this problem with Unraid... Do you suggest that I add a second parity disk? Quote Link to comment
trurl Posted January 9, 2019 Share Posted January 9, 2019 17 minutes ago, Jobine said: I recently added a big 10 and 40 gig files. Can this be my problem? Unlikely 18 minutes ago, Jobine said: How I proceed to change a disk ? shut down replace one disk with a new one of the same or larger size, but no larger than parity. boot up assign new disk to the slot for the missing disk start array to begin rebuild If for some reason it offers to Format anything, DON'T. After the rebuild completes run a non-correcting parity check. Keep the original disk for now in case it is needed as backup for file recovery. Repeat for other disk. If you have any doubts, questions, or problems, come back for further advice. 1 Quote Link to comment
trurl Posted January 9, 2019 Share Posted January 9, 2019 2 minutes ago, Jobine said: Just to be sure... Turn off the server Install the new disk Turn on the server PreClear the new disk Turn off the server Remove the disk2 Install the new disk instead of the disk2 Turn on the server Assign the new disk to disk2. Start the Parity sync Waiting and hoping 🤔 Repeat 1-11 for the disk3. That's OK too. Technically you don't have to preclear but it is one way to give a new disk a good testing. 3 minutes ago, Jobine said: I thought I was safe from this problem with Unraid... Do you suggest that I add a second parity disk? How long have you been getting errors on multiple disks? You need to attend to each problem so you don't get into a situation where you have multiple problems. Do you have Notifications setup to alert you by email or other agent when Unraid detects a problem? With only 5 data disks it is debatable whether parity2 is justified. If you are diligent you should be able to deal with problems one disk at a time. Quote Link to comment
Jobine Posted January 9, 2019 Author Share Posted January 9, 2019 I like your method, more simple ... The problems was started two weeks ago, with the addition of the 2 files. Yes, all notifications are set. (I have never received a message for the disk3, just for disk2). Thank, Quote Link to comment
trurl Posted January 9, 2019 Share Posted January 9, 2019 You might have to turn on the Array Status notification to get messages about simple I/O errors that don't result in SMART warnings or disabled disks, like what you have experienced. But the status notification only goes out on schedule instead of instantly. Maybe monitoring additional SMART attributes besides the default ones would help. For my WD Red disks, I have added monitoring of attributes 1 and 200 @johnnie.black recommendation. Quote Link to comment
JorgeB Posted January 9, 2019 Share Posted January 9, 2019 Your best bet would likely be to add a 2nd parity drive now, if there are just a few read errors on each disk they likely won't be on the same sectors, so with a little luck parity2 could build 100% successfully, though there's a small change that a disk could become disable, but they are much more likely to error on read then on writes, so IMO there's a good chance it would work. Quote Link to comment
Jobine Posted January 10, 2019 Author Share Posted January 10, 2019 2 hours ago, trurl said: You might have to turn on the Array Status notification to get messages about simple I/O errors that don't result in SMART warnings or disabled disks, like what you have experienced. But the status notification only goes out on schedule instead of instantly. Maybe monitoring additional SMART attributes besides the default ones would help. For my WD Red disks, I have added monitoring of attributes 1 and 200 @johnnie.black recommendation. I just need to add 1,200 in SMART attribute notifications: in addition to the ones below? Quote Link to comment
Jobine Posted January 10, 2019 Author Share Posted January 10, 2019 I don't think it's good, I get errors from all my disks... Quote Link to comment
Jobine Posted January 10, 2019 Author Share Posted January 10, 2019 Last receive email Quote Event: Unraid Status Subject: Notice [SERVEUR] - array health report [FAIL] Description: Array has 7 disks (including parity & cache) Importance: warning Parity - WDC_WD40EFRX-68WT0N0_WD-WCC4E2TCRKH5 (sdc) - active 29 C [OK] Disk 1 - WDC_WD40EFRX-68WT0N0_WD-WCC4E7JD0XVZ (sdd) - active 33 C [OK] Disk 2 - WDC_WD40EFRX-68WT0N0_WD-WCC4E0FCJZXN (sde) - active 29 C (disk has read errors) [NOK] Disk 3 - WDC_WD40EFRX-68N32N0_WD-WCC7K1VK78XA (sdf) - active 27 C (disk has read errors) [NOK] Disk 4 - WDC_WD40EFRX-68WT0N0_WD-WCC4E5UFLDLZ (sdg) - active 29 C [OK] Disk 5 - WDC_WD40EFRX-68WT0N0_WD-WCC4E6FT6N6E (sdh) - active 28 C [OK] Cache - INTEL_SSDPEKKW512G7_BTPY63520CAY512F (nvme0n1) - active 38 C [OK] Parity is valid Last checked on Friday, 2019-01-04, 10:33 (6 days ago), finding 0 errors. Duration: 10 hours, 33 minutes, 18 seconds. Average speed: 105.3 MB/s Quote Link to comment
trurl Posted January 10, 2019 Share Posted January 10, 2019 Yes that is telling you about the read errors. Often those don't really indicate a bad disk though. Usually it is something else like the connection. But you could take that as an indication that you should investigate further and maybe run an extended SMART test. I guess sometimes the monitored SMART attributes aren't enough to know for sure if a disk is healthy. Quote Link to comment
Jobine Posted January 10, 2019 Author Share Posted January 10, 2019 Disk 2 being rebuilt... Quote Link to comment
Jobine Posted January 11, 2019 Author Share Posted January 11, 2019 Disk2 complete Event: Unraid Disk 2 message Subject: Notice [SERVEUR] - Disk 2 returned to normal operation Description: WDC_WD40EFRX-68N32N0_WD-WCC7K6SRHYR0 (sdd) Importance: normal Event: Unraid Parity sync / Data rebuild Subject: Notice [SERVEUR] - Parity sync / Data rebuild finished (254 errors) Description: Duration: 11 hours, 11 minutes, 46 seconds. Average speed: 99.3 MB/s Importance: warning 254 errors 🤔(Exactly what I have on the disk3) Now the disk3 Quote Link to comment
JorgeB Posted January 11, 2019 Share Posted January 11, 2019 That means there will be some corruption on the rebuilt disk, that will in turn cause corruption when rebuilding disk3, still thing you should have tried to add parity2, you'd likely been able to replace both disks without errors. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.