EdgarWallace Posted August 28, 2015 Share Posted August 28, 2015 The process finished with >400mio parity errors corrected. I wasn't smart enough to implement bunker yet but a quick check showed that the data was all ok. I removed disk 7, started the server and was running another parity check. These errors that were causing the parity check to stop: Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004194360 Aug 28 13:38:41 Tower kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32 Aug 28 13:38:41 Tower kernel: sas: trying to find task 0xffff880205dec400 Next were read and write errors and finally disk 3 was disabled. I ordered a new drive but in what order should I run the process? Would that be correct: 1.) preclear the new drive 2.) swap disc 3 3.) run a new parity check I have the the diagnostic attached and would appreciate any further guidance. Thanks a lot. tower-diagnostics-20150828-1354.zip Link to comment
RobJ Posted August 28, 2015 Share Posted August 28, 2015 I removed disk 7, started the server and was running another parity check. These errors that were causing the parity check to stop: Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004194360 Aug 28 13:38:41 Tower kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32 Aug 28 13:38:41 Tower kernel: sas: trying to find task 0xffff880205dec400 Next were read and write errors and finally disk 3 was disabled. I ordered a new drive but in what order should I run the process? Would that be correct: 1.) preclear the new drive 2.) swap disc 3 3.) run a new parity check I have the the diagnostic attached and would appreciate any further guidance. Thanks a lot. This kind of error is not usually a physical drive issue but an interface issue. It usually means contact with the drive is lost, so nothing wrong with the drive, but if you cannot talk to it, you will have a lot of errors including read and write errors and then the drive is dropped by the kernel. It's not the drive's fault, and nothing is wrong with the drive. It could be a cabling or power issue, or a controller issue. Sorry, I haven't looked at the diagnostics yet. Maybe later... Link to comment
EdgarWallace Posted August 28, 2015 Author Share Posted August 28, 2015 Thanks Rob. If you would kindly look into the diagnostics...that would be excellent. I have almost all data from disc 3 on my 2nd server and doing the remaining backup as we speak. The Smart Test I was starting is stuck at 10%...so I might need to first power down the server and checking the cabling. You are right..I might have damaged the cabling when I pulled off the empty disc drive but I'm going to install the new drive afterwards. Let's see how the "old" drive behaves when I preclear it....this will making sure that the drive is in proper condition. I will post the results here. Thanks a lot. [EDIT] The data rebuild ended without any further errors. I started a SMART extended self-test (the short self-test didn't reported any errors). I will upload the SMART report as soon as it finishes. However, 12 movies are damaged now. HUGE issue as I can't trust the data integrity any more. I saw this before but this is massive now. Link to comment
RobJ Posted August 29, 2015 Share Posted August 29, 2015 Your syslog shows unRAID 6.0.1, but it's not. Here is 6.0.1 - Aug 21 14:57:33 Tower kernel: Linux version 4.0.4-unRAID (root@develop64) (gcc version 4.8.2 (GCC) ) #5 SMP PREEMPT Fri Jun 19 22:47:24 PDT 2015 Your unRAID appears to have been built 8 days later, probably based on 6.0.1, looks like this - Aug 28 10:54:54 Tower kernel: Linux version 4.0.4-unRAID (root@Server) (gcc version 4.8.2 (GCC) ) #1 SMP PREEMPT Sat Jun 27 01:01:48 BST 2015 I have no idea whether that is significant or not. Your SAS card (using mvsas) quit on you at Aug 28 13:38:10, just over 2.5 hours into the parity check. First errors were - Aug 28 13:38:10 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0FFFFDFF, slot [69]. Aug 28 13:38:10 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active FFFFFFFD, slot [21]. Aug 28 13:38:10 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active FFFFFFEF, slot [44]. Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004192336 Aug 28 13:38:10 Tower kernel: md: parity incorrect, sector=2004192344 ...snipped - more parity errors... Aug 28 13:38:41 Tower kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32 Aug 28 13:38:41 Tower kernel: sas: trying to find task 0xffff880205dec400 Aug 28 13:38:41 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff880205dec400 Aug 28 13:38:41 Tower kernel: sas: sas_scsi_find_task: task 0xffff880205dec400 is aborted Aug 28 13:38:41 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff880205dec400 is aborted ...snipped - aborted the tasks... No clues above about what actually went wrong with the card. When I have seen this in the past, the next drives to be accessed are unresponsive, and are dropped. In your case, ata11 (Disk 3) was first and ata13 was second. Some how, the drive associated with ata13 recovers after a hard reset, but ata11 does not, and is finally dropped, disabled. That usually always means there is nothing wrong with the drive, just a lost connection, fixed by a reboot. A SAS controller crash like this has always been hard to diagnose. You can check for a firmware update. Some have just replaced the card. Wish I had better advice, but I don't. What I *can* say is the drives are probably all fine. Edit: I will probably split these last posts out, into a support topic in the General Support for v6 board. Link to comment
EdgarWallace Posted August 31, 2015 Author Share Posted August 31, 2015 Rob, yes please feel free to move the last posts wherever you think it should be posted. Only two comments: 1.) I'm running the unRAID 6.0.1 Media Build, that's why you see a different build date. 2.) I will check the firmware of my SAS card, but I do believe that it's the latest 4.0.0.1812 Thanks again, you advise is much appreciated! The server is up and running again, now I will check why some of the data got corrupted. Link to comment
EdgarWallace Posted September 7, 2015 Author Share Posted September 7, 2015 I'm facing new issues after having updated today to 6.1.1 Media Build today. Parity check is at around 60% and 137 errors have been corrected. This is the first time after the removal of the disc that I see parity errors: Sep 7 13:59:58 Tower kernel: mdcmd (27): check CORRECT Sep 7 13:59:58 Tower kernel: md: recovery thread woken up ... Sep 7 13:59:58 Tower kernel: md: recovery thread checking parity... Sep 7 13:59:58 Tower kernel: md: using 8192k window, over a total of 2930266532 blocks. Sep 7 14:00:10 Tower kernel: md: correcting parity, sector=3343440 Sep 7 14:00:25 Tower kernel: md: correcting parity, sector=7459984 Sep 7 14:00:27 Tower kernel: md: correcting parity, sector=8189472 Sep 7 14:00:38 Tower kernel: md: correcting parity, sector=11363248 Sep 7 14:00:58 Tower kernel: md: correcting parity, sector=16961608 Sep 7 14:01:26 Tower kernel: md: correcting parity, sector=25038880 Sep 7 14:01:37 Tower kernel: md: correcting parity, sector=28293928 Sep 7 14:02:22 Tower kernel: md: correcting parity, sector=41052152 Sep 7 14:02:41 Tower autofan: Highest disk temp is 35°C, adjusting fan speed from: 173 (67% @ 1757rpm) to: 185 (72% @ 1814rpm) Sep 7 14:07:49 Tower autofan: Highest disk temp is 36°C, adjusting fan speed from: 185 (72% @ 1856rpm) to: 197 (77% @ 1901rpm) Sep 7 14:12:58 Tower autofan: Highest disk temp is 38°C, adjusting fan speed from: 197 (77% @ 1939rpm) to: 221 (86% @ 2061rpm) Sep 7 14:14:55 Tower kernel: md: correcting parity, sector=252510240 Sep 7 14:14:58 Tower kernel: md: correcting parity, sector=253283392 Sep 7 14:15:18 Tower kernel: md: correcting parity, sector=259051408 Sep 7 14:15:57 Tower kernel: md: correcting parity, sector=269990640 Sep 7 14:15:57 Tower kernel: md: correcting parity, sector=270041944 Sep 7 14:16:21 Tower kernel: md: correcting parity, sector=276592496 Sep 7 14:16:33 Tower kernel: md: correcting parity, sector=280145144 Sep 7 14:16:47 Tower kernel: md: correcting parity, sector=283961064 Sep 7 14:17:03 Tower kernel: md: correcting parity, sector=288320288 Sep 7 14:18:05 Tower kernel: md: correcting parity, sector=304850328 Sep 7 14:18:28 Tower kernel: md: correcting parity, sector=311543808 Sep 7 14:18:41 Tower kernel: md: correcting parity, sector=314957544 Sep 7 14:18:48 Tower kernel: md: correcting parity, sector=317048336 Sep 7 14:18:52 Tower kernel: md: correcting parity, sector=318027648 Sep 7 14:19:11 Tower kernel: md: correcting parity, sector=323384584 Sep 7 14:19:57 Tower kernel: md: correcting parity, sector=336244240 Sep 7 14:20:05 Tower kernel: md: correcting parity, sector=338529168 Sep 7 14:20:12 Tower kernel: md: correcting parity, sector=340368640 Sep 7 14:20:14 Tower kernel: md: correcting parity, sector=340988768 Sep 7 14:20:17 Tower kernel: md: correcting parity, sector=341966792 Sep 7 14:20:50 Tower kernel: md: correcting parity, sector=351065160 Sep 7 14:21:34 Tower kernel: md: correcting parity, sector=363281072 Sep 7 14:21:34 Tower kernel: md: correcting parity, sector=363346856 Sep 7 14:22:00 Tower kernel: md: correcting parity, sector=370568344 Sep 7 14:22:02 Tower kernel: md: correcting parity, sector=371111200 Sep 7 14:22:12 Tower kernel: md: correcting parity, sector=374002624 Sep 7 14:23:10 Tower autofan: Highest disk temp is 39°C, adjusting fan speed from: 221 (86% @ 2057rpm) to: 233 (91% @ 2129rpm) Sep 7 14:38:27 Tower autofan: Highest disk temp is 40°C, adjusting fan speed from: 233 (91% @ 2173rpm) to: FULL (100% @ 2261rpm) Sep 7 14:43:21 Tower afpd[16689]: transmit: Request to dbd daemon (volume TMO) timed out. Sep 7 14:43:21 Tower afpd[16689]: afp_openvol(/mnt/user/TMO): Fatal error: Unable to get stamp value from CNID backend Sep 7 14:43:21 Tower afpd[16689]: dsi_stream_read: len:0, unexpected EOF Sep 7 15:05:28 Tower kernel: md: correcting parity, sector=980278608 Sep 7 15:05:57 Tower kernel: md: correcting parity, sector=985759880 Sep 7 15:07:18 Tower kernel: md: correcting parity, sector=1001025664 Sep 7 15:07:22 Tower kernel: md: correcting parity, sector=1001759392 Sep 7 15:07:39 Tower kernel: md: correcting parity, sector=1004835824 Sep 7 15:18:25 Tower kernel: md: correcting parity, sector=1161449848 Sep 7 15:19:46 Tower kernel: md: correcting parity, sector=1182265736 Sep 7 15:20:02 Tower kernel: md: correcting parity, sector=1186601712 Sep 7 15:20:15 Tower kernel: md: correcting parity, sector=1190034936 Sep 7 15:21:11 Tower kernel: md: correcting parity, sector=1205051528 Sep 7 15:23:25 Tower kernel: md: correcting parity, sector=1240723488 Sep 7 15:24:12 Tower kernel: md: correcting parity, sector=1250962064 Sep 7 15:24:22 Tower kernel: md: correcting parity, sector=1253593120 Sep 7 15:24:26 Tower kernel: md: correcting parity, sector=1254518840 Sep 7 15:24:35 Tower kernel: md: correcting parity, sector=1257111824 Sep 7 15:25:01 Tower kernel: md: correcting parity, sector=1263885848 Sep 7 15:25:33 Tower kernel: md: correcting parity, sector=1272204680 Sep 7 15:25:45 Tower kernel: md: correcting parity, sector=1275460168 Sep 7 15:25:48 Tower kernel: md: correcting parity, sector=1276331032 Sep 7 15:26:07 Tower kernel: md: correcting parity, sector=1281163872 Sep 7 15:26:32 Tower kernel: md: correcting parity, sector=1287914224 Sep 7 15:27:35 Tower kernel: md: correcting parity, sector=1304660520 Sep 7 15:27:39 Tower kernel: md: correcting parity, sector=1305574320 Sep 7 15:31:52 Tower kernel: md: correcting parity, sector=1372377848 Sep 7 15:32:00 Tower kernel: md: correcting parity, sector=1374544968 Sep 7 15:32:32 Tower kernel: md: correcting parity, sector=1382969976 Sep 7 15:32:35 Tower kernel: md: correcting parity, sector=1383982824 Sep 7 15:33:25 Tower kernel: md: correcting parity, sector=1397184600 Sep 7 15:33:37 Tower kernel: md: correcting parity, sector=1400384120 Sep 7 15:33:45 Tower kernel: md: correcting parity, sector=1402511136 Sep 7 15:42:07 Tower kernel: md: correcting parity, sector=1531778480 Sep 7 15:42:22 Tower kernel: md: correcting parity, sector=1535854792 Sep 7 15:44:31 Tower kernel: md: correcting parity, sector=1568516280 Sep 7 15:44:46 Tower kernel: md: correcting parity, sector=1572348816 Sep 7 15:45:22 Tower kernel: md: correcting parity, sector=1581746784 Sep 7 15:45:38 Tower kernel: md: correcting parity, sector=1586174096 Sep 7 15:45:45 Tower kernel: md: correcting parity, sector=1587889496 Sep 7 15:45:57 Tower kernel: md: correcting parity, sector=1590897576 Sep 7 15:46:02 Tower kernel: md: correcting parity, sector=1592287784 Sep 7 15:46:16 Tower kernel: md: correcting parity, sector=1596037288 Sep 7 15:46:40 Tower kernel: md: correcting parity, sector=1602296856 Sep 7 15:47:17 Tower kernel: md: correcting parity, sector=1611713448 Sep 7 15:47:24 Tower kernel: md: correcting parity, sector=1613579848 Sep 7 15:47:30 Tower kernel: md: correcting parity, sector=1615184976 Sep 7 15:48:03 Tower kernel: md: correcting parity, sector=1623562832 Sep 7 15:48:10 Tower kernel: md: correcting parity, sector=1625251352 Sep 7 15:49:04 Tower kernel: md: correcting parity, sector=1639365744 Sep 7 15:52:44 Tower kernel: md: correcting parity, sector=1695258144 Sep 7 15:52:53 Tower kernel: md: correcting parity, sector=1697609320 Sep 7 15:53:36 Tower kernel: md: correcting parity, sector=1708735264 Sep 7 16:11:20 Tower kernel: md: correcting parity, sector=1973717144 Sep 7 16:11:49 Tower kernel: md: correcting parity, sector=1979377664 Sep 7 16:12:08 Tower kernel: md: correcting parity, sector=1982562488 Sep 7 16:13:43 Tower kernel: md: correcting parity, sector=1999711824 Sep 7 16:14:12 Tower kernel: md: correcting parity, sector=2005403240 Sep 7 16:14:31 Tower kernel: md: correcting parity, sector=2008878720 Sep 7 16:14:36 Tower kernel: md: correcting parity, sector=2009968720 Sep 7 16:14:40 Tower kernel: md: correcting parity, sector=2010523232 Sep 7 16:15:12 Tower kernel: md: correcting parity, sector=2016596312 Sep 7 16:15:20 Tower kernel: md: correcting parity, sector=2018135696 Sep 7 16:15:21 Tower kernel: md: correcting parity, sector=2018352872 Sep 7 16:15:59 Tower kernel: md: correcting parity, sector=2025801360 Sep 7 16:16:15 Tower kernel: md: correcting parity, sector=2028925880 Sep 7 16:16:41 Tower kernel: md: correcting parity, sector=2033556864 Sep 7 16:16:45 Tower kernel: md: correcting parity, sector=2034392456 Sep 7 16:17:19 Tower kernel: md: correcting parity, sector=2040497560 Sep 7 16:18:07 Tower kernel: md: correcting parity, stopped logging Sep 7 16:33:02 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Sep 7 16:33:02 Tower kernel: sas: trying to find task 0xffff88010d2bce00 Sep 7 16:33:02 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff88010d2bce00 Sep 7 16:33:02 Tower kernel: sas: sas_scsi_find_task: task 0xffff88010d2bce00 is aborted Sep 7 16:33:02 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff88010d2bce00 is aborted Sep 7 16:33:02 Tower kernel: sas: ata7: end_device-1:0: cmd error handler Sep 7 16:33:02 Tower kernel: sas: ata7: end_device-1:0: dev error handler Sep 7 16:33:02 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Sep 7 16:33:02 Tower kernel: ata7.00: failed command: READ DMA EXT Sep 7 16:33:02 Tower kernel: ata7.00: cmd 25/00:00:e8:23:1f/00:02:86:00:00/e0 tag 10 dma 262144 in Sep 7 16:33:02 Tower kernel: res 40/00:00:2f:d2:d0/00:00:83:00:00/e3 Emask 0x4 (timeout) Sep 7 16:33:02 Tower kernel: ata7.00: status: { DRDY } Sep 7 16:33:02 Tower kernel: ata7: hard resetting link Sep 7 16:33:02 Tower kernel: sas: ata8: end_device-1:1: dev error handler Sep 7 16:33:02 Tower kernel: sas: ata9: end_device-1:2: dev error handler Sep 7 16:33:02 Tower kernel: sas: ata10: end_device-1:3: dev error handler Sep 7 16:33:02 Tower kernel: sas: ata11: end_device-1:4: dev error handler Sep 7 16:33:02 Tower kernel: sas: ata12: end_device-1:5: dev error handler Sep 7 16:33:02 Tower kernel: sas: ata13: end_device-1:6: dev error handler Sep 7 16:33:02 Tower kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Sep 7 16:33:04 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[0]:rc= 0 Sep 7 16:33:04 Tower kernel: ata7.00: configured for UDMA/133 Sep 7 16:33:04 Tower kernel: ata7: EH complete Sep 7 16:33:04 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 Sep 7 17:35:55 Tower autofan: Highest disk temp is 39°C, adjusting fan speed from: FULL (100% @ 2246rpm) to: 233 (91% @ 2163rpm) Sep 7 17:42:46 Tower afpd[23316]: dsi_stream_read: len:0, unexpected EOF Is there any way to identify, which files have been affected by the parity error correction? I don't know if this is linked but here is an overview of files that are corrupted: Atonement (2007) 30.12.2010 Australia (2008) 05.01.2011 Flickan som lekte med elden (2009 05.01.2011 Unbreakable (2000) 05.01.2011 Lost in Translation (2003) 06.01.2011 Proposal, The (2009) 06.01.2011 Vanilla Sky (2001) 09.01.2011 Vertical Limit (2000) 09.01.2011 Blade Runner (1982) 04.02.2011 Kaerlighedshistorie, En (2001) 14.02.2013 Very similar modification dates. Link to comment
Fireball3 Posted September 8, 2015 Share Posted September 8, 2015 Is there any way to identify, which files have been affected by the parity error correction? I don't know if this is linked but here is an overview of files that are corrupted: Chances are good to find the corrupted files, if you have a backup. Run the rsync with the "checksum + dry run" option and you should get a list of files that will be copied from your backup. Link to comment
EdgarWallace Posted September 8, 2015 Author Share Posted September 8, 2015 Thanks Fireball. Is that the correct syntax? rsync -avH --stats --progress --checksum --ignore-existing --dry-run /mnt/user/Data/ rsync://tower2/mnt/user/Data/ It throws out a bunch of files that have been added/changed: Number of files: 9,465 (reg: 8,703, dir: 762) Number of created files: 37 (reg: 37) Number of deleted files: 0 Number of regular files transferred: 37 Total file size: 20,820,956,184 bytes Total transferred file size: 6,730,465 bytes Literal data: 0 bytes Matched data: 0 bytes File list size: 262,027 File list generation time: 0.757 seconds File list transfer time: 0.000 seconds Total bytes sent: 488,694 Total bytes received: 1,028 sent 488,694 bytes received 1,028 bytes 182.63 bytes/sec total size is 20,820,956,184 speedup is 42,515.87 (DRY RUN) Link to comment
Fireball3 Posted September 8, 2015 Share Posted September 8, 2015 Looks good, although I'm not sure about the --ignore-existing. Now it would be very comfortable to have a list of those files. Checking the manpage possible options are: --itemize-changes --list-only -vv -vvv Always keep the --dry-run in! You can specify a --log-file=[File] for later processing of the list. Link to comment
EdgarWallace Posted September 8, 2015 Author Share Posted September 8, 2015 Here is the next try....providing this output: root@Tower:/boot/custom# cat rsync.log 2015/09/08 17:20:51 [2097] opening tcp connection to tower2 port 873 2015/09/08 17:20:51 [2097] Connected to tower2 (192.168.178.10) 2015/09/08 17:20:51 [2097] msg checking charset: UTF-8 2015/09/08 17:20:51 [2097] (Client) Protocol versions: remote=31, negotiated=31 2015/09/08 17:20:51 [2097] building file list 2015/09/08 17:20:51 [2097] [sender] make_file(.,*,0) 2015/09/08 17:20:51 [2097] [sender] pushing local filters for /mnt/user/Fotos/ 2015/09/08 17:20:51 [2097] [sender] make_file(2001,*,2) ...followed by many many lines like this one: 2015/09/08 17:20:52 [2097] [sender] i=44 /mnt/user/Fotos 1970 - 1975/imm001.jpg mode=0100666 len=1,707,321 uid=99 gid=100 flags=0 Let's see how it looks like when there is any difference...hopefully that is visible then. Link to comment
EdgarWallace Posted September 15, 2015 Author Share Posted September 15, 2015 Not yet. I had to RMA my new Asrock board. Will report back as soon as the server is up and running again. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.