May 2, 201412 yr Author IT WORKED!!! OK, the reiserfsck process is completed and generated a huge output. The actual output is linked here : https://www.dropbox.com/s/ol2pjw82vzv0hqy/reiserfsck-output.txt when I browsed to /mnt/disk4, all I get is a folder called "lost+found" which has probably over a hundred folders with numbers on them (sectors?). The good news is that I browsed through each of those folders and managed to find all the missing movies and TV shows. I just created the /mnt/disk4/movies subfolder and copied them, and the /mnt/disk4/tvshows and copied the tvshows, and now when I browse my /mnt/user/movies share, I see all of my movies. I'm still not sure whether this is for real, but it certainly looks like it. I am wondering whether I should do a parity check now. If so, should I do a non-correcting parity check? or should I go ahead and do a correcting parity check? I was thinking that maybe I need a non-correcting check so that I know whether any of my movie files are corrupted? I checked some of my movie files haphazardly and they all seem to play well right from the share without a problem. Now going back, I'd like to re-enable my cache drive, and restart unraid in normal mode (with plugins), but I'd like to disable all the dynamix plugins and only keep Sabnzbd, Sickbeard, Couchpotato and Crashplan. But I will wait for a parity sync (according to your instructions) before I do that. Either way, I really want to thank you for all the help you gave me on this. I don't know what I would have done without your help. Thank you so much. Below is the final screenshot.
May 2, 201412 yr When we rebuilt disk4, it used parity and all of the other disks to do the rebuild. So after the rebuild, parity would be perfect. (The rebuild is the technical equivalent of building parity, except we were building the disk instead of the parity disk. Think of all of the disks being a set - and if you pull out any one you can "build" the last one.) When you ran the reisefsck, you ran it on the "md4" device and not the "sdd1" device. The mdX devices update parity. So parity would have been maintained. So no parity inconsistency should exist. If you'd like to confirm, you could run a correcting or non-correcting check. Both should come back with no problems. I've been considering why disk4 did not come up perfect - clearly there were some writes to a disk not captured by parity. I am not 100% certain, but mounting the disk3 in the cache slot might have done some updates in the housekeeping area. If I had it to do over I think I'd put disk3 back in the array rather than in the cache slot. But at the time I didn't want to do any harm as we had disk3 simulated there and I felt that some data would be saved from that if the real disk3 was toasted. So I wanted to leave the array alone while we tested it out. But the result seems pretty good. I'm happy for you. Try to go back through the steps we did and understand the reasons why. It will help keep you out of trouble! Hope you have a good weekend!
May 4, 201412 yr Author So I ran into a problem. I finished a correcting parity sync, and it sync'ed a bunch of inconsistencies. That was not the problem. The problem is that at the end of the sync, the errors column displays some errors (12) under the PARITY disk (/dev/sdd). I looked at the syslog, which was relatively quiet, and but that syslog has the following errors under the parity disk. Does this mean that the parity disk is bad and that I need to replace it? This is worrisome since we just used the parity disk to rebuild parity for the disk4. Can someone please tell me what this means? May 3 21:19:24 Tower kernel: sd 1:0:0:0: [sdd] Unhandled sense code May 3 21:19:24 Tower kernel: sd 1:0:0:0: [sdd] May 3 21:19:24 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08 May 3 21:19:24 Tower kernel: sd 1:0:0:0: [sdd] May 3 21:19:24 Tower kernel: Sense Key : 0x3 [current] [descriptor] May 3 21:19:24 Tower kernel: Descriptor sense data with sense descriptors (in hex): May 3 21:19:24 Tower kernel: 72 03 11 00 00 00 00 0c 00 0a 80 00 00 00 00 01 May 3 21:19:24 Tower kernel: 7e cb 86 80 May 3 21:19:24 Tower kernel: sd 1:0:0:0: [sdd] May 3 21:19:24 Tower kernel: ASC=0x11 ASCQ=0x0 May 3 21:19:24 Tower kernel: sd 1:0:0:0: [sdd] CDB: May 3 21:19:24 Tower kernel: cdb[0]=0x88: 88 00 00 00 00 01 7e cb 86 70 00 00 00 18 00 00 May 3 21:19:24 Tower kernel: end_request: critical target error, dev sdd, sector 6422234736 May 3 21:19:24 Tower kernel: md: disk0 read error, sector=6422234672 May 3 21:19:24 Tower kernel: md: disk0 read error, sector=6422234680 May 3 21:19:24 Tower kernel: md: disk0 read error, sector=6422234688 May 3 21:26:24 Tower kernel: mdcmd (47): spindown 6 May 3 22:12:43 Tower kernel: sd 1:0:0:0: [sdd] Unhandled sense code May 3 22:12:43 Tower kernel: sd 1:0:0:0: [sdd] May 3 22:12:43 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08 May 3 22:12:43 Tower kernel: sd 1:0:0:0: [sdd] May 3 22:12:43 Tower kernel: Sense Key : 0x3 [current] [descriptor] May 3 22:12:43 Tower kernel: Descriptor sense data with sense descriptors (in hex): May 3 22:12:43 Tower kernel: 72 03 11 00 00 00 00 0c 00 0a 80 00 00 00 00 01 May 3 22:12:43 Tower kernel: 9d f8 4f b8 May 3 22:12:43 Tower kernel: sd 1:0:0:0: [sdd] May 3 22:12:43 Tower kernel: ASC=0x11 ASCQ=0x0 May 3 22:12:43 Tower kernel: sd 1:0:0:0: [sdd] CDB: May 3 22:12:43 Tower kernel: cdb[0]=0x88: 88 00 00 00 00 01 9d f8 4f a8 00 00 00 20 00 00 May 3 22:12:43 Tower kernel: end_request: critical target error, dev sdd, sector 6945263528 May 3 22:12:43 Tower kernel: md: disk0 read error, sector=6945263464 May 3 22:12:43 Tower kernel: md: disk0 read error, sector=6945263472 May 3 22:12:43 Tower kernel: md: disk0 read error, sector=6945263480 May 3 22:12:43 Tower kernel: md: disk0 read error, sector=6945263488 May 3 22:12:46 Tower kernel: sd 1:0:0:0: [sdd] Unhandled sense code May 3 22:12:46 Tower kernel: sd 1:0:0:0: [sdd] May 3 22:12:46 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08 May 3 22:12:46 Tower kernel: sd 1:0:0:0: [sdd] May 3 22:12:46 Tower kernel: Sense Key : 0x3 [current] [descriptor] May 3 22:12:46 Tower kernel: Descriptor sense data with sense descriptors (in hex): May 3 22:12:46 Tower kernel: 72 03 11 00 00 00 00 0c 00 0a 80 00 00 00 00 01 May 3 22:12:46 Tower kernel: 9d f8 4f c8 May 3 22:12:46 Tower kernel: sd 1:0:0:0: [sdd] May 3 22:12:46 Tower kernel: ASC=0x11 ASCQ=0x0 May 3 22:12:46 Tower kernel: sd 1:0:0:0: [sdd] CDB: May 3 22:12:46 Tower kernel: cdb[0]=0x88: 88 00 00 00 00 01 9d f8 4f c8 00 00 00 28 00 00 May 3 22:12:46 Tower kernel: end_request: critical target error, dev sdd, sector 6945263560 May 3 22:12:46 Tower kernel: md: disk0 read error, sector=6945263496 May 3 22:12:46 Tower kernel: md: disk0 read error, sector=6945263504 May 3 22:12:46 Tower kernel: md: disk0 read error, sector=6945263512 May 3 22:12:46 Tower kernel: md: disk0 read error, sector=6945263520 May 3 22:12:46 Tower kernel: md: disk0 read error, sector=6945263528
May 4, 201412 yr Get a smart report on the parity disk. My guess is the cabling to that disk is loose or the cable is bad. Any parity errors it found were false errors. Parity should have been perfect. Good news is parity doesn't have to be good if the data is good. Don't do any writing.
May 4, 201412 yr Author Here is the SMART report for the parity disk. Any thoughts? I'm concerned that I have now gotten these "read" errors for more than one of my drives during the last few weeks. The only thing I can now think of is if this is a defective PSU. My PSU units is a Corsair HX750, but a few weeks ago I had a major mishap when I accidentally plugged in a molex connector in the reverse orientation and fried a bunch of my hard drives. I recovered the hard drives, but the PSU in use today is the same one I had back then. Could it be possible that a defective PSU power regulator is causing this problem? root@Tower:~# smartctl -a /dev/sdd smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Desktop HDD.15 Device Model: ST4000DM000-1F2168 Serial Number: Z3006MBD LU WWN Device Id: 5 000c50 04fbbb4f4 Firmware Version: CC51 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5900 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun May 4 00:42:17 2014 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 612) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 532) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x1085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 97814225 3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 255 5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 2872 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 66650724 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 9195 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 113 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 078 078 000 Old_age Always - 22 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 083 083 000 Old_age Always - 17 190 Airflow_Temperature_Cel 0x0022 073 051 045 Old_age Always - 27 (Min/Max 18/32) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 99 193 Load_Cycle_Count 0x0032 040 040 000 Old_age Always - 121557 194 Temperature_Celsius 0x0022 027 049 000 Old_age Always - 27 (0 15 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 6463h+55m+08.836s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 83740876584 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 267253903257 SMART Error Log Version: 1 ATA Error Count: 22 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 22 occurred at disk power-on lifetime: 9194 hours (383 days + 2 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 4d+08:06:57.905 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+08:06:35.849 READ FPDMA QUEUED 60 00 28 ff ff ff 4f 00 4d+08:06:35.813 READ FPDMA QUEUED 60 00 50 ff ff ff 4f 00 4d+08:06:35.809 READ FPDMA QUEUED 60 00 90 ff ff ff 4f 00 4d+08:06:35.809 READ FPDMA QUEUED Error 21 occurred at disk power-on lifetime: 9193 hours (383 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 28 ff ff ff 4f 00 4d+06:32:56.147 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 4d+06:32:56.073 READ LOG EXT 60 00 88 ff ff ff 4f 00 4d+06:32:36.821 READ FPDMA QUEUED 60 00 80 ff ff ff 4f 00 4d+06:32:36.820 READ FPDMA QUEUED 60 00 78 ff ff ff 4f 00 4d+06:32:36.819 READ FPDMA QUEUED Error 20 occurred at disk power-on lifetime: 9193 hours (383 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 88 ff ff ff 4f 00 4d+06:32:36.821 READ FPDMA QUEUED 60 00 80 ff ff ff 4f 00 4d+06:32:36.820 READ FPDMA QUEUED 60 00 78 ff ff ff 4f 00 4d+06:32:36.819 READ FPDMA QUEUED 60 00 88 ff ff ff 4f 00 4d+06:32:36.819 READ FPDMA QUEUED 60 00 80 ff ff ff 4f 00 4d+06:32:36.819 READ FPDMA QUEUED Error 19 occurred at disk power-on lifetime: 9192 hours (383 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 ff ff ff 4f 00 4d+05:39:26.445 READ FPDMA QUEUED 60 00 30 ff ff ff 4f 00 4d+05:39:26.435 READ FPDMA QUEUED 60 00 58 ff ff ff 4f 00 4d+05:39:26.430 READ FPDMA QUEUED 60 00 80 ff ff ff 4f 00 4d+05:39:26.430 READ FPDMA QUEUED 60 00 70 ff ff ff 4f 00 4d+05:39:26.429 READ FPDMA QUEUED Error 18 occurred at disk power-on lifetime: 9072 hours (378 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 5d+05:35:29.470 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 5d+05:35:29.263 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 5d+05:35:29.236 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 5d+05:35:29.228 READ FPDMA QUEUED e5 00 00 00 00 00 00 00 5d+05:35:29.070 CHECK POWER MODE SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 90% 9195 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Archived
This topic is now archived and is closed to further replies.