November 23, 201510 yr Got an email from my server this morning indicating that Disk 8 is having problems. Main shows the following: http://my.jetscreenshot.com/12412/20151123-ir3q-124kb.jpg Details on the drive shows the following: http://my.jetscreenshot.com/12412/20151123-h3e4-91kb.jpg Not able to run Smart on the drive from within the console. SMART report follows: smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /1:0:4:0 Product: User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes Physical block size: 3807568608 bytes Lowest aligned LBA: 14896 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Assuming my only response is to replace the drive? Would taking it offline and reformat do anything beneficial? Thanks! Jeff...
November 23, 201510 yr Community Expert The disk is marked as disabled because a write to it failed. Once that happens then it stays in that state until you take appropriate recovery action. It is possible the disk dropped offline due to a cabling/power problem. If so rebooting the system should bring it back online and give you a chance to get a SMART status. If you still have problems then the drive has probably really failed and needs replacing.
November 24, 201510 yr Author Good call. Rebooted and the disk became responsive to SMART. SMART ERROR LOG: ATA Error Count: 327 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 327 occurred at disk power-on lifetime: 18071 hours (752 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 07 00 00 00 Error: ICRC, ABRT at LBA = 0x00000007 = 7 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 07 00 00 40 ff 00:09:34.607 READ FPDMA QUEUED 60 08 00 00 00 00 40 00 00:09:34.601 READ FPDMA QUEUED ef 10 02 00 00 00 a0 00 00:09:34.601 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:09:34.601 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:09:34.600 IDENTIFY DEVICE Error 326 occurred at disk power-on lifetime: 18071 hours (752 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 07 00 00 00 Error: ICRC, ABRT at LBA = 0x00000007 = 7 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 07 00 00 40 ff 00:09:34.297 READ FPDMA QUEUED 60 08 00 00 00 00 40 00 00:09:34.260 READ FPDMA QUEUED ef 10 02 00 00 00 a0 00 00:09:31.588 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:09:31.588 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:09:31.587 IDENTIFY DEVICE Error 325 occurred at disk power-on lifetime: 18071 hours (752 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 00 00 00:08:32.395 IDENTIFY DEVICE b0 da 00 00 4f c2 40 00 00:08:27.064 SMART RETURN STATUS 25 00 01 ae 88 e0 40 00 00:08:26.622 READ DMA EXT ef 03 42 00 00 00 40 00 00:08:22.208 SET FEATURES [set transfer mode] ec 00 00 00 00 00 00 00 00:08:22.207 IDENTIFY DEVICE Error 324 occurred at disk power-on lifetime: 18071 hours (752 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 00 00 00:07:30.212 IDENTIFY DEVICE b0 da 00 00 4f c2 40 00 00:07:24.880 SMART RETURN STATUS 25 00 01 ae 88 e0 40 00 00:07:24.478 READ DMA EXT ef 03 42 00 00 00 40 00 00:07:20.045 SET FEATURES [set transfer mode] ec 00 00 00 00 00 00 00 00:07:20.044 IDENTIFY DEVICE Error 323 occurred at disk power-on lifetime: 18071 hours (752 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 00 00 00:05:40.721 IDENTIFY DEVICE b0 da 00 00 4f c2 40 00 00:05:35.420 SMART RETURN STATUS 25 00 01 ae 88 e0 40 00 00:05:34.977 READ DMA EXT ef 03 42 00 00 00 40 00 00:05:30.559 SET FEATURES [set transfer mode] ec 00 00 00 00 00 00 00 00:05:30.558 IDENTIFY DEVICE Ran a long and short test on the drive. No errors reported. http://my.jetscreenshot.com/12412/20151124-lx58-35kb.jpg Should I pull the drive, do a low-level format and reinsert into the array and let it get rebuilt?
November 24, 201510 yr Community Expert That SMART report you posted only contains part of the SMART information - we really want to see the current values of the SMART attributes on all your drives to see if any look suspicious. A syslog is also a good idea to see if any errors are being reported. If you post the file produced by Tools->Diagnostics it will include both of these (plus other useful diagnostic information). A low level format is unlikely to help on modern drives so not much point in doing that. After checking the diagnostics output, if that all looks OK, then one can force a disk to rebuild onto itself by the following process stop the array and unassigned the problem disk start the array with the disk unassigned. You should get a warning about the array being unprotected but it should start OK. The purpose of this step is to make unRAID 'forget' the serial number of the problem disk stop the array and reassign the problem disk. UnRAID will now treat the disk as if it was a new replacement disk and tell you that starting the array will trigger a rebuild start the array and let the rebuild go ahead
November 24, 201510 yr Author Ah, sorry, thought I was including all that was needed. I've run the collection process. File attached. hunternas-diagnostics-20151124-1048.zip
December 3, 201510 yr Author Ok, so new drives arrived today (3 TB). Shutdown server, pulled the old drive and inserted the new drive in the appropriate slot. When the server rebooted, it started a rebuild without my asking it to! I stopped the rebuild, shutdown the array and tried to do a preclear. Preclear could not see the disk (preclear_disk.sh -l). I get: root@HunterNAS:/boot# pc.sh -l ====================================1.15 Disks not assigned to the unRAID array (potential candidates for clearing) ======================================== No un-assigned disks detected I rebooted the server, pulled the new drive and rebooted the server. Then I shut down the server again, inserted the new drive and rebooted. This is the Array Devices showing Disk 8 unassigned. http://my.jetscreenshot.com/12412/20151203-8dis-68kb.jpg With the Array off-line you see the new drive in the unassigned slot http://my.jetscreenshot.com/12412/20151203-kx81-31kb.jpg But when I open the preclear plugin no drive shows up http://my.jetscreenshot.com/12412/20151203-gic9-38kb.jpg And when I do a preclear_disk.sh -l I see the following: http://my.jetscreenshot.com/12412/20151203-lwhd-20kb.jpg So I can't preclear the disk. However, if the disk gets assigned to Disk 8, then a rebuild is possible. I must be missing something basic...
December 3, 201510 yr Author Update: I used the second drive with the array stopped (so it would auto rebuild) and was able to get the disk found. Evidently the combination of events I was doing did not allow Unraid to forget the drive. Is there a step by step process for replacing a drive where you have only one server and need to preclear the drive on the server that its being replaced on? Here's what I think the steps are... 1. I have Drive Bays, so I can turn off power to each drive. So turn off power on affected drive. Originally I powered down the server, this was my failure point, because I have autostart of the arrac in the go file. 2. Replace bad drive with new drive. 3. Power on drive 4. Load GUI, navigate to settings>Preclear Disk, the new drive is displayed. Alternatively, you can go old school and telnet to run preclear_disk.sh 5. When preclear finishes, goto Main>Array Operation, stop the array 6. Goto Main>Array Devices, assign the drive to the failed disk. 7. Start the array (and be happy you didn't lose a second drive during this whole process! And also be happy that the next versions of Unraid have 2 disk tolerance in the roadmap! ) Correct? If you don't have the ability to power down a drive individually, I supposed you'd have to edit the go file and remove cd /boot/unmenu & /uu??
December 3, 201510 yr ... When the server rebooted, it started a rebuild without my asking it to! Not sure what you did but ... unRAID will NEVER rebuild a replacement disk automatically. You need to tell it which 'new' disk you want to use, this assignment is done when the array is off-line, and once you bring the array on-line it starts the rebuild process. These are all steps a user has to start/take.
December 3, 201510 yr Author Yep, sorry, I rushed documenting that step. I powered up and assigned the drive - and when I did that, it started to rebuild, then I realized that I needed to preclear first. After I had unassigned the drive, then I could not do anything with it other than let the rebuild go.
December 3, 201510 yr Community Expert ... I have autostart of the arrac in the go file.Autostart is in Disk Settings. It is not in go. Starting the webUI is in go, and the webUI will start the array if it is set to autostart. ... I supposed you'd have to edit the go file and remove cd /boot/unmenu & /uu?? Those lines in go are starting unMenu, which has nothing to do with any of this. The emhttp line is the line that starts the webUI. You don't want to remove that line!
December 4, 201510 yr Author Duh, of course on both points. Once I got the server going it's been flawless...how quickly we forget...
Archived
This topic is now archived and is closed to further replies.