July 29, 201015 yr Background: I have a hard drive that is probably going bad (a Seagate ST32000542AS 2TB drive, firmware CC32) which is still in the array. Current SMART readings in unMENU look like this: » reallocated_sector_ct=116 » reported_uncorrect=1 » high_fly_writes=31 » head_flying_hours=7.29071e+13 » attribute_241=1118686192 » attribute_242=1950323105 » ata_error_count=1 About 2.5 months ago, I had actually replaced it at one point with a new 2TB drive when the reallocated sector count of this drive was around 110 or so (the uncorrected & ata_error_counts were already 1 each by that time). I hooked it up to a Windows machine and ran SeaTools (full scan) on it which didn't see any issues. I returned it to the unRAID server and did a preclear on it... I think at that time the reallocated sector count increased to about 112 or 114. Since then it has increased to the current total of 116 during general use. When I returned it to the array, I changed controllers and cables. Generally I've been able to write to the drive without problems. However, twice over the past two and a half months- once before I removed the drive and now just yesterday- copying files to the drive has resulted in a full unRAID system hang. When a hang occurs, all the shares become inaccessible (they disappear from network view), the unRAID maintenance page does not respond (neither does unMENU) ,and I cannot telnet into the server or login to the console, even after letting it sit for several hours to see if some kind of drive timeout will occur. The server does answer ping. I suspect a write error of some kind is occurring. Of course, I can't see anything in the syslog because I can't access the server, and I have to forcibly reboot which results in a new log. A parity check ensues upon start up; the last one which completed this morning showed no sync errors. Question: Is this ever the expected behavior of unRAID? Or should unRAID always recover more gracefully? Is there anything I should try in order to address this issue? I can, of course, replace the problem drive.. though I'm not sure how Seagate will feel about an RMA- I notice during the RMA process they've highlighted the fact that they may return drives to you if they think it doesn't meet some unknown threshold of "working". Any thoughts/suggestions appreciated...
July 29, 201015 yr You have other issues going on if unRAID hangs. Typical behavior is unRAID will mark the drive as damaged but still be responsive with the array simulating the drive.
July 29, 201015 yr I have seen a corrupted file system on a disk cause a kernel panic which then crashes a server. you might want to check for any corruption by following the instructions in the wiki: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems
July 30, 201015 yr Author Thanks for the feedback guys. I ran the check and got nothing of note: ########### reiserfsck --check started at Thu Jul 29 16:37:17 2010 ########### Replaying journal: Done. Reiserfs journal '/dev/md11' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 91631 Internal nodes 566 Directories 3077 Other files 5554 Data block pointers 92331333 (0 of them are zero) Safe links 0 ########### reiserfsck finished at Thu Jul 29 16:45:55 2010 ########### When the server hangs it does answer pings, but that's it. I've opened a case with Seagate to see if I can just RMA this drive. In the meantime, the drive is back online and I've written to it as if there is no issue.
Archived
This topic is now archived and is closed to further replies.