Dougy Posted May 16, 2013 Share Posted May 16, 2013 I was previously running rc4 but had been having problems with the server dropping of the network every so often and needing to be restarted. This became more frequent so I ran a memory test overnight. This did not reveal any problems so I decided to upgrade to rc12a. I previously has some plugins installed and did not remember the exact details so decided to start with a fresh install. This all seemed to run fine, I assigned the disks as they had been this morning and all looked good so I started I brought it online and started a parity check. I just got home from work and released that one share that spans several disks including disk 1 is missing. When I open the share for disk 1 it shows no files! The parity check is at 87% and shows 470258523 errors for disk 1!!! Please help, is there anyway I can recover the contents of disk 1? the syslog is 35mb as a txt file and zips down to 635kb. what is the best way to share it? Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 just checked the server and found these errors at the command prompt.... doesn't look good Quote Link to comment
garycase Posted May 16, 2013 Share Posted May 16, 2013 Did you do a parity check BEFORE you upgraded, so you know your parity is good? And do all of the other disks show zero errors? If so, you can try 2 things: (1) Reseat the cables for Disk 1 and try the parity check again (or just try to access it and see if you can read the contents okay). The errors you're seeing look like Reiser file system errors -- you need to run a reiserfsck check on it ... you need to wait for a Linux guru to advise you on how best to run this. [Or read the details here and be VERY careful: http://lime-technology.com/wiki/index.php/Check_Disk_Filesystems ] (2) You should be able to read the contents of Disk 1 even if it's not able to provide that data => the fault tolerance will "take over" and the data will be reconstructed from your other disks and the parity drive ... IF the parity is good (thus my first question). In any event, if you don't have backups of your data (you SHOULD, of course), then you should definitely back up what's on Disk 1 NOW before proceeding. Quote Link to comment
garycase Posted May 16, 2013 Share Posted May 16, 2013 ... just to be clear, I'd backup the contents of Disk 1 BEFORE running the reiserfsck tool on it. While reiserfsck shouldn't cause any data loss, I'm always VERY careful before doing anything that may alter the contents of a disk that contains data you need to recover. Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 thanks for your reply. yes, parity was valid before the upgrade and all the other discs have 0 errors. currently the disk 1 share shows no files on the disk.... so I can't backup the data.... what I don't understand is why the disk hasn't red balled if its a lose cable etc...... I will power down and try reseating the cables after the parity check finishes. I will need to wait for a Linux guru to chime in as I have little experience.... Quote Link to comment
garycase Posted May 16, 2013 Share Posted May 16, 2013 Whoa -- are you saying that Disk1 still shows as green ... AND that if you access \\Tower\Disk1 it shows as empty?? If that's the case, I'd do NOTHING until you either hear from a Linux guru (hopefully Joe L will chime in); or you can send a note to Limetech with a pointer to this thread and a copy of your syslog. Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 yep that is the case I'm not sure how to share the syslog as it is really big... probably because of all the errors...... This looks bad really hoping Joe can help Quote Link to comment
garycase Posted May 16, 2013 Share Posted May 16, 2013 the syslog is 35mb as a txt file and zips down to 635kb. what is the best way to share it? Wow! That's a BIG log. I'm not sure what the attachment size limit is here -- have you tried attaching the .Zip file? If that's still too large, wait and see what sections of the log Joe L (or Limetech) need to see; then just save that section and .zip it. But first, see if you can simply attach it to a reply -- it won't let you if it's too large; and I assume that if that's an issue, it will also tell you what the max size is Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 it zips down to 600kb with max rar compression. still too large to attach here. wow..... the parity check finished with 470258525 errors on disc 1....... but now the files are back!!!!!!!!!!!! can anyone explain what happened? Quote Link to comment
binhex Posted May 16, 2013 Share Posted May 16, 2013 It could be still green ball as the errors column in the webui is for read errors not write errors, the drive will go red ball only on write errors. sound like the disk Is in a bad way to me, post smart report Sent from my GT-I9300 using Tapatalk 2 Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 things keep getting weirder.... after a restart disk 1 is redballed...... but I am able to restart the array and access the files on disk 1. I have attached the current syslog as it is smaller...... so...... is the parity still valid and disc 1 is being emulated? I don't remember how to run smart report without simple features... can somebody give me step by step instructions.... sorry syslog2.txt Quote Link to comment
binhex Posted May 16, 2013 Share Posted May 16, 2013 yes it will be emulated, you can get a smart report by doing the following from command line (using putty):- smartctl -a -d ata /dev/xxx >/boot/smart.txt where xxx is the device in question e.g. sda, then attach smart.txt to here if your parity is valid, and it looks like it is then worst case scenario is rip out disk1 and replace and let parity rebuild it, but first smart report required to confirm the disk is faulty, it could be PSU related. Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 here is the smart report. I'm thinking maybe I bumped a cable when removing the thumb drive..... I don't understand why the drive didn't redball during the parity check... but if I am currently running under emulation for disk 1 then that means that the parity is still correct even though I was getting errors for disk 1 and there appeared to be wrights to the parity during the parity check........ smart.txt Quote Link to comment
garycase Posted May 16, 2013 Share Posted May 16, 2013 "... disk 1 is redballed...... but I am able to restart the array and access the files on disk 1 " ==> That's what fault-tolerance is all about When you read Disk 1, you're not actually using it at all. ALL of the other disks, plus the parity disk, are being read, and the data is being reconstructed. In RAID lingo, this is called "at risk" ... since you're running in a state where any more failures will result in data loss. Note that if there were writes to the parity drive during the parity check, your data MAY not actually be correct. When a correcting parity check is done, UnRAID always assumes that any error is on the parity drive, and corrects the parity drive based on that assumption. Nevertheless, at this point, there's not much else you can do except hope that parity is in fact correct. At this point, the issue could be (a) cables (re-seat them); (b) power (not likely, since it was fine until now, but always possible); or © a failing/failed disk. The SMART test says it Passed; but shows a lot of read errors, which may simply be due to cable seating issues. I'd reseat the cables; boot again; and try a rebuild of Disk 1. If it doesn't work with that disk, you'll need to replace it. Quote Link to comment
binhex Posted May 16, 2013 Share Posted May 16, 2013 just to reiterate, the errors column is only to do with READ errors, not write errors, when you click on parity check it will read the source drive and read the parity disk and compare, in your situation it was unable to read the source disk and thus logged it as an error. write errors will be shown as a disk having a red ball, which is what eventually happened with this disk, so it now is failing read and writes, looking at your syslog you have a large amount of pending sectors which is not a good sign:- 197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1446 ive also noticed that the drive heads have emergency retracted 102 times as power loss was detected by the drive:- 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 102 you will need to check power supply (and cabling), this could be an intermittent connection, a bad drive tray, or a power supply splitter/connector, or just a power supply too small for the total number of drives. If power issues are ruled out then i would say remove the drive and replace (with a pre-cleared drive) and let parity rebuild. edit - did you do a parity check with the option "Correct any Parity-Check errors by writing the Parity disk with corrected parity." ticked when you saw the errors for disk1? Quote Link to comment
Dougy Posted May 16, 2013 Author Share Posted May 16, 2013 OK so it's starting to look like a straight forward disc failure.... just I still don't understand why disc1 didn't red ball when the errors began. The current pending sectors are recent as I had looked at smart reports not to long ago and no discs had pending sectors. The power off retract count does look high but this is one of my oldest drives and as well as power outages this wouldn't be the first time I have had issues with loose cables. As for the parity check, I did not change any options. Are you saying that by default the parity check does not wright corrections to the parity drive? If not that is good news and hopefully my data is still all intact. Quote Link to comment
garycase Posted May 16, 2013 Share Posted May 16, 2013 If you're running 4.7, parity checks write corrections. If you're running v5.x, you have to be sure the box is checked to write the corrections. So it depends on whether you did the parity check before or after the upgrade Quote Link to comment
Dougy Posted May 22, 2013 Author Share Posted May 22, 2013 it seems I have bigger problems. I purchased a new 3tb hd to replace the failed 2tb drive. After a successful preclear I rebuilt the drive. Things seemed fine right up until the drive finished rebuilding and then all of a sudden drive 6 showed errors and was red balled. I have no idea if this means the rebuild would have been full of errors? An initial smart report resulted in the following: Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. It seems that the drive had gone 'offline'. After a reboot a smart report results in: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3 3 Spin_Up_Time 0x0027 205 165 021 Pre-fail Always - 4733 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1851 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 16655 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 130 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 93 193 Load_Cycle_Count 0x0032 179 179 000 Old_age Always - 63201 194 Temperature_Celsius 0x0022 124 105 000 Old_age Always - 26 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 9 no sectors pending but I have never seen multi zone error rate errors before... what do they mean? I reseated the cables and then unassigned the disc and then reboot and began a rebuild but almost straight away it red balled again and gone offline again....... This is getting really frustrating, going on a week and I still can't use my server Thanks for all your help so far, but really need some more! Quote Link to comment
Joe L. Posted May 22, 2013 Share Posted May 22, 2013 I think the issues is more related to power than the data cables. According to this line in the SMART report, the disk heads have been retracted in response to power loss 93 times. That suggests either a lose power connection, or a bad power splitter, or an overloaded power supply, or a bad drive tray. (although it could also be the disk itself I suppose) 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 93 Joe L., Quote Link to comment
Dougy Posted May 22, 2013 Author Share Posted May 22, 2013 prior to my current problems since I upgraded to RC12a, I was running RC4 with special features. Sometimes the server was becoming unresponsive and I had to reboot it. This began every few weeks and at the end up to a couple of times a day which led to me upgrading. could me restarting the server have resulted in the power off retract count? Quote Link to comment
Joe L. Posted May 22, 2013 Share Posted May 22, 2013 prior to my current problems since I upgraded to RC12a, I was running RC4 with special features. Sometimes the server was becoming unresponsive and I had to reboot it. This began every few weeks and at the end up to a couple of times a day which led to me upgrading. could me restarting the server have resulted in the power off retract count? yes, it could be an alternate explanation, if you did it a lot. Quote Link to comment
Dougy Posted May 22, 2013 Author Share Posted May 22, 2013 as for the power supply, I have a Seasonic S12II 430W. I am currently running 10 WD green discs with an atom board + controller card. This might sound like small power supply but according to the power supply thread it should have no problem powering this hardware and I chose it so that it would still be efficient when idling. If there is a power problem then I would suspect the sata power splitters........ I really have no idea how to proceed Quote Link to comment
garycase Posted May 22, 2013 Share Posted May 22, 2013 As Joe mentioned, the power retract count indicates a lot of head retractions due to power failures. This count does NOT increment when you do a normal shutdown ... so simply restarting the server wouldn't cause it unless it was "hung" and you did a forced power-down (holding the power button or simply unplugging the PC). While I agree your Seasonic SHOULD have sufficient power, the issue certainly seems like it may be power-related. I've mentioned a couple of times to check all the cables; so I assume you've already done that. At this point trying a "heftier" power supply would be a good idea. There IS a way to check the likelihood of this being a power issue without getting another PSU; but it would require that you "run at risk" ... and you COULD lose data if any disk has actually failed. That would be to simply remove your parity disk from the array; then create a smaller array configuration with just a few disks (e.g. 4 disks). Then add the parity disk back to the array and compute parity (so now this small array is protected); then run a parity check. If the array is running fine at that point, your problem is almost certainly power. But if anything goes wrong before you have a protected array, any failed disk can NOT be rebuilt ... so any data that can't be read from that disk is lost. i.e. it's "safer" to simply buy a nice new 650W power supply :-) Quote Link to comment
Dougy Posted May 22, 2013 Author Share Posted May 22, 2013 its been a while since I researched PSU's but I do recall that a single 12v rail was important. Is this psu a good replacement http://www.pccasegear.com/index.php?main_page=product_info&cPath=15_354&products_id=21713 Quote Link to comment
garycase Posted May 22, 2013 Share Posted May 22, 2013 its been a while since I researched PSU's but I do recall that a single 12v rail was important. Is this psu a good replacement http://www.pccasegear.com/index.php?main_page=product_info&cPath=15_354&products_id=21713 No, I wouldn't buy a CX series unit. The Corsair AX & HX units are superb; the TX units are very good and are probably all you need to buy; but the low-end CX units aren't nearly as good as their big brothers. I'd buy a Corsair TX or HX, or a Seasonic M12 or X series unit. ~ 600 to 650W is a good range. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.