JustinAiken Posted September 30, 2010 Share Posted September 30, 2010 Okay, my server is messed up... I haven't even turned it on in 3 weeks because I've been busy, but it's time to fix it so I can listen to music and watch movies again... First, I have HPA on a drive: Sep 29 18:22:29 Tower kernel: sde1 Sep 29 18:22:29 Tower kernel: ata5.00: HPA detected: current 2930275055, native 2930277168 Sep 29 18:22:29 Tower kernel: ata5.00: ATA-7: SAMSUNG HD154UI, 1AG01118, max UDMA7 Sep 29 18:22:29 Tower kernel: ata5.00: 2930275055 sectors, multi 0: LBA48 NCQ (depth 31/32) I tried this: root@Tower:~# hdparm -N p2930277168 /dev/sde and got this: /dev/sde: setting max visible sectors to 2930277168 (permanent) SET_MAX_ADDRESS failed: Input/output error HDIO_DRIVE_CMD(identify) failed: Input/output error More seriously, my parity is useless because I have two bad drives. I did a rebuild-tree on one of them, and even after that it's still not mountable. The other one gives me a Bad root block 0. everytime I try any kind of resierfsck I've tried juggling the drives around to different backplanes/cables/power cables, and the results have been identical; I'm thinking I just got unlucky, rather than having a problem in the machine. Any advise on data recovery for these two drives? Lastly, I do get an odd warning in the syslog ever time I turn it on: Tower kernel: ACPI Warning: Incorrect checksum in table [OEMB] - 01, should be F4 (20090903/tbutils-314) What does this mean? Is it dire? Link to comment
bubbaQ Posted September 30, 2010 Share Posted September 30, 2010 Before anything else, you need to disable the HPA (Save BIOS to disk option) in the BIOS... otherwise when one disk fails, the system can then wipe out the next one as it looks for some place to put the HPA. Link to comment
JustinAiken Posted September 30, 2010 Author Share Posted September 30, 2010 My motherboard doesn't have HPA... the drive was used in my desktop before which did. Link to comment
Joe L. Posted September 30, 2010 Share Posted September 30, 2010 My motherboard doesn't have HPA... the drive was used in my desktop before which did. Then you don't need to do anything if you don't want to. The few megabytes occupied by the HPA do no harm. Link to comment
JustinAiken Posted October 1, 2010 Author Share Posted October 1, 2010 Okay cool... guess HPA only matters if it's changing, or on the parity drive (cause it would make it smaller than the equivalent data drives?) Anyways, magically, one of my disks is all good again!!! :D It was the most full to... Going to tinker with the other problematic drive for a bit before I rebuild parity... And I need to CC35 up some of those drives on CC34 before they can go off... Link to comment
JustinAiken Posted October 1, 2010 Author Share Posted October 1, 2010 just got a kernel panic while I was telnet'd... Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: Oops: 0000 [#1] SMP Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: Stack: Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: Process emhttp (pid: 1584, ti=c41a6000 task=f779c000 task.ti=c41a6000) Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/host0/port-0:3/end_device-0:3/target0:0:3/0:0:3:0/block/sde/stat Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: CR2: 00000000f95b4000 Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: EIP: [<f95aa8e4>] md_cmd_proc_read+0x41/0x54 [md_mod] SS:ESP 0068:c41a7ef4 Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: Call Trace: Message from syslogd@Tower at Thu Sep 30 23:13:54 2010 ... Tower kernel: Code: 55 f0 e8 6a c2 b8 c7 8d 50 01 29 f2 39 d3 7c 0b 8b 45 0c 89 d3 c7 00 01 00 00 00 8b 45 f0 89 d9 81 c6 ac fd 5a f9 c1 e9 02 89 38 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 5a 89 d8 5b 5e 5f 5d c3 55 89 Link to comment
JustinAiken Posted October 2, 2010 Author Share Posted October 2, 2010 Removed that drive... it was only 1TB anyways, rather get a 2TB in there... Anyways, I finished rebuilding the tree on the other drive. It did save most of the data, but the drive is slow now; it takes a long time to load directories and such. Is there some kind of defrag I should be doing? Or should I move the files to a freshly cleared drive, then reclear it? Link to comment
JustinAiken Posted October 2, 2010 Author Share Posted October 2, 2010 Okay, after deleting all the lost+found fragments, the speed is okay again. I have now fixed all the drives, and rebuilt the parity... while it was rebuilding the parity, it said it had 75 sync errors... but everything was fine, and I used the unRaid for 5-6 hours. Now I see a green circle flashing next to my disk2... 75 errors found. What does this mean? Link to comment
Joe L. Posted October 2, 2010 Share Posted October 2, 2010 Okay, after deleting all the lost+found fragments, the speed is okay again. I have now fixed all the drives, and rebuilt the parity... while it was rebuilding the parity, it said it had 75 sync errors... but everything was fine, and I used the unRaid for 5-6 hours. Now I see a green circle flashing next to my disk2... 75 errors found. What does this mean? The flashing green indicator means the disk has gone to sleep. Now, do another parity check. No errors should be found. Link to comment
JustinAiken Posted October 3, 2010 Author Share Posted October 3, 2010 Okay, did a parity check overnight, it came up clean - 0 errors. However, disk2 is up to 199 errors (they popped up before I ran the parity check) Here's part of the syslog, where all the errors occurred: http://pastebin.com/KCVUtg1y Mostly a lot of: handle_stripe read error: 24432/2, count: 1 md: disk2 read error Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 Okay, did a parity check overnight, it came up clean - 0 errors. However, disk2 is up to 199 errors (they popped up before I ran the parity check) Here's part of the syslog, where all the errors occurred: http://pastebin.com/KCVUtg1y Mostly a lot of: handle_stripe read error: 24432/2, count: 1 md: disk2 read error Those errors represent a sector that was un-readable. ( an un-correctable media error) Now, get a smart report on that drive to see what has actually happened on it. The sector(s) have probably already been re-allocated. I think disk2 is /dev/sdf the command would then be: smartctl -d ata -a /dev/sdf It is good there were no additional parity errors. That is a good sign. Link to comment
JustinAiken Posted October 3, 2010 Author Share Posted October 3, 2010 Now, get a smart report on that drive to see what has actually happened on it. The sector(s) have probably already been re-allocated. I think disk2 is /dev/sdf the command would then be: smartctl -d ata -a /dev/sdf Yes, it is sdf... here's the result: http://pastebin.com/pVA88hGY Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 It shows 1 sector pending re-allocation. (it should be re-allocated the next time the sector is written to) Link to comment
JustinAiken Posted October 3, 2010 Author Share Posted October 3, 2010 Okay, thanks! From now on I'll be running parity checks at least weekly, and keeping a precleared drive ready to go the second a disk fails... Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 Okay, thanks! From now on I'll be running parity checks at least weekly, and keeping a precleared drive ready to go the second a disk fails... Most of us do it monthly, and we do it in NOCORRECT mode. That way, if a data disk mis-behaves and returns trash it does not corrupt the parity that might be used to re-construct it. Link to comment
GK20 Posted October 3, 2010 Share Posted October 3, 2010 Most of us do it monthly, and we do it in NOCORRECT mode. That way, if a data disk mis-behaves and returns trash it does not corrupt the parity that might be used to re-construct it. What if the mis-behaves one is parity disk? Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 Most of us do it monthly, and we do it in NOCORRECT mode. That way, if a data disk mis-behaves and returns trash it does not corrupt the parity that might be used to re-construct it. What if the mis-behaves one is parity disk? It is performed a no-correction mode. You'll still see the "parity errors" occur on the interface, but they will not be automatically corrected. determining which drive is the cause of the parity errors is a whole different problem. It could be any one of them, but at least you get to attempt to figure it out from other errors/symptoms in the log files. If you decide the data disk is correct, you can run the normal "Check" while will change parity to reflect what is being read from the data disks. Link to comment
GK20 Posted October 3, 2010 Share Posted October 3, 2010 It is performed a no-correction mode. You'll still see the "parity errors" occur on the interface, but they will not be automatically corrected. determining which drive is the cause of the parity errors is a whole different problem. It could be any one of them, but at least you get to attempt to figure it out from other errors/symptoms in the log files. If you decide the data disk is correct, you can run the normal "Check" while will change parity to reflect what is being read from the data disks. This actually is what we need to think about. For what i can see so far, unRAID is taking "trust data disk" approach, whenever there is a parity mismatch, parity disk will be corrected with new parity. However when using one parity disk for 20 data disks, chance are these 20 data disks might have higher probability to produce trash data (uncorrectable read error, for example) than this solo parity disk during parity check. However even in taking "trust parity disk" approach by using NOCORRECT mode parity check, there is no enough information from syslog can tell us parity disk is really that trustworthy. Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 However even in taking "trust parity disk" approach by using NOCORRECT mode parity check, there is no enough information from syslog can tell us parity disk is really that trustworthy. We are not "trusting" parity by using the NOCORRECT mode. It is just performing a parity verify instead of a parity verify/correction. There is absolutely no way for anything to know which disk(s) are incorrect if parity is not as expected. all we know is what adding all the bits together across a specific bit position comes out to an odd number instead of an even one. Other clues in the syslog might let you identify a bad drive... for example, one where a read failure resulted in all zeros being returned. Link to comment
GK20 Posted October 3, 2010 Share Posted October 3, 2010 There is absolutely no way for anything to know which disk(s) are incorrect if parity is not as expected. all we know is what adding all the bits together across a specific bit position comes out to an odd number instead of an even one. Other clues in the syslog might let you identify a bad drive... for example, one where a read failure resulted in all zeros being returned. That is actually one thing really trouble me, From time to time, when there were parity error during periodical parity check, none of them shown there were any HW issues like bad drive, SATA error....etc, all i can see is perfect syslog content. Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 There is absolutely no way for anything to know which disk(s) are incorrect if parity is not as expected. all we know is what adding all the bits together across a specific bit position comes out to an odd number instead of an even one. Other clues in the syslog might let you identify a bad drive... for example, one where a read failure resulted in all zeros being returned. That is actually one thing really trouble me, From time to time, when there were parity error during periodical parity check, none of them shown there were any HW issues like bad drive, SATA error....etc, all i can see is perfect syslog content. The reiserfs file systems on the data disks have journals that allow them to re-group from a un-expected power failure. The parity disk have no such journal. It is not uncommon to have a few parity errors if you take a power hit. There is no simple solution other than checksums on the files themselves. Link to comment
GK20 Posted October 3, 2010 Share Posted October 3, 2010 The reiserfs file systems on the data disks have journals that allow them to re-group from a un-expected power failure. The parity disk have no such journal. It is not uncommon to have a few parity errors if you take a power hit. That is true, however i have UPS for my server and always perform clean shutdown, but still have parity error during checking from time to time, that would for sure raise my concern. Link to comment
Joe L. Posted October 3, 2010 Share Posted October 3, 2010 The reiserfs file systems on the data disks have journals that allow them to re-group from a un-expected power failure. The parity disk have no such journal. It is not uncommon to have a few parity errors if you take a power hit. That is true, however i have UPS for my server and always perform clean shutdown, but still have parity error during checking from time to time, that would for sure raise my concern. Mine too.. I've never had an un-expected parity error in 5 years. If you do, and you did not have a power failure, and you always cleanly shut down, then odds are you have some flaky hardware involved. (and it could be absolutely anything, from the power supply to the RAM to the motherboard itself, to the disks.) Joe L. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.