CupCak3 Posted September 27, 2010 Share Posted September 27, 2010 I’m currently using Unraid 4.5.3 and Unmenu 1.3 Last night I queued a bunch of files to transfer via FTP and didn’t check again until the next morning. What I did not bother to verify beforehand is that I had enough free space. When I woke up this morning, I found the FTP client still transferring files. After watching it for a min. or two, I saw that the transfer was failing b/c the drives are full and client restarting the transfer! I quickly calculated this happened 900-1000 times looking at the syslog transfer times. I so then started a parity check without correction to ensure there were no problems with my drives. It found one error pretty close to the beginning and none after that. How do I find which drive and file has the parity error? Could a non-hard drive issue have caused this? Is there anyway to verify if a hard drive is going bad aside from fixing the parity and waiting to see if I find more errors in future checks? I was really hoping this is something I would not have to deal with until a much later time... the oldest drive is less than two years old and they all have active cooling on them. Thanks! Link to comment
SSD Posted September 27, 2010 Share Posted September 27, 2010 I’m currently using Unraid 4.5.3 and Unmenu 1.3 Last night I queued a bunch of files to transfer via FTP and didn’t check again until the next morning. What I did not bother to verify beforehand is that I had enough free space. When I woke up this morning, I found the FTP client still transferring files. After watching it for a min. or two, I saw that the transfer was failing b/c the drives are full and client restarting the transfer! I quickly calculated this happened 900-1000 times looking at the syslog transfer times. I so then started a parity check without correction to ensure there were no problems with my drives. It found one error pretty close to the beginning and none after that. How do I find which drive and file has the parity error? You can't. Parity is simply the summing of corresponding bits from across each disk. There is no way to tell what bit is wrong. Could a non-hard drive issue have caused this? Is there anyway to verify if a hard drive is going bad aside from fixing the parity and waiting to see if I find more errors in future checks? You should run smart reports on each of your disks. It could help to identify a cabling problem or a failing disk. I was really hoping this is something I would not have to deal with until a much later time... the oldest drive is less than two years old and they all have active cooling on them. Thanks! When was the last time you ran a parity check and got no errors. Have you had any sort of server crash / hard shutdown since then? Post your smart reports (follow troubleshooting link in my sig for instructions). Link to comment
CupCak3 Posted September 27, 2010 Author Share Posted September 27, 2010 When was the last time you ran a parity check and got no errors. Have you had any sort of server crash / hard shutdown since then? Post your smart reports (follow troubleshooting link in my sig for instructions). I think I may have had a power outage since my last parity check. Would you rather see the results from the long or short smart tests? Thanks for the help and your prompt response! Link to comment
SSD Posted September 28, 2010 Share Posted September 28, 2010 When was the last time you ran a parity check and got no errors. Have you had any sort of server crash / hard shutdown since then? Post your smart reports (follow troubleshooting link in my sig for instructions). I think I may have had a power outage since my last parity check. Would you rather see the results from the long or short smart tests? Thanks for the help and your prompt response! If you had a power outage (and assuming you do not have a UPS), what would happen is the next time you booted the system would run an automatic parity check. For the first several minutes the OS is replaying journaled transactions to deal with the fact that the filesystem was not properly shutdown (RFS does an excellent job of doing this and not losing data). But my experience has been that unRAID's parity check fights with the replaying of transactions and results in parity not always getting fully corrected the first time. Take a look at this thread. It documents something very similar to what I believe happened to you: http://lime-technology.com/forum/index.php?topic=1562.msg10612#msg10612 I would suggest that you run the smart reports. I'm not asking you to run either the long or short test, just to take the reports. smartctl -a -d ata /dev/sd? Where "sd?" is sda, sdb, sdc, etc. All unRAID users should have a UPS! Link to comment
GK20 Posted September 28, 2010 Share Posted September 28, 2010 How do I find which drive and file has the parity error? You can't. Parity is simply the summing of corresponding bits from across each disk. There is no way to tell what bit is wrong. Under some assumptions, we actually can identify which disk but to id which file if the error is not at file system but data block then we probably can. Assumption: This parity error is caused by one and only one disk. How to do: By borrowing concept from data rebuild, assuming we already know which block has parity error, let's say it is block 20000 and we have n data disks. for (i = 1; i <= n; i++){ reset new_parity_block to zero; for (j = 1; j <= n; j++){ if (i == j){ continue; } new_parity_block = new_parity_block XOR (read block 20000 from disk j) } new_parity_block = new_parity_block XOR (read block 20000 from parity disk) old_data_block = read block 20000 from disk i if (compare_memory(new_parity_block, old_data_block) == 0){ continue; }else{ print (Old data and new generated data is not the same at disk i); break; } } If the error block is a file system block, then nothing we can do. otherwise browsing through file system at disk i and try to identify which file is using this block In the end, even we can id which disk and which file there is nothing we can do because we can not correct this error but trust data disks are good by writing new parity to parity disk in order to keep them in sync. Link to comment
vca Posted September 28, 2010 Share Posted September 28, 2010 If the error block is a file system block, then nothing we can do. otherwise browsing through file system at disk i and try to identify which file is using this block In the end, even we can id which disk and which file there is nothing we can do because we can not correct this error but trust data disks are good by writing new parity to parity disk in order to keep them in sync. If you could identify the file that contains the block in error then, if you have backups, you could restore a good version of the file from your backup. Even if I could only find out which file, on each disk in my array, contains this block it would be useful - then I could just restore those N files (even though N-1 of them are probably ok) and fix the issue. I'd probably have to update the parity disk too, in case none of the restored files were written to the block with the problem. So what's the magic command that I could use to take the block ID and find out, for a particular disk, what file (if any) uses the block? Regards, Stephen Link to comment
Joe L. Posted September 28, 2010 Share Posted September 28, 2010 If the error block is a file system block, then nothing we can do. otherwise browsing through file system at disk i and try to identify which file is using this block In the end, even we can id which disk and which file there is nothing we can do because we can not correct this error but trust data disks are good by writing new parity to parity disk in order to keep them in sync. If you could identify the file that contains the block in error then, if you have backups, you could restore a good version of the file from your backup. Even if I could only find out which file, on each disk in my array, contains this block it would be useful - then I could just restore those N files (even though N-1 of them are probably ok) and fix the issue. I'd probably have to update the parity disk too, in case none of the restored files were written to the block with the problem. So what's the magic command that I could use to take the block ID and find out, for a particular disk, what file (if any) uses the block? Regards, Stephen To be honest, I do not know one for the reiserfs file system. To get around that issue, if you knew the block and could read it using "dd" I'd try to look at the contents and match it with a file. (That could be REALLY tough since most of what I've got on my server is movies... Not much but bits...) Link to comment
GK20 Posted September 28, 2010 Share Posted September 28, 2010 So what's the magic command that I could use to take the block ID and find out, for a particular disk, what file (if any) uses the block? I don't think you will ever find a magic command for this purpose, not because it is not doable but it has no need. because end users don't suppose to know those detail that is why there is such thing as file system. However for those who want to write their own commands/tools information are there. one of good example is reiserfsck command, what it does is browsing and repairing file system based on information embedded in file system. Link to comment
CupCak3 Posted September 30, 2010 Author Share Posted September 30, 2010 Post your smart reports (follow troubleshooting link in my sig for instructions). Sorry for the late reply but work and family time have taken me away the past couple days. I have 5 disks and put all the reports in one txt file. Each disk is separated by ----- starting with sda then sdb.. etc. Thanks for the help; its much appreciated! All unRAID users should have a UPS! I know, I know. My old APC died and I haven't bothered replacing it. Definitely need to get that taken care of sooner than later How to do: By borrowing concept from data rebuild, assuming we already know which block has parity error, let's say it is block 20000 and we have n data disks. Dumb question; how do I figure out the block #? I haven't "fixed" the error yet. smarta.txt Link to comment
SSD Posted September 30, 2010 Share Posted September 30, 2010 Your disks look fine. No reallocated sectors. No logging to indicate cabling problems. Gold star! Run a normal (correcting) parity check and you'll be fine. Link to comment
GK20 Posted September 30, 2010 Share Posted September 30, 2010 Dumb question; how do I figure out the block #? I haven't "fixed" the error yet. If i recalled 4.5.6 image now will report first 20 parity errors in syslog. those information will be available in parity check process. Either limetech export those information to a file or you write your own parity check tool to capture them. Link to comment
vca Posted September 30, 2010 Share Posted September 30, 2010 So what's the magic command that I could use to take the block ID and find out, for a particular disk, what file (if any) uses the block? I don't think you will ever find a magic command for this purpose, not because it is not doable but it has no need. because end users don't suppose to know those detail that is why there is such thing as file system. However for those who want to write their own commands/tools information are there. one of good example is reiserfsck command, what it does is browsing and repairing file system based on information embedded in file system. The last time I programmed at that level was with the Commodore Amiga, so I'm probably not going to start now... That said, I have backups of pretty much everything on the RAID array, so I could compare the backups to the current disk contents to discover problems - but that's going to be a long job as it involves reading about 5TB of data... Stephen Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.