October 29, 201015 yr Ok so I had what looked like a drive failure on disk5. I replaced the old 500gig Disk 5 with a new 2TB disk5 - did rebuild - went to 99.9% stuck. So I thought - ok maybe the disk is bad. Installed a NEW 2TB - repeat rebuild - again 99.9% and STUCK!! 2 brand new western digital Green 2TB drives. Rebuild of 2TB took about 24hrs to get to 99% - then stuck at 99.9% for hours. See attached image. The estimated speed was cranking at 3000 or so but after sitting at 99.9% it drops Any help is appreciated.
October 29, 201015 yr You might have better chance to get help from forum if you can post a copy of your syslog. because no one has x-ray eyes can see through your web GUI interface to figure out what is wrong. Whatever goes wrong your syslog might, but no guarantee, have some information there.
October 29, 201015 yr Something is clearly wrong with disk6. Your syslog reports read errors for that disk, and the web interface shows a temp of 0. I don't know the solution to this, so you'll have to wait for Joe L or someone else to guide you.
October 29, 201015 yr Author Oh man dont tell me two drive failure!!! Result of FREE total used free shared buffers cached Mem: 3082520 2992828 89692 0 217220 2503056 -/+ buffers/cache: 272552 2809968 Swap: 0 0 0 RESULT of PS PID TTY TIME CMD 2783 pts/0 00:00:00 bash 2820 pts/0 00:00:00 ps
October 29, 201015 yr Author Machine is an AMD X2 64 with 3Gig of RAM On an AASUS ABN32-SLI SOCKET 939 5 drives using on board SATA parity drive is driven by one of these ports 4 IDE PATA drives driven by a Promise Controller - PCI - bus 2 SATA drives driven by a Promise Controller - PCI bus PCI video card.
October 29, 201015 yr Machine is an AMD X2 64 with 3Gig of RAM On an AASUS ABN32-SLI SOCKET 939 5 drives using on board SATA parity drive is driven by one of these ports 4 IDE PATA drives driven by a Promise Controller - PCI - bus 2 SATA drives driven by a Promise Controller - PCI bus PCI video card. Your syslog has been wrapped around, so a lot of information is not in this copy of syslog. you might need to post previous copy of syslog as well for whole picture. regardless, a lot of things going on in background. (a) unRAID has problem to spin down disk6, without inventory table at boot time, can not figure out which disk is disk6. (b) You have I/O error toward hdb IDE disk. in unRAID, IDE disk start with "h", SATA disk start with "s" in disk name © System has decided to remount md5 as read-only due to I/O error but assuming you are rebuilding md5, that is why it never finish (d) because of I/O error, some data write to md5 has been declared MIA (Missing in Action). At this moment, i am suspecting your Promise Controller might have problems, if disk6 is also a IDE disk. ------------ from your syslog ------------------------------------------------------------------------------- Oct 29 13:34:48 theoracle kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 Oct 29 13:34:58 theoracle kernel: mdcmd (7060): spindown 6 Oct 29 13:34:58 theoracle kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 Oct 29 13:34:59 theoracle kernel: end_request: I/O error, dev hdb, sector 940049679 Oct 29 13:34:59 theoracle kernel: md: disk6: ATA_OP_SETIDLE1 ioctl error: -5 Oct 29 13:35:09 theoracle kernel: md: disk6 read error Oct 29 13:35:09 theoracle kernel: handle_stripe read error: 940049616/6, count: 1 Oct 29 13:35:09 theoracle kernel: REISERFS error (device md5): vs-2100 add_save_link: search_by_key ([-1 8 0x1001 DIRECT]) returned -2 Oct 29 13:35:09 theoracle kernel: REISERFS (device md5): Remounting filesystem read-only Oct 29 13:35:09 theoracle kernel: end_request: I/O error, dev hdb, sector 47655 Oct 29 13:35:22 theoracle kernel: md: disk6 read error Oct 29 13:35:22 theoracle kernel: handle_stripe read error: 47592/6, count: 1 Oct 29 13:35:22 theoracle kernel: Buffer I/O error on device md5, logical block 5949 Oct 29 13:35:22 theoracle kernel: lost page write due to I/O error on md5 Oct 29 13:35:22 theoracle kernel: md: disk6 read error Oct 29 13:35:22 theoracle kernel: handle_stripe read error: 47600/6, count: 1 Oct 29 13:35:22 theoracle kernel: Buffer I/O error on device md5, logical block 5950 Oct 29 13:35:22 theoracle kernel: lost page write due to I/O error on md5
October 29, 201015 yr Author Thanks GK appreciate this much. Disk 6 is IDE PATA on one promise controller Disk 5 is SATA on another promise controller Some more info - I just switched motherboard/cpu combo as well - can this have anything to do with it? How do I get previous versions of syslog? Why does it complete up to 99.9% always stop exactly at the same place?
October 29, 201015 yr Author Also is this problem somehow related - this is the same array: http://lime-technology.com/forum/index.php?topic=5921.0
October 29, 201015 yr Some more info - I just switched motherboard/cpu combo as well - can this have anything to do with it? Well, if you mention this at beginning, i will definitely think this might be an issue, especially if you hadn't done any burn-in test on new CPU/MB/memory. Even you get a new MB recommended by unRAID Wiki doesn't meant you have "free out of jail" card. Massive production stuffs quality is dynamic could change from time to time. If you have new CPU/MB then you should at least run it for certain period of time, couple weeks maybe, with some heavy duty operations such as multiple non-corrected parity check, copying large amount of data in and out, in order to exercise this new set of hardware. Also run memtest for at least a day or two. After that you can start to do disk swapping/replacement. How do I get previous versions of syslog? I believe it is under /var/log Why does it complete up to 99.9% always stop exactly at the same place? I have no idea, from sector number in I/O error, that doesn't look like the last sector.
October 29, 201015 yr Author Its not a new MB/CPU - I mean I have been using this CPU/MB in my main desktop for like 3 years no problems. Its new to the Unraid server. I upgraded my desktop and moved the old MB/CPU to the unraid server. So what options? How do I move forward? What do you mean disk swapping replacement?
October 30, 201015 yr Author Ok running Mem Check Also I noticed that Disk 6 was on an IDE chain with Disk 5 (original) PATA when I replaced Disk 5 - I installed a SATA. Which left the PATA IDE channel primary empty - ie: end of the cable did not have a device - and the only drive on that IDE PATA chain was disk 6 - I wonder if this caused some kind of problem. I moved it to the end of the chain. When Mem Check completes I will try another rebuild on 5. Man if nothing else I am learning some linux the hardway
October 30, 201015 yr Its not a new MB/CPU - I mean I have been using this CPU/MB in my main desktop for like 3 years no problems. Its new to the Unraid server. I upgraded my desktop and moved the old MB/CPU to the unraid server. When you used this CPU/MB in desktop if every now and then you also have so many disks and reading data from all of them at same time and doing XOR calculations and comparing data for 6 to 10 hours non-stop each time, then your assumption might be sustained. . parity check/data rebuilding is a heavy duty job because it exercise almost every piece of your system components for many hours non-stop, doesn't like in desktop if you only do web browsing/editing documents....etc. So what options? How do I move forward? Your option now is to eliminate all those I/O errors because as long as those errors are there you have no chance can rebuild md5. But how to is a good question. you can (a) Go back to "last know good hardware", that is put everything back to previous HW setup and try to rebuild data on this platform. (b) Still using "new" CPU/MB but try to make sure they can work together (maybe replace controller card or remove SATA controller card and use on board SATA port instead....etc) © Last but not least, make sure there is no loose cables (SATA cables as well as power cables). While you are troubleshooting always checkout syslog for any warning/error messages. What do you mean disk swapping replacement? Originally i though you are replacing md5 but now i knew you did that because this disk is "broken". However given all those I/O at data rebuild, i will have my doubt if this old md5 disk is really broken or not.
October 30, 201015 yr Also I noticed that Disk 6 was on an IDE chain with Disk 5 (original) PATA when I replaced Disk 5 - I installed a SATA. Which left the PATA IDE channel primary empty - ie: end of the cable did not have a device - and the only drive on that IDE PATA chain was disk 6 - I wonder if this caused some kind of problem. I moved it to the end of the chain. IDE has master/slave (aka primary/secondary) concept, every time you rearrange disks you have to make sure those IDE disks had been jumper with correct mode. if you have only one disk on a IDE chain, make sure it is master device.
October 30, 201015 yr Author GK20 thanks for all the guidance. MemCheck ran 5 passes 0 errors. I will check all cables and try one more rebuild with existing hardware. If it fails then I will bring old MB/CPU back online - but I am suspect of this old MB/CPU because I was having the share disappearing problem I linked above. Basically server would work for awhile and then die - need a reboot and then be ok for awhile - no one was able to help me - hence the new MB/CPU combo. Again thank you - I will keep posting till I get it all resolved.
October 30, 201015 yr Author I tried to remount the old disk 5 Now UnRAID is looking for a 2TB drive or bigger so no going back AND the old disk 5 - a 500 gig PATA makes some not normal noises on boot up - so I do believe it is failing. Ok rebuild started again. Keeping my fingers crossed.
October 30, 201015 yr Author FYI Total size: 1,953,514,552 KB Current position: 754,020 (0.0%) Estimated speed: 13,674 KB/sec Estimated finish: 2379.8 minutes
October 30, 201015 yr Author Is there a way to watch the sys log in real time? No I do not plan to watch it for 24hrs but wanted to spot check it off and on.
October 30, 201015 yr Author Rebuild seems to be running consistent Total size: 1,953,514,552 KB Current position: 65,257,836 (3.3%) Estimated speed: 19,129 KB/sec Estimated finish: 1644.9 minutes
October 30, 201015 yr Is there a way to watch the sys log in real time? No I do not plan to watch it for 24hrs but wanted to spot check it off and on. Easiest way is from a telnet session or from the console type tail -f /var/log/syslog That will have the tail utility watch the file and print any new additions to the screen real time. To stop monitoring just press CTRL-C
October 30, 201015 yr Author New log attached. Speed seems to have dropped Total size: 1,953,514,552 KB Current position: 381,260,264 (19.5%) Estimated speed: 1,612 KB/sec Estimated finish: 16250.6 minutes AND I see error messages on drive 6 Model / Serial No. Temperature Size Free Reads Writes Errors parity WDC_WD20EADS-00S_WD-WCAVY0864163 35°C 1,953,514,552 - 1,073,873 64 0 disk1 WDC_WD5000AAKS-2_WD-WCAS82451651 32°C 488,386,552 44,390,284 1,120,078 5 0 disk2 ST3500641AS_3PM1YPT7 35°C 488,386,552 73,378,604 1,123,186 7 0 disk3 WDC_WD20EADS-00S_WD-WCAVY0878118 36°C 1,953,514,552 8,128,452 1,083,450 5 0 disk4 ST3500641AS_3PM0LCW6 36°C 488,386,552 122,091,392 1,090,824 8 0 disk5 WDC_WD20EARS-00M_WD-WCAZA0967156 23°C 1,953,514,552 1,933,671,552 45 1,721,426 0 disk6 ST3500841A_3PM079GN 27°C 488,386,552 245,981,228 955,140 5 29,777 disk7 ST3200822A_3LJ3BJ39 * 195,360,952 2,922,704 394,103 5 0 disk8 ST3200822A_3LJ3A3Z8 * 195,360,952 4,284,900 388,041 5 0 disk9 Maxtor_6Y200P0_Y65GPWPE * 199,148,512 56,608,624 435,458 5 0 disk10 ST3200822A_4LJ22G0B * 195,360,952 56,558,920 393,822 5 0 msgs1.txt
October 30, 201015 yr Model / Serial No. Temperature Size Free Reads Writes Errors disk5 WDC_WD20EARS-00M_WD-WCAZA0967156 23°C 1,953,514,552 1,933,671,552 45 1,721,426 0 disk6 ST3500841A_3PM079GN 27°C 488,386,552 245,981,228 955,140 5 29,777 The log you attached doesn't show any error. those errors look like read error from disk6, without disk6 participating in data rebuild error free you can not rebuild disk5. I think you can stop this data rebuild now and try to resolve disk6 issue first. check if the mode (primary/secondary) setting is correct and maybe replace IDE cable. or use different IDE port. meanwhile do a smart check on disk6 to find out if reallocated sector count has increased or not.
October 30, 201015 yr Author I already changed the IDE cable - just in case. Shall try different IDE port. Whats the easiest way to do a SMART test on an unraid machine? Ok lets say disk 6 is crap - so now I have crapped out disk 5 and disk 6 Max I loose disk 5 (which was not much data - nothing important) and disk 6 which has some stuff I would really like to save. What are my options?
October 30, 201015 yr Author So this time it has failed at 19% and disk 6 shows more errors than before. This looks like its going from bad to worse. Attached current log. Will attempt change port / rebuild. msg2.txt
October 30, 201015 yr Author Error on console: reiserfs abort (device md5): Journal write error in flush_commit_list reiserfs abort (device md6): Journal write error in flush_commit_list
Archived
This topic is now archived and is closed to further replies.