April 25, 201412 yr I'm really hoping someone can help. Joe L. has been helpful with some of these issues, but I really need the help of others as well. This all started when one of my drives (md4) got red-balled. I did something stupid and figured that if the drive got red-balled and it is off the array that I could simply pull it out from my hot-swappable case and replace it. Big Mistake! As soon as I did that I started getting I/O errors in my array and most of the shares were not browsable (I would get I/O errors). I then followed the advice of Garycase, and shut down the array, restarted a new configuration with the old drives, and replaced md4 with a new drive and let the rebuild process continue. The rebuild process was going smoothly on the webui (i.e., I see progression, %complete, time remaining all that), but the syslog keeps showing me errors for another drive (md3). Now, the rebuild is going on md4, but my array is completely inaccessible (whenever I try to 'cd' into any protected share, I get 'I/O error'). However, I am able to access the '/mnt/disk*/share' folders, except for disk3 (md3) and disk4 (md4). I know that the parity is valid (and I made sure that the parity drive was not written to). I just got the following status update, which shows both md3 as problematic. Note that md4 is where the rebuild is currently ongoing. I'm freaking out since I know that unraid cannot protect the array if more than 1 drive fails, and while md4 previously failed and I am rebuilding it, I am worried about the concomitant failure of md3, which might screw up the rebuild process. Joe L. suggested that I allow the rebuild process to continue and try a filesystem check on md3, but I am not sure whether there is a mechanical failure in md3 while the rebuild process for md4 is ongoing. I posted the automated status message I get daily from my server below. Can someone please please help me. This message is a status update for unRAID Tower ----------------------------------------------------------------- Server Name: Tower IP: 192.168.1.10 Status: The unRaid array needs attention. One or more disks are disabled or invalid. Date: Fri Apr 25 15:44:00 EDT 2014 Disk Temperature Status ----------------------------------------------------------------- Parity Disk [sdg]: 29C (DiskId: ST4000DM000-1F2168_Z3006MBD) Disk 1 [sdj]: 30C (DiskId: ST2000DL003-9VT166_5YD3XDQR) Disk 2 [sdc]: 30C (DiskId: ST2000DL003-9VT166_5YD4QXMD) Disk 3 [sdd]: Not-Reported (DiskId: SAMSUNG_HD204UI_S2H7J9AB703152) Disk 4 [sdf]: 30C (DiskId: ST4000DM000-1F2168_Z301C7G6) Disk 5 [sdi]: 33C (DiskId: HGST_HMS5C4040ALE640_PL2331LAGGY7DJ) Disk 6 [sdh]: 30C (DiskId: ST3000DM001-9YN166_W1F0FN9P) Disk 7 [sde1]: 29C (DiskId: ST1000DX001-1CM162_Z1D8Y3KK) Disk SMART Health Status ----------------------------------------------------------------- Parity Disk PASSED (DiskId: ST4000DM000-1F2168_Z3006MBD) Disk 1 PASSED (DiskId: ST2000DL003-9VT166_5YD3XDQR) Disk 2 PASSED (DiskId: ST2000DL003-9VT166_5YD4QXMD) Disk 3 Not-Reported (DiskId: SAMSUNG_HD204UI_S2H7J9AB703152) Disk 4 PASSED (DiskId: ST4000DM000-1F2168_Z301C7G6) Disk 5 PASSED (DiskId: HGST_HMS5C4040ALE640_PL2331LAGGY7DJ) Disk 6 PASSED (DiskId: ST3000DM001-9YN166_W1F0FN9P) Disk 7 PASSED (DiskId: ST1000DX001-1CM162_Z1D8Y3KK) Output of /proc/mdcmd: ----------------------------------------------------------------- sbName=/boot/config/super.dat sbVersion=2.2.0 sbCreated=1398281816 sbUpdated=1398283286 sbEvents=6 sbState=0 sbNumDisks=7 sbSynced=0 sbSyncErrs=189005275 mdVersion=2.2.0 mdState=STARTED mdNumProtected=7 mdNumDisabled=0 mdDisabledDisk=0 mdNumInvalid=1 mdInvalidDisk=4 mdNumMissing=0 mdMissingDisk=0 mdNumNew=0 mdResync=3907018532 mdResyncCorr=1 mdResyncPos=1446328068 mdResyncDt=33 mdResyncDb=261548 diskNumber.0=0 diskName.0= diskSize.0=3907018532 diskState.0=7 diskId.0=ST4000DM000-1F2168_Z3006MBD rdevNumber.0=0 rdevStatus.0=DISK_OK rdevName.0=sdg rdevSize.0=3907018532 rdevId.0=ST4000DM000-1F2168_Z3006MBD rdevNumErrors.0=0 rdevLastIO.0=1398455040 rdevSpinupGroup.0=0 diskNumber.1=1 diskName.1=md1 diskSize.1=1953514552 diskState.1=7 diskId.1=ST2000DL003-9VT166_5YD3XDQR rdevNumber.1=1 rdevStatus.1=DISK_OK rdevName.1=sdj rdevSize.1=1953514552 rdevId.1=ST2000DL003-9VT166_5YD3XDQR rdevNumErrors.1=0 rdevLastIO.1=1398455040 rdevSpinupGroup.1=0 diskNumber.2=2 diskName.2=md2 diskSize.2=1953514552 diskState.2=7 diskId.2=ST2000DL003-9VT166_5YD4QXMD rdevNumber.2=2 rdevStatus.2=DISK_OK rdevName.2=sdc rdevSize.2=1953514552 rdevId.2=ST2000DL003-9VT166_5YD4QXMD rdevNumErrors.2=0 rdevLastIO.2=1398455040 rdevSpinupGroup.2=0 diskNumber.3=3 diskName.3=md3 diskSize.3=1953514552 diskState.3=7 diskId.3=SAMSUNG_HD204UI_S2H7J9AB703152 rdevNumber.3=3 rdevStatus.3=DISK_OK rdevName.3=sdd rdevSize.3=1953514552 rdevId.3=SAMSUNG_HD204UI_S2H7J9AB703152 rdevNumErrors.3=189107685 rdevLastIO.3=1398455040 rdevSpinupGroup.3=0 diskNumber.4=4 diskName.4=md4 diskSize.4=3907018532 diskState.4=6 diskId.4=ST4000DM000-1F2168_Z301C7G6 rdevNumber.4=4 rdevStatus.4=DISK_INVALID rdevName.4=sdf rdevSize.4=3907018532 rdevId.4=ST4000DM000-1F2168_Z301C7G6 rdevNumErrors.4=0 rdevLastIO.4=0 rdevSpinupGroup.4=0 diskNumber.5=5 diskName.5=md5 diskSize.5=3907018532 diskState.5=7 diskId.5=HGST_HMS5C4040ALE640_PL2331LAGGY7DJ rdevNumber.5=5 rdevStatus.5=DISK_OK rdevName.5=sdi rdevSize.5=3907018532 rdevId.5=HGST_HMS5C4040ALE640_PL2331LAGGY7DJ rdevNumErrors.5=0 rdevLastIO.5=1398455040 rdevSpinupGroup.5=0 diskNumber.6=6 diskName.6=md6 diskSize.6=2930266532 diskState.6=7 diskId.6=ST3000DM001-9YN166_W1F0FN9P rdevNumber.6=6 rdevStatus.6=DISK_OK rdevName.6=sdh rdevSize.6=2930266532 rdevId.6=ST3000DM001-9YN166_W1F0FN9P rdevNumErrors.6=0 rdevLastIO.6=1398455040 rdevSpinupGroup.6=0
April 25, 201412 yr Author Yes, I can telnet the server, and the flash drive share is online. I can also individually look at each of the drives (/mnt/disk1... disk2... etc) and look at the folder contents for all the hard drives, except for disk3 and disk4. (again, disk3 is the one with the errors, and disk4 is the new disk that is being rebuilt). Here is my unraid main page. As you can see, all the drives (including the one giving me all the i/o errors) are green. The only yellow-balled one is the disk4 (/dev/sdf) which is being rebuilt. This is my rebuild progress. Funny thing is my syslog is no longer active (it is blank). The last entries on my syslog are the following: Apr 24 17:38:02 Tower kernel: md: disk3 read error, sector=1395582720 Apr 24 17:38:02 Tower kernel: md: disk3 read error, sector=1395582728 Apr 24 17:38:02 Tower kernel: md: disk3 read error, sector=1395582736 Apr 24 17:38:02 Tower kernel: md: disk3 read error, sector=1395582744 Apr 24 17:38:02 Tower kernel: md: disk3 read error, sector=1395582752 Apr 24 17:38:02 Tower kernel: md: disk3 read error, sector=1395582760... The syslog when the errors started, as the rebuild was going is posted here : (Line 461 is where the trouble starts). http://pastebin.com/WGjjawKx When I try to manually get a SMART report, I get the following (here, /dev/sdd = md3 where the errors are coming). root@Tower:/boot# smartctl -a -A -T permissive /dev/sdd smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id === START OF READ SMART DATA SECTION === SMART Health Status: OK Read defect list: asked for grown list but didn't get it Error Counter logging not supported scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 Device does not support Self Test logging What I'm not sure is whether this is a hardware problem, or an issue with the file-system? Right now, I am allowing the drive rebuild to continue. On my case, all the LED hard drive indicators are flashing, but the indicator for md3 is solid and not flashing at all. Moreover, on the unraid webui main page, I am not seeing any progress in the number of "writes" applied to the drive that is supposed to be rebuilt. Thank you so much for your help!
April 26, 201412 yr Got your PM. I actually had seen your case earlier in the other thread and saw several folks including Joe L. were trying to help. Please keep all of your posts about a single problem in one thread. Makes it easier to get all the facts without jumping around. I'll merged the two threads. I am not sure how this whole thing started, but let me guess. Disk4 red balled and you decided to rebuild it? Is that correct? And then while it was rebuilding you started getting a ton of errors on disk3 and the rebuild of disk4 either stopped or slowed to a crawl? Please verify and correct if that's not right. Please capture a syslog, and also a smart report from drive 3. To do this, run these two commands using telnet, or by typing into the command prompt on the unRAID server. cp /var/log/syslog /boot/syslog1.txt smartctl -A /dev/sdd>/boot/smartd3.txt Once those are complete you should be able to access them via Windows Explorer using the path "\\tower\flash". Upload the files. You can use a web site called pastebin.com if the syslog it too long to upload. Post a link to the pastebin if you use it. Or you can zip it and upload the zip file since it is much smaller. When you rebuild a disk, ALL of your other disks and parity have to be in good health to do the reconstruction. If you look at your page in the error column you should see that disk3 has generated a monster number of errors. It is a world record. Like a heartrate of 497. I do not know why but we are going to try to figure it out. I am hopeful it is a problem with the reading and not with the patient. Is it true that you have the disk4 that appeared to fail, and it is just not connected to the array? My guess is that the drive really did not fail, and that we will be able to read its data. I am really more worried about disk3 at this time. Do not write anything to your array. Post the files I asked for. Ask questions if you have any.
April 26, 201412 yr Author Thank you for your help. I will keep everything in this thread. I am not sure how this whole thing started, but let me guess. Disk4 red balled and you decided to rebuild it? Is that correct? 1. Yes, you are correct. Disk4 red-balled, and I tried to rebuild it, but I went about it the wrong way by removing disk4 from the array while the array was active (I figured that since the disk was red-balled it would be off the array). As I have stated above, that started triggering "Input/output errors" on the array in general and I could not write to any of the other drives. Please capture a syslog, and also a smart report from drive 3. To do this, run these two commands using telnet, or by typing into the command prompt on the unRAID server. cp /var/log/syslog /boot/syslog1.txt smartctl -A /dev/sdd>/boot/smartd3.txt 2. Tried this approach. I was able to get a syslog when the problems started. That syslog is posted above (http://pastebin.com/WGjjawKx) and if you scroll to line 460-461, you will see the errors starting to pop up with disk3. Unfortunately, when I try to output the syslog NOW (more than 24 hrs after the problem started), I get an empty syslog. In other words, the command that you indicated above (copying the syslog to syslog1.txt) outputs an empty text file. Before the syslog disappeared on me, I was getting a list of "read errors" on disk3, but no write errors. When I do a SMARTCTL I get the following error: smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.[/quote][/quote] When I try to re-run the SMARTCTL using the "-T Permissive" option, I get the following output. root@Tower:~# smartctl -a -T permissive /dev/sdd smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id === START OF READ SMART DATA SECTION === SMART Health Status: OK Read defect list: asked for grown list but didn't get it Error Counter logging not supported Device does not support Self Test logging The above SMART report is for /dev/sdd (which is the problematic disk3). 3. However, I think my disk4 really did fail. Because I tried to pre clear it and I started getting a ton of write errors. I don't know if this is good news, but the errors in disk3 are all read errors (see my post above where I put an excerpt of what I was getting on the screen). 4. The thing that bothers me is that the web ui is behaving like the rebuild process is going smoothly (with the % complete, time remaining, write speed) all advancing properly. However, on the main webui the number of "writes" for disk4 (which should be getting rebuilt) is not increasing. It is stuck at a particular value and not increasing at all. The number of errors on disk3 (all of which are read errors) are multiplying rapidly. I have about 2 more days for the rebuild process to complete so I was wondering should I stop the process right now to troubleshoot disk3, or should I let the rebuild continue and finish and then try to troubleshoot disk3 and disk4? 5. Lastly, I have a question as a last resort. If both disk3 and disk4 have failed, I'm faced the nightmare that my array is unrecoverable. But currently, I have disk1, disk2, disk5, disk6, all of which have data that I can browse through by going to /mnt/disk*/ after I root login to the server. If there are 2 failed drives, does this mean that the data on all of the other disks are also unrecoverable? Or is there a way to rebuild the array without the data from disk3 and disk4 (so at worse I am only losing the data in those drives)? I hope this covers the questions you asked. I cannot tell you how much I appreciate your assistance.
April 26, 201412 yr Sorry - did you try to remove a disk from a running server? Or you just didn't remove it from the array before powering down and removing the disk. You should not have tried to preclear disk4 before the rebuild of it succeeded. Always want to have a backup recovery plan in case the primary failed. If you are popping disks in and out of your array without shutting down, that may be the cause of your ills. No telling the state of the disks though. I would shut down, physically install all of the drives in the server (if you can), and let's try the smart reports and syslog again.
April 26, 201412 yr A couple of thoughts ... Not helpful here, but for future use, do NOT physically remove a disk from a powered array !! Next, NEVER pre-clear a disk that may have data you need to recover on it. That wipes the disk and makes recovery effectively impossible. Finally, if you have in fact had 2 drives fail, then UnRAID can NOT rebuild either one. You MAY be able to restore the file systems on one or both using reiserfsck; but that won't be possible on the one you pre-cleared. The data on your other disks (1, 2, 5, and 6) is fine ... but clearly is not parity protected, so if any of those drives were to fail you'd potentially lose it as well. What to do next? Given what you've already done (especially the pre-clear) I'd let the rebuild complete. It's likely it won't be successful, but I'd wait to confirm that. The Disk3 status is very interesting ... I have to wonder if perhaps you hit the cable when you pulled out or reinserted disk4, and that's causing all the read errors. When the rebuild completes, I'd shut down the system; unplug/replug the cables for disks 3 and 4 (or use new ones); then boot to the system and see what the status is ... in particular what the data looks like on disk4 (the rebuilt one) and disk3. If all looks okay, do a NON-correcting parity check to see if everything matches your current parity. Next step depends on the results of that.
April 26, 201412 yr Author Thanks to both of you, garycase and bjp999: so I think I am going to let the rebuild continue and we what happens. I am at 55% now and disk4 (being rebuilt) suddenly has a write count that is slowly increasing. Lots of lessons learned here, but I cannot thank you guys enough for your assistance. I will report back soon.
April 26, 201412 yr Can you estimate the completion time based on current progress? Just time it for 10 or 15 minutes and note the progress and then extrapolate. Are we talking about days, weeks, months or longer to complete this rebuild
April 29, 201412 yr Author OK, After literally 5 days -- the so-called "Rebuild" is complete. I am getting greenballs in the all the drives in the array (including the one that was rebuilt), but a red-ball in md3, the drive which failed in the middle of the build process. Here's in my main screen (all my other drives show up as spun down). The array is repotedly active, and when I browse through my shares, I can access some but not all the shares. For example, my movie share comes up blank (I/O error) but I get a good directory listing and accessibility with, say, my e-book share. I'd love to know what I should do next. I feel that the rebuild was obviously a failure. It is also possible that md3 was not really red-balled. On the front chassis of the server, the LED indicator for md3 is "on" solid, stuck, like it's constantly being accessed. I was thinking about doing the following: 1. Stop the array 2. Restart the physical server 3. Start the array in maintenance mode. 4. Get SMART reports on each of the drives (short test) 5. Get REISERFSCK --check on each of the drives. I will report back the data here.
April 29, 201412 yr Save a syslog before you do anything. Can you see files on disk4? Can you see files on disk3? (UnRaid will be simulating disk3 if it is red balled. What you see is what you'll get if you rebuild. ) I would definitely reboot the box if you haven't already. In fact I would power it down completely (unplugged for 30 seconds) before continuing. I'd replace the data cable on disk3 while it's down. Is the old disk4 in the server? If not I would put it in the server. Do not preclear it or write to it in any way. Get smart reports in all the disks and post them.
April 29, 201412 yr Author Save a syslog before you do anything. This is the most recent syslog. As the rebuild was ongoing, on 4/24 suddenly the syslog stopped. The last entries I have are from 4/24. It is posted here: https://www.dropbox.com/s/n6vz16aj39ib7gn/syslog.txt Can you see files on disk4? Yes. I can browse through /mnt/disk4 all the way to the individual file level. Can you see files on disk3? (UnRaid will be simulating disk3 if it is red balled. What you see is what you'll get if you rebuild. ) No, this is where the problem is. I can browse /mnt/disk3 upto the directory levels (I can even go into my movie folder and to individual movies' folder), but when I "ls" from the movie folder, I don't get a file listing. Instead I get input/output error. I would definitely reboot the box if you haven't already. In fact I would power it down completely (unplugged for 30 seconds) before continuing. I'd replace the data cable on disk3 while it's down. OK, I will try this. But before I do, I should get a directory listing of the contents of /mnt/disk3 and /mnt/disk4. That way, if I lose one or both of those drives, I know which data I need to retrieve. Is the old disk4 in the server? If not I would put it in the server. Do not preclear it or write to it in any way. Unfortunately, this is where I did something stupid earlier, that both you and Garycase had alluded to. I precleared the old red-balled disk4 before the data rebuild was complete. You'd think after 14 years of medical school and training that would prevent you from doing something stupid, but hell no it doesn't. Get smart reports in all the disks and post them. I will do this next. Thanks bjp999!
April 29, 201412 yr I thought you had said that the attempt to preclear disk4 failed. After the reboot test it for a valid preclear signature (-t parameter I think) Disk3 is being simulated from parity and all other disks. Since you just rebuilt disk4 its contribution to the simulation may be flawed. This would tend to point to a bad rebuild of disk4. Are you able to confirm more than file names? Can you try playing some movies, checking integrity of zip files, etc on disk4? Test files in various folders, including the newst added files if possible. DO NOT WRITE TO THE ARRAY AT ALL. Do not run any reiser commands that are not 100% read only. My reasoning is that we may be successful at resuscitating disk3. If so, we may be able to rebuild disk4 again. Any update to any disk in any way will impact the quality of that rebuild.
April 29, 201412 yr Author Yes, you were right. Those files on the rebuilt disk4 were only teasers.... Many of those don't play so I assume that meant the rebuild failed. Now disk3 is red-balled, so I am going to shut down the array. Like you said, I will set up a new configuration (with trust parity) and try to see if tightening the cables etc will help resuscitate disk3 (with disk4 set up to be missing on the array). Then I will go into maintenance mode and try SMARTCTL on each drive and post a report here. If the problem on disk3 was a reiserfs error, then this should not have impacted the rebuild process for disk4 right? I'll get back to you soon. Thank you again!
April 29, 201412 yr I did not say to do any of that. I said to shutdown, make sure all of the drives are in the server, reboot and get the smart reports.
April 29, 201412 yr Author OK, so here is what happened. STEP 1. I stopped the array after the rebuild, ran the power down script. Completely unhooked the power from the server. Opened the case, and tightened each one of the connections and restarted the server. STEP 2. During the boot process -- I received a couple of Reiserfs errors for md3 (this was the disk that was red-balled). I have attached the syslog for you to review from the boot up process, here : https://www.dropbox.com/s/3wbryfqy2xa454e/syslog-restart.txt STEP 3. Then when the boot up process was finally over, my webui shows the following: STEP 4. As you had indicated, I ran "smartctl -a" for the red-balled disk3(which is /dev/sdh), the newly rebuilt disk4 (which is /dev/sdg) and for disk5 which seems to be working (/dev/sde). I also ran a "short test" and "long test" for disk3 (long test still pending, and short test results posted), and a "short test" for disk4. The attached zip file contains the text files for these SMART reports. SMART-reports.zip : https://www.dropbox.com/s/8zchc1ctc3aahck/SMART-reports.zip smart-disk3.txt = Output of "smartctl -a /dev/sdh" (disk3) smart-disk3-short.txt = Output of the report generated from smartctl short test for disk3 smart-disk4.txt = Output of "smartctl -a /dev/sdg" (disk4) smart-disk4-short.txt = Output of the report generated from smartctl short test for disk4 smart-disk5.txt = Output of "smartctl -a /dev/sde" (disk5) I'm not great at interpreting smart reports but I'm not sure whether it is saying that disk3 is working (I think it maybe, but I'm not sure). I will post the "long result" separately when it is done in a few hours. I will await your advice for the next step. Thank you again, bjp999.
April 29, 201412 yr Are you running any plugins that write to the array. (Sickbeard, anything nzb related, etc,) There are no smart errors that I see. There is some message about a possible firmware issue that was a little scarey. Smartmon tools is what you ran to see the reports. Do you have a pro license or plus license?m And what about the old disk4? Did you check for a preclear signature? Run a smart report?
April 29, 201412 yr Author Are you running any plugins that write to the array. (Sickbeard, anything nzb related, etc,) Yes, I am running Sabnzbd, Sickbeard and Couchpotato. But I have disabled all of them after the boot process since I am worried about them writing to the array. There are no smart errors that I see. There is some message about a possible firmware issue that was a little scarey. Smartmon tools is what you ran to see the reports. So does this mean that disk3 is "physically" OK and perhaps needs a file system correction? Looking over the syslog, I see one error that shows up as a reiserfs error. Here Apr 29 13:49:33 Tower kernel: REISERFS (device md3): found reiserfs format "3.6" with standard journal Apr 29 13:49:33 Tower kernel: REISERFS (device md3): using ordered data mode Apr 29 13:49:33 Tower kernel: reiserfs: using flush barriers Apr 29 13:49:33 Tower kernel: REISERFS (device md3): journal params: device md3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Apr 29 13:49:33 Tower kernel: REISERFS (device md3): checking transaction log (md3) Apr 29 13:49:33 Tower kernel: REISERFS (device md3): replayed 2 transactions in 0 seconds Apr 29 13:49:33 Tower kernel: REISERFS (device md3): Using r5 hash to sort names Apr 29 13:49:33 Tower logger: mount: block device /dev/md3 is write-protected, mounting read-only Apr 29 13:49:34 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 27119 does not match to the expected one 1 Apr 29 13:49:34 Tower kernel: REISERFS error (device md3): vs-5150 search_by_key: invalid format found in block 202261721. Fsck? Apr 29 13:49:34 Tower kernel: REISERFS (device md3): Remounting filesystem read-only Apr 29 13:49:34 Tower kernel: REISERFS error (device md3): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3 3385 0x0 SD] Apr 29 13:49:34 Tower kernel: REISERFS (device md3): found reiserfs format "3.6" with standard journal Apr 29 13:49:34 Tower kernel: REISERFS (device md3): using ordered data mode Apr 29 13:49:34 Tower kernel: reiserfs: using flush barriers Apr 29 13:49:34 Tower kernel: REISERFS (device md3): journal params: device md3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Apr 29 13:49:34 Tower kernel: REISERFS (device md3): checking transaction log (md3) *********************************** ERRORS START HERE ************************* Apr 29 13:49:34 Tower logger: mount: cannot mount block device /dev/md3 read-only Apr 29 13:49:34 Tower emhttp: _shcmd: shcmd (28): exit status: 32 Apr 29 13:49:34 Tower emhttp: disk3 mount error: 32 Apr 29 13:49:34 Tower emhttp: shcmd (29): rmdir /mnt/disk3 Apr 29 13:49:34 Tower emhttp: shcmd (30): mkdir /mnt/disk4 Apr 29 13:49:34 Tower emhttp: shcmd (31): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md4 /mnt/disk4 |& logger Apr 29 13:49:34 Tower kernel: REISERFS (device md3): Using r5 hash to sort names Apr 29 13:49:34 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 27119 does not match to the expected one 1 Apr 29 13:49:34 Tower kernel: REISERFS error (device md3): vs-5150 search_by_key: invalid format found in block 202261721. Fsck? Apr 29 13:49:34 Tower kernel: REISERFS error (device md3): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3 3385 0x0 SD] ****************************** SEE ERRORS ABOVE ******************************** Apr 29 13:49:34 Tower kernel: REISERFS (device md4): found reiserfs format "3.6" with standard journal Apr 29 13:49:34 Tower kernel: REISERFS (device md4): using ordered data mode Apr 29 13:49:34 Tower kernel: reiserfs: using flush barriers Apr 29 13:49:34 Tower kernel: REISERFS (device md4): journal params: device md4, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Apr 29 13:49:34 Tower kernel: REISERFS (device md4): checking transaction log (md4) Apr 29 13:50:04 Tower kernel: REISERFS (device md4): replayed 162 transactions in 30 seconds Apr 29 13:50:05 Tower kernel: REISERFS (device md4): Using r5 hash to sort names Apr 29 13:50:05 Tower emhttp: shcmd (32): mkdir /mnt/disk5 Apr 29 13:50:05 Tower emhttp: shcmd (33): set -o pipefail ; mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/md5 /mnt/disk5 Do you have a pro license or plus license?m Pro license. And what about the old disk4? Did you check for a preclear signature? Run a smart report? I did run a pre clear signature test (preclear_disk.sh -t) and I got a message saying that the drive IS "successfully pre cleared" -- So I suppose that is bad news for me. What do you suggest I do at this point? Should I try to run reiserfsck on disk3? try a new rebuild process for disk4? I will wait to hear from you.
April 29, 201412 yr I am at work and this will take some attention for the next series of steps. I will respond later this evening. Suggest you stop the array. I do see a POSSIBLE path to recovery, but one mistake will throw us off. If you take another person's input you will need to finish with them supporting you.
April 29, 201412 yr Author Well let me tell you that I am not going to be even pressing a button without your approval. I greatly greatly appreciate all that you are willing to do. I will stop the array and await your instructions. Thank you so much!
April 29, 201412 yr Author I also wanted to point out something I noticed when I try to shut down the array under normal mode. It looks like some process is always holding back the disks from being unmounted. After a while, all the disks get unmounted and the array is stopped according to the web ui. However, the LED indicators on the array continue to blink every second. When I look at the syslog, the syslog is wildly active with the same set of messages repeating itself in a wild loop even when the web ui states that the array is shut down. Apr 29 15:27:02 Tower kernel: md: unRAID driver removed Apr 29 15:27:02 Tower kernel: md: unRAID driver 2.2.0 installed Apr 29 15:27:02 Tower kernel: mdcmd (1): import 0 8,48 3907018532 ST4000DM000-1F2168_Z3006MBD Apr 29 15:27:02 Tower kernel: md: import disk0: [8,48] (sdd) ST4000DM000-1F2168_Z3006MBD size: 3907018532 Apr 29 15:27:02 Tower kernel: mdcmd (2): import 1 8,80 1953514552 ST2000DL003-9VT166_5YD3XDQR Apr 29 15:27:02 Tower kernel: md: import disk1: [8,80] (sdf) ST2000DL003-9VT166_5YD3XDQR size: 1953514552 Apr 29 15:27:02 Tower kernel: mdcmd (3): import 2 8,144 1953514552 ST2000DL003-9VT166_5YD4QXMD Apr 29 15:27:02 Tower kernel: md: import disk2: [8,144] (sdj) ST2000DL003-9VT166_5YD4QXMD size: 1953514552 Apr 29 15:27:02 Tower kernel: mdcmd (4): import 3 8,112 1953514552 SAMSUNG_HD204UI_S2H7J9AB703152 Apr 29 15:27:02 Tower kernel: md: import disk3: [8,112] (sdh) SAMSUNG_HD204UI_S2H7J9AB703152 size: 1953514552 Apr 29 15:27:02 Tower kernel: mdcmd (5): import 4 8,96 3907018532 ST4000DM000-1F2168_Z301C7G6 Apr 29 15:27:02 Tower kernel: md: import disk4: [8,96] (sdg) ST4000DM000-1F2168_Z301C7G6 size: 3907018532 Apr 29 15:27:02 Tower emhttp: shcmd (826): /usr/local/sbin/emhttp_event driver_loaded Apr 29 15:27:02 Tower emhttp_event: driver_loaded Apr 29 15:27:02 Tower kernel: mdcmd (6): import 5 8,64 3907018532 HGST_HMS5C4040ALE640_PL2331LAGGY7DJ Apr 29 15:27:02 Tower kernel: md: import disk5: [8,64] (sde) HGST_HMS5C4040ALE640_PL2331LAGGY7DJ size: 3907018532 Apr 29 15:27:02 Tower kernel: mdcmd (7): import 6 8,128 2930266532 ST3000DM001-9YN166_W1F0FN9P Apr 29 15:27:02 Tower kernel: md: import disk6: [8,128] (sdi) ST3000DM001-9YN166_W1F0FN9P size: 2930266532 Apr 29 15:27:02 Tower kernel: mdcmd (: import 7 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (9): import 8 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (10): import 9 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (11): import 10 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (12): import 11 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (13): import 12 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (14): import 13 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (15): import 14 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (16): import 15 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (17): import 16 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (18): import 17 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (19): import 18 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (20): import 19 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (21): import 20 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (22): import 21 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (23): import 22 0,0 Apr 29 15:27:02 Tower kernel: mdcmd (24): import 23 0,0 Apr 29 15:27:03 Tower emhttp: shcmd (827): rmmod md-mod |& logger Apr 29 15:27:03 Tower emhttp: shcmd (828): modprobe md-mod super=/boot/config/super.dat slots=24 |& logger Apr 29 15:27:03 Tower emhttp: shcmd (829): udevadm settle Apr 29 15:27:03 Tower kernel: md: unRAID driver removed Apr 29 15:27:03 Tower kernel: md: unRAID driver 2.2.0 installed Apr 29 15:27:03 Tower emhttp: Device inventory: Apr 29 15:27:03 Tower emhttp: ST2000DX001-1CM164_Z1E5L2EN (sda) 1953514584 Apr 29 15:27:03 Tower emhttp: ST1000DX001-1CM162_Z1D8Y3KK (sdc) 976762584 Apr 29 15:27:03 Tower emhttp: ST4000DM000-1F2168_Z3006MBD (sdd) 3907018584 Apr 29 15:27:03 Tower emhttp: HGST_HMS5C4040ALE640_PL2331LAGGY7DJ (sde) 3907018584 Apr 29 15:27:03 Tower emhttp: ST2000DL003-9VT166_5YD3XDQR (sdf) 1953514584 Apr 29 15:27:03 Tower emhttp: ST4000DM000-1F2168_Z301C7G6 (sdg) 3907018584 Apr 29 15:27:03 Tower emhttp: SAMSUNG_HD204UI_S2H7J9AB703152 (sdh) 1953514584 Apr 29 15:27:03 Tower emhttp: ST3000DM001-9YN166_W1F0FN9P (sdi) 2930266584 Apr 29 15:27:03 Tower emhttp: ST2000DL003-9VT166_5YD4QXMD (sdj) 1953514584 Apr 29 15:27:03 Tower kernel: mdcmd (1): import 0 8,48 3907018532 ST4000DM000-1F2168_Z3006MBD Apr 29 15:27:03 Tower kernel: md: import disk0: [8,48] (sdd) ST4000DM000-1F2168_Z3006MBD size: 3907018532 Apr 29 15:27:03 Tower kernel: mdcmd (2): import 1 8,80 1953514552 ST2000DL003-9VT166_5YD3XDQR Apr 29 15:27:03 Tower kernel: md: import disk1: [8,80] (sdf) ST2000DL003-9VT166_5YD3XDQR size: 1953514552 Apr 29 15:27:03 Tower kernel: mdcmd (3): import 2 8,144 1953514552 ST2000DL003-9VT166_5YD4QXMD Apr 29 15:27:03 Tower kernel: md: import disk2: [8,144] (sdj) ST2000DL003-9VT166_5YD4QXMD size: 1953514552 Apr 29 15:27:03 Tower kernel: mdcmd (4): import 3 8,112 1953514552 SAMSUNG_HD204UI_S2H7J9AB703152 Apr 29 15:27:03 Tower kernel: md: import disk3: [8,112] (sdh) SAMSUNG_HD204UI_S2H7J9AB703152 size: 1953514552 Apr 29 15:27:03 Tower kernel: mdcmd (5): import 4 8,96 3907018532 ST4000DM000-1F2168_Z301C7G6 Apr 29 15:27:03 Tower kernel: md: import disk4: [8,96] (sdg) ST4000DM000-1F2168_Z301C7G6 size: 3907018532 Apr 29 15:27:03 Tower emhttp: shcmd (830): /usr/local/sbin/emhttp_event driver_loaded Apr 29 15:27:03 Tower emhttp_event: driver_loaded Apr 29 15:27:03 Tower kernel: mdcmd (6): import 5 8,64 3907018532 HGST_HMS5C4040ALE640_PL2331LAGGY7DJ Apr 29 15:27:03 Tower kernel: md: import disk5: [8,64] (sde) HGST_HMS5C4040ALE640_PL2331LAGGY7DJ size: 3907018532 Apr 29 15:27:03 Tower kernel: mdcmd (7): import 6 8,128 2930266532 ST3000DM001-9YN166_W1F0FN9P Apr 29 15:27:03 Tower kernel: md: import disk6: [8,128] (sdi) ST3000DM001-9YN166_W1F0FN9P size: 2930266532 Apr 29 15:27:03 Tower kernel: mdcmd (: import 7 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (9): import 8 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (10): import 9 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (11): import 10 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (12): import 11 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (13): import 12 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (14): import 13 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (15): import 14 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (16): import 15 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (17): import 16 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (18): import 17 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (19): import 18 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (20): import 19 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (21): import 20 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (22): import 21 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (23): import 22 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (24): import 23 0,0 Apr 29 15:27:03 Tower emhttp: shcmd (831): rmmod md-mod |& logger Apr 29 15:27:03 Tower emhttp: shcmd (832): modprobe md-mod super=/boot/config/super.dat slots=24 |& logger Apr 29 15:27:03 Tower emhttp: shcmd (833): udevadm settle Apr 29 15:27:03 Tower emhttp: Device inventory: Apr 29 15:27:03 Tower emhttp: ST2000DX001-1CM164_Z1E5L2EN (sda) 1953514584 Apr 29 15:27:03 Tower emhttp: ST1000DX001-1CM162_Z1D8Y3KK (sdc) 976762584 Apr 29 15:27:03 Tower emhttp: ST4000DM000-1F2168_Z3006MBD (sdd) 3907018584 Apr 29 15:27:03 Tower emhttp: HGST_HMS5C4040ALE640_PL2331LAGGY7DJ (sde) 3907018584 Apr 29 15:27:03 Tower emhttp: ST2000DL003-9VT166_5YD3XDQR (sdf) 1953514584 Apr 29 15:27:03 Tower emhttp: ST4000DM000-1F2168_Z301C7G6 (sdg) 3907018584 Apr 29 15:27:03 Tower emhttp: SAMSUNG_HD204UI_S2H7J9AB703152 (sdh) 1953514584 Apr 29 15:27:03 Tower emhttp: ST3000DM001-9YN166_W1F0FN9P (sdi) 2930266584 Apr 29 15:27:03 Tower emhttp: ST2000DL003-9VT166_5YD4QXMD (sdj) 1953514584 Apr 29 15:27:03 Tower kernel: md: unRAID driver removed Apr 29 15:27:03 Tower kernel: md: unRAID driver 2.2.0 installed Apr 29 15:27:03 Tower kernel: mdcmd (1): import 0 8,48 3907018532 ST4000DM000-1F2168_Z3006MBD Apr 29 15:27:03 Tower kernel: md: import disk0: [8,48] (sdd) ST4000DM000-1F2168_Z3006MBD size: 3907018532 Apr 29 15:27:03 Tower kernel: mdcmd (2): import 1 8,80 1953514552 ST2000DL003-9VT166_5YD3XDQR Apr 29 15:27:03 Tower kernel: md: import disk1: [8,80] (sdf) ST2000DL003-9VT166_5YD3XDQR size: 1953514552 Apr 29 15:27:03 Tower kernel: mdcmd (3): import 2 8,144 1953514552 ST2000DL003-9VT166_5YD4QXMD Apr 29 15:27:03 Tower kernel: md: import disk2: [8,144] (sdj) ST2000DL003-9VT166_5YD4QXMD size: 1953514552 Apr 29 15:27:03 Tower kernel: mdcmd (4): import 3 8,112 1953514552 SAMSUNG_HD204UI_S2H7J9AB703152 Apr 29 15:27:03 Tower kernel: md: import disk3: [8,112] (sdh) SAMSUNG_HD204UI_S2H7J9AB703152 size: 1953514552 Apr 29 15:27:03 Tower kernel: mdcmd (5): import 4 8,96 3907018532 ST4000DM000-1F2168_Z301C7G6 Apr 29 15:27:03 Tower kernel: md: import disk4: [8,96] (sdg) ST4000DM000-1F2168_Z301C7G6 size: 3907018532 Apr 29 15:27:03 Tower emhttp: shcmd (834): /usr/local/sbin/emhttp_event driver_loaded Apr 29 15:27:03 Tower kernel: mdcmd (6): import 5 8,64 3907018532 HGST_HMS5C4040ALE640_PL2331LAGGY7DJ Apr 29 15:27:03 Tower kernel: md: import disk5: [8,64] (sde) HGST_HMS5C4040ALE640_PL2331LAGGY7DJ size: 3907018532 Apr 29 15:27:03 Tower kernel: mdcmd (7): import 6 8,128 2930266532 ST3000DM001-9YN166_W1F0FN9P Apr 29 15:27:03 Tower kernel: md: import disk6: [8,128] (sdi) ST3000DM001-9YN166_W1F0FN9P size: 2930266532 Apr 29 15:27:03 Tower kernel: mdcmd (: import 7 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (9): import 8 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (10): import 9 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (11): import 10 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (12): import 11 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (13): import 12 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (14): import 13 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (15): import 14 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (16): import 15 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (17): import 16 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (18): import 17 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (19): import 18 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (20): import 19 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (21): import 20 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (22): import 21 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (23): import 22 0,0 Apr 29 15:27:03 Tower kernel: mdcmd (24): import 23 0,0 Apr 29 15:27:03 Tower emhttp_event: driver_loaded Apr 29 15:27:04 Tower emhttp: shcmd (835): rmmod md-mod |& logger Apr 29 15:27:04 Tower emhttp: shcmd (836): modprobe md-mod super=/boot/config/super.dat slots=24 |& logger Apr 29 15:27:04 Tower emhttp: shcmd (837): udevadm settle Apr 29 15:27:04 Tower emhttp: Device inventory: Apr 29 15:27:04 Tower emhttp: ST2000DX001-1CM164_Z1E5L2EN (sda) 1953514584 Apr 29 15:27:04 Tower emhttp: ST1000DX001-1CM162_Z1D8Y3KK (sdc) 976762584 Apr 29 15:27:04 Tower emhttp: ST4000DM000-1F2168_Z3006MBD (sdd) 3907018584 Apr 29 15:27:04 Tower kernel: md: unRAID driver removed Apr 29 15:27:04 Tower kernel: md: unRAID driver 2.2.0 installed Apr 29 15:27:04 Tower emhttp: HGST_HMS5C4040ALE640_PL2331LAGGY7DJ (sde) 3907018584 Apr 29 15:27:04 Tower emhttp: ST2000DL003-9VT166_5YD3XDQR (sdf) 1953514584 Apr 29 15:27:04 Tower emhttp: ST4000DM000-1F2168_Z301C7G6 (sdg) 3907018584 Apr 29 15:27:04 Tower emhttp: SAMSUNG_HD204UI_S2H7J9AB703152 (sdh) 1953514584 Apr 29 15:27:04 Tower emhttp: ST3000DM001-9YN166_W1F0FN9P (sdi) 2930266584 Apr 29 15:27:04 Tower emhttp: ST2000DL003-9VT166_5YD4QXMD (sdj) 1953514584 Apr 29 15:27:04 Tower kernel: mdcmd (1): import 0 8,48 3907018532 ST4000DM000-1F2168_Z3006MBD Apr 29 15:27:04 Tower kernel: md: import disk0: [8,48] (sdd) ST4000DM000-1F2168_Z3006MBD size: 3907018532 Apr 29 15:27:04 Tower kernel: mdcmd (2): import 1 8,80 1953514552 ST2000DL003-9VT166_5YD3XDQR Apr 29 15:27:04 Tower kernel: md: import disk1: [8,80] (sdf) ST2000DL003-9VT166_5YD3XDQR size: 1953514552 Apr 29 15:27:04 Tower kernel: mdcmd (3): import 2 8,144 1953514552 ST2000DL003-9VT166_5YD4QXMD Apr 29 15:27:04 Tower kernel: md: import disk2: [8,144] (sdj) ST2000DL003-9VT166_5YD4QXMD size: 1953514552 Apr 29 15:27:04 Tower kernel: mdcmd (4): import 3 8,112 1953514552 SAMSUNG_HD204UI_S2H7J9AB703152 Apr 29 15:27:04 Tower kernel: md: import disk3: [8,112] (sdh) SAMSUNG_HD204UI_S2H7J9AB703152 size: 1953514552 Apr 29 15:27:04 Tower kernel: mdcmd (5): import 4 8,96 3907018532 ST4000DM000-1F2168_Z301C7G6 Apr 29 15:27:04 Tower kernel: md: import disk4: [8,96] (sdg) ST4000DM000-1F2168_Z301C7G6 size: 3907018532 The full shutdown syslog is posted here : https://www.dropbox.com/s/vgc5c8z6ufv6iaj/syslog-close.txt **** UPDATE 4PM : When I start the server in Safe Mode (without the plugins) and then stopped the array. The server behaved very well and immediately stopped the array without those runs of error messages. I suspect it must be one of the plugins that is interfering with proper shutdown of the array. I will keep it like this so that I can do this job as cleanly as possible without interference from sick beard, couch potato and other plugins running in the background.
April 29, 201412 yr When I look at the syslog, the syslog is wildly active with the same set of messages repeating itself in a wild loop even when the web ui states that the array is shut down.That particular set of messages looks to me like what happens whenever the main web ui is refreshed. So if you hit refresh on the browser, it will generate all those messages each time you refresh.
April 29, 201412 yr Author jonathanm, Thank you. That's quite interesting. I am not doing anything to the browser, yet somehow it keeps acting like that. When I started the server without the plugins it shut itself down beautifully. The only plugins I am running are: Sabnzbd Sickbeard Couchpotato BitTorrent Sync Crashplan Dynamix Cache_Dir (Dynamix) Email (Dynamix)
Archived
This topic is now archived and is closed to further replies.