September 15, 201015 yr Hi All OK... I have only recently got my unRaid (4.5.6) server working and only YESTERDAY finally put my precious data on it and reused one 1Tb SATA disk from my PC (which used to contain "My Documents") in the unRaid box. The setup is 2Tb parity disc a 1Tb data disk and 3 odd disks (320Gb, 160Gb & 80Gb) to make up space so I can copy data from my PC then free-up and transfer my newish disks to the unRaid server... I copied all my movies over etc and then last night DISASTER (well maybe!!) When I tried to access my data over the network I couldn’t - so I logged on to the main page and it showed red spots by 3 out of 4 disks and the parity disk. So I thought it must be a glitch in the software or a hardware failure - so I rebooted the server and all the disks came up green except the parity drive which was orange and said it needed a rebuild. Obviously I was a bit worried, but ticked the "I'm sure" box and started the parity check. Shortly after starting it, I could hear what sounded like a disk spinning up then down repeatedly. So I stopped the parity check and rebooted the server - I got the same thing except this time there was a clunking sound too - indicating a failed disk!! So I played about connecting disks to my Adaptec 1430SA instead of the mainboard SATA ports in case it was a mobo problem - to no avail. Every time I tried something I was getting errors saying various disks were missing etc. Anyway was then in the best position yet where if I left one disk out, the other three disks were online and I could access the data held on them - but this one data disk (when re-assigned to the array) then had a blue spot and the parity drive had an orange spot. The only option I had was to tick the "I am sure" box and start the array - but the warning says that it will expand data into the "new disk". I didn't select anything as I didn’t want to format disks and lose data, but I fear that due to being given incorrect information from the unRaid main menu I may have dumped what protection I had by starting a parity drive rebuild !! Since writing the above (at 2am!!), the array was then showing the 320Gb disk with a dark red spot and the 160Gb with a brighter red spot - saying "Missing"... then every time I restart the server I get different reporst on drive conditions... so all in all it appears to have become very flaky - my question is now around trying to work out what has gone wrong and why. I have noticed that occasionally two disks showed a temperature of zero degrees – almost as though the two are inextricably linked together!! At 3am this morning I decided to go back to basics and closed the unRaid server down, installed YAReG and the associated ReiserFS drivers on one of my Windows PCs – then connected the first of the data drives from the unRaid server to it to begin data recovery. The files on this first drive appear OK and it has taken about 7 hours to transfer 36Gb of data back to my trusted Thecus NAS box. However based on this transfer rate, the total 1.5Tb+ which I am trying to recover from the array will take a long time!! So until this first disk is done I can’t even start to look at the other 3 data disks, and I can’t power the unRaid box back up until I put the disks back in (so long delay!)… but can anyone offer any advice as to what the heck is going on and what would cause this real inconsistency with the array? Could a single disk failure cause so many different results in unRaid?? Whilst I can live with a disk failure… to have unRaid fall over and then incorrectly report the parity disk as needing a rebuild leaves me concerned as whether this is the correct solution to replace my aging Thecus. As you will appreciate I am a little annoyed as it looks like I will suffer some data-loss as my last solid backup was in July!! (My fault I know). Any thoughts appreciated
September 15, 201015 yr Drive cabling / power issue? I can't comment on the steps you took and where you are right now as I lost track. But I would say knee jerk reacting and swapping the drives around controllers whilst clicking various 'I'm sure buttons' then cancelling them half way through etc etc probably hasn't helped matters. Some screen shots / info of where you are now from the unraid panel, syslog and the drive -> array assignment page might help figure out where you are. If your question is why did unraid do this, we might not be able to tell as we're too far past the original event - but having random disks report failure is usually a physical issue with cabling or power. At that point your data should have been fine so long as the physical disks were actually working. But now they have been shuffled around and various parity operations invoked that's a little more difficult to unravel.
September 15, 201015 yr Author Hi Boof Thanks for your thoughts... I think in summary the actions that have been taken are quite limited, and perhaps my initial post didn’t reflect this. To summarise; Initially the unRaid server reported that the parity drive needed syncing - it had an orange/light red button by the drive in the main menu - all other drives were green. Whilst this did concern me, as all the others were green I assumed that the data disks were fine, but the parity drive had somehow become corrupt. I ruled out a disk hardware failure as it is only a week old and so hardware failure would be unlikely (although I realise not impossible). So at this stage I ticked the "I am sure" button and started a parity rebuild. After this point it sounded like one of the data drives was struggling by spinning up and down repeatedly - there was no warning of this in the unRaid console, just the sound it made! So this is the only action which has been started by me - I have not written any data to any disks (apart from the start of the parity rebuild). Although I have tried putting disks onto the Adaptec card, this was only to try and rule out the onboard HDD controller failing and I kept a record of the drive assignments so they have always been correctly assigned in unRaid. I am hoping that whichever drive was making a clunking noise (sounds like slight platen contact - if such a thing exists) will spin up and run long enough in my Windows environment to allow me to recover my data as it will have been able to cool down completely by the time I try accessing it. I have to admit that I had no idea that reading the data from the ReiserFS drivers would be so slow - but hey, if it gets my data back I won’t complain!! I have just found a good backup of my movis folder which is over 1Tb in total, so hopefully recovering 500Gb won’t be too bad!! I had already ordered a new Icybox SATA backplane for my disks as the model I am using is quite old and slots 4 & 5 don’t work - so its possible that it could be a power issue if the backplane is failing more seriously... perhaps taking the disks out and connecting them individually to the mobo/Adaptec card might help as it should at least rule out the caddy. I am sure that the PSU is OK, as I have only just bought it and it is a Corsair single 12v rail (33A). One thing someone may be able to clarify is this. If a disk has come from an unRaid server and has a ReiserFS system present - I assume that unRaid would recognise this and mount the drive to the array. The reason I ask is that now when I start the server sometimes disks show as missing, sometimes they show a blue spot and say that they need formatting, etc. So I am thinking of taking my old test server and connecting these disks to it and configuring them in exactly the same way they are/were in this server and seeing what unRaid recognises of them… obviously what I don’t want is to do this and have it tell me that they all need formatting, etc – not that I would do this!! Any more thoughts welcomed and greatly appreciated.
September 15, 201015 yr Hi SliMat You are doing the right thing, starting over from the beginning is the correct thing to do. You should be able to recover all your data. I am new to unRaid and currently run the 3 drive system (2 data 1 parity). I have done a lot of reading on the fourms though and thought I would put in my $0.02 worth. I have had over 20 years working in the computer industry (does that make me old?) and have lots of troubleshooting experience. Troubleshooting goes from unknown to known. to success. Its pretty easy, you just need to think in in a logical order going from easiest troubleshooting steps to more and more difficult and rare steps. First you have done the correct thing to copy all your data back to your NAS. As a suggestion (hind site is 20/20) you should not delete the data off of that NAS until after you have successfully troubleshot the problem and insured that the server will remain stable after you have copied your data back off the NAS to the unRaid server. Next: after you have your data backed back on the NAS, put the unRaid server back together. If you have not already, you need to make sure that you document what slot belongs to what drive. Document this on paper as well as mark the drive. Also mark what SATA port you are using for that slot. They are all "Linked" together and you can't switch ports with out causing the unRaid software to freak out. I think this is the confusion you are seeing from the unRaid server. If you have not documented this information, and you have a full backup, start over from the beginning. THIS TIME DOCUMENT!!!!. This actually helps you troubleshoot your problems later on when you have 22TB of data and everything is really on the line. I would keep an Excel Spreadsheet that has the Make, Model, Interface (sata or pata), Size, Serial number, Port (Ide1 or Ide1 for pata, Sata Port number) as well as role (SATA, Master, Slave). In the Port column for the SATA port number since you are using motherboard and addon cards you can designate all mother board ports as SATA 1-1,1-2,1-3. The number to the left of the dash indicates what card it belongs to. The number on the right is the port number on that card. In this case all motherboard ports are 1, and all ports on your add on card are designated as 2, so if you have the second port on the mother board its 1-2, and if its the 4th port on the add on card its 2-4. I think you get the idea. Now to the problem at hand. The symptoms seem to indicate that you have a weak power supply. There is not enough current to power all the devices in your case, and I would recommend that you either drop one of the hard drives or, replace the power supply with a more powerful one. Yes I know you are going to say but his is a 450w power supply and its worked well in the past! Yes that may be true, but in a server such as this I would go with no less than a 650w power supply (for growth) and a good quality one, I would not put my valuable data on a 650 power supply that I only paid 20 bucks for. The reason I am pointing toward the power supply is you are able to remove the hard drive and recover your data. Usually when a drive spins up and down its because its not getting enough current to do the job. Your other problems are coming from you moving the drives from motherboard ports to your external card. By doing this (in an effort to troubleshoot) has changed what slot and port they were originally at. When working with any RAID system, the drives need to go into the exact same port and have the same size drive or larger. That is the reason why you want to document everything. One last thought, there is a test that you can run on a drive something about a format/clear test that will run some time, but I would suggest after you have the system together but before you copy your data back to perform this test. It will excessive the drive and let you know if a hard disk is about to fail and it will allow you to make sure your power supply is in good shape if you have decided not to replace it. I see as I preview my post that someone has posted additional information, so he may have already suggested that the Power supply is bad. But I want to makes sure my thoughts are organized before I post, so I am a little slower than most. Good Luck! Sideband Samurai
September 15, 201015 yr Author Hi Sideband Thanks for your thoughts too and you are also pointing at the PSU - which is a shame as I have only just bought it having read a lot on the unRaid Wiki about the hardware and I calculated that 33A would be ample (BTW well done for recognising the 33A as 450W ). I too have been in IT for a while (about 15 years... but we're not old!) and tend to write on the data cables at each end so I know where they started life - so reconnecting everything back to where it started life will be simple... but I think after this slow recovery process I think I will fit my new SATA backplane (IB-555SK) and get a new 650 PSU - to give everything the best chance. I guess my reason for this post is my concern about the robustness of unRaid having been bitten so soon after investing in the technology. I did start to read the Wiki on clearing the disks - but not being too proficient with non Windows O/S's I decided to avoid this as it looked quite tricky - but I think I will invest some time in this step next time round. Thanks again to all.
September 15, 201015 yr SliMat, I think the robustness is there, I have read some people have 3 22TB monsters in their house. I have a feeling that you were a little to quick to jump the gun. With servers I always over estimate what I need, For example if the server requires 2 gig of ram and 2tb of storage, I will always double that. (that is in the case of windows servers) For linux, 1.5 times is good enough. Having worked with regular server hardware, it is the reason in the case of Dell or HP servers, they have two and some times 3 PSU's. One reason is for redundancy but the other is for load balancing. Since we are not using commercial grade hardware here, and its only one PSU, I will always go for the higher amount and much better quality. I would try eliminating the backplane to see if that is the source of your power problems. When calculating the amount of power you need, you also have to take into account the amount of watts the CPU takes as well as what the Motherboard takes and what other peripherals you might have installed (the extra SATA card comes to mind). All of that must be factored into calculating the amount of PSU you need. Then you need to consider upgrades, I know you will be adding more drives in the future, you don't want to be pulling the server apart again to replace the PSU because you have overloaded it by adding one or two drives. Sideband Samurai
September 15, 201015 yr Author Hi Sideband Some good points - the CPU is the 84W P4/3.2 I chose this over the more efficient C2D 65W models as unRaid doesnt benefit from multi-core technology (so I read) - so I opted for the slightly beefier P4 to help with processing power - but as you elude to, it is more power hungry too. I have already ordered the new backplane - before this all went ugly, so I should be able to have confidence that the new one will be a good investment. I think that one of the disks is definately failing because of the worrying noise, but because they are all side by side in the backplane I havent identified which one it is for sure - although my money is on the 80Gb as its quite a few years old. I am just going to order a 650W Corsair drive so that I can get it for tomorrow
September 15, 201015 yr Sorry you are having problems with your new array. If you look in the sig of this message you will see a troubleshooting link. Suggest you read this and understand the tools available - like syslogs and smart reports - that can help you isolate your problem. It is easy to start replacing hardware chasing a solution but this can get expensive. The symptoms you report sound like some sort of cabling / connection problems between the drives and your system. Cables, power splitters, and drive backplanes are all suspects. Bad connections can cause all of the symptoms you are reporting. I would not expect that based on your drive count that a 450 watt PSU would be a problem. The symptoms of an underpowered PSU are sudden powerdowns, esp. just as the computer is being turned on. The computer pulls the most current as drives are spun up (about 2x normal), so once you get past that stage the chances that the PSU is underpowered become much less.
September 15, 201015 yr Author Thanks BJP... I have to say that I calculated a 450W as ample for this realtively small array. But I have decided that if it does prove to be this I will go for the Corsair TX-750W PSU (CMPSU-750TXUK)! There are no power splitters, everything is powered from the PSU directly and my suspect is definately the backplane as this was £58 on Ebay (its about £90 new!!) and only when I started using it I have found that two of the 5 bays dont work - once my data recovery is complete I will have my brand new backplane to install, but I will connect the drives directly to the mobo to discount either backplane. When I get in tonight I will take a look at the tools available as I think that my knee-jerk reaction is limited enough that there could still be some useful information to try and diagnose exactly what failed. Obviously my concern is if the mobo, CPU etc is faulty I dont want to come to rely on my shiny new NAS unit only to be let down again. Thanks for the advice and when I am in a position to power up the unRaid unit again I'll post some log files for analysis.
September 15, 201015 yr Author OK - slight update... I have reconnected all the discs exactly how they were and now have 3 green disks - the 250Gb disk is blue (!!?!) and the parity disk is orange. Accessing the ReiserFS partition on the 250Gb disk using Windows shows my files are there - but trying to copy them only recovered 35Gb out of 244Gb of data. So it does look like this is the failed disk!! So as the 250Gb disk has data but is in trouble I have removed it so the array will start without needing to format the 250Gb disk and it just shows as missing. I read somewhere that you can force unRaid into accepting that the parity disk is OK - if I do this, what are the chances of the data on the dead disk (disk 2 of the array) being rebuilt and accessible?? I am thinking that this might be worth a go as I have the original data disk removed and the parity disk is orange... Any thoughts??
September 15, 201015 yr OK - slight update... I have reconnected all the discs exactly how they were and now have 3 green disks - the 250Gb disk is blue (!!?!) and the parity disk is orange. Accessing the ReiserFS partition on the 250Gb disk using Windows shows my files are there - but trying to copy them only recovered 35Gb out of 244Gb of data. So it does look like this is the failed disk!! So as the 250Gb disk has data but is in trouble I have removed it so the array will start without needing to format the 250Gb disk and it just shows as missing. I read somewhere that you can force unRaid into accepting that the parity disk is OK - if I do this, what are the chances of the data on the dead disk (disk 2 of the array) being rebuilt and accessible?? I am thinking that this might be worth a go as I have the original data disk removed and the parity disk is orange... Any thoughts?? Did you ever successfully complete an initial parity calculation? If not, then no, you cannot "force" it to be valid.
September 15, 201015 yr Author Hi Joe Yes the server had been working for a week before it went ugly. So there had been a good parity check done.
September 15, 201015 yr If the disk 2 is physically bad then you can force unRAID to know it is bad (and force it to trust that parity is good) by Stopping the array Powering down Installing a working replacement disk for the bad one, of equal or larger size, but not larger than parity. Powering up. Logging in as root on the system console, or via telnet and typing: initconfig The initconfig command will prompt you to continue. Answer exactly as it says "Yes" (capital "Y", lower case "es") Then type mdcmd set invalidslot 2 It should respond with Command OK. Then go back to the management web-page and refresh it. It should show disk2 as either "red", if you do not install a working replacement, or blue (I think) if you do and all the others as green. When you start the array it will simulate any contents of disk2 (if it can, if you've not corrupted the file-system on parity/other disks) It will also begin the process of re-constructing the contents onto the replacement drive. (it will do this even if there is corruption, so let the process finish, regardless of if the disk mounts or not) Joe L.
September 15, 201015 yr Author Thanks Joe - thats pretty much what I thought... But just thinking along a slightly different line... could I remove the parity disk and re-insert the data disk whis has gone blue and then force unRaid to assume that the blue disk is good? The reason I ask is because I can see my data when I mount the blue disk on a Windows box - but when I try to copy the data it is far too slow (30Gb in a few hours then no more for the rest of the day!). So if I force unRaid to assume this disk is OK I may be able to get my data - but if this is no good, I can revert to leaving the blue data disk out, re-insert the parity disk and force the array to trust it and therefore it would simulate the missing disk - with corruption if the parity drive was not up to date! Thanks for all the advice from everyone.
September 16, 201015 yr SliMat, I would go with the Parity disk at this point as being reliable and rebuild the data disk the way it was suggested. This is why you have a parity disk to begin with. In other words I would follow the suggestion by JoeL and see if you can get the array to rebuild. Also as a suggestion, download a copy of spinrite (www.grc.com). I believe the cost is around 69 us dollars and is an excellent tool for attempting to recover the data on the drive. The program is written entirely in assembly language. This program has gotten me out of a few binds in the past. if the drive is indeed failing spinrite will tell you. It does not matter what operating system the drive came from, this program works at the lowest level to take control of the hard drive and test every sector to see if its bad and recover and move data to a good portion of the disk as well as mark that sector perminately bad. You might just get this repaired enough to get a good copy of the disk to other storage. Also if you find that spinrite did not solve your problem there is a 30 day no questions asked money back guarantee. Sideband Samurai
September 16, 201015 yr SliMat, I would go with the Parity disk at this point as being reliable and rebuild the data disk the way it was suggested. This is why you have a parity disk to begin with. In other words I would follow the suggestion by JoeL and see if you can get the array to rebuild. Also as a suggestion, download a copy of spinrite (www.grc.com). I believe the cost is around 69 us dollars and is an excellent tool for attempting to recover the data on the drive. The program is written entirely in assembly language. This program has gotten me out of a few binds in the past. if the drive is indeed failing spinrite will tell you. It does not matter what operating system the drive came from, this program works at the lowest level to take control of the hard drive and test every sector to see if its bad and recover and move data to a good portion of the disk as well as mark that sector perminately bad. You might just get this repaired enough to get a good copy of the disk to other storage. Also if you find that spinrite did not solve your problem there is a 30 day no questions asked money back guarantee. Sideband Samurai If you physically remove the drive you think is readable and force the array to think the rest are OK, you'll be able to see if it can be "simulated" If it cannot then power down once more, then put that drive back in, then use initconfig mdcmd set invalidslot 99 to force all the drives to be recognized as valid. It will start with a parity check... Stop it as soon as it begins, just in case. Then, see if you can get to your files. If you can then perform a full parity check and expect some parity errors. Joe L.
September 17, 201015 yr Author If the disk 2 is physically bad then you can force unRAID to know it is bad (and force it to trust that parity is good) by Stopping the array Powering down Installing a working replacement disk for the bad one, of equal or larger size, but not larger than parity. Powering up. Logging in as root on the system console, or via telnet and typing: initconfig The initconfig command will prompt you to continue. Answer exactly as it says "Yes" (capital "Y", lower case "es") Then type mdcmd set invalidslot 2 It should respond with Command OK. Then go back to the management web-page and refresh it. It should show disk2 as either "red", if you do not install a working replacement, or blue (I think) if you do and all the others as green. When you start the array it will simulate any contents of disk2 (if it can, if you've not corrupted the file-system on parity/other disks) It will also begin the process of re-constructing the contents onto the replacement drive. (it will do this even if there is corruption, so let the process finish, regardless of if the disk mounts or not) Joe L. Joe I followed your instructions exactly and this is what happened... 1. I used telnet to log in - all OK 2. I used the initconfig command and Yes'd it... it responded "Completed" 3. I typed "mdcmd set invalidslot 2" and got a two line resonse; cmdOper=set cmdResult=ok Now when I refreshed the main menu from the GUI - I had the parity disk, disk 1, 3 & 4 all blue, and disk 2 was greyed out as its not installed... I started the array and it is now just sitting there with "Starting" (with an orange spot) showing in the command area and saying that disks 1, 2 (the missing one), 3 & 4 are "Mounting". All the disks are showing flashing green spots except the missing number 2 which is solid red. It has been like this for about 15 minutes and I am getting concerned as I expected it to be much quicker. Also the only share I can see on the LAN is the "Flash" drive. Does this sound OK, or should I reboot?? Any help appreciated
September 17, 201015 yr Author OK... I have just got home and the terminal screen on the UnRaid server has loads of reports on the screen and still says its mounting the disks... The screen shows things like... "sh: line 1: 1742 Segmentation fault mount -t reiserfs -o noacl,nouser_xattr,noatime,nodiratime /dev/md4 /mnt/disk4 2>&1 1744 Done : logger mdcmd (125): spindown 0 mdcmd (126): spindown1 mdcmd (126): spindown3 mdcmd (126): spindown4"
September 19, 201015 yr Author Hi All Thanks for the advice... I thought I'd let you know the outcome incase it helps anyone else... I managed to copy the data from disks 1,3 & 4 over the LAN from the server... I tried to mount the reiserFS partition on my Windows system to recover my data but it was sooo slow and I only got 20Gb back off a nearly full 250Gb disk! So at this point I resigned myself to the fact that I had lost the contents of that disk. The server had got to the point that even when I reset it, it just came up saying that the disks were mounting and sat there for days like this! So I re-downloaded UR 4.5.6 and reinstalled it on my USB key - I was surprised that when I booted it the config settings were still there. So I did an initconfig and only put the "Lost" disk in the array (in its original slot)... and lo & behold when the server booted it mounted the disk and I could get the data over the LAN from the disk share. So now having recovered ALL my data I have brought all the disks back to their original slots and when booted it mounted them all and everything seens good!! It did show the parity drive as needing a rebuild, so I have left it doing that at the moment and fingers crossed all will be well. The only hardware change I have made is to install my shiny new Icy Box IB555SK backplane in place of the older one which I was using and had my doubts about. So there you have it - some good news... whilst I cant explain it all - I thought I'd let you know the outcome in case it offers help to anyone else with a similar problem. Oh yes, also I'll be doing a regular backup of my most important files to avoid this happening again!!
Archived
This topic is now archived and is closed to further replies.