knalbone Posted June 6, 2015 Share Posted June 6, 2015 I moved all my Drives and my USB stick to a new drive today. Upon bootup, the configuration looked fine so I started the array and all was well. Just a few minutes later, drive 4 showed a grey exclamation mark (invalid data I think). I stopped the array and replaced disk 4 with a preleared spare. It looks like a parity rebuild started, but I think it stopped. The GUI shows no activity and the hourly reports simply state that one of the drives contains invalid data and the array requires attention. To make matters worse, disk 2 its showing many errors. They all appear to be write errors and the disk passed a short SMART test so I'm not sure what is going on. I feel like my "old" disk 4 is fine and wonder if it is possible to force it back into the array, trust parity and try to rebuild disk 2. This is the second time I tried to open a topic. Last time I attached my syslog and it looks like the post never appeared. I will try to attach the syslog in a reply message. Link to comment
knalbone Posted June 6, 2015 Author Share Posted June 6, 2015 The syslog is huge. Every time I try to attach it, I just get a new, empty "post reply" page as if I did nothing. Here it is on dropbox. https://www.dropbox.com/s/f4vho6s7fzkfk8a/syslog?dl=0 Link to comment
knalbone Posted June 6, 2015 Author Share Posted June 6, 2015 Short or long? Just copy and paste what appears in the webGUI? Link to comment
knalbone Posted June 6, 2015 Author Share Posted June 6, 2015 Here are some SMART reports. sdd is disk2 (the one showing write errors in the webGUI log). sdg is the disk 4 that unraid reports as having invalid data. sdh is my precleared spare that I tried to replace disk 4 with but was unsuccessful (parity rebuild stopped almost immediately. SMARTsdd.txt SMARTsdg.txt SMARTsdh.txt Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 I just powered down the server and tried moving the drives in question to different locations on the backplane. No change. Anyone have any ideas? Only shares with caching enable appear to be writeable at the moment. All the data appears to be intact, I just can't write to the array and the GUI looks like this: Link to comment
trurl Posted June 7, 2015 Share Posted June 7, 2015 Your syslog has some disk4 write errors, and then later even more write errors on disk2. Normally unRAID disables a disk when it has write errors, which may be what the triangle on disk4 is about, but I am not sure whether it will also disable another disk if it has errors after that. Maybe it disables writes to the whole array, which would seem reasonable since parity won't be able to "absorb" the unwritten data from both drives. I think the smart reports are OK, but I'm not sure how to proceed from here. Is rebuilding disk4 the way forward? Probably checking cables will be a good idea to try and get rid of whatever caused this, but don't shutdown yet. Wait and see what others suggest. Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 Rebuild isn't even an option apparently. I have no option for it when I stop the array. Too late to not shut down. As I mentioned earlier, I shutdown the server and moved disks 2 and 4 to different spots on the backplane, but that had no effect. Link to comment
trurl Posted June 7, 2015 Share Posted June 7, 2015 The first thing to do is make sure nothing continues to try to write to the array. With the write errors you have already had, the disks and parity are not going to be consistent with each other. When a drive has a write error, the parity is updated anyway, so the disk can be rebuilt with the data that failed to be written. So it doesn't seem like a New Config with trust parity is the right approach. I think we need to figure out what caused the write errors and try to eliminate that before doing anything else. Then maybe we'll have to check the filesystems on disk2 and disk4 and repair if necessary, and then maybe rebuild parity. Something that would probably make sense and won't make anything worse is to do a memory test. You can select that from the boot menu. Have you checked all SATA and power cables and plugs at both ends? What model is your power supply? Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 Something about the hardware move must have screwed things up. I don't think anything has been written anywhere other than the cache drive since this all started. I am running this as a VM on a dell poweredge c2100 running esxi 5.5 with an LSI 2008 SAS controller passed through to unRAID. I may just moved everything back to the physical box it was on. For now I am going to bed. Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 I have backups of all the "important" stuff via crash plan. Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 I was able to start a rebuild this morning. I basically unmounted the shares and stopped all services that were trying to write to the array on other VMs/clients. Rebuild is very slow (about 4MB/s), but at least it wasn't stopped almost immediately (it was previously). UNRAID says the rebuild will take >10 days. Should I just let it run? Link to comment
trurl Posted June 7, 2015 Share Posted June 7, 2015 Post a new syslog. I don't remember whether beta15 has a download button on the syslog page or not. In any case, please zip and attach. Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 New syslog attached. Thanks. syslog.zip Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 Looks like the rebuild failed again. New syslog attached. syslog.zip Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 I guess something is truly wrong with /dev/sdd. I restarted the rebuild using my precleared spare and it is flying at ~140MB/s. I will let it finish (about 9 hours). What can I do to test /dev/sdd (bad disk) when the rebuild is finished? Link to comment
trurl Posted June 7, 2015 Share Posted June 7, 2015 I guess something is truly wrong with /dev/sdd. I restarted the rebuild using my precleared spare and it is flying at ~140MB/s. I will let it finish (about 9 hours). What can I do to test /dev/sdd (bad disk) when the rebuild is finished? Preclear it. Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 Sounds good. I will preclear and run a long SMART test after the rebuild is complete. Should hte preclear and SMART test pass, do you think it would be safe to add the disk back to the array? Thanks for your time trurl. Link to comment
trurl Posted June 7, 2015 Share Posted June 7, 2015 Yes if it passes, doesn't have many power on hours. Not entirely clear the disk was the problem. Maybe just plugging/replugging cleared something up. If it is working now I guess we will go with it. Link to comment
Squid Posted June 7, 2015 Share Posted June 7, 2015 I would also reseat the controller card Link to comment
knalbone Posted June 7, 2015 Author Share Posted June 7, 2015 I would also reseat the controller card Probably a good idea too. I will make sure to next time I am able to power the server down. Link to comment
knalbone Posted June 8, 2015 Author Share Posted June 8, 2015 Thanks everyone who has read and helped on this. The rebuild finished yesterday around 7pm. I started using the array again and everything was well. I started preclearing /dev/sdd and all of a sudden disk 2 was disabled again! AAARRG! Something really screwy is going on. I'm going preclear sdd for one cycle then I will rebuild the array and hopefully all will be well. I will remove the supposedly bad disk so I can preclear it on another PC and see what happens. I think I'll order another drive or two as well and preclear them so I have them on hand. I just can't figure out what is going on with this server. Link to comment
trurl Posted June 8, 2015 Share Posted June 8, 2015 Good to have spare drives, but it is beginning to sound like drives are not your problem. I assume you reseated the controller. Do you have spare ports you could use to try and isolate the problem? Link to comment
knalbone Posted June 8, 2015 Author Share Posted June 8, 2015 I have not yet reseated the controller. That is a bit of an ordeal as it involves taking multiple VMs down. I agree that signs point to it being the problem though. This is a Mezzanine card specific to this model of server, so it can only fit in one spot directly on the motherboard. I will make sure to reseat it when I pull one the drive out. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.