Rajahal Posted October 10, 2011 Share Posted October 10, 2011 I was helping guiri replace a failed disk when we saw this bit of odd behavior. Here's the story: guiri was running unRAID 4.5.4. unRAID red-balled disk7. guiri replaced the disk with a used 2 TB WD EARS (with jumper installed). Upon the next boot, unRAID would say 'disabled disk replaced' and everything else looked normal. Upon pressing 'start', unRAID would say 'upgrading parity disk' and offer to start a parity sync. This is clearly wrong. I thought it might be a bug with unRAID 4.5.4, so I remote connected to guiri's computer and upgraded his server to unRAID 4.7. The odd behavior continued with unRAID 4.7. Screenshot and syslog (after the upgrade to 4.7) attached. I thought that his server might have some loose wiring inside, so I had him move disk7 into a different physical hot swap bay on the server. He did so (the drive is now in a completely different drive cage, so the cabling shouldn't be the issue at this point, though it is still a possibility). Some odd things I notice in his syslog: Oct 10 22:50:18 Tower kernel: 0MB HIGHMEM available. Oct 10 22:50:18 Tower kernel: 766MB LOWMEM available. Potentially a problem with the RAM? Oct 10 22:50:18 Tower kernel: scsi 0:0:0:0: Direct-Access SanDisk U3 Cruzer Micro 8.02 PQ: 0 ANSI: 0 CCS Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:599:suspend_count 0 Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0 Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0 Oct 10 22:50:18 Tower kernel: [753]: __scst_resume_activity:675:suspend_count 0 left Oct 10 22:50:18 Tower kernel: [753]: scst: scst_register_device:792:Attached to scsi0, channel 0, id 0, lun 0, type 0 Potentially a problem with the flash drive? Oct 10 22:50:18 Tower kernel: scsi 1:0:0:0: Direct-Access ATA WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5 Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:599:suspend_count 0 Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0 Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0 Oct 10 22:50:18 Tower kernel: [770]: __scst_resume_activity:675:suspend_count 0 left Oct 10 22:50:18 Tower kernel: [770]: scst: scst_register_device:792:Attached to scsi1, channel 0, id 0, lun 0, type 0 Several of these errors on several different drives. Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error } Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError } Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error } Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError } Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error } Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError } Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error } Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError } Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error } Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError } Tons of these errors. Screenshot: You can see that the new disk also fails to report a temperature. I'm thinking the next step should be obtaining a SMART report on the new disk7 replacement disk. However, even if disk7 did turn out to be bad it wouldn't explain unRAID's odd behavior... or would it? Any other thoughts? guiris_syslog.txt Quote Link to comment
guiri Posted October 11, 2011 Share Posted October 11, 2011 What? No one? C'mon smart people... Quote Link to comment
dgaschk Posted October 11, 2011 Share Posted October 11, 2011 Do a memtest overnight. Quote Link to comment
mbryanr Posted October 11, 2011 Share Posted October 11, 2011 Had a similar story here: <excuse the poor advice given by myself as it was the first time I was attempting to troubleshoot a major issue> http://lime-technology.com/forum/index.php?topic=15088.0 Quote Link to comment
guiri Posted October 12, 2011 Share Posted October 12, 2011 Thanks bryan, let's see what Rajahal tells me to do Quote Link to comment
dlandon Posted October 12, 2011 Share Posted October 12, 2011 Once you get your situation straightened out I'd make disk6 which is a RE4 drive the parity drive. It is a WD high reliability Raid Edition drive. I'm currently using one of these as my parity drive. Quote Link to comment
guiri Posted October 12, 2011 Share Posted October 12, 2011 Well, I was going to take it out as I had it sold/traded to my computer guy. I guess I can keep it if it's going to make a difference and yes, they are nice drives. I had 4 of them in my ReadyNas before I got the unraid Quote Link to comment
guiri Posted October 12, 2011 Share Posted October 12, 2011 Thanks bryan, let's see what Rajahal tells me to do Actually, he might suggest suicide so maybe I shouldn't wait for his reply Quote Link to comment
Rajahal Posted October 12, 2011 Author Share Posted October 12, 2011 Thanks to dgaschk and mbryanr for your input. I do think a memtest is a good idea (certainly couldn't hurt). mbryanr, I agree that the situation seems similar to the one you linked. While I certainly know how to guide guiri back into having a healthy array by backing up the contents of the emulated disk7 then using initconfig and starting anew, I was hoping that we would be able to figure out why this happened in the first place. If it does turn out to be bad RAM then I think that would sufficiently explain this odd behavior, as bad RAM can cause all sorts of weird things to happen (especially when the OS is loaded into RAM as unRAID is). However, if we don't find any bad hardware then I believe this could only be explained as an unRAID bug. A bug in 4.7 no less. Anyway, I'll guide guiri through a memtest and we'll go from there. I believe he has already backed up the data from the emulated disk7, so the data should be safe. Quote Link to comment
guiri Posted October 12, 2011 Share Posted October 12, 2011 As for it being a bug, it did it before we upgraded too, right? Personally, I'm hoping for a memory problem. I'm thinking it'll be much easier to rip that stick out and giving it to one of my dogs than doing any of the other possible options.. Thanks guys George Quote Link to comment
Rajahal Posted October 12, 2011 Author Share Posted October 12, 2011 Quite right. If it is a bug, then it has existed in the unRAID code for quite some time. I agree, bad RAM would be one of the easiest (and least expensive) things to fix. We'll see what memtest turns up. Quote Link to comment
mbryanr Posted October 12, 2011 Share Posted October 12, 2011 I was hoping that we would be able to figure out why this happened in the first place. Those were my thoughts as well. I'm not certain what caused it, but as you stated intermittent hardware issues could cause this. Ultimately - had to get the data protected again. If it is my server, I'd want to know why. I'll follow along and hope you find a cause. I noticed these similarities in the linked thread in the other thread.. Just had a failed disk (Seagate 1.5tb) which I have replaced with a new drive (WD EARS 1.5tb - edit: jumpered over 7-. Edit: UnRAID version 4.5.1 (she's been shut down for a while! Gimme a break! ) I initially "unassigned" the faulty drive (disk7) so I could access a few "crucial" files - while the array remained unprotected (*gasp!*) I then replaced the drive, assigned the new drive to the same slot (disk7) and rebooted. Was prompted to push Start ("Start will bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)") Pushed Start, and anticipated a loooong data rebuild... But instead, minimal disk activity, no data rebuild. The array is "Stopped", saying there is a new parity disk installed (but I never changed the parity drive!!), the new disk is orange balled (with temperature 0 degC), and I am asked to push Start again ("Start will bring the array on-line and start Parity-Sync") Quote Link to comment
guiri Posted October 12, 2011 Share Posted October 12, 2011 Alright boys, running a long memtest right now. Not in a hurry so I'll let it run 24 hours or so. Personally, I'm hopin' for the memory being bad Quote Link to comment
Rajahal Posted October 14, 2011 Author Share Posted October 14, 2011 OK, so the RAM is fine. I think the next step should be to restore your parity protection as soon as possible. Unfortunately this means that we may never figure out what caused this in the first place, but I expect you are more concerned with keeping your data safe than troubleshooting a potential unRAID bug. George, I'll call you soon and walk you through the 'initconfig' process. Quote Link to comment
guiri Posted October 14, 2011 Share Posted October 14, 2011 Your suspicions are correct Quote Link to comment
mbryanr Posted October 14, 2011 Share Posted October 14, 2011 Your suspicions are correct What fun is that. Maybe Tom will see this and look into what may have caused it. Quote Link to comment
guiri Posted October 14, 2011 Share Posted October 14, 2011 Oh yeah, easy to talk when it's not your files Quote Link to comment
guiri Posted October 14, 2011 Share Posted October 14, 2011 We also had problems creating a cache drive but I won't go into that as I don't know if it's related and don't understand this stuff. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.