Replacing disk7 yet unRAID says 'upgrading parity'?

Rajahal · October 10, 2011

I was helping guiri replace a failed disk when we saw this bit of odd behavior. Here's the story:

guiri was running unRAID 4.5.4. unRAID red-balled disk7. guiri replaced the disk with a used 2 TB WD EARS (with jumper installed). Upon the next boot, unRAID would say 'disabled disk replaced' and everything else looked normal. Upon pressing 'start', unRAID would say 'upgrading parity disk' and offer to start a parity sync. This is clearly wrong. I thought it might be a bug with unRAID 4.5.4, so I remote connected to guiri's computer and upgraded his server to unRAID 4.7. The odd behavior continued with unRAID 4.7. Screenshot and syslog (after the upgrade to 4.7) attached.

I thought that his server might have some loose wiring inside, so I had him move disk7 into a different physical hot swap bay on the server. He did so (the drive is now in a completely different drive cage, so the cabling shouldn't be the issue at this point, though it is still a possibility).

Some odd things I notice in his syslog:

Oct 10 22:50:18 Tower kernel: 0MB HIGHMEM available.
Oct 10 22:50:18 Tower kernel: 766MB LOWMEM available.

Potentially a problem with the RAM?

Oct 10 22:50:18 Tower kernel: scsi 0:0:0:0: Direct-Access     SanDisk  U3 Cruzer Micro  8.02 PQ: 0 ANSI: 0 CCS
Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:599:suspend_count 0
Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete
Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [753]: __scst_resume_activity:675:suspend_count 0 left
Oct 10 22:50:18 Tower kernel: [753]: scst: scst_register_device:792:Attached to scsi0, channel 0, id 0, lun 0, type 0

Potentially a problem with the flash drive?

Oct 10 22:50:18 Tower kernel: scsi 1:0:0:0: Direct-Access     ATA      WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:599:suspend_count 0
Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete
Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [770]: __scst_resume_activity:675:suspend_count 0 left
Oct 10 22:50:18 Tower kernel: [770]: scst: scst_register_device:792:Attached to scsi1, channel 0, id 0, lun 0, type 0

Several of these errors on several different drives.

Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }

Tons of these errors.

Screenshot:

You can see that the new disk also fails to report a temperature. I'm thinking the next step should be obtaining a SMART report on the new disk7 replacement disk. However, even if disk7 did turn out to be bad it wouldn't explain unRAID's odd behavior... or would it? Any other thoughts?

guiris_syslog.txt

guiri · October 11, 2011

What? No one? C'mon smart people...

dgaschk · October 11, 2011

Do a memtest overnight.

mbryanr · October 11, 2011

Had a similar story here: <excuse the poor advice given by myself as it was the first time I was attempting to troubleshoot a major issue>

http://lime-technology.com/forum/index.php?topic=15088.0

guiri · October 12, 2011

Thanks bryan, let's see what Rajahal tells me to do

dlandon · October 12, 2011

Once you get your situation straightened out I'd make disk6 which is a RE4 drive the parity drive. It is a WD high reliability Raid Edition drive. I'm currently using one of these as my parity drive.

guiri · October 12, 2011

Well, I was going to take it out as I had it sold/traded to my computer guy. I guess I can keep it if it's going to make a difference and yes, they are nice drives. I had 4 of them in my ReadyNas before I got the unraid

guiri · October 12, 2011

Thanks bryan, let's see what Rajahal tells me to do

Actually, he might suggest suicide so maybe I shouldn't wait for his reply

Rajahal · October 12, 2011

Thanks to dgaschk and mbryanr for your input. I do think a memtest is a good idea (certainly couldn't hurt). mbryanr, I agree that the situation seems similar to the one you linked. While I certainly know how to guide guiri back into having a healthy array by backing up the contents of the emulated disk7 then using initconfig and starting anew, I was hoping that we would be able to figure out why this happened in the first place. If it does turn out to be bad RAM then I think that would sufficiently explain this odd behavior, as bad RAM can cause all sorts of weird things to happen (especially when the OS is loaded into RAM as unRAID is). However, if we don't find any bad hardware then I believe this could only be explained as an unRAID bug. A bug in 4.7 no less.

Anyway, I'll guide guiri through a memtest and we'll go from there. I believe he has already backed up the data from the emulated disk7, so the data should be safe.

guiri · October 12, 2011

As for it being a bug, it did it before we upgraded too, right?

Personally, I'm hoping for a memory problem. I'm thinking it'll be much easier to rip that stick out and giving it to one of my dogs than doing any of the other possible options..

Thanks guys

George

Rajahal · October 12, 2011

Quite right. If it is a bug, then it has existed in the unRAID code for quite some time. I agree, bad RAM would be one of the easiest (and least expensive) things to fix. We'll see what memtest turns up.

mbryanr · October 12, 2011

I was hoping that we would be able to figure out why this happened in the first place.

Those were my thoughts as well. I'm not certain what caused it, but as you stated intermittent hardware issues could cause this.

Ultimately - had to get the data protected again. If it is my server, I'd want to know why. I'll follow along and hope you find a cause.

I noticed these similarities in the linked thread in the other thread..

Just had a failed disk (Seagate 1.5tb) which I have replaced with a new drive (WD EARS 1.5tb - edit: jumpered over 7-.

Edit: UnRAID version 4.5.1 (she's been shut down for a while! Gimme a break! )

I initially "unassigned" the faulty drive (disk7) so I could access a few "crucial" files - while the array remained unprotected (*gasp!*)

I then replaced the drive, assigned the new drive to the same slot (disk7) and rebooted.

Was prompted to push Start ("Start will bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)")

Pushed Start, and anticipated a loooong data rebuild... But instead, minimal disk activity, no data rebuild. The array is "Stopped", saying there is a new parity disk installed (but I never changed the parity drive!!), the new disk is orange balled (with temperature 0 degC), and I am asked to push Start again ("Start will bring the array on-line and start Parity-Sync")

guiri · October 12, 2011

Alright boys, running a long memtest right now. Not in a hurry so I'll let it run 24 hours or so. Personally, I'm hopin' for the memory being bad

Rajahal · October 13, 2011

Any news?

guiri · October 14, 2011

120 PASS, 0 Errors

Rajahal · October 14, 2011

OK, so the RAM is fine. I think the next step should be to restore your parity protection as soon as possible. Unfortunately this means that we may never figure out what caused this in the first place, but I expect you are more concerned with keeping your data safe than troubleshooting a potential unRAID bug. George, I'll call you soon and walk you through the 'initconfig' process.

guiri · October 14, 2011

Your suspicions are correct

mbryanr · October 14, 2011

Your suspicions are correct

What fun is that.

Maybe Tom will see this and look into what may have caused it.

guiri · October 14, 2011

Oh yeah, easy to talk when it's not your files

guiri · October 14, 2011

We also had problems creating a cache drive but I won't go into that as I don't know if it's related and don't understand this stuff.

Replacing disk7 yet unRAID says 'upgrading parity'?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation