Replacing disk7 yet unRAID says 'upgrading parity'?


Recommended Posts

I was helping guiri replace a failed disk when we saw this bit of odd behavior.  Here's the story:

 

guiri was running unRAID 4.5.4.  unRAID red-balled disk7.  guiri replaced the disk with a used 2 TB WD EARS (with jumper installed).  Upon the next boot, unRAID would say 'disabled disk replaced' and everything else looked normal.  Upon pressing 'start', unRAID would say 'upgrading parity disk' and offer to start a parity sync.  This is clearly wrong.  I thought it might be a bug with unRAID 4.5.4, so I remote connected to guiri's computer and upgraded his server to unRAID 4.7.  The odd behavior continued with unRAID 4.7.  Screenshot and syslog (after the upgrade to 4.7) attached.

 

I thought that his server might have some loose wiring inside, so I had him move disk7 into a different physical hot swap bay on the server.  He did so (the drive is now in a completely different drive cage, so the cabling shouldn't be the issue at this point, though it is still a possibility).

 

Some odd things I notice in his syslog:

 

Oct 10 22:50:18 Tower kernel: 0MB HIGHMEM available.
Oct 10 22:50:18 Tower kernel: 766MB LOWMEM available.

 

Potentially a problem with the RAM?

 

Oct 10 22:50:18 Tower kernel: scsi 0:0:0:0: Direct-Access     SanDisk  U3 Cruzer Micro  8.02 PQ: 0 ANSI: 0 CCS
Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:599:suspend_count 0
Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete
Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [753]: __scst_resume_activity:675:suspend_count 0 left
Oct 10 22:50:18 Tower kernel: [753]: scst: scst_register_device:792:Attached to scsi0, channel 0, id 0, lun 0, type 0

 

Potentially a problem with the flash drive?

 

Oct 10 22:50:18 Tower kernel: scsi 1:0:0:0: Direct-Access     ATA      WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:599:suspend_count 0
Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete
Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [770]: __scst_resume_activity:675:suspend_count 0 left
Oct 10 22:50:18 Tower kernel: [770]: scst: scst_register_device:792:Attached to scsi1, channel 0, id 0, lun 0, type 0

 

Several of these errors on several different drives.

 

Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }

 

Tons of these errors.

 

Screenshot:

klqaW.png

 

You can see that the new disk also fails to report a temperature.  I'm thinking the next step should be obtaining a SMART report on the new disk7 replacement disk.  However, even if disk7 did turn out to be bad it wouldn't explain unRAID's odd behavior... or would it?  Any other thoughts?

guiris_syslog.txt

Link to comment

Thanks to dgaschk and mbryanr for your input.  I do think a memtest is a good idea (certainly couldn't hurt).  mbryanr, I agree that the situation seems similar to the one you linked.  While I certainly know how to guide guiri back into having a healthy array by backing up the contents of the emulated disk7 then using initconfig and starting anew, I was hoping that we would be able to figure out why this happened in the first place.  If it does turn out to be bad RAM then I think that would sufficiently explain this odd behavior, as bad RAM can cause all sorts of weird things to happen (especially when the OS is loaded into RAM as unRAID is).  However, if we don't find any bad hardware then I believe this could only be explained as an unRAID bug.  A bug in 4.7 no less.

 

Anyway, I'll guide guiri through a memtest and we'll go from there.  I believe he has already backed up the data from the emulated disk7, so the data should be safe.

Link to comment

As for it being a bug, it did it before we upgraded too, right?

 

Personally, I'm hoping for a memory problem. I'm thinking it'll be much easier to rip that stick out and giving it to one of my dogs than doing any of the other possible options..

 

Thanks guys

 

George

Link to comment

I was hoping that we would be able to figure out why this happened in the first place.

Those were my thoughts as well.  I'm not certain what caused it, but as you stated intermittent hardware issues could cause this.

Ultimately - had to get the data protected again.  If it is my server, I'd want to know why.  I'll follow along and hope you find a cause.

 

I noticed these similarities in the linked thread in the other thread..

Just had a failed disk (Seagate 1.5tb) which I have replaced with a new drive (WD EARS 1.5tb - edit: jumpered over 7-8).

 

Edit: UnRAID version 4.5.1 (she's been shut down for a while! Gimme a break! :) )

 

I initially "unassigned" the faulty drive (disk7) so I could access a few "crucial" files - while the array remained unprotected (*gasp!*)

 

I then replaced the drive, assigned the new drive to the same slot (disk7) and rebooted.

 

Was prompted to push Start ("Start will bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)")

 

Pushed Start, and anticipated a loooong data rebuild... But instead, minimal disk activity, no data rebuild. The array is "Stopped", saying there is a new parity disk installed (but I never changed the parity drive!!), the new disk is orange balled (with temperature 0 degC), and I am asked to push Start again ("Start will bring the array on-line and start Parity-Sync")

Link to comment

OK, so the RAM is fine.  I think the next step should be to restore your parity protection as soon as possible.  Unfortunately this means that we may never figure out what caused this in the first place, but I expect you are more concerned with keeping your data safe than troubleshooting a potential unRAID bug.  George, I'll call you soon and walk you through the 'initconfig' process.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.