Replacing disk7 yet unRAID says 'upgrading parity'?

October 10, 201114 yr

I was helping guiri replace a failed disk when we saw this bit of odd behavior. Here's the story:

guiri was running unRAID 4.5.4. unRAID red-balled disk7. guiri replaced the disk with a used 2 TB WD EARS (with jumper installed). Upon the next boot, unRAID would say 'disabled disk replaced' and everything else looked normal. Upon pressing 'start', unRAID would say 'upgrading parity disk' and offer to start a parity sync. This is clearly wrong. I thought it might be a bug with unRAID 4.5.4, so I remote connected to guiri's computer and upgraded his server to unRAID 4.7. The odd behavior continued with unRAID 4.7. Screenshot and syslog (after the upgrade to 4.7) attached.

I thought that his server might have some loose wiring inside, so I had him move disk7 into a different physical hot swap bay on the server. He did so (the drive is now in a completely different drive cage, so the cabling shouldn't be the issue at this point, though it is still a possibility).

Some odd things I notice in his syslog:

Oct 10 22:50:18 Tower kernel: 0MB HIGHMEM available.
Oct 10 22:50:18 Tower kernel: 766MB LOWMEM available.

Potentially a problem with the RAM?

Oct 10 22:50:18 Tower kernel: scsi 0:0:0:0: Direct-Access     SanDisk  U3 Cruzer Micro  8.02 PQ: 0 ANSI: 0 CCS
Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:599:suspend_count 0
Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [753]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete
Oct 10 22:50:18 Tower kernel: [753]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [753]: __scst_resume_activity:675:suspend_count 0 left
Oct 10 22:50:18 Tower kernel: [753]: scst: scst_register_device:792:Attached to scsi0, channel 0, id 0, lun 0, type 0

Potentially a problem with the flash drive?

Oct 10 22:50:18 Tower kernel: scsi 1:0:0:0: Direct-Access     ATA      WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:599:suspend_count 0
Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [770]: scst_suspend_activity:644:Waiting for 0 active commands finally to complete
Oct 10 22:50:18 Tower kernel: [770]: scst_susp_wait:578:wait_event() returned 0
Oct 10 22:50:18 Tower kernel: [770]: __scst_resume_activity:675:suspend_count 0 left
Oct 10 22:50:18 Tower kernel: [770]: scst: scst_register_device:792:Attached to scsi1, channel 0, id 0, lun 0, type 0

Several of these errors on several different drives.

Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }
Oct 10 22:50:18 Tower kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Oct 10 22:50:18 Tower kernel: ata2: status=0x41 { DriveReady Error }
Oct 10 22:50:18 Tower kernel: ata2: error=0x04 { DriveStatusError }

Tons of these errors.

Screenshot:

You can see that the new disk also fails to report a temperature. I'm thinking the next step should be obtaining a SMART report on the new disk7 replacement disk. However, even if disk7 did turn out to be bad it wouldn't explain unRAID's odd behavior... or would it? Any other thoughts?

guiris_syslog.txt

Quote

October 11, 201114 yr

What? No one? C'mon smart people...

Quote

October 11, 201114 yr

Do a memtest overnight.

Quote

October 11, 201114 yr

Had a similar story here: <excuse the poor advice given by myself as it was the first time I was attempting to troubleshoot a major issue>

http://lime-technology.com/forum/index.php?topic=15088.0

Quote

October 12, 201114 yr

Thanks bryan, let's see what Rajahal tells me to do

Quote

October 12, 201114 yr

Once you get your situation straightened out I'd make disk6 which is a RE4 drive the parity drive. It is a WD high reliability Raid Edition drive. I'm currently using one of these as my parity drive.

Quote

October 12, 201114 yr

Well, I was going to take it out as I had it sold/traded to my computer guy. I guess I can keep it if it's going to make a difference and yes, they are nice drives. I had 4 of them in my ReadyNas before I got the unraid

Quote

October 12, 201114 yr

Thanks bryan, let's see what Rajahal tells me to do

Actually, he might suggest suicide so maybe I shouldn't wait for his reply

Quote

October 12, 201114 yr

Author

Thanks to dgaschk and mbryanr for your input. I do think a memtest is a good idea (certainly couldn't hurt). mbryanr, I agree that the situation seems similar to the one you linked. While I certainly know how to guide guiri back into having a healthy array by backing up the contents of the emulated disk7 then using initconfig and starting anew, I was hoping that we would be able to figure out why this happened in the first place. If it does turn out to be bad RAM then I think that would sufficiently explain this odd behavior, as bad RAM can cause all sorts of weird things to happen (especially when the OS is loaded into RAM as unRAID is). However, if we don't find any bad hardware then I believe this could only be explained as an unRAID bug. A bug in 4.7 no less.

Anyway, I'll guide guiri through a memtest and we'll go from there. I believe he has already backed up the data from the emulated disk7, so the data should be safe.

Quote

October 12, 201114 yr

As for it being a bug, it did it before we upgraded too, right?

Personally, I'm hoping for a memory problem. I'm thinking it'll be much easier to rip that stick out and giving it to one of my dogs than doing any of the other possible options..

Thanks guys

George

Quote

October 12, 201114 yr

Author

Quite right. If it is a bug, then it has existed in the unRAID code for quite some time. I agree, bad RAM would be one of the easiest (and least expensive) things to fix. We'll see what memtest turns up.

Quote

October 12, 201114 yr

I was hoping that we would be able to figure out why this happened in the first place.

Those were my thoughts as well. I'm not certain what caused it, but as you stated intermittent hardware issues could cause this.

Ultimately - had to get the data protected again. If it is my server, I'd want to know why. I'll follow along and hope you find a cause.

I noticed these similarities in the linked thread in the other thread..

Just had a failed disk (Seagate 1.5tb) which I have replaced with a new drive (WD EARS 1.5tb - edit: jumpered over 7-.

Edit: UnRAID version 4.5.1 (she's been shut down for a while! Gimme a break! )

I initially "unassigned" the faulty drive (disk7) so I could access a few "crucial" files - while the array remained unprotected (*gasp!*)

I then replaced the drive, assigned the new drive to the same slot (disk7) and rebooted.

Was prompted to push Start ("Start will bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)")

Pushed Start, and anticipated a loooong data rebuild... But instead, minimal disk activity, no data rebuild. The array is "Stopped", saying there is a new parity disk installed (but I never changed the parity drive!!), the new disk is orange balled (with temperature 0 degC), and I am asked to push Start again ("Start will bring the array on-line and start Parity-Sync")

Quote

October 12, 201114 yr

Alright boys, running a long memtest right now. Not in a hurry so I'll let it run 24 hours or so. Personally, I'm hopin' for the memory being bad

Quote

October 13, 201114 yr

Author

Any news?

Quote

October 14, 201114 yr

120 PASS, 0 Errors

Quote

October 14, 201114 yr

Author

OK, so the RAM is fine. I think the next step should be to restore your parity protection as soon as possible. Unfortunately this means that we may never figure out what caused this in the first place, but I expect you are more concerned with keeping your data safe than troubleshooting a potential unRAID bug. George, I'll call you soon and walk you through the 'initconfig' process.

Quote

October 14, 201114 yr

Your suspicions are correct

Quote

October 14, 201114 yr

Your suspicions are correct

What fun is that.

Maybe Tom will see this and look into what may have caused it.

Quote

October 14, 201114 yr

Oh yeah, easy to talk when it's not your files

Quote

October 14, 201114 yr

We also had problems creating a cache drive but I won't go into that as I don't know if it's related and don't understand this stuff.

Quote

Replacing disk7 yet unRAID says 'upgrading parity'?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)