5.0-rc11 immediate errors when replacing failed disk

February 10, 201313 yr

Hello,

I have an interesting issue. I recently saw that one of my disks had redballed, with (I think) 768 errors, both read and write. So, I immediately went out and bought a new disk. I installed it, ran a full preclear, which completed successfully. I stopped the array, added the new disk, and restarted it to begin a rebuild. Unfortunately, the rebuild immediately stopped, with an error, and another redball with the new disk. I stopped the array, started it, stopped it and re-added the disk, only to see the same thing.

I powered down, and reconnected the disk to a different SATA channel (I have a SaS controller, with one free cable). I powered up again, and again started the array, stopped it, added the disk, started the array, and saw the same thing again- an error and a redball.

Now- Im sort of unsure what I should do next... seems like the disk is ok, given that the preclear didn't give any errors or anything, and I've now tried a different SATA channel. Does anyone have any ideas? I am attaching a syslog. This is disk6, by the way, sdf.

Thank you,

Sean

[Moderator (RobJ): changed icon to report bug]

[Moderator (RobJ): changed icon - not bug, dependency conflict with addon]

syslog.txt

Quote

February 10, 201313 yr

Author

I've now changed the power connector (just because I hadn't before), but am still having the same issue.

Reading through my syslog, I was searching Google and this forum for "kernel: attempt to access beyond end of device", and the results I have found seem to share something in common with me: they all concern Seagate drives. Mine is a ST3000DM001-1CH166_Z1F1YR94.

The other 3TB drives I have are WD's- I just bought a Seagate (which was a bit more expensive) because the drive that failed was a WD, and I thought I would try another brand.

I have 12 drives total, and I DO have 2 other Seagate drives, but they are ST31500341AS 1.5TB drives. So I am wondering if this issue is because of some odd issue with this 3TB Seagate drive?

I'd like to just go out and exchange it for another WD, but unfortunately we are buried under almost 3 feet of snow here in CT! I am going to try jumpering the drive for 1.5 Gbits/sec- maybe there is an auto-negotiation problem.

Quote

February 10, 201313 yr

Author

One last thing- I have a Supermicro 8-port SAS Controller, an AOC-SAS2LP-MV8, which supposedly works fine with 3TB (or 4TB) drives. I know for sure that it works with Western Digital 3TB drives, anyway.

In any case, jumpering the Seagate didn't make any difference. Oh, well. Hopefully someone will have an idea, if not I suppose I will go and exchange this for another WD.

Quote

February 10, 201313 yr

I'm not an expert, but no one else has replied, so ....

The disk is originally imported with (correctly) a size of 2.9TB.

The write which fails is addressed to a sector at approximately 2.5TB, with the report that the disk limit is approx 2.15TB (sector 4294967295).

Now, that sector number is the absolute limit for a 32bit address (the actual limit for disk size before unRAID was enabled for 3TB drives).

The question then is why?

The system recognises the drive as 3TB, but when it comes to write, it is limited at the old MBR 32 bit maximum. Perhaps the disk has something strange pre-recorded, or when you precleared, it failed to write a GPT? More investigation is required, and I leave that to someone with more expertise.

Quote

February 10, 201313 yr

Feb 10 01:00:25 Tower emhttp: writing GPT on disk (sdf), with partition 1 offset 64, erased: 0

Feb 10 01:00:25 Tower emhttp: shcmd (160): sgdisk -Z /dev/sdf &> /dev/null

...

Feb 10 01:00:25 Tower emhttp: shcmd (161): sgdisk -o -a 64 -n 1:64:0 /dev/sdf |& logger

Feb 10 01:00:25 Tower logger: sgdisk: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by sgdisk)

One more line that should have appeared (but did not) is Feb 10 01:00:25 Tower kernel: sdf: sdf1, indicating the kernel acknowledges finding one partition on this drive.

You are using RC11, and it appears one lib was not found (perhaps the wrong version?). Try going back temporarily to RC10 (replace bzroot and bzimage) and try this operation again. When it finishes, restore the RC11 bzroot and bzimage files.

Tom?

Quote

February 10, 201313 yr

Author

PeterB and RobJ,

Thank you both for your replies. I hadn't seen that Peter, concerning the size- at least that explains the "beyond end of device error".

Rob, when you say "try this operation again", do you mean the disk rebuild?

Thanks,

Sean

Quote

February 10, 201313 yr

Rob, when you say "try this operation again", do you mean the disk rebuild?

Yes

Quote

February 10, 201313 yr

Author

Rob,

Thank you (I confess, I was hoping you didn't mean another 30-hour preclear!). I restored my bzimage and bzroot from rc10 and tried again- but I get the same error. I've created another system log, which I will attach. Do you think it is worth attempting to swap cables with another drive that is using one of the motherboard ports, rather than my Supermicro card? Or would that create new difficulties?

Thank you,

Sean

Quote

February 10, 201313 yr

Author

Aaaand, I forgot to attach the file. Sorry about that.

syslog2.txt

Quote

February 10, 201313 yr

Which version of preclear did you use?

http://lime-technology.com/forum/index.php?topic=2817.0

Quote

February 10, 201313 yr

Author

I'd thought I was using the latest version- "./preclear_disk.sh -v" returns "./preclear_disk.sh version: 1.13"

Thank you,

Sean

Quote

February 10, 201313 yr

Yeah, the RC10 syslog reports exactly the same behavior. I checked other RC10 syslogs, and the same sgdisk command executed without problems, so that means the lib or sgdisk is being removed or replaced in your setup. You are loading Avahi support, so I have to think that it or some other unlogged addon is replacing either the stock sgdisk or the lib with a different version. I would return to RC11, disable any addons, and try again.

Quote

February 10, 201313 yr

Author

Rob,

Ok- I used unmenu to disable package install upon reboot. As far as I know, I only had ds_store_cleanup, image_server.sh, screen, and UnRAID-Web installed (I don't believe I specifically installed Avahi, although it may well be a dependency). I left unmenu and mc- but so far, the rebuild appears to be working. For what it's worth, I recall installing ds_store_cleanup and screen recently, so one of those may have contributed to this issue, I am not certain.

I am only at:

Current position: 24.37 GB (1%)

Estimated speed: 72.19 MB/sec

Estimated finish: 687 minutes

So far, but it hasn't gotten this far since I replaced the disk. So- fingers crossed. I will update this thread after it completes. Thank you (and PeterB and mbryanr) so much for all your help!

Thank you,

Sean

Quote

February 10, 201313 yr

Very good! Those darn addons again!

I suspect ds_store_cleanup. You may want to post about this in the thread of that addon. Perhaps it just needs an update for RC10 and up.

Quote

February 11, 201313 yr

Very good! Those darn addons again!

... and a very good justification for a 'vanilla' boot option. Perhaps that will encourage most people to take all the addons out before reporting a problem.

Quote

February 11, 201313 yr

Author

Yes- I realized belatedly that I had recently installed the ds_store_cleanup. I just hadn't considered it as a possible cause. Still, point taken, with my apologies.

Now- one last question, to anyone who might know: I have successfully rebuilt the disk, and all appears to be well. I have not used the array since completion (haven't written any new data to it, nor deleted anything), and I am curious as to why I would suddenly have all these writes to my disks 7 and 8, including 17314573113 writes to an empty drive!

I'm attaching a screenshot. Hopefully this is nothing to worry about.

Quote

February 12, 201313 yr

Now- one last question, to anyone who might know: I have successfully rebuilt the disk, and all appears to be well. I have not used the array since completion (haven't written any new data to it, nor deleted anything), and I am curious as to why I would suddenly have all these writes to my disks 7 and 8, including 17314573113 writes to an empty drive!

From a parity standpoint or a rebuilding standpoint, file contents are completely irrelevant, so the drive being empty does not matter at all. However, those Disk 7 and Disk 8 read and write numbers are clearly very wrong, and look like a variable overwrite. All other numbers look correct. Does the syslog show anything odd during this period? Have you done a long memory test?

If those numbers were right, then Disk 7 and Disk 8 were read a thousand times as many times as the others, and written to 3000 times as much as the drive that was rebuilt, which would take roughly 3000 times as long as the time it took to rebuild Disk 6! (Maybe 3 or 4 years?)

I would shut down and reboot, in case there is internal corruption. Then when convenient for you, do a full parity check.

Quote

February 13, 201313 yr

Author

Rob,

Apparently my cat decided the pull out the power connection to those 2 drives at some point. I think the rebuild was complete, or almost complete, when he did this - I only noticed it when those 2 disks came up as "not installed"! I shut down, re-connected power, and have now run a full parity check- which came up with 6 errors. So, who knows. Not sure whether I lost anything or not, but everything SEEMS ok so far.

Quote

5.0-rc11 immediate errors when replacing failed disk

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)