Help first problem in about 1.5 years..

October 8, 200916 yr

All,

I've been waiting for this day to come, so here it is. I tried doing some initial research but not really making any progress... So recently I just bought 2 Seagate 2TB drives, I made one my new parity and another my disk 6(/dev/sdh) the new 2tb disk (disk6,aka /dev/sdh). Seems now be having some issues. Both of these disk along with 2 other data disks are in the same 5 disk sata backplane I have as well. So everything was and somewhat is still working fine but when I looked at the www interface I noticed the red ball next to disk 6 and it listed about 300 errors. I then tried running a smart report "smtctrl" on /dev/sdh but i kept on getting an invalid device. I stopped my Array, and now disk 6 is showing as "not installed". I'm afraid to do anything next, so I'm just gonna leave the array stopped, and the machine on. I have attached a copy of syslog as well. I truly appreciate the help in advance guys

October 8, 200916 yr

That sdh disk started having problems on Sep 29 20:18:10...

I wonder why it took unraid eight days to remove that disk from the array, and only after you tried to stop the array.

You should replace that "sdh" Seagate as quickly as humanly possible, (before a second disk craps out) so there will be no data loss.

Or, if you don't have anything of value on that disk yet (you mentioned that it's new), then just rebuild the array without it.

October 8, 200916 yr

. I stopped my Array, and now disk 6 is showing as "not installed".

It could be that the drive has failed, or it could be that the drive has a bad connection via its tray/cabling.

Before considering the drive as dead, you should re-seat it and check it cabling. If it comes back to life, because it was just a bad connection, and you can again do a SMART report, then you might consider the "trust my parity" procedure as described in the wiki.

Joe L.

October 8, 200916 yr

Author

Well I think unraid actually removed the drive before i stopped the array..cause it had the red ball next to it.. and it wasn't listing the drive as up(I cant remember the exact status msg it listed)

So I tried powering down the machine, and then went in for surgery to make sure everything was plugged in right, I then took the problem sdh/drive 6 out of the backplane and rescrewed it in the drive tray (with better flat screws) and plugged it back in. Then kicked the machine a few times Powered it all back up,and wam drive came back, and I ran the short smart report and everything seem ok! Since I had not noticed the problem til late last night I had actually added about 15 gigs of data to the array which seem to work flawlessly, even though unraid had already removed the drive. So since I did some writing after the drive was removed, I decided to just rebuild the drive as that's what unRaid wanted to do. Everything seems to be moving along fine.

Now my other question is, since again I did some writing to the array after the drive was taken out of rotation, could I have done the whole "trust-my-array" procedure to get it added back into the array quickly? I thought because I did some writes(15g of data added) it would not work?

October 8, 200916 yr

Now my other question is, since again I did some writing to the array after the drive was taken out of rotation, could I have done the whole "trust-my-array" procedure to get it added back into the array quickly? I thought because I did some writes(15g of data added) it would not work?

You are correct, that procedure would not work if you had "written" to the failed drive.

You acted correctly. The rebuilding of the drive will put the 15gig of data you changed onto it. Using the "Trust-my-drive" process would have put the drive's data back to the time before the drive failed.

Hopefully, it was just a loose connection that caused all your problems with that drive. If not, you'll soon know as it will go off-line again.

You'll want to keep an eye on the syslog for errors for a few days...

Joe L.

October 8, 200916 yr

Looking at the syslog, you booted at 7:47am on September 29, and as he said, the trouble begins about 12 and a half hours later at 8:18pm. I do not see any errors related to the physical drive itself, so the drive should be fine. There are no CRC errors to implicate a bad cable either. In my inexpert opinion, the problems are related to the backplane, perhaps loose, perhaps vibration-related. The SATA link remained up for awhile, but communications were clearly bad, and both the SATA link and the transfer mode (initially UDMA/133) were slowed way down, hoping to achieve more reliable comm. Finally around 8:35pm, the SATA link goes down too, and shortly thereafter the following line occurs, a very bad line, and in my experience, always fatal from unRAID's viewpoint.

   Sep 29 20:35:03 Tower kernel: ata7.00: disabled

All subsequent errors can be ignored. Once the drive is disabled, then (false) read errors accumulate.

The kernel is actually able to recover the drive one more time, but by then it is too late, and it identifies it as a new drive, sdi. UnRAID itself knows nothing of sdi, it only knows about sdh. Soon sdi is lost also, and by 8:50pm, the kernel quits completely on trying to recover the drive. You will notice that I differentiate between unRAID and the kernel, because they actually are separate and independent entities. In a sense, they represent different layers of communications, and unRAID itself does not know about the problems in the underlying layer. It only knows whether it can successfully request that blocks be read and written. So it continues to try and read blocks from the disabled sdh, but of course unsuccessfully.

It's hard to say what the actual problem is. The kernel module knew that the SATA link was still up, so it knew the drive was still there, but its requests for the identity of the device were unanswered, timed out. One possibility is that the loose or faulty component was related to the power connection, not the SATA connection. The errors certainly look like they could be power-related disruption of the drive's 'computer', and it *was* able to maintain the SATA connection longer than anything else.

These kinds of issues are not uncommon to backplane connected drives. I think special effort is needed in any tray mounting, to make absolutely sure that both the SATA and power connections are tight, and will remain so even if there is substantial vibration.

Help first problem in about 1.5 years..

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)