Well This one is fun


Recommended Posts

So I logged into my server today to find a disk disabled. I also notice a new device in my unassigned devices that I didn't expect. It took me an embarrassingly long time to realize that the serial numbers were the same. I am further confused by the fact that it appears that the kernel attempted to mount this same device on multiple dev nodes. I've never seen this before, but my first wild-a**ed-guess is that i may have had a interruption on either the power or data cable and maybe the kernel got a little messed up trying to manage hotplug.

 

Anybody else have a smarter answer? (smarter-aleck answers accepted too, I could use a laugh)

 

Diagnostics:

http://tinyurl.com/j7raxsc

double_mount.png.05a66bec5cf552fede7edd9980ddface.png

Link to comment

So I logged into my server today to find a disk disabled. I also notice a new device in my unassigned devices that I didn't expect. It took me an embarrassingly long time to realize that the serial numbers were the same. I am further confused by the fact that it appears that the kernel attempted to mount this same device on multiple dev nodes. I've never seen this before, but my fist wild-a**ed-guess is that i may have had a interruption on either the power or data cable and maybe the kernel got a little messed up trying to manage hotplug.

 

Anybody else have a smarter answer? (smarter-aleck answers accepted too, I could use a laugh)

 

Diagnostics:

http://tinyurl.com/j7raxsc

 

No real ideas but think the bold could do with a spelling correction.....  ::)  As it is doesn't sound much fun to me....  :o

Link to comment

So I logged into my server today to find a disk disabled. I also notice a new device in my unassigned devices that I didn't expect. It took me an embarrassingly long time to realize that the serial numbers were the same. I am further confused by the fact that it appears that the kernel attempted to mount this same device on multiple dev nodes. I've never seen this before, but my first wild-a**ed-guess is that i may have had a interruption on either the power or data cable and maybe the kernel got a little messed up trying to manage hotplug.

 

Anybody else have a smarter answer? (smarter-aleck answers accepted too, I could use a laugh)

 

I have not looked at your diagnostics to confirm, but whenever I have seen this situation, it's because a drive was dropped by the kernel, then reappeared.  As you can see in your pic, the drive is first recognized as sdp, so somewhere in your syslog is a statement about sdp being disabled (by its ata number).  The kernel does that when the drive does not respond, even after multiple attempts.  Then later the drive suddenly reappeared (like a hotplug), so the kernel assigned it the symbol sdi.  The array is still looking for sdp, not sdi, so it doesn't know the drive is there.  So on next array start, sdi is added to the Unassigned Drives list, a list of those drives that aren't a part of the array or cache or the boot flash.  What is also interesting in this is that there had to be an sdi already, which also had been dropped in order to make the sdi symbol available.  (Somewhere else in your syslog is a statement about the ata number of sdi being disabled.)  The answers will be in your diagnostics, if you or I or someone else has time.

 

It seems improbable that the drive was both disabled because of temporary disconnection, *and* that it has read errors.  It's more likely that read errors were reported when the drive did not respond, which means that the drive is probably fine, but it's connection (either cable or backframe) is loose (less likely, its power was insufficient).

Link to comment

So a reboot cleared the weird double-node thing. Unfortunately it looks like the drive is a lost cause, it fails SMART tests with read errors. :(

I decided to take a look at the SMART report in the diagnostics, and it looks fine, a brand new drive with 17 hours, no sector issues at all.  What is very unusual though is the Power_Cycle_Count of 1161.  That's really hard to do when the drive has only 17 operational hours!  So that seems to confirm that a drive connection is very loose, and the power connection is probably vibrating on and off.

Link to comment

It looks like that's exactly what happened. When I swapped the drive, it started rebuilding excruciatingly slow, and the log was throwing a lot of "Link is slow to respond" errors and ultimately disabled the replacement drive. I just finished reseating all of the drives and cables, and did some overdue cleaning inside the case. We'll see how things look when it comes back up.

Link to comment

So I reintroduced the original drive and allowed the array to rebuild back onto it. Everything is back to normal. This is the second time I've been bitten by a connection issue. I'm using locking sata cables, so I'm unsure what else can be done to ensure a more reliable connection. Anybody have any ideas that are better than 'goober some hot glue onto the connectors'?

Link to comment

I'm using a single rail, modular psu and direct power to each drive cage. This supply and cables were saved when I rebuilt my server last year so they've been in service since 2011. I admit that lends the possibility that the psu could be nearing the end of it's useful life. However,after doing some log diving, I found that both of my recent drive issues have been isolated to the same cage slot so I'm inclined toward a faulty sata connection.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.