my first red ball, after driving my server almost 3000 miles


Recommended Posts

The SATA error flags you are seeing (RecovComm and PHYRdyChg) are typical of a bad connection, perhaps loose and vibrating, or poor backplane connection, or perhaps poor or noisy power.  You have 2 separate drives with the same issues, connected to channels ata12 and ata14.  I understand you've reconnected drives to different ports, so that in itself may have improved all connections.  In cases like these, the drives themselves are usually completely fine.

 

That's good to hear.

 

I'll probably rebuild it onto the new drive tomorrow, then preclear this one, then see if the one that was acting up before is still having issues, and replace it if so.  Otherwise, I'll just upgrade a smaller drive with it.

 

I started the long smart test on the drive about an hour ago, and I'm not really sure how to see the results when done, but once it finishes, hopefully it will be okay.

 

root@media:~# smartctl --test=long /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.15.0-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 388 minutes for test to complete.
Test will complete after Wed Aug 27 03:33:11 2014

Use smartctl -X to abort test.

Link to comment
  • Replies 55
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

I'm not sure if I get it right, but...

Now, that the drive seems running again, why not assign it as disk9 and check if your array gets up again?  ???

 

Edit:

Q: For my understanding - once a drive gets red balled, will it change it's status when the problem is resolved or is a rebuild inevitable?

 

Link to comment

The only danger would be if it started building the drive and something else happened, you may loose the original data on that drive, as well as the emulated drive from parity.  It would be safer to rebuild on a new drive, and then you always have the original drive with the original data.

 

From my experience, once it is red balled, it can't trust the integrity of the parity since it doesn't know what has changed on that missing disk since it dropped out.  I think there is a way to force it to accept the drive back in and assume parity is correct, but you would have to be 100% sure nothing has been updated, or I guess you could run a parity check and it would correct anything that is not correct.  But that always seemed a bit risky to me..

Link to comment
From my experience, once it is red balled, it can't trust the integrity of the parity since it doesn't know what has changed on that missing disk since it dropped out.

 

Makes sense, although it's quite annoying if you get a red ball due to cabling issues.

 

Occasionally I carry my server to friends to LAN parties.

Would be a pretty showstopper if I had a red ball due to a loose cable.

I will disable the autostart next time before transport and convince myself that all drives are available before

I start the array again.

Link to comment

Q: For my understanding - once a drive gets red balled, will it change it's status when the problem is resolved or is a rebuild inevitable?

 

I've never seen a case where it didn't rebuild the disk.    That's why I suggested using a different disk, so if anything goes awry the original disk is still available for data recovery.  [if the OP has a full set of backups that's less important, but I know a lot of folks don't have current backups.]

 

Link to comment
Q: For my understanding - once a drive gets red balled, will it change it's status when the problem is resolved or is a rebuild inevitable?
It's more helpful to think of the red ball as applying to the array slot, not specifically the disk itself. Once the slot has a red ball, unraid doesn't even attempt any more file operations to the assigned disk, because it failed a write operation. Anything you write to that slot is written to the emulated drive that is calculated from the rest of the drives including the parity drive. In order to get that slot back green again, you either have to write the contents back from the emulated drive to the same drive or preferably a new one, or throw all that updated information away and rebuild parity from what was on the drive the moment before it red balled by setting a new configuration. Anything that was written to that slot when the red ball occurred or afterwards will be lost if you do set a new config.
Link to comment

So, the new 3TB drive arrived today.  I'm trying to decide my best course of action now.  My current system is still running v6 beta6.  My red-balled drive is 2TB, and is 90% full.

 

I want to move to beta7, but I can't rebuild a 2TB drive onto a 3TB drive without losing the 1TB space with beta7 (from what I understand).  Also, the less I change before I repair/restore the drive, the better (from what I understand).

 

I would like to migrate to XFS file system for my array drives, mostly just because it seems like a good move for future compatibility.  I suspect I'll eventually move to BTRFS, so I'm not sure it's worth taking the time/energy to move to XFS at this point in time.  I can't change the file system at all on beta6, so I suspect I just need to stick with reisrfs for the time being, and worry about file system changes later.

 

So, assuming I stay on beta6 for the repair work, does the new drive have to be pre-cleared before I can rebuild onto it?  I'd rather pre-clear it than have the server be unavailable for a day.  Also, is there any reason I should not disconnect the 'failed' drive, and use that spot to connect the new drive to pre-clear it?  I don't really have room in my server to install another drive (all bays are full now), but I do have room on my controller card for another SATA cable, so I could just lay the new drive inside the server (not actually installed) and pre-clear it that way.

 

Once it's pre-cleared (assuming it's necessary/good to do so), I can just stop the array, assign the new drive to the red-balled slot, then restart the array and it will begin to rebuild onto the new drive, correct?

 

Once that's all finished, and everything seems fine, could I assign the 'failed' drive to another slot, which currently holds a 1TB drive, to have those contents re-built onto the larger 2TB drive?

 

Sorry for being so 'thorough' (anal?) about this, I just want to be sure I'm going about this the right way.

 

Thanks again for everyone's help!!

Link to comment

If you run another smart report, you can see the results/progress of the test.

 

I had to shutdown my laptop (which I used to start the test, via putty) before the 3 hours had passed.  I assumed the test would continue, since I thought it was actually running on the server.

 

I just ran

smartctl --test=long /dev/sdb

on my laptop (in a new putty session), and it seems to have just restarted the test again, but didn't provide any results from the last test.

 

Do I just need to leave this session open on my laptop until it finishes, or how can I see the results of the long test?

Link to comment

So, the new 3TB drive arrived today.  I'm trying to decide my best course of action now.  My current system is still running v6 beta6.  My red-balled drive is 2TB, and is 90% full.

 

I want to move to beta7, but I can't rebuild a 2TB drive onto a 3TB drive without losing the 1TB space with beta7 (from what I understand).  Also, the less I change before I repair/restore the drive, the better (from what I understand).

 

I would like to migrate to XFS file system for my array drives, mostly just because it seems like a good move for future compatibility.  I suspect I'll eventually move to BTRFS, so I'm not sure it's worth taking the time/energy to move to XFS at this point in time.  I can't change the file system at all on beta6, so I suspect I just need to stick with reisrfs for the time being, and worry about file system changes later.

 

So, assuming I stay on beta6 for the repair work, does the new drive have to be pre-cleared before I can rebuild onto it?  I'd rather pre-clear it than have the server be unavailable for a day.

Preclear is not necessary for a rebuild as far as unRAID operation is concerned. unRAID clears added array drives so parity will remain valid when a drive is added, but a rebuild is making the replaced drive in sync with parity so it does not need to be clear.

 

While unRAID only requires an added array drive to be cleared, you should still preclear any new drive because it is a good way to test the drive. You don't want to put a bad drive in the array.

Also, is there any reason I should not disconnect the 'failed' drive, and use that spot to connect the new drive to pre-clear it?  I don't really have room in my server to install another drive (all bays are full now), but I do have room on my controller card for another SATA cable, so I could just lay the new drive inside the server (not actually installed) and pre-clear it that way.

Yes you can use the slot that the old drive was in. Probably best to set that drive aside anyway in case something goes wrong with the rebuild.

Once it's pre-cleared (assuming it's necessary/good to do so), I can just stop the array, assign the new drive to the red-balled slot, then restart the array and it will begin to rebuild onto the new drive, correct?
Yes that's right
Once that's all finished, and everything seems fine, could I assign the 'failed' drive to another slot, which currently holds a 1TB drive, to have those contents re-built onto the larger 2TB drive?
You should preclear the old drive to make sure it is OK, then you can use it to rebuild your smaller drive.

 

Sorry for being so 'thorough' (anal?) about this, I just want to be sure I'm going about this the right way.

 

Thanks again for everyone's help!!

 

Link to comment

If you run another smart report, you can see the results/progress of the test.

 

I had to shutdown my laptop (which I used to start the test, via putty) before the 3 hours had passed.  I assumed the test would continue, since I thought it was actually running on the server.

 

I just ran

smartctl --test=long /dev/sdb

on my laptop (in a new putty session), and it seems to have just restarted the test again, but didn't provide any results from the last test.

 

Do I just need to leave this session open on my laptop until it finishes, or how can I see the results of the long test?

 

You need to install the screen package.  Then you run screen before issuing the command you want to run. This will allow you to disconnect and reconnect to your session. Otherwise if your putty session dies, the commands running in it die too.

Link to comment

I want to move to beta7, but I can't rebuild a 2TB drive onto a 3TB drive without losing the 1TB space with beta7 (from what I understand).  Also, the less I change before I repair/restore the drive, the better (from what I understand).

Actually you CAN as long as your parity is at least as large as the replacement drive.  unRAID would rebuild the 2TB drive onto the 3TB one and then extend the filing system to use the rest of the disk.  What you cannot do is replace a drive with a smaller one.

Link to comment

I want to move to beta7, but I can't rebuild a 2TB drive onto a 3TB drive without losing the 1TB space with beta7 (from what I understand).  Also, the less I change before I repair/restore the drive, the better (from what I understand).

Actually you CAN as long as your parity is at least as large as the replacement drive.  unRAID would rebuild the 2TB drive onto the 3TB one and then extend the filing system to use the rest of the disk.  What you cannot do is replace a drive with a smaller one.

 

I don't think that's correct...

 

Known issues in this release

----------------------------

- emhttp: removed automatic file system expand when small drive is replaced by bigger drive.

  This will be put back in next release.

 

Tom, can you elaborate on this one ...

- emhttp: removed automatic file system expand when small drive is replaced by bigger drive.

 

Does this mean that if you are upsizing a disk, say from 2T to 4T, the 4T would continue to be limited to 2T? Can it be manually expanded? Will beta8 detect this condition and auto-expand any replaced disks? Would you advise users to boot back into 6b6 to do the replacement?

Code used to just do an unconditional "resize" upon every mount, and that was not an issue because with reiserfs it just was a no-op.  But with btrfs/xfs there needs to be a little more sophistication in the coding.  I didn't want to delay beta7 another week to get this in, so left auto-resize out until next release.  If someone needs to do an expansion before beta8 I will post instructions.

 

Somewhat related - if you add a btrfs or xfs disk in 6b7, can you boot back into 6b6? (I realize the XFS and BRTFS disks would appear unformatted and their contents not accessible, but would parity be maintained and therefore ability to do a disk rebuild?)

Yes.

Link to comment

I want to move to beta7, but I can't rebuild a 2TB drive onto a 3TB drive without losing the 1TB space with beta7 (from what I understand).  Also, the less I change before I repair/restore the drive, the better (from what I understand).

Actually you CAN as long as your parity is at least as large as the replacement drive.  unRAID would rebuild the 2TB drive onto the 3TB one and then extend the filing system to use the rest of the disk.  What you cannot do is replace a drive with a smaller one.

 

I don't think that's correct...

Well the User Guide says you can - see http://lime-technology.com/wiki/index.php?title=UnRAID_Manual#Replace_a_single_disk_with_a_bigger_one

 

Link to comment

I want to move to beta7, but I can't rebuild a 2TB drive onto a 3TB drive without losing the 1TB space with beta7 (from what I understand).  Also, the less I change before I repair/restore the drive, the better (from what I understand).

Actually you CAN as long as your parity is at least as large as the replacement drive.  unRAID would rebuild the 2TB drive onto the 3TB one and then extend the filing system to use the rest of the disk.  What you cannot do is replace a drive with a smaller one.

 

Not true.  The auto-extend was removed in Beta-7.  Tom plans to restore it in Beta-8, but if a rebuild was done now, only 2TB of the 3TB drive would be used.

 

Link to comment

If you run another smart report, you can see the results/progress of the test.

 

I had to shutdown my laptop (which I used to start the test, via putty) before the 3 hours had passed.  I assumed the test would continue, since I thought it was actually running on the server.

 

I just ran

smartctl --test=long /dev/sdb

on my laptop (in a new putty session), and it seems to have just restarted the test again, but didn't provide any results from the last test.

 

Do I just need to leave this session open on my laptop until it finishes, or how can I see the results of the long test?

 

It runs on the drive, you don't have to stay connected.  You should see output like this in the smart report:

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error      00%      7071        -

# 2  Extended offline    Aborted by host              60%      6850        -

# 3  Extended offline    Interrupted (host reset)      60%      3003        -

# 4  Short offline      Completed without error      00%      2939        -

 

 

Link to comment

It runs on the drive, you don't have to stay connected.  You should see output like this in the smart report:

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error      00%      7071        -

# 2  Extended offline    Aborted by host              60%      6850        -

# 3  Extended offline    Interrupted (host reset)      60%      3003        -

# 4  Short offline      Completed without error      00%      2939        -

 

How do I go and get that report?  I closed the putty window, so I'm just not sure how to see that report now that it should have finished.

Link to comment

It runs on the drive, you don't have to stay connected.  You should see output like this in the smart report:

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error      00%      7071        -

# 2  Extended offline    Aborted by host              60%      6850        -

# 3  Extended offline    Interrupted (host reset)      60%      3003        -

# 4  Short offline      Completed without error      00%      2939        -

 

How do I go and get that report?  I closed the putty window, so I'm just not sure how to see that report now that it should have finished.

 

That is a section of the output from the "smartctl -a /dev/sdX" report.  The test is running on the drive it self, not really in the OS and the report is saved internally on the drive, so the results last even past a power cycle of the drive.  Some manufactures reports are a bit different but should contain mostly the same information.  The above was from a Seagate drive.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.