Error with disk drive. Automatically disabled. Next steps?


Recommended Posts

EDIT: I just realized I posted in the 5.x support forum instead of 6.x.  Can a moderator please move this for me?  Thank you!

 

 

 

Hello all.  Just a few minutes ago as I was watching a TV show off of my server, the video froze up randomly.  I though it strange and decided to open the file again after closing it, which after skipping ahead in the video resulted in the same thing.  I thought maybe I just needed to give my server a restart, so as I was taking the array offline I noticed that a red toast popup in the UnRAID GUI appeared saying something relating to errors with the disk, in this instance "disk 1".  As the array was already being stopped, I thought it best to reboot the server.

 

Upon booting, UnRAID displayed a large number of errors, all of which I assumed were on that disk.  It took an abnormal amount of time for UnRAID to finish booting, but eventually it did and I was able to get back into the GUI.  Once in the GUI, I saw that Disk 1 is being listed as not installed with "no device" and the disk model/serial number being listed under that.  When I click the drop down for Disk 1, it does not give me the option of remounting the drive.

 

My immediate thoughts are that the disk crapped out and I'm pretty much going to have to RMA it at this point.  Before I do that though, I wanted to upload my syslog to see if this is confirmed, and also to see what people suggest.  When I go to the dashboard it says that parity status is "invalid" which has me worried.  That might be normal, but this is the first time I've ever had an issue like this before.

 

I currently have the array stopped and will not be doing anything further until I get a bit of advice from the friendly community here.  Thank you!

syslog-2015-08-114.txt

Link to comment

Go to Tools - Diagnostics and post result. It has more information that just syslog for v6.

 

Also, read this wiki while waiting for further advice.

Thanks for the reply!  Unfortunately I am running 6.0-rc3, the version before the diagnostics were added to UnRAID (just my luck.)  Looks like the best I can do is a syslog for now.  And due to the fact that I did reboot the server before collecting the log, I'm hoping that I didn't lose any vital information to share for diagnosing.

 

It looks like I should try and get a SMART report, however I receive "scsi error aborted command" when I try to do that.

 

I know it's possible it could either be the SATA cable itself going bad, or even just the connection becoming lose.  I'm skeptical that's the case, but I'll wait to see what others say first before do anything.  Thanks again!

Link to comment

In Main, click on the drive and go to the Attributes and post them.

Thank you, but I can't actually click on the drive because it says Disk 1 is "not installed" in main with a red square next to it (device disabled.) :(

 

You could always upgrade to 6.0.1 final.

That's true.  Doing that would give me access to the new diagnostics feature.  I wonder if the diagnostics work even work though if it's not detecting the drive.

 

 

On a side note, I noticed it says in the syslog for sda1 (my UnRaid flash drive) that the "Volume was not properly unmounted. Some data may be corrupt. Please run fsck."  Not sure that has anything to do with my issue though!

Link to comment

Well after doing a bit of research, it appears that it could be a number of things, including bad cable, bad SATA port or a bad drive, with the drive appearing to me to be the most likely case.  I did a some Googling with parts of my syslog seen below and found some threads on here and some other forums that seem to indicate a similar problem.

 

Aug 11 20:23:56 Tower kernel: ata7.00: status: { DRDY ERR }
Aug 11 20:23:56 Tower kernel: ata7.00: error: { ABRT }
Aug 11 20:23:56 Tower kernel: ata7.00: configured for UDMA/133
Aug 11 20:23:56 Tower kernel: ata7: EH complete
Aug 11 20:23:56 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 11 20:23:56 Tower kernel: ata7.00: irq_stat 0x40000001
Aug 11 20:23:56 Tower kernel: ata7.00: failed command: READ DMA
Aug 11 20:23:56 Tower kernel: ata7.00: cmd c8/00:02:00:00:00/00:00:00:00:00/e0 tag 28 dma 1024 in
Aug 11 20:23:56 Tower kernel:         res 51/04:02:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)

 

Aug 11 20:43:38 Tower kernel: sd 7:0:0:0: [sdf] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Aug 11 20:43:38 Tower kernel: sd 7:0:0:0: [sdf] tag#28 Sense Key : 0xb [current] [descriptor] 
Aug 11 20:43:38 Tower kernel: sd 7:0:0:0: [sdf] tag#28 ASC=0x0 ASCQ=0x0 
Aug 11 20:43:38 Tower kernel: sd 7:0:0:0: [sdf] tag#28 CDB: opcode=0x28 28 00 00 00 00 04 00 00 04 00
Aug 11 20:43:38 Tower kernel: blk_update_request: I/O error, dev sdf, sector 4
Aug 11 20:43:38 Tower kernel: Buffer I/O error on dev sdf, logical block 2, async page read
Aug 11 20:43:38 Tower kernel: Buffer I/O error on dev sdf, logical block 3, async page read

 

When I get home from work, I will try a new SATA cable first, then I suppose the next step is to try a different SATA port.  I unfortunately don't have any free SATA ports, so I'll have to swap out a drive to see if UnRAID detects the potentially problematic one.

 

Question:  Would it be okay to upgrade to UnRAID 6.0.1 final just to get access to the diagnostic tool in the UnRAID GUI?  If so, I can do that as well in order to post what I hope is more beneficial information.

 

Thank you all in advance!

Link to comment

Well after doing a bit of research, it appears that it could be a number of things, including bad cable, bad SATA port or a bad drive, with the drive appearing to me to be the most likely case.

The drive actually being bad is by far the least frequent cause of this sort of error!  Cables, ports, power are all far more frequent.

Link to comment

Okay so I tried two new SATA cables but no luck.  Next I'm going to try using a different SATA port.  Unfortunately I do not have any open SATA ports at this time.  Would there be any issues with unplugging one of my currently working drives and using its SATA and power cables?  That would allow me to test both the SATA port and the power if so.  Not sure if it would mess up anything with the UnRAID array configuration by doing so, so I wanted to play it safe first and ask.

 

Also, this is what happens when I try to run a SMART report on the drive from UnRAID:

 

=== START OF INFORMATION SECTION ===
Device Model:     ST3000DM001
Serial Number:    W1F1N7ZW
LU WWN Device Id: 5 000c50 05ce42712
Firmware Version: CC24
User Capacity:    137,438,952,960 bytes [137 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s
Local Time is:    Thu Aug 13 20:05:48 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Read SMART Data failed: Input/output error

=== START OF READ SMART DATA SECTION ===
Error SMART Status command failed: Input/output error
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.

Read SMART Log Directory failed: Input/output error

Read SMART Error Log failed: Input/output error

Read SMART Self-test Log failed: Input/output error

Selective Self-tests/Logging not supported

 

I've attached my latest syslog if it is of any help as well to my current issue.  Thank you again to everyone as always!

syslog.zip

Link to comment

Latest syslog was identical to the first, just multiplied many times, neither was complete, no setup section.  The Seagate (serial ending in 7ZW) appears to be bad, or it's attached to a very bad port.  You can swap cables with a different drive with no problems, but you don't want to just pull another drive, leaving the array with 2 drives wrong, and unable to start.

 

If after swapping the cables, the problem moves to the swapped drive, then you have a bad port, and need a new controller.  If the issue stays with the drive, it needs to be replaced.  I suspect it IS the drive this time as it cannot even report its own size correctly, so you may need a new 3TB drive, to rebuild Disk 1 onto.

Link to comment

@RobJ - Thank you for your reply!  Once I get home from work I will give this a go.  I have a suspicion that it is the drive as well, mainly because I know this particular Seagate 3TB drive has a high failure rate.  Nonetheless, I will give the cable swap a try and see what happens when doing so.  I'm almost hoping it is the SATA port as I have been wanting to upgrade my server to something more powerful, and that would certainly give me a reason  ;D

 

I will report my results back, but the resolution to either a bad port or bad drive is pretty straightforward.  Thank you again!

Link to comment

Okay so I just swapped drives and it appears that the error has followed the drive and not the port.  As suspected, it looks like my drive has died.  Time to go purchase a 3TB or larger drive to replace this one.

 

Just so I'm clear about the rebuild process, I was going to follow the steps found on this wiki post (https://lime-technology.com/wiki/index.php/Replacing_a_Data_Drive):

 

A data drive needs to be replaced for whatever reason.
backup 'config/super.dat' and 'config/disk.cfg' files to workstation
Stop the array
Power down
Replace harddrive with new drive.
Turn on
Replaced drive appears with blue dot
Tick the "I'm sure" checkbox, and press "Start will bring the array on-line, start Data-Rebuild, and then expand the file system."
Hefty disk activity and main page will show lots of reading on "the other" disks and writing on new disk as data is being rebuilt.
End

 

I just want to make sure that the process hasn't changed it any way since this was written.  I would also optionally opt to use preclear on the drive to test it first before rebuilding.  Thank you!

Link to comment

That wiki page was very old, and needed changes, so I've revised it considerably, left the old instructions for anyone still running unRAID v4.

 

  Replacing a Data Drive

 

Let me know how it works for you, what changes still need to be made, what else needs to be said, etc...

 

And if you could provide a screen capture of what it looks like (the array operations section of the webGui), that would be great!  Or at least make a note of the prompts, so I can improve the wiki page with what a user would see.  Thanks!

Link to comment

@RobJ -  Thank you for taking the time to update the Wiki article!  I just received a replacement drive, and I wanted to ask a few questions before I start.

 

[*]You mention in the updated instructions to "Unassign the old drive" of which I'm not sure I am able to or need to do because it is listed as "not installed."  I've attached an image of my UnRAID GUI to illustrate what I am seeing.

[*]You also mentioned that all of the contents of the old drive will be copied onto the new drive, making it an exact replacement, except possibly bigger.  I worry about that last part as 3TB is the largest drive I can install since that is the size of my parity drive, and I (stupidly, I know) kept my drive pretty full, so much so that there were only roughly 6GB left on the drive before it died.  Will I have to worry about anything happening if it is indeed larger since I suspect 6GB isn't a large buffer for any overflow

 

Thank you again for all of your help as always, and I will begin the rebuild process as soon as I get confirmation on those two items!

disk1.JPG.acfce2b2e37d075e981f1ca2560d6586.JPG

Link to comment

The rebuild process works at the sector level and has no idea of the actual data on the disk, so it is irrelevant how full it is from a data perspective.  If unRAID accepts the disk as valid for rebuild purposes then it is happy that there are enough sectors to restore all the ones it thinks should be there.

Link to comment

@itimpi - Thank you for the quick explanation.  I have attached the new drive and it is currently, seemingly with no issues, rebuilding the drive as we speak.  I expect this to take a fairly long time, maybe a full day, so I will report back once it is done if there are any issues.

 

@RobJ - I followed the instructions you laid out and had no problems so far.  I admit I got a little excited and forgot to take a screenshot of what it said for the "I'm sure" portion of the instructions.  The wording had mentioned something about "rebuilding the drive," but I forgot to copy it down before proceeding.  What I did was the following, starting with removing the old drive because my server was already shutdown:

 

[*]Removed problem drive from server

[*]Installed new drive with the same SATA cable and power cable, also keeping the drive connected to the same port the troubled drive was connected to

[*]Booted up server (noticed that the SuperMicro splash boot screen did not sit there for a minute as it previously had, presumably because the troubled drive was no longer delaying boot.)

[*]Opened up UnRAID GUI to the "Main" tab (default)

[*]Clicked the drop down list next to "Disk 1" (currently reporting "no device") and selected the new drive

[*]Clicked the "Yes, I'm sure" check box next to the information explaining that the drive will be rebuilt upon hitting the "Start" button

[*]Clicked on the "Start" button and the rebuilding process started, with the UnRAID GUI displaying orange-colored toast pop-ups in the top right corner.  The toast pop-ups said the following:

[*]unRAID Disk 1 error: 2015-08-16 04:11 PM

Warning [TOWER] - Disk 1, drive not ready, content being reconstructed

WDC_WD30EZRX-00D8PB0_WD-WMC4N0K9ALWF (sdf)

[*]unRAID Data rebuild:: 2015-08-16 04:11 PM

Notice [TOWER] - Data rebuild: started

Parity size: 3 TB

[*]Data build currently in progress

 

I've attached a screenshot of the current screen for possibly posting on the wiki post if it is of any help. 

 

In the meantime while the drive re-builds, I want to try and see if I can diagnose the drive that stopped working.  I have a SystemRescueCD USB drive that I will use to boot up into Linux and try and see if I can get the drive to be detected.  If anyone has any suggestions on things to do when trying to diagnose a potentially bad drive, that would be greatly appreciated.  In the meantime, off to Google I go.

 

Nonetheless, thank you to everyone who has been a help in this post.  I hope that this will be a simple solution to my issue! :)

data-rebuild.jpg.52298844d174e4c6ea3681620145d530.jpg

Link to comment

I've read conflicting arguments on here that it does and doesn't matter how much free space you have on your drives, so I'd love to hear what you suggest is an adequate amount :D

Static content that is read only after you put it on the drive (movies, etc) fill 'er up. The last few will probably take a while to start copying, but after it's on there, reading speeds should be fine. Drives that have changing content, I'd keep about 20% free, and every once in a while it would probably be a good idea to totally clean off the drive, format it, and start filling it up again. Fragmentation is still an issue, and I have not found a good way to deal with it short of reformatting every once in a while. Keeping 20% free will keep the effects to a minimum.

 

Others may have differing opinions, but this has been my experience so far with unraid.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.