Upgrading a disk -- having problems

March 3, 200917 yr

Hi, I am replacing a 500GB drive with a 1TB. First I ran a parity check and it finished with no errors. Then I copied my entire unRAID flash drive in case I need a system config file. Then I powered down the server, replaced my drive. Powered on and the disk was ready to be upgraded. Started the rebuild process by using the Start button and it looked like everything was great. However the speed seemed pretty slow. About 4,000 KB/sec. I looked at my syslog and there were a lot of ata errors. I know this sometimes means that the cable or connection is bad but I had just completed the parity check. Attached is a copy of my syslog for your review.

One kind of strange thing that happened was I wanted to use the GREAT preclear_disk.sh script but I am out of available ports on my system. So I hooked up a laptop and used a USB/SATA dock for my 1TB drive. I set it up out of the way (in a bathroom we're remodeling) because it was to be unused for the next couple of days. Then, last minute, my wife needed to paint it so I killed the preclear process in the middle. When I put the drive in my real server I figured it wouldn't matter anyway, because the rebuild process should wipe the drive. Was this correct?

I still have the old 500GB drive available if necessary and a brand new 1TB still shrink wrapped if my existing drive is suspect.

Any suggestions on how to proceed? I think my system is currently unprotected.

Thanks.

March 4, 200917 yr

First I ran a parity check and it finished with no errors.

Excellent! Excellent!

Then I copied my entire unRAID flash drive in case I need a system config file.

Excellent! Excellent! Great start here.

As someone else has said, we sure are having some nasty issues lately, especially a rash of hard drive and motherboard issues. Yours is a little unusual too. It looks to me as if the Jmicron chipset crashed hard about 2 and a half hours after the drive rebuild began, and both of its SATA links went down completely, at which point you lost sdq and sdr, your swap drive and cache drive, NOT any of your array drives! All of the rest of the syslog consists of the same repeated errors trying to contact the 2 lost drives.

The array and your drives are all fine, as far as I can tell, probably no issues at all. What happened to the Jmicron chipset, I don't know. You could disable it in the BIOS, but of course, you would lose your cache and swap drive, and you are already short of ports.

I'm not sure about the next step, as you did not indicate whether you killed the rebuild or not. You could continue the rebuild, but it is going to be a very long process at this speed, plus you have to worry about an 'exploded syslog' crashing your server. If you stop it, there are other complications to re-starting it, doable, with no data loss, but the steps have to be carefully followed, I think.

At some point, you will want to do some hardware testing, make sure of the status of the JMB ports.

March 4, 200917 yr

Author

Thanks for the reply RobJ. I think you are saying that letting it continue and hope the rebuild completes would be the safest option. I have a 4GB flash and only 25% used so hopefully the syslog won't grow more than 3GB in the next 2993.4 minutes.

Let's say the syslog does crash the system and I do have to do a reboot. Any ideas how to avoid disaster? Even though the console says the array is started, my media player shows an error when trying to access the Movies share.

Thanks again.

March 4, 200917 yr

I agree with RobJ. You have followed the recommended procedures and have done everything right!

A clarification, the syslog will not fill up your flash disk, it will fill up your main memory. Once main memory is exhausted, unRAID will crash.

If it were me, I would put the old disk back in the array and use the "Trust my parity" procedure (link here).

That should get you back to a usable array. Let it complete its parity check. If all is well you know the problem is with the new disk somehow. But if you start to have problems with your existing array, you'll have more hints as to what might be the root cause.

I would then, hopefully in a test machine, run Joe L.'s preclear script on this drive. Although you don't need it to clear the disk, it will run the disk through its paces. If it completes the clearing process and the smart reports look clean, you'll have confidence that the drive is okay.

If it is clean, I would try rebuilding this disk again, using a fresh data cable and making sure that your connections are good. If it gives trouble again, you may be having a compatibility problem or could be dealing with a defective motherboard.

March 4, 200917 yr

Author

Thanks bjp999. I think you are saying I should stop the rebuild, powerdown my server, replace the new 1TB with the old 500GB, boot up the server. Ensure that all drives are working and the syslog looks clean. Make sure the devices tab has the same assignments as before I started (I took a screen shot). Stop the array. Do the "trust my parity" procedure and then let the parity check complete.

Do I need to do anything with the configuration files I copied over before I started this fiasco?

Also, it says in big red print on the trust my parity procedure "It should NEVER be used if you are replacing a drive with one of a larger size.". I think this is okay since I will have already put back my old 500GB drive but I want to make sure.

Sorry, just a little uneasy about this one because it involves the Restore button, which I was hoping never to have to use.

Any additional words of wisdom? I'll be attempting this in about 3 or 4 hours from now.

Thanks as always.

March 4, 200917 yr

What Brian is suggesting is exactly what I was going to suggest, but there is another way, that will probably be easier. I haven't done it, but others have just cleared the syslog, and bought time that way. I don't know if it is safe to just delete the syslog, and assume it will be recreated with the additional syslog lines. But it should be safe to clean it out, by something simple like:

echo "new syslog" >/var/log/syslog

That simply replaces the entire current syslog with a single dummy line, which is then appended to by subsequent syslog lines. You might check top before and after, to see if the memory available increases (sorry, not sure which quantity to monitor). You could probably do that every 8 hours and be safe. You already have a copy of the current syslog, so nothing is lost. (except for about 3000 more minutes of your time!)

I would not use the server during this time, for movies or anything else. And there is no guarantee other issues won't appear. I don't trust a machine that has had a crash. I would not trust it, until you have rebooted, then tested it.

March 4, 200917 yr

Author

As a quick update --

I did the trust my parity process and put my old 500GB drive back in.

I only have access to one of my shares but the files are still on the disks so I think that if all goes well, when I reboot the shares will come back.

I can properly see that the cache and swap drives are mounted and there are no red entries in my syslog.

Progress on the Parity Check is going at the appropriate speed.

37,817 KB/sec and only 367.4 minutes left until finished.

However I did receive 24 (corrected) sync errors so far at 38.2% complete. Actually they happened right away on the parity check. This number hasn't increased since I first noticed. None of the disks have errors logged.

I think everything will turn out alright. I'm not sure how important the 24 sync errors are but I've got my fingers crossed. I'll update if things take a turn for the worst.

Thanks everybody for the help and advise. Invaluable!

March 4, 200917 yr

Author

As a followup, my parity check ended up with 3751 errors corrected. I have no idea if that is significant or expected but my data seems to be intact. I am streaming movies and music with no problem. My latest downloads are there and play. There are no red entries in the syslog. Feels like all is good.

Thanks for all the help. That was scary.

March 4, 200917 yr

As a followup, my parity check ended up with 3751 errors corrected. I have no idea if that is significant or expected but my data seems to be intact. I am streaming movies and music with no problem. My latest downloads are there and play. There are no red entries in the syslog. Feels like all is good.

Thanks for all the help. That was scary.

I would run one more parity check and see if anything changes. Hopefully it does not!! If no errors are found on the next run then you should be back to normal.

I have not read through the entire thread but when i get a chance to I will see if i can comment any further on this problem you are having and hopefully we can get it figured out.

March 4, 200917 yr

Author

Thanks prostuff1. What if there are parity errors? I just started another parity check and am only 0.5% complete at 28,376 KB/sec and 571.0 minutes until complete but with 1 sync error already. I imagine there will be more around the same place as last time (around 50 or 70 % complete). Should I run another check after that until there are no more errors? I'll save a syslog after each run just in case. Thanks for the help.

March 4, 200917 yr

Thanks prostuff1. What if there are parity errors? I just started another parity check and am only 0.5% complete at 28,376 KB/sec and 571.0 minutes until complete but with 1 sync error already. I imagine there will be more around the same place as last time (around 50 or 70 % complete). Should I run another check after that until there are no more errors? I'll save a syslog after each run just in case. Thanks for the help.

From my understanding you should not keep getting parity errors after a parity check is done. I know on my machine that has been set up for a while now I do not get parity errors very often, virtually never in fact. If this check is also return parity sync errors then something else might be up.

I would check cables (data and power) and try to get the stuff that is on the drive backed up somewhere else. Something is not being read correctly if there are always parity sync errors.

I will let RobJ or bjp999 comment on this also but I have not received any parity errors on my array for months now. After parity is built the parity check should run and come back with the message that nothing has changed. By coming back with parity errors it is saying that stuff is still changing.

March 4, 200917 yr

Author

My parity check is almost complete. At this point I'm at 91.1% complete still with the one single sync error. I think the next step should be to run another parity sync and hope that there are 0 errors this time.

Is it bad practice to stream movies or use the server during this parity check? I've been leaving it alone for the last two days to get through some of this but am anxious to get back to working.

Thanks

March 5, 200917 yr

My parity check is almost complete. At this point I'm at 91.1% complete still with the one single sync error. I think the next step should be to run another parity sync and hope that there are 0 errors this time.

Is it bad practice to stream movies or use the server during this parity check? I've been leaving it alone for the last two days to get through some of this but am anxious to get back to working.

Thanks

Since a parity check is very IO intensive you will general get very poor performance while there is one going on. You can do a check while using the server but you would be better off to copy the stuff you want to watch to your computer and then do it from there.

It is a good sign that the error was only one this time!! Hopefully the next run will produce zero errorr's!!

March 5, 200917 yr

Author

I just ran Smart History and there was a message --

9TE0NDJZ: Warning - Reallocated_Sector_Ct has increased from 0 to 4 since yesterday
********************************************************

* DRIVE WD-WMAKE1332211: FAILED S.M.A.R.T. HEALTH TEST!!!! *

********************************************************

17 device(s) active, 0 sleeping, 1 did not return SMART data.

I think I need to watch this carefully. I'm going to run another parity check and then another Smart History. Hopefully the Reallocated Sector Count doesn't go up any more. -- Thanks

March 5, 200917 yr

DO NOT RUN A PARITY CHECK!!!

Run a full smartctl report and post it.

March 5, 200917 yr

DO NOT RUN A PARITY CHECK!!!

Run a full smartctl report and post it.

AGREE WITH THIS MAN COMPLETELY!!!

you need to get the full smart report on the drive first. Doing parity on it right now could kill the drive if it came back with that Fail smart message. Hopefully there is not anything crucial on the drive and if there is you might want to try moving it around to get the stuff.

March 5, 200917 yr

DO NOT RUN A PARITY CHECK!!!

Run a full smartctl report and post it.

AGREE WITH THIS MAN COMPLETELY!!!

you need to get the full smart report on the drive first. Doing parity on it right now could kill the drive if it came back with that Fail smart message. Hopefully there is not anything crucial on the drive and if there is you might want to try moving it around to get the stuff.

Run a smart report. If it really did fail, you will want to disconnect the drive. unRAID will simulate the failed disk until you can replace it with a good drive and then unRAID will rebuild the contents onto the good drive.

Running a parity check on a failed drive SHOULDN'T hurt anything, but I have seen reports from users that seem to imply that a drive can return bad data and corrupt parity. Once parity is corrupted, rebuilding a failed disk will result in garbage.

Like I said, don't run a parity check on a failed drive. Run a full smart report and let's see why the drive says it has failed.

March 5, 200917 yr

Author

Thanks everybody for chiming in. I was about to start parity but luckily was waiting to see if anybody responded on the forum first. Thanks. Here's what I've done so far --

- ran this command in a terminal 'smartctl -d ata -tlong /dev/sdm'
- ran this command to check the status 'smartctl -a -d ata /dev/sdm'

I receive this data back -

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Device Model: ST31000333AS

Serial Number: 9TE0NDJZ

Firmware Version: CC3H

User Capacity: 1,000,204,886,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Wed Mar 4 17:23:47 2009 GMT+8

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 249) Self-test routine in progress...

90% of test remaining.

Total time to complete Offline

data collection: ( 625) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 210) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x103f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 185397006

3 Spin_Up_Time 0x0003 095 092 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 34

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 4

7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 320028

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 174

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 12

184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 6

189 High_Fly_Writes 0x003a 095 095 000 Old_age Always - 5

190 Airflow_Temperature_Cel 0x0022 068 064 045 Old_age Always - 32 (Lifetime Min/Max 25/32)

194 Temperature_Celsius 0x0022 032 040 000 Old_age Always - 32 (0 19 0 0)

195 Hardware_ECC_Recovered 0x001a 046 026 000 Old_age Always - 185397006

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 234994840633504

241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1943668863

242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1920700743

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Self-test routine in progress 90% 174 -

# 2 Extended offline Aborted by host 90% 174 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

It doesn't seem as though the percent complete is proceeding and there is a line in there about being Aborted by user at 90%. Is there any way to ensure that this is actually running?

Also, I have a brand new 1TB drive in the shrink wrap. Should I just ditch these two existing drives (old 500GB and the 1TB) and put in the new drive, rebuild from parity and hope for the best? Is that the safest course of action here?

Thanks very much.

March 5, 200917 yr

This is not the drive that failed. 4 reallocated sectors may be something to worry about, but it may also be fine. Just keep an eye on it and see if the number increases.

The drive that failed has S/N WD-WMAKE1332211.

March 5, 200917 yr

Author

Oh -- big misunderstanding. I was asking about the reallocated sectors and I wasn't even thinking about the Failed Health Test. I should have been more clear -- sorry. I may be totally wrong here but that Failed drive I think is working fine. It has always been that way and I didn't worry about it before because I thought it was a mis-read from the Smart Health script. The reason why is that drive is my swap drive. It is not actually assigned to the protected array. I figured it was that or the fact that this is an old Raptor (I think) that spins at 10K rpm. It is only 74GB and I only have one 4GB swap partition. The rest of the drive isn't even formatted. I figured that was why it would always show as failed.

Here is what it looks like in the middle of the main page from unMENU.

Drive Partitions - Not In Protected Array
Device Model/Serial Mounted File System Temp Size Used %Used Free

/dev/sdq scsi-SATA_WDC_WD740GD-00FWD-WMAKE1332211 nvidia_raid_member * 74.3G

/dev/sdq1 scsi-SATA_WDC_WD740GD-00FWD-WMAKE1332211 swap *

/dev/sdq2 scsi-SATA_WDC_WD740GD-00FWD-WMAKE1332211-part2 *

So, am I crazy or is there really nothing wrong with the swap drive. It seems to work as expected.

That being the case, if nothing is wrong with the Failed drive from the report (as I suspect that nothing is), then should I just do another parity check on the drive that received the 1 sync error and hope for no additional errors? I really don't believe I have a failed drive at this point.

Or should I just replace the drive with a new 1TB and try to rebuild from parity?

Any ideas are much appreciated. I'm not really sure what the smart way is to go at this point.

March 5, 200917 yr

I would run a full smartctl report on your failed drive. Smartctl doesn't much care about what is on the disk and whether it is in the array or not - and if it thinks the drive has failed you would definitely want to know about it.

4 reallocated sectors is not the end of the world. If that is the only issue (and that failed drive above is not in your array), then your original concept of running another parity check is likely the right choice. After running it, see if the number of reallocated sectors has gone up, and if you get any disk errors on the unRAID console afterwards. If you are still sitting at 4 reallocated sectors than you can rest a bit easier (although I would suggest running 1-2 parity checks a week for the next month to convince yourself that everything is stable).

But if you run another parity check and the number goes from 4 to 7. And then again and it goes from 7 to 24, And again it goes from 24 to 53. (You get the idea). Then you have a problem. Some have described the process as a radiating unformatting of the disk. Kind of like pulling on a lose string in your shirt and the next thing you know you have a ball of string in your hand and the shirt is ruined. There is nothing you can do to stop this decay, unless somehow it stops on its own. But if after running a burn-in process, you can't run 3 parity checks in a row without another reallocated sector occurring, the drive needs to be replaced IMO.

Hope this helps.

March 6, 200917 yr

Author

Thanks. I have run another parity check and this time there were no errors. Everything working well. Watched movies, listened to some music. No problems.

Then all of a sudden, another disk is disabled. What is going on here? My server has been bullit proof for a long time but this is getting worrisome.

Here is a link to my syslog -- http://pastebin.com/m60cbc9b6

Here are some excerpts I copied out of it. Don't know if these are pertinent or not. Just guessing.

Mar  6 11:29:11 Tower kernel: ata3: EH complete
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 11:29:11 Tower kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x100000 action 0x6 frozen
Mar  6 11:29:11 Tower kernel: ata3: edma_err_cause=00000020 pp_flags=00000002, SError=00100000
Mar  6 11:29:11 Tower kernel: ata3: SError: { Dispar }
Mar  6 11:29:11 Tower kernel: ata3: hard resetting link
Mar  6 11:29:12 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar  6 11:29:12 Tower kernel: ata3.00: configured for UDMA/33
Mar  6 11:29:12 Tower kernel: ata3: EH complete

Mar  6 13:07:03 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar  6 13:07:03 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x40)
Mar  6 13:07:03 Tower kernel: ata3.00: revalidation failed (errno=-5)
Mar  6 13:07:03 Tower kernel: ata3.00: disabled
Mar  6 13:07:03 Tower kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x6 frozen t4
Mar  6 13:07:03 Tower kernel: ata3: edma_err_cause=00000020 pp_flags=00000002, SError=00000000
Mar  6 13:07:03 Tower kernel: ata3: hard resetting link

Mar  6 13:07:05 Tower kernel: ata3: EH complete
Mar  6 13:07:05 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:07:05 Tower kernel: end_request: I/O error, dev sdc, sector 4106648
Mar  6 13:07:05 Tower kernel: Buffer I/O error on device sdc, logical block 513331
Mar  6 13:07:05 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:07:05 Tower kernel: end_request: I/O error, dev sdc, sector 470968
Mar  6 13:07:05 Tower kernel:

Mar  6 13:53:45 Tower login[2504]: ROOT LOGIN  on `pts/0' from `192.168.1.202'
Mar  6 13:54:16 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:54:16 Tower kernel: end_request: I/O error, dev sdc, sector 9258447
Mar  6 13:54:16 Tower kernel: md: disk11 read error
Mar  6 13:54:16 Tower kernel: handle_stripe read error: 9258384/11, count: 1
Mar  6 13:54:17 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:54:17 Tower kernel: end_request: I/O error, dev sdc, sector 9258447
Mar  6 13:54:17 Tower kernel: md: disk11 write error
Mar  6 13:54:17 Tower kernel: handle_stripe write error: 9258384/11, count: 1
Mar  6 13:54:17 Tower kernel: md: recovery thread woken up ...
Mar  6 13:54:17 Tower kernel: md: recovery thread has nothing to resync

I shutdown the server, replaced the cable to this disk with a brand new one, verified that all other data and power cables were snug, made sure my controller cards were seated properly, and then restarted the server. I still have that disk as disabled.

Here is a quick smart report from unMENU --

smartctl -a -d ata /dev/sdc (disk11)

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.9 family
Device Model:     ST3500641AS
Serial Number:    3PM0DJRK
Firmware Version: 3.AAD
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Mar  6 14:53:05 2009 GMT+8
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 ( 430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   078   006    Pre-fail  Always       -       209636127
  3 Spin_Up_Time            0x0003   097   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1602
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -       128652140
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19678
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       714
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   044   045    Old_age   Always   In_the_past 28 (Lifetime Min/Max 27/28)
194 Temperature_Celsius     0x0022   028   056   000    Old_age   Always       -       28 (0 15 0 0)
195 Hardware_ECC_Recovered  0x001a   080   048   000    Old_age   Always       -       113282126
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Here is the hdparm --

hdparm -I /dev/sdc (disk11)


/dev/sdc:

ATA device, with non-removable media
Model Number:       ST3500641AS                             
Serial Number:      3PM0DJRK
Firmware Revision:  3.AAD   
Standards:
Supported: 7 6 5 4 
Likely used: 7
Configuration:
Logical		max	current
cylinders	16383	16383
heads		16	16
sectors/track	63	63
--
CHS current addressable sectors:   16514064
LBA    user addressable sectors:  268435455
LBA48  user addressable sectors:  976773168
device size with M = 1024*1024:      476940 MBytes
device size with M = 1000*1000:      500107 MBytes (500 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16	Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
     Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
     Cycle time: no flow control=240ns  IORDY flow control=120ns
Commands/features:
Enabled	Supported:
   *	SMART feature set
    	Security Mode feature set
   *	Power Management feature set
   *	Write cache
   *	Look-ahead
   *	Host Protected Area feature set
   *	WRITE_BUFFER command
   *	READ_BUFFER command
   *	DOWNLOAD_MICROCODE
    	SET_MAX security extension
   *	48-bit Address feature set
   *	Device Configuration Overlay feature set
   *	Mandatory FLUSH_CACHE
   *	FLUSH_CACHE_EXT
   *	SMART error logging
   *	SMART self-test
   *	General Purpose Logging feature set
   *	SATA-I signaling speed (1.5Gb/s)
   *	SATA-II signaling speed (3.0Gb/s)
   *	Native Command Queueing (NCQ)
   *	Phy event counters
   *	Software settings preservation
Security: 
Master password revision code = 65534
	supported
not	enabled
not	locked
	frozen
not	expired: security count
not	supported: enhanced erase
Checksum: correct

Here are the filtered syslog entries for this drive since I've rebooted.

cat /var/log/syslog|egrep "(sdc|scsi3 :|scsi 3:0:0:0| ata3:| ata3.00:|disk11[^0-9]| md11:)" (Filtered syslog for disk11)

Mar  6 14:34:14 Tower kernel: scsi3 : sata_mv
Mar  6 14:34:14 Tower kernel: ata3: SATA max UDMA/133 mmio m1048576@0xfd200000 port 0xfd226000 irq 16
Mar  6 14:34:14 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar  6 14:34:14 Tower kernel: ata3.00: ATA-7: ST3500641AS, 3.AAD, max UDMA/133
Mar  6 14:34:14 Tower kernel: ata3.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
Mar  6 14:34:14 Tower kernel: ata3.00: configured for UDMA/133
Mar  6 14:34:14 Tower kernel: scsi 3:0:0:0: Direct-Access     ATA      ST3500641AS      3.AA PQ: 0 ANSI: 5
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 14:34:14 Tower kernel:  sdc: sdc1
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Attached SCSI disk
Mar  6 14:34:14 Tower emhttp: pci-0000:01:00.0-scsi-2:0:0:0 (sdc) ata-ST3500641AS_3PM0DJRK
Mar  6 14:34:14 Tower kernel: md: import disk11: [8,32] (sdc) ST3500641AS                                          3PM0DJRK offset: 63 size: 488386552
Mar  6 14:34:15 Tower kernel: md: import disk11: [8,32] (sdc) ST3500641AS                                          3PM0DJRK offset: 63 size: 488386552
Mar  6 14:34:16 Tower kernel: md: import disk11: [8,32] (sdc) ST3500641AS                                          3PM0DJRK offset: 63 size: 488386552
Mar  6 14:34:16 Tower kernel: md11: running, size: 488386552 blocks
Mar  6 14:34:16 Tower emhttp: shcmd (15): mount -t reiserfs -o noatime,nodiratime /dev/md11 /mnt/disk11  >/dev/null 2>&1
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: found reiserfs format "3.6" with standard journal
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: using ordered data mode
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: journal params: device md11, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: checking transaction log (md11)
Mar  6 14:34:17 Tower kernel: ReiserFS: md11: Using r5 hash to sort names

Here is my new and most current syslog (with no errors). -- http://pastebin.com/m6eb4197a

Any ideas how to proceed?

Also, this is happening at the worst time. I'm having a poker game tonight and was going to stream movies and music. I'm sure the consensus is to not use the server while a disk is disabled but wanted to check. Is it really that dangerous? Should I definitely not use my server until this is worked out?

Thanks very much.

March 6, 200917 yr

Thanks. I have run another parity check and this time there were no errors. Everything working well. Watched movies, listened to some music. No problems.

Then all of a sudden, another disk is disabled. What is going on here? My server has been bullit proof for a long time but this is getting worrisome.

Here is a link to my syslog -- http://pastebin.com/m60cbc9b6

Here are some excerpts I copied out of it. Don't know if these are pertinent or not. Just guessing.

Mar  6 11:29:11 Tower kernel: ata3: EH complete
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 11:29:11 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 11:29:11 Tower kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x100000 action 0x6 frozen
Mar  6 11:29:11 Tower kernel: ata3: edma_err_cause=00000020 pp_flags=00000002, SError=00100000
Mar  6 11:29:11 Tower kernel: ata3: SError: { Dispar }
Mar  6 11:29:11 Tower kernel: ata3: hard resetting link
Mar  6 11:29:12 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar  6 11:29:12 Tower kernel: ata3.00: configured for UDMA/33
Mar  6 11:29:12 Tower kernel: ata3: EH complete

Mar  6 13:07:03 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar  6 13:07:03 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x40)
Mar  6 13:07:03 Tower kernel: ata3.00: revalidation failed (errno=-5)
Mar  6 13:07:03 Tower kernel: ata3.00: disabled
Mar  6 13:07:03 Tower kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x6 frozen t4
Mar  6 13:07:03 Tower kernel: ata3: edma_err_cause=00000020 pp_flags=00000002, SError=00000000
Mar  6 13:07:03 Tower kernel: ata3: hard resetting link

Mar  6 13:07:05 Tower kernel: ata3: EH complete
Mar  6 13:07:05 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:07:05 Tower kernel: end_request: I/O error, dev sdc, sector 4106648
Mar  6 13:07:05 Tower kernel: Buffer I/O error on device sdc, logical block 513331
Mar  6 13:07:05 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:07:05 Tower kernel: end_request: I/O error, dev sdc, sector 470968
Mar  6 13:07:05 Tower kernel:

Mar  6 13:53:45 Tower login[2504]: ROOT LOGIN  on `pts/0' from `192.168.1.202'
Mar  6 13:54:16 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:54:16 Tower kernel: end_request: I/O error, dev sdc, sector 9258447
Mar  6 13:54:16 Tower kernel: md: disk11 read error
Mar  6 13:54:16 Tower kernel: handle_stripe read error: 9258384/11, count: 1
Mar  6 13:54:17 Tower kernel: sd 3:0:0:0: [sdc] Result: hostbyte=0x04 driverbyte=0x00
Mar  6 13:54:17 Tower kernel: end_request: I/O error, dev sdc, sector 9258447
Mar  6 13:54:17 Tower kernel: md: disk11 write error
Mar  6 13:54:17 Tower kernel: handle_stripe write error: 9258384/11, count: 1
Mar  6 13:54:17 Tower kernel: md: recovery thread woken up ...
Mar  6 13:54:17 Tower kernel: md: recovery thread has nothing to resync

I shutdown the server, replaced the cable to this disk with a brand new one, verified that all other data and power cables were snug, made sure my controller cards were seated properly, and then restarted the server. I still have that disk as disabled.

Here is a quick smart report from unMENU --

smartctl -a -d ata /dev/sdc (disk11)

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.9 family
Device Model:     ST3500641AS
Serial Number:    3PM0DJRK
Firmware Version: 3.AAD
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Mar  6 14:53:05 2009 GMT+8
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 ( 430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   078   006    Pre-fail  Always       -       209636127
  3 Spin_Up_Time            0x0003   097   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1602
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -       128652140
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19678
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       714
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   044   045    Old_age   Always   In_the_past 28 (Lifetime Min/Max 27/28)
194 Temperature_Celsius     0x0022   028   056   000    Old_age   Always       -       28 (0 15 0 0)
195 Hardware_ECC_Recovered  0x001a   080   048   000    Old_age   Always       -       113282126
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Here is the hdparm --

hdparm -I /dev/sdc (disk11)


/dev/sdc:

ATA device, with non-removable media
Model Number:       ST3500641AS                             
Serial Number:      3PM0DJRK
Firmware Revision:  3.AAD   
Standards:
Supported: 7 6 5 4 
Likely used: 7
Configuration:
Logical		max	current
cylinders	16383	16383
heads		16	16
sectors/track	63	63
--
CHS current addressable sectors:   16514064
LBA    user addressable sectors:  268435455
LBA48  user addressable sectors:  976773168
device size with M = 1024*1024:      476940 MBytes
device size with M = 1000*1000:      500107 MBytes (500 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16	Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
     Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
     Cycle time: no flow control=240ns  IORDY flow control=120ns
Commands/features:
Enabled	Supported:
   *	SMART feature set
    	Security Mode feature set
   *	Power Management feature set
   *	Write cache
   *	Look-ahead
   *	Host Protected Area feature set
   *	WRITE_BUFFER command
   *	READ_BUFFER command
   *	DOWNLOAD_MICROCODE
    	SET_MAX security extension
   *	48-bit Address feature set
   *	Device Configuration Overlay feature set
   *	Mandatory FLUSH_CACHE
   *	FLUSH_CACHE_EXT
   *	SMART error logging
   *	SMART self-test
   *	General Purpose Logging feature set
   *	SATA-I signaling speed (1.5Gb/s)
   *	SATA-II signaling speed (3.0Gb/s)
   *	Native Command Queueing (NCQ)
   *	Phy event counters
   *	Software settings preservation
Security: 
Master password revision code = 65534
	supported
not	enabled
not	locked
	frozen
not	expired: security count
not	supported: enhanced erase
Checksum: correct

Here are the filtered syslog entries for this drive since I've rebooted.

cat /var/log/syslog|egrep "(sdc|scsi3 :|scsi 3:0:0:0| ata3:| ata3.00:|disk11[^0-9]| md11:)" (Filtered syslog for disk11)

Mar  6 14:34:14 Tower kernel: scsi3 : sata_mv
Mar  6 14:34:14 Tower kernel: ata3: SATA max UDMA/133 mmio m1048576@0xfd200000 port 0xfd226000 irq 16
Mar  6 14:34:14 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar  6 14:34:14 Tower kernel: ata3.00: ATA-7: ST3500641AS, 3.AAD, max UDMA/133
Mar  6 14:34:14 Tower kernel: ata3.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
Mar  6 14:34:14 Tower kernel: ata3.00: configured for UDMA/133
Mar  6 14:34:14 Tower kernel: scsi 3:0:0:0: Direct-Access     ATA      ST3500641AS      3.AA PQ: 0 ANSI: 5
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write Protect is off
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar  6 14:34:14 Tower kernel:  sdc: sdc1
Mar  6 14:34:14 Tower kernel: sd 3:0:0:0: [sdc] Attached SCSI disk
Mar  6 14:34:14 Tower emhttp: pci-0000:01:00.0-scsi-2:0:0:0 (sdc) ata-ST3500641AS_3PM0DJRK
Mar  6 14:34:14 Tower kernel: md: import disk11: [8,32] (sdc) ST3500641AS                                          3PM0DJRK offset: 63 size: 488386552
Mar  6 14:34:15 Tower kernel: md: import disk11: [8,32] (sdc) ST3500641AS                                          3PM0DJRK offset: 63 size: 488386552
Mar  6 14:34:16 Tower kernel: md: import disk11: [8,32] (sdc) ST3500641AS                                          3PM0DJRK offset: 63 size: 488386552
Mar  6 14:34:16 Tower kernel: md11: running, size: 488386552 blocks
Mar  6 14:34:16 Tower emhttp: shcmd (15): mount -t reiserfs -o noatime,nodiratime /dev/md11 /mnt/disk11  >/dev/null 2>&1
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: found reiserfs format "3.6" with standard journal
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: using ordered data mode
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: journal params: device md11, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Mar  6 14:34:16 Tower kernel: ReiserFS: md11: checking transaction log (md11)
Mar  6 14:34:17 Tower kernel: ReiserFS: md11: Using r5 hash to sort names

Here is my new and most current syslog (with no errors). -- http://pastebin.com/m6eb4197a

Any ideas how to proceed?

Also, this is happening at the worst time. I'm having a poker game tonight and was going to stream movies and music. I'm sure the consensus is to not use the server while a disk is disabled but wanted to check. Is it really that dangerous? Should I definitely not use my server until this is worked out?

Thanks very much.

Once unRAID marks the disk as "red" it will not restore it unless you take the following actions.

Stop the array

un-assign the disk marked as red

reboot the server

if the array starts on its own when rebooted, stop it once more

re-assign the disk

start the server once more.

If you go through those steps it will think a new disk was installed and proceed to rebuild it. That will take a number of hours, more if you are serving movies too.

Or... instead of the above, you can use the "Trust My Parity" procedure as described in the wiki here. It sounds like you might have had had a lose or defective cable. In that case the "Trust-My-Parity" might just get you up and running fastest.

If you don't want to do this until after your guests leave, there is no problem with using the server while it has a defective disk. It is exactly why you have the server. You can even play movies on the "defective" disk. It will work just fine.

Joe L.

March 6, 200917 yr

Author

Thanks for the quick reply - especially since I have people arriving soon!

So my plan is to continue to use my server as is tonight to watch movies and listen to music. Great!

If I'm going to rebuild my array, could I use a brand new 1TB that I already have? Or is that not advised until all is working correctly for a while on the old drive?

I kind of understand the trust my array procedure but all that bold red type at the top makes me wonder if I'm better off just rebuilding. I'm trying to understand it but -- I had a parity check end this morning with 0 errors. I'm sure I was writing to the disks before, during, and after the disk was disabled. I don't know if it was that exact disk though because I use user shares. Realistically, I don't care if I loose any data between my last parity check and the time this disk was disabled. To make it even more confusing, I use a cache drive. I suppose this means nothing is ever written the array disks until 3:40am when cron starts up mover.

I'll probably just try to rebuild on that new 1TB. Please let me know if that is a bad idea.

Also, thanks JoeL for

Once unRAID marks the disk as "red" it will not restore it unless you take the following actions.
Stop the array

un-assign the disk marked as red

reboot the server

if the array starts on its own when rebooted, stop it once more

re-assign the disk

start the server once more.

If you go through those steps it will think a new disk was installed and proceed to rebuild it. That will take a number of hours, more if you are serving movies too.

putting your response in such an easy to follow format. That is very COOL!

March 6, 200917 yr

The "trust" procedure is lower risk (if you follow the directions properly) than the "rebuild" procedure.

In the "trust" procedure the disk is not updated at all. If the "rebuild" procedure it is. If parity is bad, the rebuild procedure will result in overwriting a good disk with garbage.

Upgrading a disk -- having problems

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)