Cable Popped Loose - UPDATE - Errors during Data Rebuild?? - Syslog added - unRAID Server 4.3 [No new topics]

December 16, 200817 yr

Well I expanded my system by adding a drive only to notice that another drive now had a "red dot" next to it after the system was up and running.

I checked my cables and noticed that the SATA cable had popped loose to this drive. Even after hooking the cable back up the drive is still red. Well I searched the forum and came up with this advice from Joe L.

"Before you replace it, you should power down and check the cable to that drive.

If by chance it was loose, you can get unRAID to re-try that drive by going through these steps.

Stop the array

Un-assign the defective drive

go back to the main management page and reboot the server

When it comes back up go to the devices page and attempt to re-assign the drive to the slot.

If it comes back on-line, it was just the cable. "

I did this and indeed the drive then turned blue and allowed me to assign it. Upon starting the array, however, instead of just coming back online, the system started a "Data-Rebuild" on the disk. I expected that it would just re-add the drive since nothing had really failed. Is my system behaving normally? Did I do things correctly?

Thanks in advance and thanks to Joe L. for his original post!

December 16, 200817 yr

Well I expanded my system by adding a drive only to notice that another drive now had a "red dot" next to it after the system was up and running.

I checked my cables and noticed that the SATA cable had popped loose to this drive. Even after hooking the cable back up the drive is still red. Well I searched the forum and came up with this advice from Joe L.

"Before you replace it, you should power down and check the cable to that drive.

If by chance it was loose, you can get unRAID to re-try that drive by going through these steps.

Stop the array

Un-assign the defective drive

go back to the main management page and reboot the server

When it comes back up go to the devices page and attempt to re-assign the drive to the slot.

If it comes back on-line, it was just the cable. "

I did this and indeed the drive then turned blue and allowed me to assign it. Upon starting the array, however, instead of just coming back online, the system started a "Data-Rebuild" on the disk. I expected that it would just re-add the drive since nothing had really failed. Is my system behaving normally? Did I do things correctly?

Thanks in advance and thanks to Joe L. for his original post!

You did things correctly, the system is behaving as expected.

By re-booting with the drive un-assigned, it forgot it ever knew it. (it did not forget the data, just the drive's model/serial)

Now that you again rebooted with it connected and re-assigned, It thinks the drive is a "new" drive replacing the old failed one, so it is rebuilding the old data onto it.

Just let it finish the rebuild process. There is a way to force it to think the drive is already valid, but unless you need to use it, just let the process complete. Odds are it will be done by morning.

Joe L.

December 16, 200817 yr

Author

Thanks for replying so quickly, take care!

December 16, 200817 yr

funny, same thing happened to me two days ago...got a red dot unassigned disk (no wonder the transfer to disk2 was going only 4MB/sec)....i didn't realize while transferring I thought I'd check and the disk was unassigned, but it was still writing to the parity i guess....so I let it finish writing

first i tried just a shutdown and restarted the system, it found the disk again but with a blue dot and asked me to rebuild, which i did, but again it failed after a while and appeared with a red dot unassigned disk....

then i remembered one of Joe's or Rob's old posts that said that the easiest and least expensive thing to check first is the cable, and sure enough, it was...although my cable wasn't loose, i replaced the cable, started the rebuild, and everything was working perfectly this morning

And through all of this, all my data was still available!!

how often do cables fail anyway...?

December 16, 200817 yr

Author

Well my Data Rebuild worked but there is ~500 errors on the parity drive I've never seen this before. Does this mean the rebuild failed? Is my data corrupt? Is the disk bad? I've posted a screenshot. Unfortunately I'm not at home so I can't post a syslog, but if it would help I can do so?

Any ideas?

December 16, 200817 yr

Syslog definitely needed. And a SMART report for the Parity drive.

Disk is probably not corrupt, but the data integrity of Disk 1 has to be considered suspect, will need testing. Sorry. I would first run a reiserfsck on it, see Check Disk Filesystems. Then browse it a bit, then try comparing files on it with good copies elsewhere, running archive integrity tests, such as zip file testing, etc. Hopefully there will be absolutely nothing wrong.

It's good that you had run a recent parity check, although I'm not sure I can trust the time on it. According to that screen shot, it finished in my future, and after checking your profile 'Local Time', it appears to have finished in your future too! Might want to correct the time on your server, or your profile.

This is one situation where I think it would have been best to have run the Make unRAID Trust the Parity Drive procedure. If you haven't run a parity check recently, then attempting a (forced in this case) data drive rebuild, could uncover a problem on another drive, and result in a failed or corrupted rebuild. Perhaps we should call it the Validate Array procedure. The Validate Array procedure lets you inform unRAID that the array is fully valid. It configures it as valid, then begins a parity check, to see for itself, which is what it should do.

You will need another parity check, to correct any possible problems, but not yet, until someone checks the syslog and SMART report, and understands what the problem with the parity drive was.

December 16, 200817 yr

Author

Ok, thanks, I"ll try to get that information when I get home tonight!

December 17, 200817 yr

Author

Ok, here's the syslog (had to split it in two because the size was too large). I hope this allows us to see where the errors are? In the meantime I'll get the rest of the info

Thanks in advance

December 17, 200817 yr

Author

2nd part of syslog

December 17, 200817 yr

Author

reiserfsck results:

"Replaying journal..

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree. .finished

Compariing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

Leaves 227047

Internal nodes 1369

Directories 219

Other files 432

Data block pointers 229676227 (8 of them are zero)

Safe links 0

##############

reiserfsck finished at Wed . .. .

Ok that was it!

Hope that means something? "No corruptions found" sounds encouraging"

- Also I tried to run Smartctl

I have version 4.2, so I copied the unzipped file to //tower/flash/ and //tower/flash/config (b/c I wasn't sure which was my "root" directory.

My parity drive is "SDF", so on the command line I type:

"/boot/smartctl -a -d ata dev/sdf"

and it says:

"Smartctl open device: dev/sdf failed: No such file or directory

??

December 17, 200817 yr

"/boot/smartctl -a -d ata dev/sdf"

So close... just needed a slash in front of dev, try this:

/boot/smartctl -a -d ata /dev/sdf

That was a good reiserfsck report. Unfortunately, you have a series of media errors on the parity drive, not good news. You will need to check the SMART report, and make a decision about this drive. As best as I can tell, the parity check took about 9 hours, and the media errors came about 7 hours into it, so that makes it at about the 75% point of the drive. There were no more errors after that cluster of them. Because these read errors are accompanied by handle_stripe read errors, I have to assume that the data rebuild is BAD for that section of the rebuild. I don't see how the data could have been reconstructed correctly for that region. And since the drive was almost full, there are corrupt files.

It may be hard to determine what files are involved. If the files were added incrementally, with no modifications or deletions, then you could *try* to guess what files were added at around the 75% full mark, but this would not be reliable, even in an ideal situation. If files were modified and/or deleted occasionally, then it will be extremely difficult, if not impossible. I'm sorry. This is a worst case situation, where media errors appeared on one drive, while rebuilding another drive.

December 18, 200817 yr

Author

Ahh, Thanks Rob,

The new SMART command worked and I get the following:

"SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [to run self-tests, use: smartctl -t]"

Is that it? Do I need to run a self-test or does the log keep track of all errors throughout the life of the drive?

Also with ~500 errors in a row, on the drive, is it possible to know approximately how much data has been affected? 1 gb? 10gb?, 100gb? etc. . ?

Also which drive has the faulty media? Do I need to search the whole system or just the 1tb drive that I was "Data - Rebuilding"?

THanks so much !

December 18, 200817 yr

That is only a small part of the SMART report, there should have been much more, especially an attribute table. Read the Troubleshooting#Hard drive failures section, where it mentions the smartctl command and how to copy it to a file. Or just use the following command, which saves the SMART report to a file called smart_sdf.txt on your flash drive.

/boot/smartctl -a -d ata /dev/sdf >/boot/smart_sdf.txt

It is the Parity drive that had the media errors, no other. Disk 1 is the drive that probably has corrupted files. I think the corrupted data is less than a gigabyte, probably much less, but it was not a single contiguous section. It was a series of close sections, that may have been evenly spaced. That makes me think the problem might be something like a small scratch on one platter. If I had more time, I think I could be more specific as to quantities of sectors involved, and how they are spaced.

That "No errors logged" really surprises me, need to see the SMART attributes.

A SMART long self-test may be a good idea, but the array will need to be down for awhile, perhaps close to 4 hours. The Troubleshooting link above discusses using the long test.

December 18, 200817 yr

Author

Ok, once again nice instructions!

Here is the SMART report

What do you think?

December 18, 200817 yr

Ok, once again nice instructions!

Here is the SMART report

What do you think?

There are 4 lines in that SMART report that immediately stand out to me.

The first two:

SMART overall-health self-assessment test result: FAILED!

Drive failure expected in less than 24 hours. SAVE ALL DATA.

Then, this line indicates that 904 sectors were already re-allocated by the SMART firmware when un-readable sectors were subsequently written.

5 Reallocated_Sector_Ct 0x0033 087 087 140 Pre-fail Always FAILING_NOW 904

Lastly, there are 36 other sectors that are un-readable and waiting for a subsequent "write" so they too can be re-allocated.

197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 36

I think you should RMA this drive... as soon as possible, maybe sooner.

Joe L.

December 18, 200817 yr

Joe could not have said it better, or plainer! If there's any good news, it's your parity drive, so you don't have to worry about saving data from it. Just replace that drive as fast as possible, and it will rebuild the parity info on the new drive. This SMART report should be all the evidence you need to get it replaced, if you still have the receipt or invoice.

Oh, and don't waste any time with further tests. It's failed!

December 18, 200817 yr

Author

Thanks!

Why mess around right? After all it is under warranty. WD will even ship a drive to you first before you ship the bad one back! Cool. The RMA is on the way (first time i've ever had to do this). Good news is the files were all movies so if on of them is corrupt I can just re upload it.

So I should be good to go if I replace the parity drive and then just rebuild the parity? (minus the few errors that are likely corrupt on my Disc 1)?

December 18, 200817 yr

As best as we can tell, yes.

December 18, 200817 yr

This is the poster child of a bad disk! I'd consider myself lucky that it was able to rebuild the disk with only 500 errors! My guess is that the scope of the error would be way less than a gig, and likely confined to 1 or 2 files. It is likely right in the middle of of the most exciting/juicy part of your favorite movie - the one that is out of print and you lent to your brother in law last xmas and he lost.

I cringe every time someone does what you did (a disk gets a red ball and you rebuild over top of it). (I realize it is the recommended procedure, but there is now MUCH better alternative). This is likely THE most exposed you will ever be with unRAID. If the rebuild fails, you have NO opportunity to recover your data in another way. And if you get even a tiny blip in the process, you have NO IDEA what file is affected. I do not have this concern when upsizing a disk, as you still have the original disk if the upsize fails. Even rebuilding parity is not as bad - because if something does go wrong at least you would be able to figure out what files are corrupted. I would INFINITELY recommend using the trust my array technique. Reserve rebuilding data drives for a technique of last resort (i.e., the data drive actually failed)

I created a plugin for unmenu called myMain. One of the features is a "smart" view that shows you all (or most) of the smart data for all of your drives in an easy to use, color coded, display. I would recommend that everyone use this from time to time (especially before attempting to rebuild a disk).

I took your smart report and loaded it into the tool and here is how that drive would have displayed for that drive.

The red "X" in "Overall" means that the drive failed the overall test. With an X here you will have no problems getting the drive replaced through RMA if it is under warranty.

The reallocated sector count (904) and the current pending sector (36) are the issues that Joe L. reported. These are very serious.

Reallocated sectors mean that the drive detected a bad spot on the disk and was able to correct the problem by mapping a "spare" sector. A reallocated sector count of 1 or 2 is pretty common and not to be that concerned about. Higher that that is likely okay as long as the number stays constant and is not getting larger after each parity check. But once you go over 10 I'd start to get worried. Certainly over 100 and your drive is not to be trusted.

The current pending sector indicates sectors that are bad but have not been remapped. unRAID tends to force the issue and cause the remaps to occur (a good thing). If you have pending sectors it might mean that there are no more spare sectors available for remapping purposes, or other serious problems. If you have current pending sectors that don't go away after a parity check, I'd be concerned.

The ata_error_count is further articulated in the smart report. Details of the last 5 are presented. Although you would think that this would be the worst attribute, it is actually not. Quite often a cabling problem or other installation-time issue will cause the drive to get unrecognized commands which increment this count. If you look at the details of the last 5, it will tell you how old the drive was when the error occurred. If you had 74 errors but the date of the last error was 1000s of power-on hours ago, likely not a big deal. However, your most recent error was at 6176 "power on hours", and the report was taken at 6216 "power on hours". So 40 power on hours ago you got a series of "UNC" error. A quick google gives a meaning ...

(UNC) Uncorrectable Data: An ECC error in the data field could not be corrected (a media error or read instability).

* * * * *

If you DO figure out what file(s) got corrupted, DO NOT DELETE THEM. Just post a note to the forum and we'll figure out how to get a copy of the file. It would be useful to look at them and see if there is some perceptible anomaly in the file (like huge blocks of binary zeros). If there is, it may be possible to create a tool for future users to run to try and come up with a list of possibly corrupted files.

December 18, 200817 yr

Author

I want to think you all for your help. I've learned a lot.

The good news is there is probably one or two movies that are corrupt, but those are EASY to replace. (I'm just glad it wasn't the Home movies or Personal Photos etc.)

I'll definitely use the "trust my array" mode, that makes more sense. I honestly didn't know that it was going to rebuild my drive when I plugged it back in and reassigned it.

But all in all, not much damage here, so no worries!

Thanks again for your efforts!

December 19, 200817 yr

I have a red dot next to a drive and am not sure how to proceed. In unMENU it is labeled as "DISK_DSBL" in bold red type. Does anybody please know how to proceed?

Here is what I've researched so far --

Using myMain (thanks bjp999) it would appear that nothing catastrophic has occurred. However, I do question what is up with the bottom part of the chart where there is a big red X next to Overall health of SAMSUNG SP2504C for disk "Failing now.txt". I don't have any Samsung drives. It seems as though there is a parsing error with my output. It would appear that my physical drives are not immediately failing as they all have a black check next to them but I'm not sure if this view is working correctly or not.

Looking at the SMART summary (attached smart_summary.txt) the disabled disk (serial # WD-WCAU40326374) shows a SMART overall-health self-assessment test result: PASSED. It does show UDMA_CRC_Error_Count of 97 which I read in these forums somewhere that a loose or bad cable could cause this. I have opened my server and wiggled around some of the wires but none seemed loose. I have Chenbro 3*5 backplanes and also ensured that the power leads were in tight. It didn't appear that any wires were loose. I have not gone through and replaced any of the wires because they are wire tied together (not too tight) for routing and cooling. It would be a pain, but totally possible, to re-wire the system.

The error in the logs from my various disks happened thousands of hours ago so I think I can discount those for now.

I have some Reallocated_Sector_Ct RAW_VALUEs but they are usually 1 or 2 and at the most 8 so I think I can discount those for now.

I have Current_Pending_Sector value of 1 on a disk (not the disabled one). I'm not sure if this is critical. I think I need to get to a point where I can do a parity check and see if this value changes to 0.

I suspect that everything is fine and I should do the "trust my array" process to re-enable the disabled drive. Then do a parity check and hope for the best. Is that the best course of action?

Thanks for any suggestions.

Syslog is attached and there didn't appear to be any red entries in unMENU.

Last parity check: Sun Nov 30 21:05:00 Local time zone must be set--see zic manual page 2008 finding 0 errors.

December 19, 200817 yr

I have a red dot next to a drive and am not sure how to proceed. In unMENU it is labeled as "DISK_DSBL" in bold red type. Does anybody please know how to proceed?

Here is what I've researched so far --

Using myMain (thanks bjp999) it would appear that nothing catastrophic has occurred. However, I do question what is up with the bottom part of the chart where there is a big red X next to Overall health of SAMSUNG SP2504C for disk "Failing now.txt". I don't have any Samsung drives. It seems as though there is a parsing error with my output. It would appear that my physical drives are not immediately failing as they all have a black check next to them but I'm not sure if this view is working correctly or not.

Looking at the SMART summary (attached smart_summary.txt) the disabled disk (serial # WD-WCAU40326374) shows a SMART overall-health self-assessment test result: PASSED. It does show UDMA_CRC_Error_Count of 97 which I read in these forums somewhere that a loose or bad cable could cause this. I have opened my server and wiggled around some of the wires but none seemed loose. I have Chenbro 3*5 backplanes and also ensured that the power leads were in tight. It didn't appear that any wires were loose. I have not gone through and replaced any of the wires because they are wire tied together (not too tight) for routing and cooling. It would be a pain, but totally possible, to re-wire the system.

The error in the logs from my various disks happened thousands of hours ago so I think I can discount those for now.

I have some Reallocated_Sector_Ct RAW_VALUEs but they are usually 1 or 2 and at the most 8 so I think I can discount those for now.

I have Current_Pending_Sector value of 1 on a disk (not the disabled one). I'm not sure if this is critical. I think I need to get to a point where I can do a parity check and see if this value changes to 0.

I suspect that everything is fine and I should do the "trust my array" process to re-enable the disabled drive. Then do a parity check and hope for the best. Is that the best course of action?

Thanks for any suggestions.

Syslog is attached and there didn't appear to be any red entries in unMENU.

Last parity check: Sun Nov 30 21:05:00 Local time zone must be set--see zic manual page 2008 finding 0 errors.

Great post. You've done everything right. Here are my comments and suggestions:

1. Do not worry about that "Failing now.txt" line. The smart view looks in a directory for text files containing smart reports. (If you see one posted you can copy it to that directory and see it loaded in the tool.) I included a few in the delivery so that folks could see what a failed drive would look like loaded in the tool. That drive is NOT in your system!

2. Your various and sundry smart attribute issues are worthy of study and consideration. I don't want to distract attention from resolving the more critical issue of getting your disabled disk back on line. One that surprises me is the single pending relocation. I'll be interested in seeing if that sector actually gets remapped after we're done.

3. Your current situation: The array is operational and is using the parity and other disks to simulate the failed disk. Unless you notice the red ball or the "DSBL" status in unMenu or myMain, you can operate in this mode for a very long time without even knowing. You can even write to the disabled disk. It will not actually write to the failed disk, but will update parity in such a way that the simulated disk will be "updated". If another disk were to fail, unRAID will not be able to operate in this manner and you would lose data.

4. The $64k question is when did this disabling occur and have you written anything to that disk since the disabling? If you can say with confidence that no data has been written to the disk, that is a good thing and Will likely push you towards the "trust my parity" model. But if you have done a lot of updates, that is not the right choice. Note that it doesn't matter if you have written to OTHER disks in the array, the question is only about disk13.

NOTE: If you are not sure, there may be some way to mount the disk13 drive outside the array and see what is on it. I will leave to one of the Linux experts to see if that is a feasible thing to do.

5. The current state gives you an opportunity to backup any critical data off of the simulated disk13 which you can copy to another disk on the array, one of the non-array disks attached to the server, or to an attached workstation. If you have anything that you would consider of a critical nature on that disk, I would take the opportunity to back them up now. If you want to do the "trust" procedure but know you have written a limited about of data to disk13, you could also backup just the new data and expect significant parity sync issues during the parity check.

6. The final question to ask is why disk13 was disabled. Disks are disabled when a write to that disk fails. This is often caused by a cabling issue on either the data cable or power cable. Since you are able to run smart reports and there is nothing too nasty in the smart log, I am believing that the drive is being recognized and is responsive. I had a disk fail recently and attempts to get it to trust the drive were not successful. The disk would re-enable but then get disabled very soon after the parity check started. Like you, I was able to run smart reports and they looked clean. I wound up replacing it and did a drive rebuild and did not lose any data. (This was an old drive anyway so no huge loss).

7. My recommendation depends a lot on the answer to question #4:

a. If you have not written any data to disk13 since it was disabled, I would do the trust my parity procedure. (Backup critical data first!) It will re-enable the disk and start a parity check. Let the parity check run to completion. A few (maybe up to several hundred) sync errors early on can be expected, but if you continue to get lots of them that likely means that you wrote to the disk more than you had thought. This is not a huge problem. When it completes you will have lost any data written to the disk since the disabling occurred, but everything should be good from an array integrity perspective. I'd likely run several parity checks over the next week to convince myself that all was well.

b. If you have written data to disk13, I might suggest rebuilding onto a spare disk or replacement disk. You can then do the data REBUILD procedure onto the new disk. With the exception of that one sector with a current pending remap, the disk should rebuild. (You could leave it that way, or you could put your disk13 back into the array and attempt to rebuild back onto that original disk. If there were some problem, you'd have the spare disk and could put it back in and do the trust procedure and not lose any data. )

c. If you don't have a spare disk, you can rebuild directly on top of the existing disk13. I just don't like to do that because you'd have zero options if something goes wrong.

Make sense? Please feel free to post back with followup questions ...

UPDATE: I just took a look at the syslog. The following lines are from the most recent boot that occurred on Dec 12:

Dec 12 20:11:29 Tower kernel: ata15.00: status: { DRDY }
Dec 12 20:11:29 Tower kernel: ata15: hard resetting link
Dec 12 20:11:34 Tower kernel: ata15: link is slow to respond, please be patient (ready=0)
Dec 12 20:11:39 Tower kernel: ata15: COMRESET failed (errno=-16)
Dec 12 20:11:39 Tower kernel: ata15: hard resetting link
Dec 12 20:11:44 Tower kernel: ata15: link is slow to respond, please be patient (ready=0)
Dec 12 20:11:49 Tower kernel: ata15: COMRESET failed (errno=-16)
Dec 12 20:11:49 Tower kernel: ata15: hard resetting link
Dec 12 20:11:55 Tower kernel: ata15: link is slow to respond, please be patient (ready=0)
Dec 12 20:12:24 Tower kernel: ata15: COMRESET failed (errno=-16)
Dec 12 20:12:24 Tower kernel: ata15: limiting SATA link speed to 1.5 Gbps
Dec 12 20:12:24 Tower kernel: ata15: hard resetting link
Dec 12 20:12:30 Tower kernel: ata15: COMRESET failed (errno=-16)
Dec 12 20:12:30 Tower kernel: ata15: reset failed, giving up
Dec 12 20:12:30 Tower kernel: ata15.00: disabled

So the disk has been disabled for at least a week. Without older syslogs, I don't think there is any way to tell when the disabling occurred. (Joe L., would looking at the datetime stamp on the super.dat file tell us anything?)

December 19, 200817 yr

I noticed my drive had failed while actually writing about 150GB to the disk (which it was actually writing to the parity since the drive had been disabled, but i didn't know if it failed between the write or before i started writing to it, so i let it finish writing to the parity)....it was fixed by replacing the cable and doing a rebuild, but now you say that might have been a bad idea...?

I don't understand why though, you say if the rebuild fails, you would have no opportunity to recover the data? The parity isn't affected when rebuilding (doesn't it treat the disk just as if you were inserting a new disk and rebuilds based on the information on the parity?...it doesn't rebuild the parity, but the failed drive).

The first time i did the rebuild (without changing the cable), the rebuild failed again with a red dot on the same drive. I then replaced the cable and started the rebuild again, and everything was fine, 0 errors on the rebuild, and all the data was fine...

On a side note, the failed disk (red dot) had a blue dot next to it both times:

After my first shutdown/restart, had a blue dot, started rebuild, rebuild failed with red dot failed disk.

After fixing the cable, restarted, had a blue dot, started rebuild, rebuild succeeded with 0 errors and valid parity.

I would think that a rebuild would cause loss of data/or loss of a whole disk if the failed drive appeared green after fixing the cable, as if nothing was wrong, and unraid thought the parity was invalid based on that data, and at that time of rebuilding the parity, the data disk failed again, so now you have no parity information to depend on. On the other side, if the rebuild went through fine, but you had written data to the failed disk (which was actually the parity), the written data would be lost b/c now the parity is based on the information on the drive.

So i guess i got lucky and the disk came back with a blue dot both times (since it was rebuilding the data disk and not the parity)?? Otherwise I would have lost all the newly written data to the parity or worst (if the data disk failed again) lost all my data!

December 19, 200817 yr

I noticed my drive had failed while actually writing about 150GB to the disk (which it was actually writing to the parity since the drive had been disabled)....it was fixed by replacing the cable and doing a rebuild, but now you say that might have been a bad idea...?

I don't understand why though, you say if the rebuild fails, you would have no opportunity to recover the data? The parity isn't affected when rebuilding (doesn't it treat the disk just as if you were inserting a new disk and rebuilds based on the information on the parity?...it doesn't rebuild the parity, but the failed drive).

The first time i did the rebuild (without changing the cable), the rebuild failed again with a red dot on the same drive. I then replaced the cable and started the rebuild again, and everything was fine, 0 errors on the rebuild, and all the data was fine...

On a side note, the failed disk had a blue dot (assumed a new disk to replace failed drive) next to it. I would think that a rebuild of the parity drive would cause you to lose data if the failed drive appeared green after fixing the cable, and it thought the parity was invalid, but for me it treated the disk as a replacement disk and performed the rebuild on that disk, so my parity was not affected.

You are 100% right - that if a data rebuild fails you'd be able to try again with a different disk. No data loss there.

I guess what I was saying is that if you have a disk that gets disabled due to a lose cable, the disk is perfect. All of the files are intact. If you build over top of it, and there are disk errors (bad sectors) on any other disk in the array, your rebuild will not be perfect. This happened to the OP. If he had rebuilt onto a different disk, he'd still have his original and be able to figure out what files got corrupted.

The other issue is that if a drive happened to fail in the middle of the drive rebuild, you could find yourself in a very difficult position trying to recover data.

I agree with you that the chances of having problems doing a drive rebuild over top of the existing disk is not terribly high risk, but it is riskier than other available alternatives.

December 19, 200817 yr

Thank you so much for the reply, bjp999. Having some direction from experienced users is a much less stressful way to get through these hiccups.

3. Your current situation: The array is operational and is using the parity and other disks to simulate the failed disk. Unless you notice the red ball or the "DSBL" status in unMenu or myMain, you can operate in this mode for a very long time without even knowing. You can even write to the disabled disk. It will not actually write to the failed disk, but will update parity in such a way that the simulated disk will be "updated". If another disk were to fail, unRAID will not be able to operate in this manner and you would lose data.

I use the email notification script so that I won't go too long without knowing that something has failed. It works great. In fact, here is some more data.

In checking my gmail log -- this is the last email notification of "unRAID Status: unRaid is OK".

date Fri, Dec 12, 2008 at 11:43 AM
subject unRAID Status: unRaid is OK

mailed-by gmail.com

hide details Dec 12 (7 days ago)

Reply

This message is a status update for unRAID Tower

-----------------------------------------------------------------

Server Name: Tower

Status: unRaid is OK

Date: Fri Dec 12 19:43:22 Local time zone must be set--see zic manual page 2008

Disk SMART Health Status

-----------------------------------------------------------------

Parity Disk PASSED (DiskId: ata-MAXTOR_STM310003_9QJ0G5RG)

Disk 1 Spun-Down (DiskId: ata-WDC_WD10EACS-00D_WD-WCAU40229133)

Disk 2 Spun-Down (DiskId: ata-WDC_WD10EACS-00D_WD-WCAU40196412)

Disk 3 PASSED (DiskId: ata-ST3750640AS_5QD03RVW)

Disk 4 Spun-Down (DiskId: ata-ST3750640AS_5QD03GC1)

Disk 5 Spun-Down (DiskId: ata-_)

Disk 6 Spun-Down (DiskId: ata-ST3750640AS_5QD02WEA)

Disk 7 Spun-Down (DiskId: ata-ST3750640AS_5QD02SPD)

Disk 8 Spun-Down (DiskId: ata-HDS725050KLA360_KRVN65ZAGLVN7F)

Disk 9 Spun-Down (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40326317)

Disk 10 Spun-Down (DiskId: ata-HDS725050KLA360_KRVN65ZAGR2S7F)

Disk 11 Spun-Down (DiskId: ata-ST3500641AS_3PM0DJRK)

Disk 12 PASSED (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630)

Disk 13 Spun-Down (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40326374)

diskNumber.13=13

diskName.13=md13

diskState.13=7

diskSize.13=976762552

diskModel.13=WDC WD10EACS-00D6B0

diskSerial.13= WD-WCAU40326374

diskNumReads.13=3077549

diskNumWrites.13=3074938

- Hide quoted text -

diskNumErrors.13=0

diskId.13=ata-WDC_WD10EACS-00D6B0_WD-WCAU40326374

rdevNumber.13=13

rdevStatus.13=DISK_OK

rdevName.13=sdn

rdevSize.13=976762552

rdevModel.13=WDC WD10EACS-00D6B0

rdevSerial.13= WD-WCAU40326374

rdevId.13=ata-WDC_WD10EACS-00D6B0_WD-WCAU40326374

And here is the next email where "unRAID Status: Array fault".

date Fri, Dec 12, 2008 at 12:13 PM
subject unRAID Status: Array fault

mailed-by gmail.com

hide details Dec 12 (7 days ago)

Reply

This message is a status update for unRAID Tower

-----------------------------------------------------------------

Server Name: Tower

Status: The unRaid array needs attention. One or more disks are disabled or invalid.

Date: Fri Dec 12 20:13:48 Local time zone must be set--see zic manual page 2008

Disk SMART Health Status

-----------------------------------------------------------------

Parity Disk PASSED (DiskId: ata-MAXTOR_STM310003_9QJ0G5RG)

Disk 1 PASSED (DiskId: ata-WDC_WD10EACS-00D_WD-WCAU40229133)

Disk 2 PASSED (DiskId: ata-WDC_WD10EACS-00D_WD-WCAU40196412)

Disk 3 PASSED (DiskId: ata-ST3750640AS_5QD03RVW)

Disk 4 PASSED (DiskId: ata-ST3750640AS_5QD03GC1)

Disk 5 PASSED (DiskId: ata-_)

Disk 6 PASSED (DiskId: ata-ST3750640AS_5QD02WEA)

Disk 7 PASSED (DiskId: ata-ST3750640AS_5QD02SPD)

Disk 8 PASSED (DiskId: ata-HDS725050KLA360_KRVN65ZAGLVN7F)

Disk 9 PASSED (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40326317)

Disk 10 PASSED (DiskId: ata-HDS725050KLA360_KRVN65ZAGR2S7F)

Disk 11 PASSED (DiskId: ata-ST3500641AS_3PM0DJRK)

Disk 12 Not-Reported (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40183630)

Disk 13 PASSED (DiskId: ata-WDC_WD10EACS-00D6B0_WD-WCAU40326374)

diskNumber.13=13

diskName.13=md13

diskState.13=4

diskSize.13=976762552

diskModel.13=WDC WD10EACS-00D6B0

diskSerial.13= WD-WCAU40326374

diskNumReads.13=4454497

diskNumWrites.13=4451647

diskNumErrors.13=240

diskId.13=ata-WDC_WD10EACS-00D6B0_WD-WCAU40326374

rdevNumber.13=13

rdevStatus.13=DISK_DSBL

rdevName.13=sdn

rdevSize.13=976762552

rdevModel.13=WDC WD10EACS-00D6B0

rdevSerial.13= WD-WCAU40326374

rdevId.13=ata-WDC_WD10EACS-00D6B0_WD-WCAU40326374

4. The $64k question is when did this disabling occur and have you written anything to that disk since the disabling? If you can say with confidence that no data has been written to the disk, that is a good thing and Will likely push you towards the "trust my parity" model. But if you have done a lot of updates, that is not the right choice. Note that it doesn't matter if you have written to OTHER disks in the array, the question is only about disk13.

I cannot confidently say that there were no writes to this disk since disabling. I added this disk because my array was so full and was in the process of moving lots of files to it around the time of the failure. I would probably feel better working off the assumption that writes were in process. I have however been very careful to not write anything to the array since the error.

NOTE: If you are not sure, there may be some way to mount the disk13 drive outside the array and see what is on it. I will leave to one of the Linux experts to see if that is a feasible thing to do.

I think I can do this if necessary. I'm pretty sure that since the drive is disabled, I can pull it out and attach it to my Windows workstation with a usb-SATA dock and then load the Reiser driver on Windows and inspect the disk. Do I have to power off the array first, then pull the drive? Should I not power on the array without the drive in place or does this not matter since it's disabled. I think I want to ensure that the array does not "forget" about this drive until I'm ready.

5. The current state gives you an opportunity to backup any critical data off of the simulated disk13 which you can copy to another disk on the array, one of the non-array disks attached to the server, or to an attached workstation.

What is the best way to do this. Should I use MC and do a disk13 to diskX copy? Or is it better to pull the data from a windows machine through the network? I'm leaning towards the network solution so I can keep array activity to a minimum until this is solved and I am fully protected.

b. If you have written data to disk13, I might suggest rebuilding onto a spare disk or replacement disk. You can then do the data REBUILD procedure onto the new disk. With the exception of that one sector with a current pending remap, the disk should rebuild. (You could leave it that way, or you could put your disk13 back into the array and attempt to rebuild back onto that original disk. If there were some problem, you'd have the spare disk and could put it back in and do the trust procedure and not lose any data. )

So this seems like the best bet for me.

So if I understand correctly, my safest course of action would be to --

1. Backup important files from disk13 share over the network to my Windows workstation

2. Shutdown the array

3. Replace disk13 with a new hard drive

4. Start the array

5. Stop the array

6. Assign my new drive to the array in the devices tab

7. Check the box next to Start the Array

Then parity should rebuild to the new drive. If something happens, I still have the old failed drive as a backup.

8. Inspect new disk13 contents on the rebuilt new drive and replace any missing files from the backup I took in step 1

Then I could use the old disk13 to replace an older smaller disk in the array at a later date -- after everything has been running smoothly for a while.

Would this be the best plan of attack to get my array back on solid ground?

Thanks again for all the insight.

Cable Popped Loose - UPDATE - Errors during Data Rebuild?? - Syslog added

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)