Error or no error?

November 20, 200817 yr

Good morning,

after upgrading disk11 from 750GB to 1TB I found the following this morning.

In the command area 1880 errors were reported but there is no error in the error column. What does this mean? Is the array trustful or not?

Please note that the additional writes did take place after the upgrade. When I first noticed the errors all values in the write column reported "8" (except parity and disk11).

Thanks

Harald

November 20, 200817 yr

I can help you understand the mechanisms in unRAID, but the answer to your questions about is the array trustful can't be answered with the info we have to date.

There is a difference between the data in the array being correct and the array reporting no parity errors. The following examples will help explain ...

1A. Let's just say that an evil practical joker took one of the drives out of your array and wrote random garbage to 100 sectors on your Movies disk and put in back in your array without you knowing. The next day you just happen to run a parity check. Parity would show 100 "sync" errors. You then ran another parity check - you get 0 sync errors. Why? What unRAID does when it gets a sync error is to trust the data disks, and update parity. So for each of the 100 sync errors, parity got updated. (In this sense, calling it a sync error is confusing - it should say something like "100 sync errors had parity recomputed"). Since the data was the thing that was wrong, this is not what you would want it to do. unRAID is going to update PARITY to make the parity calculation work. SO IN THIS EXAMPLE, PARITY IS CORRECT BUT THE DATA IS STILL WRONG. Now let's say that the evil prankster confesses his evil deed and REVERSES the damage he did. You then run a parity check. unRAID will again find 100 sync errors and again update parity for each one. But now these "sync errors" are your friends. They actually update your parity in a good way!

2A. Now let's change the example a bit. Instead of the evil joker updating 100 sectors, let's say he physically damaged 100 sectors. This time when you run parity check, unRAID will get 100 read errors (the error column will show 100) and 100 sync errors. Each of these will cause unRAID to do a good thing. It will read corresponding sectors from each of the other disks and recompute what it should have read from the damaged disk. It will then WRITE that sector back to the damaged sector. The disk does something very helpful here, it will "REMAP" the damaged sector (take the damaged sector out of service and bring in a spare sector in its place). In the end, all will be well. Your data is correct and your parity is correct. unRAID is designed for this!

Now let's talk about these 2 examples, but instead of the joker doing his joke while the array is parity protected, let's say you built parity just after the damage.

1B. (sectors randomly updated, while no parity in place) You then build parity. unRAID will just build parity with the the data it finds. No errors of any kind. In the end, the results will be the same as in scenario 1 above.

2B. (physically damaged sectors) This time unRAID will hit these bad sectors while trying to build parity. unRAID has no way to reconstruct the bad data, so it will just assume that the bad sector is full of binary zeros and compute parity that way. It will leave the bad sector in place and move on. On the next parity check, unRAID will again read the sectors, get the read error (updating the error column), recompute the data (all binary zeros), and write that binary zero back to the damaged sectors. The result - you lost the data in the 100 sectors.

* * * * *

I do not know why you are getting sync errors, but the result sounds very much like scenario #1. Parity is getting updated to reflect the data it read, but the data it read might very well be wrong. unRAID does not protect you from this at all.

One guess I have is that the cabling to your new drive may be lose or faulty (this might include a bad backplane connection). A bad connection can cause data sent from the drive to get corrupted by the time it gets to unRAID. If this is the case, if you run parity check again, you will again get some parity errors (not exactly the same number but likely in the same neighborhood).

Another possibility is some incompatibility with your system and the drive or the controller.

Post a syslog covering the time period when these errors occurred, as well as a smartctl report of all of your disks. Refer to the "Troubleshooting" page (see link in my sig) for instructions.

I would not feel warm and fuzzy with my array.

November 20, 200817 yr

Author

bjp999,

thanks for your lengthy answer. I feared your answer. At the end of this article I will post the relevant part of my syslog. At around 3pm yesterday I started the upgrade process. It was finished this morning around 7:00am.

Again involved is sda - it's my disk15 sitting at the eSATA port. Yesterday I bought a new enclosure and new cables because I thought that the cables didn't fit properly. Now this enclosure has a drawback - the disk is becoming way to hot (43 actually - yesterday I saw 46). I'm waiting for a new machine. I will put this disk15 into the new machine then and leave the eSATA port unused.

What I see is that at some point (around 6:00am) things were getting worst. NCQ was disabled for this drive and I can only guess that at this time the 750GB drives are no longer involved and the few 1TBs were under heavy fire bringing temperature of the disk15 even higher.

My question is now: These are all messages I found in the syslog for the complete upgrade process. Did the hardware fix that problem or is a possible data read error populated to the parity disk?

Thanks

Harald

Nov 20 02:29:06 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x2 frozen
Nov 20 02:29:06 Tower kernel: ata1.00: cmd 60/e0:00:7f:10:42/01:00:42:00:00/40 tag 0 ncq 245760 in
Nov 20 02:29:06 Tower kernel:          res 40/00:0c:9f:7d:a5/00:00:30:00:00/40 Emask 0x4 (timeout)
Nov 20 02:29:06 Tower kernel: ata1.00: status: { DRDY }
Nov 20 02:29:06 Tower kernel: ata1.00: cmd 60/e0:08:9f:0f:42/00:00:42:00:00/40 tag 1 ncq 114688 in
Nov 20 02:29:06 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 02:29:06 Tower kernel: ata1.00: status: { DRDY }
Nov 20 02:29:06 Tower kernel: ata1.00: cmd 60/40:10:5f:12:42/02:00:42:00:00/40 tag 2 ncq 294912 in
Nov 20 02:29:06 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 02:29:06 Tower kernel: ata1.00: status: { DRDY }
Nov 20 02:29:06 Tower kernel: ata1.00: cmd 60/00:18:9f:14:42/04:00:42:00:00/40 tag 3 ncq 524288 in
Nov 20 02:29:06 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 02:29:06 Tower kernel: ata1.00: status: { DRDY }
Nov 20 02:29:06 Tower kernel: ata1: hard resetting link
Nov 20 02:29:07 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 20 02:29:07 Tower kernel: ata1.00: configured for UDMA/133
Nov 20 02:29:07 Tower kernel: ata1: EH complete
Nov 20 02:29:07 Tower kernel: sd 1:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB)
Nov 20 02:29:07 Tower kernel: sd 1:0:0:0: [sda] Write Protect is off
Nov 20 02:29:07 Tower kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 20 02:29:07 Tower kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 20 06:02:03 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x2 frozen
Nov 20 06:02:03 Tower kernel: ata1.00: cmd 60/38:00:2f:42:c8/03:00:58:00:00/40 tag 0 ncq 421888 in
Nov 20 06:02:03 Tower kernel:          res 40/00:0c:9f:7d:a5/00:00:30:00:00/40 Emask 0x4 (timeout)
Nov 20 06:02:03 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:02:03 Tower kernel: ata1.00: cmd 60/20:08:0f:42:c8/00:00:58:00:00/40 tag 1 ncq 16384 in
Nov 20 06:02:03 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:02:03 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:02:03 Tower kernel: ata1.00: cmd 60/a8:10:67:45:c8/01:00:58:00:00/40 tag 2 ncq 217088 in
Nov 20 06:02:03 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:02:03 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:02:03 Tower kernel: ata1.00: cmd 60/00:18:0f:47:c8/04:00:58:00:00/40 tag 3 ncq 524288 in
Nov 20 06:02:03 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:02:03 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:02:03 Tower kernel: ata1: hard resetting link
Nov 20 06:02:04 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 20 06:02:04 Tower kernel: ata1.00: configured for UDMA/133
Nov 20 06:02:04 Tower kernel: ata1: EH complete
Nov 20 06:02:04 Tower kernel: sd 1:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB)
Nov 20 06:02:04 Tower kernel: sd 1:0:0:0: [sda] Write Protect is off
Nov 20 06:02:04 Tower kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 20 06:02:04 Tower kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 20 06:04:23 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x2 frozen
Nov 20 06:04:23 Tower kernel: ata1.00: cmd 60/a8:00:47:c0:79/01:00:59:00:00/40 tag 0 ncq 217088 in
Nov 20 06:04:23 Tower kernel:          res 40/00:0c:9f:7d:a5/00:00:30:00:00/40 Emask 0x4 (timeout)
Nov 20 06:04:23 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:04:23 Tower kernel: ata1.00: cmd 60/70:08:d7:bf:79/00:00:59:00:00/40 tag 1 ncq 57344 in
Nov 20 06:04:23 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:04:23 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:04:23 Tower kernel: ata1.00: cmd 60/e8:10:ef:c1:79/02:00:59:00:00/40 tag 2 ncq 380928 in
Nov 20 06:04:23 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:04:23 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:04:23 Tower kernel: ata1.00: cmd 60/00:18:d7:c4:79/04:00:59:00:00/40 tag 3 ncq 524288 in
Nov 20 06:04:23 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:04:23 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:04:23 Tower kernel: ata1: hard resetting link
Nov 20 06:04:23 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 20 06:04:23 Tower kernel: ata1.00: configured for UDMA/133
Nov 20 06:04:23 Tower kernel: ata1: EH complete
Nov 20 06:04:23 Tower kernel: sd 1:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB)
Nov 20 06:04:23 Tower kernel: sd 1:0:0:0: [sda] Write Protect is off
Nov 20 06:04:23 Tower kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 20 06:04:23 Tower kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 20 06:05:19 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x2 frozen
Nov 20 06:05:19 Tower kernel: ata1.00: cmd 60/28:00:a7:f8:a2/02:00:59:00:00/40 tag 0 ncq 282624 in
Nov 20 06:05:19 Tower kernel:          res 40/00:0c:9f:7d:a5/00:00:30:00:00/40 Emask 0x4 (timeout)
Nov 20 06:05:19 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:05:19 Tower kernel: ata1.00: cmd 60/68:08:3f:f8:a2/00:00:59:00:00/40 tag 1 ncq 53248 in
Nov 20 06:05:19 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:05:19 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:05:19 Tower kernel: ata1.00: cmd 60/70:10:cf:fa:a2/02:00:59:00:00/40 tag 2 ncq 319488 in
Nov 20 06:05:19 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:05:19 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:05:19 Tower kernel: ata1.00: cmd 60/00:18:3f:fd:a2/04:00:59:00:00/40 tag 3 ncq 524288 in
Nov 20 06:05:19 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:05:19 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:05:19 Tower kernel: ata1: hard resetting link
Nov 20 06:05:20 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 20 06:05:20 Tower kernel: ata1.00: configured for UDMA/133
Nov 20 06:05:20 Tower kernel: ata1: EH complete
Nov 20 06:05:20 Tower kernel: sd 1:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB)
Nov 20 06:05:20 Tower kernel: sd 1:0:0:0: [sda] Write Protect is off
Nov 20 06:05:20 Tower kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 20 06:05:20 Tower kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 20 06:08:37 Tower kernel: ata1.00: NCQ disabled due to excessive errors
Nov 20 06:08:37 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x2 frozen
Nov 20 06:08:37 Tower kernel: ata1.00: cmd 60/c0:00:ff:1d:b3/01:00:5a:00:00/40 tag 0 ncq 229376 in
Nov 20 06:08:37 Tower kernel:          res 40/00:0c:9f:7d:a5/00:00:30:00:00/40 Emask 0x4 (timeout)
Nov 20 06:08:37 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:08:37 Tower kernel: ata1.00: cmd 60/20:08:df:1d:b3/00:00:5a:00:00/40 tag 1 ncq 16384 in
Nov 20 06:08:37 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:08:37 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:08:37 Tower kernel: ata1.00: cmd 60/20:10:bf:1f:b3/03:00:5a:00:00/40 tag 2 ncq 409600 in
Nov 20 06:08:37 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:08:37 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:08:37 Tower kernel: ata1.00: cmd 60/00:18:df:22:b3/04:00:5a:00:00/40 tag 3 ncq 524288 in
Nov 20 06:08:37 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 20 06:08:37 Tower kernel: ata1.00: status: { DRDY }
Nov 20 06:08:37 Tower kernel: ata1: hard resetting link
Nov 20 06:08:38 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 20 06:08:38 Tower kernel: ata1.00: configured for UDMA/133
Nov 20 06:08:38 Tower kernel: ata1: EH complete
Nov 20 06:08:38 Tower kernel: sd 1:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB)
Nov 20 06:08:38 Tower kernel: sd 1:0:0:0: [sda] Write Protect is off
Nov 20 06:08:38 Tower kernel: sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 20 06:08:38 Tower kernel: sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 20 07:26:29 Tower kernel: md: sync done. time=57332sec rate=17036K/sec
Nov 20 07:26:29 Tower kernel: md: recovery thread sync completion status: 0

November 20, 200817 yr

I need the smartctl for all the disks in the array and the entire syslog.

I use my eSata port with a special cable that converts from the eSata connector to an internal Sata connector. I just run the cable back inside the case and connect to an internal drive. If you have room inside the case, I suggest you do this also.

November 21, 200817 yr

In my limited experience, the errors you are seeing here (frozen / timeout), are perhaps the least serious, although the resets they force can be bad, as some drives *may* not recover correctly from the reset. I have not seen these errors indicate any errors on the drive or damage to its data, but they do indicate an occasional loss of contact. Until we understand them better, I think the best way to characterize them and what they mean, is that the link to the drive is unreliable (which I believe you may already be concluding).

The disabling of NCQ may have been a good thing, as there were no more errors reported (but can't make any conclusions). Performance was still very good, the parity sync or check finished with an average speed of 57MB/sec.

Brian makes good points. The data on the drives is probably fine, but if there were parity errors recorded, AND you see any other errors (especially any kind of read error), then I would power off and on, make sure the array comes up clean (all drives present and no drive errors), and immediately do another parity check, to 'correctly' correct the parity.

I'm not sure that the scenario in 2B is correct. Somewhere, I believe Tom has said what actually happens in similar circumstances. I don't remember where or what he said, but it was completely reasonable to me. Making any assumption about the contents of a bad sector is not usually done, as it is customary to return an error only, not the contents. I *think* the parity build would actually abort. Also in this case, since you are building parity, it would be assumed you don't have parity, AND you can't read the sectors, so the data should be considered already lost. Only a tool like SpinRite could recover any of the data.

Special note to Brian: Thank you so much for your PM, I still am hoping to respond, but I'm actually not used to friendship (hope that does not sound too strange), and I've been highly overwhelmed by life in general lately (rather depressed, highly disorganized), and by numerous little projects/posts involving unRAID. I now have almost 20 Firefox tabs open to unRAID things I need/plan/want to get back to, for the FAQ, the Wiki, or various users. I have over 40 tabs open in all, to unfinished research, projects, etc. I do really want to get back to you, and to contribute some ideas and enhancements to MyMain. I can say I have no clue about anything to apologize for, and think your work here has been excellent, generous and helpful. My little UnMENU contributions are child's play in comparison.

November 21, 200817 yr

I'm not sure that the scenario in 2B is correct. Somewhere, I believe Tom has said what actually happens in similar circumstances. I don't remember where or what he said, but it was completely reasonable to me. Making any assumption about the contents of a bad sector is not usually done, as it is customary to return an error only, not the contents. I *think* the parity build would actually abort. Also in this case, since you are building parity, it would be assumed you don't have parity, AND you can't read the sectors, so the data should be considered already lost. Only a tool like SpinRite could recover any of the data.

I wondered about what it would do if building parity hit a read error. I don't remember reading anything from Tom - may have happened before my time. Would be nice to know for sure. But whatever happens (unRAID gives up or does something more like I wrote) scenario 2B is not a good place to find oneself. You'd much rather be a 2A!

Special note to Brian: Thank you so much for your PM, I still am hoping to respond, but I'm actually not used to friendship (hope that does not sound too strange), and I've been highly overwhelmed by life in general lately (rather depressed, highly disorganized), and by numerous little projects/posts involving unRAID. I now have almost 20 Firefox tabs open to unRAID things I need/plan/want to get back to, for the FAQ, the Wiki, or various users. I have over 40 tabs open in all, to unfinished research, projects, etc. I do really want to get back to you, and to contribute some ideas and enhancements to MyMain.

Thanks for letting me know. Sorry to hear you are having life issues. If it makes you feel any better, so am I. This little community is a very welcome diversion for me. Take good care of yourself, and may God bless.

... and think your work here has been excellent, generous and helpful. My little UnMENU contributions are child's play in comparison.

Thanks for the compliment, but truthfully I feel you've got us reversed. I am a real Linux newbie. You help tons more people that I have been able to. Look at the hit counts on the FAQ and you'll see that you're hard work has helped a lot of people! The programming stuff plays into my strength. Awk is pretty cool. It's definitely more fun than work.

November 21, 200817 yr

My little UnMENU contributions are child's play in comparison.

RobJ, I wish I had half the knowledge you do in interpreting the system logs of people who come here for help. Everyone's contributions to this forum make it what it is. The work and time you have contributed here helping people with their systems, and editing the Wiki is a huge contribution. Writing skills and diagnostic skills are not child's play.

I can't tell you how many times I've learned something new from you. In this small online community everyone contributes in one way or another, at times it is experience to share (the FAQ), sometimes it is a failing system to diagnose, and a mystery to solve (a syslog to examine), sometimes it is a script or program to share (unMENU). In my way of thinking, all contributions are equally important.

Joe L.

November 21, 200817 yr

My little UnMENU contributions are child's play in comparison.

RobJ, I wish I had half the knowledge you do in interpreting the system logs of people who come here for help. Everyone's contributions to this forum make it what it is. The work and time you have contributed here helping people with their systems, and editing the Wiki is a huge contribution. Writing skills and diagnostic skills are not child's play.

I can't tell you how many times I've learned something new from you. In this small online community everyone contributes in one way or another, at times it is experience to share (the FAQ), sometimes it is a failing system to diagnose, and a mystery to solve (a syslog to examine), sometimes it is a script or program to share (unMENU). In my way of thinking, all contributions are equally important.

Joe L.

I would have to agree here, you're a great asset to the community. Thank you for your efforts.

Error or no error?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)