[Closed] Please Help: 456 errors during parity check - now read only filesystem?


Recommended Posts

My system was working just fine yesterday. This morning it ran it's monthly parity check. When it was complete I see this:

Last checked on Thu May 1 10:12:13 2014 MDT, finding 456 errors.

 

I upgraded to 5.0.5 recently, though I don't think this is my first parity check. It may however, be the first one kicked off by the monthly UnMenu package....I have never seen anything but "0 errors", so this is quite concerning. Then, I noticed that while it says my parity is good, my array appears to be in read-only mode. I am unable to add any files to UnRaid presently.

 

Looking through the syslog, two things concern me:

[*]sdg appears to have some issues. this is my cache drive. Perhaps everything is ro though and it's only mentioning sdg because thats the disk its trying to write to

[*]My syslog begins at 4:40am today. I believe this is right in the middle of the parity check since it finished at 10am

 

 

I would really appreciate any guidance you can provide me as I don't know where to turn. Should I run another parity check? I'm going to let it sit as-is for the moment so I don't make it any worse. The Web UI is not showing any redballed disks or any other issues aside from the parity check errors.

 

Please let me know what else I can provide that might be helpful.

 

Thanks in advance,

Brian

 

Edit: Realized that my syslog had gotten so large, it recycled it. So I grabbed syslog.1 which has everything back to the last reboot on 4/15. Most every error I see seems to involve my cache drive. Is it possible that it's dying and UnRaid is having issues reading from it (I saw a lot of mover errors where it said it couldn't read the file). It also seems that the mover was running during the parity check... so if the mover was having issues reading from cache, and the parity check was going.. maybe that's a bad combination. Attached syslog.1.txt Thanks!

 

Edit #2: I'm pretty confident my cache drive is the issue. I am able to write to a share which does not have cache enabled. I just need to know how to handle the parity check situation so I don't lose any data. I imagine if I just go in and disable my cache drive on each of my shares and run a parity check, I should be OK?? Thx

 

Edit #3: I disabled my cache drive and things seem to be working just fine. I guess I should run a preclear on the cache disk and see if it breaks? Am I OK to run a parity check in the meantime?

syslog.zip

syslog.1.zip

Link to comment

Any ideas? Looking through the logs, everything was fine prior to the 1st. There were no mover errors prior. However, last nite I confirmed that some files from the cache drive had only been partially written to the array by the mover.

 

So, I went and re-copied all the files that the log said Mover had problems with. All of those files play just fine now. I've used the array without issue now that the cache drive is disabled.

 

Am I risking anything by running another parity check? I'm hoping that without the misbehaving cache drive, it'll complete without issue.

Link to comment

Thanks for taking a look!

 

No errors on main UnRAID Main:

aab1d6324184381.jpg

 

I've attached the SMART report for disk 9. For my own edification, I'm curious why you singled out disk 9 (sde)? In my scan of the logs I didn't seem to see anything calling out anything other than sdg. Nevermind - I see them.

 

 

Edit: I ran reiserfsck on my cache drive and it comes back with the following. Note - My cache drive is not presently enabled in Tower/ShareSettingsMenu. I didn't think it needed to be though.

root@Tower:~# reiserfsck --check /dev/sdg1   
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/sdg1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block  that the disk  drive internals  cannot hide from
your sight,the chances of getting more are generally said to become
much higher  (precise statistics are unknown to us), and  this disk
drive is probably not expensive enough  for you to you to risk your
time and  data on it.  If you don't want to follow that follow that
advice then  if you have just a few bad blocks,  try writing to the
bad blocks  and see if the drive remaps  the bad blocks (that means
it takes a block  it has  in reserve  and allocates  it for use for
of that block number).  If it cannot remap the block,  use badblock
option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (2): (Input/output error).

Aborted (core dumped)
root@Tower:~# 

 

Smart Report for Cache drive (sdg):

root@Tower:~# smartctl -a -A /dev/sdg
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /1:0:3:0
Product:              
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
root@Tower:~# 

 

My cache drive is now showing as unformatted in the unRAID Main GUI.

1bb26e324189545.jpg

 

When I attempt to format it, it won't and I see the following in my log:

May 2 12:17:56 Tower emhttp: ckmbr: read: Input/output error
May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] Unhandled error code
May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg]
May 2 12:17:56 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] CDB:
May 2 12:17:56 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 20 00
May 2 12:17:56 Tower kernel: end_request: I/O error, dev sdg, sector 0
May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 0
May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 1
May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 2
May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 3
May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] Unhandled error code
May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg]
May 2 12:17:56 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00
May 2 12:17:56 Tower kernel: sd 1:0:3:0: [sdg] CDB:
May 2 12:17:56 Tower kernel: cdb[0]=0x28: 28 00 00 00 00 00 00 00 08 00
May 2 12:17:56 Tower kernel: end_request: I/O error, dev sdg, sector 0
May 2 12:17:56 Tower kernel: Buffer I/O error on device sdg, logical block 0

 

Seems like my cache drive is dead?

 

 

Thanks so much for your help.

-Brian

smart.txt

Link to comment

From the SMART report your drive shows:

 

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       274

 

So only has one bad block that has already been remapped.  So the surface is good.  But has 274 UDMA errors which are usually some problem with the SATA or power cabling.

 

So time to check (reseat or perhaps replace) the SATA or power cables.  If you have a power splitter in the cable then look at it too.

 

Regards,

 

Stephen

 

Link to comment

From the SMART report your drive shows:

 

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       274

 

So only has one bad block that has already been remapped.  So the surface is good.  But has 274 UDMA errors which are usually some problem with the SATA or power cabling.

 

So time to check (reseat or perhaps replace) the SATA or power cables.  If you have a power splitter in the cable then look at it too.

 

Regards,

 

Stephen

 

Thanks! These errors are cumulative over the life of the drive right? So if these drives are 4-5 years old from multiple other previous machines, the errors could have happened anywhere along the way? I've confirmed all my power and sata cables are seated properly.

 

 

Link to comment

Wanted to close out this request for help as I think I've got things under control at the moment... to summarize the events:

 

  • Saw monthly parity check on unRAID main indicating 456 errors.
  • Investigated logs and saw many drive related errors, mainly sdg (cache drive) and some to a data drive sde.
  • Found that I could not write to any cachedrive-enabled shares. Disabled cache drive. Could write to all shares again.
  • Saw that the mover happened to run in the middle of the monthly parity check the night before. Complete ignorant speculation, but perhaps the read errors on the cache drive introduced some bad data which the parity check detected? Noted which files mover attempted but failed to move and wrote them to Tower again from backups.
  • Ran SMART tests. Drives are old but no major issues detected.
  • Not sure what to do - documented sectors parity check identified and kicked off a second parity check.
  • Second parity check finished with 380 errors. A comparison of the sectors (first 100) showed it was the exact same sectors as in the first parity check.
  • Third parity check completed with 0 errors. Array appears to be OK. No errors in syslog

 

 

So that's where it stands right now. My cache drive appears to have some serious issues. I tried the reiserfsck check on it and it couldn't connect to it. Neither could smartctl. Going to pull it out later and see if it's truely dead.

 

Thanks for everyone's help. I plan to continue to watch syslog closely for errors and am considering calculating md5 hashes of known good files to keep on hand for review if I encounter more issues.

 

Edit: Pulled the cache drive and ran WD diagnostics on it. Found a couple bad sectors and reallocated them. Going to run 3-4 preclears before putting the cache drive back in service.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.