[SOLVED] REISERFS error - help please - bad drive or something else?


Recommended Posts

Today I see many errors on my console.  If I understand the errors, it would seem that my SSD cache (SDH) drive is failing.  I'm seeing errors like this,

 

Jun 22 10:40:27 Tower kernel: REISERFS error (device sdh1): zam-7001 reiserfs_find_entry: io error

Jun 22 10:40:27 Tower kernel: sd 7:0:0:0: [sdh] Unhandled error code

Jun 22 10:40:27 Tower kernel: sd 7:0:0:0: [sdh] 

Jun 22 10:40:27 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00

Jun 22 10:40:27 Tower kernel: sd 7:0:0:0: [sdh] CDB:

Jun 22 10:40:27 Tower kernel: cdb[0]=0x28: 28 00 28 24 00 97 00 00 08 00

Jun 22 10:40:27 Tower kernel: end_request: I/O error, dev sdh, sector 673448087

 

I have Plex, nzbsab, cp, and sb writing data to the cache drive all the time and this drive is about 1 year old, so perhaps it's bad.

 

Please take a look at my syslog and confirm.  For now, I have shutdown my server so I don't have any data loss.  To that point and based on the syslog, does it look like I've lost data already?  If yes, it will only be plugin stuff, so not too important, but I'll want to swap that drive out.

 

Or do I have some other I/O issue going on?  I have a green ball next to it, so WTF.

 

Thanks!

syslog_errors6_22_2014.zip

Link to comment

I'm not as good at Syslog reading as most on the forums.

It looks like you restarted the log, thus cutting off the startup sequence.

The startup is where drive assignments show up...so drive 'sdh' is clearly in trouble...but I can't tell if that's cache or not.

Later on, 'cache' has some open files and kicks out some errors, but they could be totally separate.

 

What I would do is run SMART on all your drives, and post results as well as full syslog.

Include your VERSION and SYSTEM LOG for support issues

 

If you're sure that its the cache drive and SMART is clean, then run "Reiserfsck --check" on it to see if you can clean up the directory.

Cache is a bit of a special case, so read the Wiki and mind the last sentence on the page:

 

http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

"If operating on the cache drive that is not protected by parity you would use /dev/sdX1 (note the trailing "1" indicating the first partition on the cache drive)"

 

Link to comment

Thanks for your help.  I'm confused though, because that is / was my entire syslog.  all logs start with Restart...  It is my cache drive and attached is the smart report. 

 

At first I thought I saw errors, but the smart report looks fine.  Same for all drives.

 

Ok, so I want to run the Reiserfsck --check on my cache drive, but what line are you referring to in the wiki?  The last line just says Troubleshooting.

 

So do I need to stop my array or be in Maintenance mode and then run the Reiserfsck --check?  what is special about running this command on the cache drive?

 

Update:  After rebooting, the errors haven't occured (yet).  All smart data shows no reallocation sectors and no errors.  I rebooted again, and still no errors.  The Mover is running right now getting the last few items off.

 

I'm not finding anything wrong now - weird.  So is running reiserfsck still the next step to investigate and ensure no data issues?

 

I hate not finding a smoking gun.

smart.txt

Link to comment

sorry, I bobbled the quote of the last line from the Wiki. It says,

"If you get an error that says reiserfs_open: the reiserfs superblock cannot be found on /dev/sdX. Failed to open the filesystem. it usually indicates you attempted to run the file system check on the wrong device name.  For almost all repairs, you would use /dev/md1, /dev/md2, /dev/md3, /dev/md4, etc.  If operating on the cache drive that is not protected by parity you would use /dev/sdX1  (note the trailing "1" indicating the first partition on the cache drive) "

 

So (again from the Wiki) your command to check the cache drive would be:

reiserfsck --check /dev/sdh

Assuming that the cached drive is 'h'. (That verification that cache=h was what is cut off at the beginning of the log.)

 

 

A complete log may require that you Telnet into the box, and copy and paste it out by hand. If you're picking it up from a web gui, it may be truncating it.

There should be a 1000 lines or so of 'setup', where unRAID is busy finding the drives, assigning them drive letters, loading your plugins, etc.

Link to comment

Ok, I rebooted and started a parity check, but I started getting more errors on the console when I tried to stop one of my plugins that writes to the cache drive, and so I had to shut it down again.  I saw this one first, but then had several other reiserfs errors,

 

Reiserfs about (device sdh1):  Journal write error in flush_commit_list

 

I attached the syslog, this one log was taken while the parity check was running, but then I had to shutdown so I don't have the log from when I tried to stop the plugin.

 

sdh is my cache drive.  Again I ran and confirmed no SMART errors exist.

 

Output from reiserfsck --check /dev/sdh1

 

Will read-only check consistency of the filesystem on /dev/sdh1

Will put log info to 'stdout'

 

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Mon Jun 23 11:30:08 2014

###########

Replaying journal: Done.

Reiserfs journal '/dev/sdh1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

        Leaves 27991

        Internal nodes 193

        Directories 200720

        Other files 146521

        Data block pointers 1583926 (5057 of them are zero)

        Safe links 0

###########

reiserfsck finished at Mon Jun 23 11:30:22 2014

###########

root@Tower:~#

 

I ordered another drive today to replace my cache drive.  I booted up the server, stopped the array and disabled the use of the cache drive.  I also stopped all plugins that write to the cache drive. 

 

I think this is just a bad drive, but I don't like the I/O error messages.  I hope this isn't a driver or motherboard issue, and only appears to be a bad hard drive.

 

Given I don't want my server running until I have the bad drive out, I rebooted and I'm running memtest for a few days to check hardware.

update: memtest on pass 14 with zero errors.

syslog.zip

Link to comment

Can anyone confirm that my cache drive is in fact bad by looking at my above post and syslog? 

 

I got my new drive today, will swap it tonight.  I'm planning to boot into maintenance mode and running preclear on it, before adding it to my array.

 

I just want to be sure there are no signs of driver issues or perhaps a I/O issue with my controller.

Link to comment

You have several disks showing errors.

ata#15,16,17.

Are they all on an add-in SATA card? 

There are CRC errors which can be bad SATA cables.

 

The other possibilities (from the wiki at: http://lime-technology.com/wiki/index.php?title=The_Analysis_of_Drive_Issues )

are: However, if a better SATA cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).

 

Because you have THREE disks with errors, I'm wondering if its an add-in card?

 

Link to comment

Ok, not sure how to tell which drives are ata 15,16, etc. but most of the errors have the cache drive name it them, sdh.  Most likely when the Mover was running, it was having issues reading the files from the cache drive, just a guess.  I have now removed the old drive.  I'm currently running a preclear on the new drive, to be cache drive.

 

Syslog attached, and no CRC errors now.  I also went back and confirmed that the CRC errors, while only a few per syslog, only started after I starting seeing the errors on the console complaining about the cache drive, sdh.  So I hope that by replacing the cache drive, things are good now.

 

I will run a partiy check again once I have confirmation of a clean syslog.  That said, I just ran one 3 days ago, zero errors.

 

None of them show any SMART errors.

 

I have no addon cards, as my mb has 3 controllers onboard.  Perhaps a driver issue, but I've been running for several months with no errors and no issues, with the same cables - haven't touched a thing.  I've been on 5.05 since released too, so no changes there either.  With that, I find it hard to believe cables or controller.  That said, changing cables is easy, I just need to order some.

syslog.zip

Link to comment
  • 1 month later...

I'm going to mark this thread as solved.

 

I had two issues going on.  One disk did have SMART errors, so I replaced that one.  The cache drive was the drive causing the CRC errors, replaced that one too.

 

The controller errors point to the onboard Gigabyte SATA 2 controller, or perhaps a driver.  I'm sorting that out in my other thread, http://lime-technology.com/forum/index.php?topic=33937.new;topicseen#new

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.