Jump to content

[SOLVED] Need help with syslog errors.


levster

Recommended Posts

My 4.6 has been running without a single issue for about a year or more, until last night when for no reason at all I started having difficulty accessing certain folders, primarily on disk 5. I tried to reboot the array but it would not unmount the drives, so I had to hard reset it. When the array was coming back online disk 5, 6 and parity took a long time to mount. Now the array is running parity check as it should. I looked at the syslog and there are errors, but I do not know what they mean.  Any help would be appreciated.

 

Thanks,

 

Lev

syslog-2011-09-25.txt.zip

Link to comment

This is the output of reiserfsck --check /dev/md5:

 

 

Replaying journal: Done.

Reiserfs journal '/dev/md5' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. \/  1 (of  17|bad_internal: vpf-10320: block 153550992, items 0 and 1: The wrong order of items: [3 1740 0x2400dc82001 IND (1)], [3 1219 0x924e001 IND (1)]

the problem in the internal node occured (153550992), whole subtree is skipped

finished   

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

1 found corruptions can be fixed only when running with --rebuild-tree

 

I started the --rebuild-tree command, but my laptop that I telneted through "fell asleep", and I have no way of knowing how far along I am. I think that the original time to completion was something like 25,000 minutes. How will I know when the operation has finished and what should I do next?

Link to comment

the reiserfsck probably died when the laptop fell asleep.

 

telnet into the unRAID machine and type:

ps -ef

if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

 

you will have to run the check again, and then the rebuild tree if it tells you to.

 

You were correct. Nothing running. I am "re-running" --rebuild-tree directly from the console to avoid any time outs. Looks like it takes a long time - 6+ hours. Is that expected?

Link to comment

the reiserfsck probably died when the laptop fell asleep.

 

telnet into the unRAID machine and type:

ps -ef

if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

 

you will have to run the check again, and then the rebuild tree if it tells you to.

 

You were correct. Nothing running. I am "re-running" --rebuild-tree directly from the console to avoid any time outs. Looks like it takes a long time - 6+ hours. Is that expected?

To rebuild the tree, especially if you expect it to read the entire drive, requires "reading the entire drive"

Typically, drives can be read at about 80 to 100MB/s .. but they often are a lot slower on inner cylinders. 

 

At the fastest, that is 10 seconds per Gigabyte, or 6 Gig per minute.

 

You have 2000 Gigabytes on a 2TB drive.  2000 / 6 = 333.3 minutes.  ( or ... 5.5 hours)

And that is just reading as fast as it can, not interpreting what it sees.  Yes, 6+ hours is not unusual.

Link to comment

the reiserfsck probably died when the laptop fell asleep.

 

telnet into the unRAID machine and type:

ps -ef

if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

 

you will have to run the check again, and then the rebuild tree if it tells you to.

 

 

You were correct. Nothing running. I am "re-running" --rebuild-tree directly from the console to avoid any time outs. Looks like it takes a long time - 6+ hours. Is that expected?

To rebuild the tree, especially if you expect it to read the entire drive, requires "reading the entire drive"

Typically, drives can be read at about 80 to 100MB/s .. but they often are a lot slower on inner cylinders.  

 

At the fastest, that is 10 seconds per Gigabyte, or 6 Gig per minute.

 

You have 2000 Gigabytes on a 2TB drive.   2000 / 6 = 333.3 minutes.  ( or ... 5.5 hours)

And that is just reading as fast as it can, not interpreting what it sees.   Yes, 6+ hours is not unusual.

 

Thanks. That makes me feel a little more at ease. At the end, will I have to remount the drive or should I perform:

 

reiserfsck --check /dev/md5

 

 

 

Lev

 

Link to comment

Thank you everyone who helped.  ;D

 

After running the recommended rebuild-tree 2 folders were added to the lost directory, both duplicates of what was already on the disk 5. subsequent checks found no additional problems. I checked disk 6 just to be sure, and no problems were detected. I've upgraded to 4.7 and attached is the latest syslog. There are still errors but these were present on the other syslogs. Are these anything to worry about? I want to add an additional data drive and before that want to make sure that all is kosher.

 

Sep 27 13:34:28 Tower kernel:  [<f83d7a36>] scst_build_proc_target_dir_entries+0x55/0xdc [scst] (Routine)

Sep 27 13:34:28 Tower kernel:  [<f83beca9>] __scst_register_target_template+0x16c/0x3af [scst] (Routine)

Sep 27 13:34:28 Tower kernel:  [<f843cd5d>] mvst_init+0x3b/0x5b [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<f8441895>] mvs_pci_init+0xaa5/0xaf7 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<c10062d9>] ? dma_generic_alloc_coherent+0x0/0xdb (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1142050>] local_pci_probe+0xe/0x10 (Errors)

Sep 27 13:34:28 Tower kernel:  [<c11426ad>] pci_device_probe+0x48/0x66 (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1194956>] driver_probe_device+0x79/0xed (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1194a0d>] __driver_attach+0x43/0x5f (Errors)

Sep 27 13:34:28 Tower kernel:  [<c11940a7>] bus_for_each_dev+0x39/0x5a (Errors)

Sep 27 13:34:28 Tower kernel:  [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<c119482f>] driver_attach+0x14/0x16 (Errors)

Sep 27 13:34:28 Tower kernel:  [<c11949ca>] ? __driver_attach+0x0/0x5f (Errors)

Sep 27 13:34:28 Tower kernel:  [<c119451c>] bus_add_driver+0x9f/0x1c5 (Errors)

Sep 27 13:34:28 Tower kernel:  [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1194ccf>] driver_register+0x7b/0xd7 (Errors)

Sep 27 13:34:28 Tower kernel:  [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1142882>] __pci_register_driver+0x39/0x8c (Errors)

Sep 27 13:34:28 Tower kernel:  [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<f8449030>] mvs_init+0x30/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1001139>] do_one_initcall+0x4c/0x131 (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1042e6e>] sys_init_module+0xa7/0x1dd (Errors)

Sep 27 13:34:28 Tower kernel:  [<c1002935>] syscall_call+0x7/0xb (Errors)

Sep 27 13:34:28 Tower kernel: ---[ end trace d0f101247926fe9d ]---

 

Also, should I be concerned with disk 5 as it had errors on it to begin with? Should I replace it?

 

Thanks,

 

Lev

syslog-2011-09-27.txt.zip

Link to comment

Please post a SMART report for sde. Te previous SMART report attachment is corrupt. Please paste the results.

 

Sorry.  :-\

 

I decided to try a different disk of the same manufacturer and size and the preclear is now running. 'guess it must be a bad disk. I'll see if I can plug it into my other PC and run a SMART on it or just send it out for RMA.

 

 

Lev

 

Link to comment

Well if it's not one thing it's another...

 

Preclear script has been running now for almost a day on a 2TB Seagate drive installed in Norco 4220 case. This AM while routinely checking the server I noticed that the disk 6 has red balled and I am getting these errors in the syslog since I believe 5 AM:

 

Sep 28 12:56:24 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:24 Tower kernel: mdcmd (7449): spindown 6 (Routine)

Sep 28 12:56:24 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:56:34 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:34 Tower kernel: mdcmd (7450): spindown 6 (Routine)

Sep 28 12:56:34 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:56:44 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:44 Tower kernel: mdcmd (7451): spindown 6 (Routine)

Sep 28 12:56:44 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:56:54 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:54 Tower kernel: mdcmd (7452): spindown 6 (Routine)

Sep 28 12:56:54 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:01 Tower unmenu[1660]: bad method -       69     1208     2201    13400       13        6      152      130        0    13490    1353069-^M

Sep 28 12:57:04 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:04 Tower kernel: mdcmd (7453): spindown 6 (Routine)

Sep 28 12:57:04 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:14 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:14 Tower kernel: mdcmd (7454): spindown 6 (Routine)

Sep 28 12:57:14 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:24 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:24 Tower kernel: mdcmd (7455): spindown 6 (Routine)

Sep 28 12:57:24 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:34 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:34 Tower kernel: mdcmd (7456): spindown 6 (Routine)

Sep 28 12:57:34 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:44 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:44 Tower kernel: mdcmd (7457): spindown 6 (Routine)

Sep 28 12:57:44 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:54 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:54 Tower kernel: mdcmd (7458): spindown 6 (Routine)

Sep 28 12:57:54 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:58:04 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:58:04 Tower kernel: mdcmd (7459): spindown 6 (Routine)

Sep 28 12:58:04 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:58:14 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:58:14 Tower kernel: mdcmd (7460): spindown 6 (Routine)

Sep 28 12:58:14 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

 

I ran a reiserfsck --check /dev/md6 yesterday prior to upgrading to 4.7 from 4.6 and there were no problems. I rebooted the system with a new drive installed, started the preclear and there were no problems at all till this AM.

 

I do not want to mess around with the server till the preclear is finished, but I think that that will be another 5 to 8 hours judging by how "fast" it is going. Attached is a report pertaining to disk 6, generated by unMENU.

 

Any thoughts?

 

Thanks,

 

Lev

disk_6_errors.zip

Link to comment

It indicates a "write" to the drive failed and it was taken off-line.  (disabled)

 

There can be many causes, anything from the drive itself failing to a cable failing, to a loose connector.

 

See here for a starting point:

http://lime-technology.com/wiki/index.php?title=FAQ#What_does_the_Red_Ball_mean.3F

 

 

Thank you. I will try swapping the drive to a different location after the preclear script is finished to see if that makes a difference.

 

Lev

Link to comment

In the end I am still not sure as to what could have happened. I tried a second 2TB drive and while preclearing it, noticed that there were a lot of errors in the SMART report, all looking identical to this:

 

 

Error 91 occurred at disk power-on lifetime: 29 hours (1 days + 5 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00      00:35:36.910  READ DMA EXT

  27 00 00 00 00 00 e0 00      00:35:36.908  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      00:35:36.868  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      00:35:36.850  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 00      00:35:36.808  READ NATIVE MAX ADDRESS EXT

 

Error 90 occurred at disk power-on lifetime: 29 hours (1 days + 5 hours)

  When the command that caused the error occurred, the device was active or idle.

 

Both drives were brand new and I tried installing them in various slots on my Norco 4220 case with 2 SAS-LP-MV8 cards and SUPERMICRO MBD-C2SEE-O. The last drive, the one who's SMART report is attached, has been cleared and incorporated into the array. So far no issues, but there are persistent errors in the syslog, which are unchanged from the beginning as far as I can tell.

 

On the first Seagate drive, which I thought was bad, I ran extensive tests while connected to a Win 7 machine, and have found 0 problems.

 

I may take the server apart and check to make sure that all connections are still good, but I somehow doubt that anything could be loose.

 

Thanks for the help,

 

Lev

syslog-2011-10-03.txt.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...