[SOLVED] Need help with syslog errors.

September 25, 201114 yr

My 4.6 has been running without a single issue for about a year or more, until last night when for no reason at all I started having difficulty accessing certain folders, primarily on disk 5. I tried to reboot the array but it would not unmount the drives, so I had to hard reset it. When the array was coming back online disk 5, 6 and parity took a long time to mount. Now the array is running parity check as it should. I looked at the syslog and there are errors, but I do not know what they mean. Any help would be appreciated.

Thanks,

Lev

syslog-2011-09-25.txt.zip

Quote

September 25, 201114 yr

Post SMART reports for the drives.

Quote

September 26, 201114 yr

Author

Here is a Zip file of the SMART report for all drives.

Thanks,

Lev

PS. Now I am getting strange duplicate folders on separate drives. I removed the 3 folders which were reported as dups under unMenu, and now another one is a duplicate on disks 5 and 6. However, this one is not being reported as a duplicate although it is.

smart_report.zip

Quote

September 26, 201114 yr

Run reiserfsck: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

Quote

September 26, 201114 yr

Author

Run reiserfsck: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

Should I run this on all drives or just some, such as dsk5 & dsk6?

Quote

September 26, 201114 yr

Author

This is the output of reiserfsck --check /dev/md5:

Replaying journal: Done.

Reiserfs journal '/dev/md5' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. \/ 1 (of 17|bad_internal: vpf-10320: block 153550992, items 0 and 1: The wrong order of items: [3 1740 0x2400dc82001 IND (1)], [3 1219 0x924e001 IND (1)]

the problem in the internal node occured (153550992), whole subtree is skipped

finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

1 found corruptions can be fixed only when running with --rebuild-tree

I started the --rebuild-tree command, but my laptop that I telneted through "fell asleep", and I have no way of knowing how far along I am. I think that the original time to completion was something like 25,000 minutes. How will I know when the operation has finished and what should I do next?

Quote

September 26, 201114 yr

the reiserfsck probably died when the laptop fell asleep.

telnet into the unRAID machine and type:

ps -ef

if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

you will have to run the check again, and then the rebuild tree if it tells you to.

Quote

September 26, 201114 yr

Author

the reiserfsck probably died when the laptop fell asleep.

telnet into the unRAID machine and type:
ps -ef
if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

you will have to run the check again, and then the rebuild tree if it tells you to.

You were correct. Nothing running. I am "re-running" --rebuild-tree directly from the console to avoid any time outs. Looks like it takes a long time - 6+ hours. Is that expected?

Quote

September 26, 201114 yr

the reiserfsck probably died when the laptop fell asleep.

telnet into the unRAID machine and type:
ps -ef
if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

you will have to run the check again, and then the rebuild tree if it tells you to.
You were correct. Nothing running. I am "re-running" --rebuild-tree directly from the console to avoid any time outs. Looks like it takes a long time - 6+ hours. Is that expected?

To rebuild the tree, especially if you expect it to read the entire drive, requires "reading the entire drive"

Typically, drives can be read at about 80 to 100MB/s .. but they often are a lot slower on inner cylinders.

At the fastest, that is 10 seconds per Gigabyte, or 6 Gig per minute.

You have 2000 Gigabytes on a 2TB drive. 2000 / 6 = 333.3 minutes. ( or ... 5.5 hours)

And that is just reading as fast as it can, not interpreting what it sees. Yes, 6+ hours is not unusual.

Quote

September 26, 201114 yr

Author

the reiserfsck probably died when the laptop fell asleep.

telnet into the unRAID machine and type:
ps -ef
if there is a reiserfsck running then it is still going, but I got $20 says that it is not.

you will have to run the check again, and then the rebuild tree if it tells you to.
You were correct. Nothing running. I am "re-running" --rebuild-tree directly from the console to avoid any time outs. Looks like it takes a long time - 6+ hours. Is that expected?
To rebuild the tree, especially if you expect it to read the entire drive, requires "reading the entire drive"
Typically, drives can be read at about 80 to 100MB/s .. but they often are a lot slower on inner cylinders.

At the fastest, that is 10 seconds per Gigabyte, or 6 Gig per minute.

You have 2000 Gigabytes on a 2TB drive. 2000 / 6 = 333.3 minutes. ( or ... 5.5 hours)

And that is just reading as fast as it can, not interpreting what it sees. Yes, 6+ hours is not unusual.

Thanks. That makes me feel a little more at ease. At the end, will I have to remount the drive or should I perform:

reiserfsck --check /dev/md5

Lev

Quote

September 26, 201114 yr

Repeat the --check command on all disks until no problems are found. The --check command is non destructive; it only reads the disk. If a problem is found then follow the given instructions.

Quote

September 26, 201114 yr

Author

Repeat the --check command on all disks until no problems are found. The --check command is non destructive; it only reads the disk. If a problem is found then follow the given instructions.

OK. Thank you.

Quote

September 27, 201114 yr

Author

Thank you everyone who helped.

After running the recommended rebuild-tree 2 folders were added to the lost directory, both duplicates of what was already on the disk 5. subsequent checks found no additional problems. I checked disk 6 just to be sure, and no problems were detected. I've upgraded to 4.7 and attached is the latest syslog. There are still errors but these were present on the other syslogs. Are these anything to worry about? I want to add an additional data drive and before that want to make sure that all is kosher.

Sep 27 13:34:28 Tower kernel: [<f83d7a36>] scst_build_proc_target_dir_entries+0x55/0xdc [scst] (Routine)

Sep 27 13:34:28 Tower kernel: [<f83beca9>] __scst_register_target_template+0x16c/0x3af [scst] (Routine)

Sep 27 13:34:28 Tower kernel: [<f843cd5d>] mvst_init+0x3b/0x5b [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<f8441895>] mvs_pci_init+0xaa5/0xaf7 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<c10062d9>] ? dma_generic_alloc_coherent+0x0/0xdb (Errors)

Sep 27 13:34:28 Tower kernel: [<c1142050>] local_pci_probe+0xe/0x10 (Errors)

Sep 27 13:34:28 Tower kernel: [<c11426ad>] pci_device_probe+0x48/0x66 (Errors)

Sep 27 13:34:28 Tower kernel: [<c1194956>] driver_probe_device+0x79/0xed (Errors)

Sep 27 13:34:28 Tower kernel: [<c1194a0d>] __driver_attach+0x43/0x5f (Errors)

Sep 27 13:34:28 Tower kernel: [<c11940a7>] bus_for_each_dev+0x39/0x5a (Errors)

Sep 27 13:34:28 Tower kernel: [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<c119482f>] driver_attach+0x14/0x16 (Errors)

Sep 27 13:34:28 Tower kernel: [<c11949ca>] ? __driver_attach+0x0/0x5f (Errors)

Sep 27 13:34:28 Tower kernel: [<c119451c>] bus_add_driver+0x9f/0x1c5 (Errors)

Sep 27 13:34:28 Tower kernel: [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<c1194ccf>] driver_register+0x7b/0xd7 (Errors)

Sep 27 13:34:28 Tower kernel: [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<c1142882>] __pci_register_driver+0x39/0x8c (Errors)

Sep 27 13:34:28 Tower kernel: [<f8449000>] ? mvs_init+0x0/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<f8449030>] mvs_init+0x30/0x45 [mvsas] (Errors)

Sep 27 13:34:28 Tower kernel: [<c1001139>] do_one_initcall+0x4c/0x131 (Errors)

Sep 27 13:34:28 Tower kernel: [<c1042e6e>] sys_init_module+0xa7/0x1dd (Errors)

Sep 27 13:34:28 Tower kernel: [<c1002935>] syscall_call+0x7/0xb (Errors)

Sep 27 13:34:28 Tower kernel: ---[ end trace d0f101247926fe9d ]---

Also, should I be concerned with disk 5 as it had errors on it to begin with? Should I replace it?

Thanks,

Lev

syslog-2011-09-27.txt.zip

Quote

September 27, 201114 yr

Author

I actually decided to go ahead and add an additional 2TB drive to the array and try to perform a preclear. I cannot. I get an error message that the "device /dev/sde" is not responding to an fdisk -l /dev/sde command. This is the only unassigned disk in the system as per the preclear_disk.sh -l. I see a lot more errors in the syslog. Frustrating...

Lev

syslog-2011-09-27-1.txt.zip

Quote

September 27, 201114 yr

Please post a SMART report for sde. Te previous SMART report attachment is corrupt. Please paste the results.

Quote

September 27, 201114 yr

Author

Please post a SMART report for sde. Te previous SMART report attachment is corrupt. Please paste the results.

Sorry. $:-\$

I decided to try a different disk of the same manufacturer and size and the preclear is now running. 'guess it must be a bad disk. I'll see if I can plug it into my other PC and run a SMART on it or just send it out for RMA.

Lev

Quote

September 28, 201114 yr

Author

Well if it's not one thing it's another...

Preclear script has been running now for almost a day on a 2TB Seagate drive installed in Norco 4220 case. This AM while routinely checking the server I noticed that the disk 6 has red balled and I am getting these errors in the syslog since I believe 5 AM:

Sep 28 12:56:24 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:24 Tower kernel: mdcmd (7449): spindown 6 (Routine)

Sep 28 12:56:24 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:56:34 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:34 Tower kernel: mdcmd (7450): spindown 6 (Routine)

Sep 28 12:56:34 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:56:44 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:44 Tower kernel: mdcmd (7451): spindown 6 (Routine)

Sep 28 12:56:44 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:56:54 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:56:54 Tower kernel: mdcmd (7452): spindown 6 (Routine)

Sep 28 12:56:54 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:01 Tower unmenu[1660]: bad method - 69 1208 2201 13400 13 6 152 130 0 13490 1353069-^M

Sep 28 12:57:04 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:04 Tower kernel: mdcmd (7453): spindown 6 (Routine)

Sep 28 12:57:04 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:14 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:14 Tower kernel: mdcmd (7454): spindown 6 (Routine)

Sep 28 12:57:14 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:24 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:24 Tower kernel: mdcmd (7455): spindown 6 (Routine)

Sep 28 12:57:24 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:34 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:34 Tower kernel: mdcmd (7456): spindown 6 (Routine)

Sep 28 12:57:34 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:44 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:44 Tower kernel: mdcmd (7457): spindown 6 (Routine)

Sep 28 12:57:44 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:57:54 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:57:54 Tower kernel: mdcmd (7458): spindown 6 (Routine)

Sep 28 12:57:54 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:58:04 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:58:04 Tower kernel: mdcmd (7459): spindown 6 (Routine)

Sep 28 12:58:04 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

Sep 28 12:58:14 Tower emhttp: mdcmd: write: Input/output error (Errors)

Sep 28 12:58:14 Tower kernel: mdcmd (7460): spindown 6 (Routine)

Sep 28 12:58:14 Tower kernel: md: disk6: ATA_OP_STANDBYNOW1 ioctl error: -5 (Errors)

I ran a reiserfsck --check /dev/md6 yesterday prior to upgrading to 4.7 from 4.6 and there were no problems. I rebooted the system with a new drive installed, started the preclear and there were no problems at all till this AM.

I do not want to mess around with the server till the preclear is finished, but I think that that will be another 5 to 8 hours judging by how "fast" it is going. Attached is a report pertaining to disk 6, generated by unMENU.

Any thoughts?

Thanks,

Lev

disk_6_errors.zip

Quote

September 28, 201114 yr

It indicates a "write" to the drive failed and it was taken off-line. (disabled)

There can be many causes, anything from the drive itself failing to a cable failing, to a loose connector.

See here for a starting point:

http://lime-technology.com/wiki/index.php?title=FAQ#What_does_the_Red_Ball_mean.3F

Quote

September 28, 201114 yr

Author

It indicates a "write" to the drive failed and it was taken off-line. (disabled)

There can be many causes, anything from the drive itself failing to a cable failing, to a loose connector.

See here for a starting point:

http://lime-technology.com/wiki/index.php?title=FAQ#What_does_the_Red_Ball_mean.3F

Thank you. I will try swapping the drive to a different location after the preclear script is finished to see if that makes a difference.

Lev

Quote

October 4, 201114 yr

Author

In the end I am still not sure as to what could have happened. I tried a second 2TB drive and while preclearing it, noticed that there were a lot of errors in the SMART report, all looking identical to this:

Error 91 occurred at disk power-on lifetime: 29 hours (1 days + 5 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 08 ff ff ff ef 00 00:35:36.910 READ DMA EXT

27 00 00 00 00 00 e0 00 00:35:36.908 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 00 00:35:36.868 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 00 00:35:36.850 SET FEATURES [set transfer mode]

27 00 00 00 00 00 e0 00 00:35:36.808 READ NATIVE MAX ADDRESS EXT

Error 90 occurred at disk power-on lifetime: 29 hours (1 days + 5 hours)

When the command that caused the error occurred, the device was active or idle.

Both drives were brand new and I tried installing them in various slots on my Norco 4220 case with 2 SAS-LP-MV8 cards and SUPERMICRO MBD-C2SEE-O. The last drive, the one who's SMART report is attached, has been cleared and incorporated into the array. So far no issues, but there are persistent errors in the syslog, which are unchanged from the beginning as far as I can tell.

On the first Seagate drive, which I thought was bad, I ran extensive tests while connected to a Win 7 machine, and have found 0 problems.

I may take the server apart and check to make sure that all connections are still good, but I somehow doubt that anything could be loose.

Thanks for the help,

Lev

syslog-2011-10-03.txt.zip

Quote

[SOLVED] Need help with syslog errors.

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)