Jump to content

Red error messages in my syslog


Recommended Posts

Ok, herewith, I tried it yesterday and it consistenly gave the same information... I do however want to stress that the preclear script did NOT fail because of errors with the drive... It causes large number of errors in the system, the new disk even became totally unavailable under use of the script.

 

When clearing with unraid itself all went without a problem (took a good amount of time, but no errors or whatsoever).

 

I am happy with the result I have right now, and I have made a link to this post in the preclear thread for people who think some more investigation is needed.

 

=== START OF INFORMATION SECTION ===

Device Model:    WDC WD20EARX-00PASB0

Serial Number:    WD-WMAZA5761786

Firmware Version: 51.0AB51

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Mon Mar 12 18:11:00 2012 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 113) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (39900) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  208  171  021    Pre-fail  Always      -      4575

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      62

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  093  093  000    Old_age  Always      -      5127

10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      52

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      38

193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      2040

194 Temperature_Celsius    0x0022  123  107  000    Old_age  Always      -      27

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      26

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      171

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed: read failure      10%      5127        168700144

# 2  Extended offline    Completed: read failure      90%      5080        168542561

# 3  Short offline      Completed: read failure      80%      5080        168542563

# 4  Short offline      Completed: read failure      90%      5080        168542560

# 5  Short offline      Completed: read failure      90%      5080        168542560

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Link to comment

The pending sectors have been resolved and Offline_Uncorrectable has increased to 26. This indicates media problems on the HDD. These SMART values need to be monitored: Reallocated_Sector_Ct, Offline_Uncorrectable, Reallocated_Event_Count, and Current_Pending_Sector.

 

Run pre-clear on this disk and observe these values. Then post a new SMART report.

 

Pre-clear was reporting the HDD errors and the unRAID clearing did not (Ignorance is bliss.). Using this drive without further testing is a mistake.

Link to comment

Thanks to everyone for assisting and please do not understand the following in a negative way: But as I stated earlier I CANNOT run preclear on this drive since it will crash my system... People seem to consistently think that my problem is that preclear finds errors on the drive, it is not... Preclear causes errors in the unraid system to an extent that it crashes my system..

 

I have to admit however that I am getting a crash course in SMART values here, and indeed the link between current pending sector and offline uncorrectable points to possible issues in the disk surface. So indeed the disk does seem  to have issues. I have set up the array in such a way that only a temporary drive will get data on it, I will do a bit of experimenting to see how the values change (not because I do not recognise the issue but because I want to learn on the way)

Link to comment

Thanks to everyone for assisting and please do not understand the following in a negative way: But as I stated earlier I CANNOT run preclear on this drive since it will crash my system... People seem to consistently think that my problem is that preclear finds errors on the drive, it is not... Preclear causes errors in the unraid system to an extent that it crashes my system..

 

I have to admit however that I am getting a crash course in SMART values here, and indeed the link between current pending sector and offline uncorrectable points to possible issues in the disk surface. So indeed the disk does seem  to have issues. I have set up the array in such a way that only a temporary drive will get data on it, I will do a bit of experimenting to see how the values change (not because I do not recognise the issue but because I want to learn on the way)

We understand you are trying to learn.  The unRAID "clearing" process only writes to the disk.  The preclear_disk.sh process reads and writes the disk.  There is a difference.

 

If you only intend to write to your array, and never read data from it, the disk might be perfectly fine.  If you intend to use it to read from, it may (or may not) show more errors.  it is perfectly possible that once written the sectors will be readable.  I suspect that will be the case, since I see no re-allocated sectors.  (all the bad sectors , so far, have been re-written in place)

 

Now that the disk is in your array, I might suggest one or more non-correcting parity checks.  You can perform that through the button in unMENU, or, lacking that add-on, log on via telnet or the system console and type:

/root/mdcmd check NOCORRECT

That will initiate a non-correcting parity check.  If it is successful, you should be fine.

 

Do be aware that none of the "long" or "short" internal tests of that disk have ever completed.

They've all aborted on a read failure.

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed: read failure      10%      5127        168700144

# 2  Extended offline    Completed: read failure      90%      5080        168542561

# 3  Short offline      Completed: read failure      80%      5080        168542563

# 4  Short offline      Completed: read failure      90%      5080        168542560

# 5  Short offline      Completed: read failure      90%      5080        168542560

 

Link to comment

Thanks again. Several of the disks in my array have come from my WHS system (that I now retired) and have been running for some time. Several disks get red flags from smarthistory (just installed it) because of number of hours run.

 

I will wait for the next unraid version to start replacing with brand new 3tb drives, the 2TB's I free up I will use for off-site backup storage.

Link to comment

WEll... that was a golden bullet tip... I started the non correcting parity check and the disk got flagged and taken out of the array within 30 minutes.. I just removed it and replaced it with another one.. I had that one lying around and have done several small and extensive tests with it on my pc, got no errors.

 

Am now running preclear on the new drive and it is again giving me a lot of red syslog errors. I'll do a a smart check

Link to comment

Following were the red errors that ended up in my syslog, syslog has remained stable and quiet since, preclear is still commencing, anyone here who can determine what these errors mean ?

 

Mar 15 09:11:42 Tower kernel: ------------[ cut here ]------------

Mar 15 09:11:42 Tower kernel: WARNING: at drivers/ata/libata-core.c:5186 ata_qc_issue+0x10b/0x308() (Minor Issues)

Mar 15 09:11:42 Tower kernel: Hardware name: System Product Name

Mar 15 09:11:42 Tower kernel: Modules linked in: ntfs md_mod xor mvsas libsas scst scsi_transport_sas forcedeth sata_nv amd74xx (Drive related)

Mar 15 09:11:42 Tower kernel: Pid: 7683, comm: hdparm Not tainted 2.6.32.9-unRAID #8 (Errors)

Mar 15 09:11:42 Tower kernel: Call Trace: (Errors)

Mar 15 09:11:42 Tower kernel:  [<c102449e>] warn_slowpath_common+0x60/0x77 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c10244c2>] warn_slowpath_null+0xd/0x10 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11b624d>] ata_qc_issue+0x10b/0x308 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11ba260>] ata_scsi_translate+0xd1/0xff (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11a816c>] ? scsi_done+0x0/0xd (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11a816c>] ? scsi_done+0x0/0xd (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11baa40>] ata_sas_queuecmd+0x120/0x1d7 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11bc6df>] ? ata_scsi_pass_thru+0x0/0x21d (Errors)

Mar 15 09:11:42 Tower kernel:  [<f845769a>] sas_queuecommand+0x65/0x20d [libsas] (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11a816c>] ? scsi_done+0x0/0xd (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11a82c0>] scsi_dispatch_cmd+0x147/0x181 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11ace4d>] scsi_request_fn+0x351/0x376 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1126798>] __blk_run_queue+0x78/0x10c (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1124446>] elv_insert+0x67/0x153 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11245b8>] __elv_add_request+0x86/0x8b (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1129343>] blk_execute_rq_nowait+0x4f/0x73 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11293dc>] blk_execute_rq+0x75/0x91 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11292cc>] ? blk_end_sync_rq+0x0/0x28 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c112636f>] ? get_request+0x204/0x28d (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11269d6>] ? get_request_wait+0x2b/0xd9 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c112c2bf>] sg_io+0x22d/0x30a (Errors)

Mar 15 09:11:42 Tower kernel:  [<c112c5a8>] scsi_cmd_ioctl+0x20c/0x3bc (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11b3257>] sd_ioctl+0x6a/0x8c (Errors)

Mar 15 09:11:42 Tower kernel:  [<c112a420>] __blkdev_driver_ioctl+0x50/0x62 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c112ad1c>] blkdev_ioctl+0x8b0/0x8dc (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1131e2d>] ? kobject_get+0x12/0x17 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c112b0f8>] ? get_disk+0x4a/0x61 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c101b028>] ? kmap_atomic+0x14/0x16 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf (Errors)

Mar 15 09:11:42 Tower kernel:  [<c104a179>] ? filemap_fault+0xb8/0x305 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1048c43>] ? unlock_page+0x18/0x1b (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1057c63>] ? __do_fault+0x3a7/0x3da (Errors)

Mar 15 09:11:42 Tower kernel:  [<c105985f>] ? handle_mm_fault+0x42d/0x8f1 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c108b6c6>] block_ioctl+0x2a/0x32 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c108b69c>] ? block_ioctl+0x0/0x32 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c10769d5>] vfs_ioctl+0x22/0x67 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1076f33>] do_vfs_ioctl+0x478/0x4ac (Errors)

Mar 15 09:11:42 Tower kernel:  [<c105dcdd>] ? do_mmap_pgoff+0x232/0x294 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1076f93>] sys_ioctl+0x2c/0x45 (Errors)

Mar 15 09:11:42 Tower kernel:  [<c1002935>] syscall_call+0x7/0xb (Errors)

Mar 15 09:11:42 Tower kernel: ---[ end trace 80e02952ab951772 ]---

 

The following thread describes the same problem with another user:

 

http://lime-technology.com/forum/index.php?topic=14946.0

 

Suggested solution is to do preclears from a motherboard SATA connector, meaning there is some kind of incompatibility between preclear and the expansion card ..

Link to comment

A SMART check on the new drive fails to start with the following output:

 

=== START OF INFORMATION SECTION ===

Device Model:    WDC WD20EARS-00MVWB0

Serial Number:    WD-WCAZA4913598

Firmware Version: 51.0AB51

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Thu Mar 15 13:56:51 2012 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

Error SMART Status command failed

Please get assistance from http://smartmontools.sourceforge.net/

Register values returned from SMART Status command are:

ST =0x40

ERR=0x00

NS =0x04

SC =0xe0

CL =0x6c

CH =0x28

SEL=0x40

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Link to comment

From console I was able to get output using command:

 

smartctl -a -A -T permissive /dev/sdg

 

Output is as follows:

 

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  191  172  021    Pre-fail  Always      -      5441

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      49

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  096  096  000    Old_age  Always      -      3384

10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      21

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      18

193 Load_Cycle_Count        0x0032  172  172  000    Old_age  Always      -      84663

194 Temperature_Celsius    0x0022  120  108  000    Old_age  Always      -      30

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0

 

So disk looks fine as far as I am concerned..

 

Two questions remain:

 

1) Any idea on the errors I got in the syslog ?

2) Is there a possibility to have SMARTHISTORY and UNRAID configured in such a way that they will work from the unmenu ?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...