Jump to content

Array Yellow, Parity Check Missing, but everything else seems fine. 6 Beta-14


guyonphone

Recommended Posts

Syslog does not show any drive or array issues either (unless I missed it!).  I don't understand what you mean by a 'yellow light', could you create a screen pic and attach it?

 

By the way, the preferred way to provide syslogs is in this ->  Need help? Read me first!.

 

Also, you have IDE emulation turned on for 2 of your SATA drives (sdh and sdi, disk 4 and Disk 5).  When you next boot, go into the BIOS settings and look for the SATA mode, and change it to a native SATA mode, preferably AHCI if available, anything but IDE emulation mode.  It should be a little faster, and a little safer.

Link to comment

Thanks for the Tip on IDE emulation, i will disable that.

 

qZIzBPj.png

 

Smart Report for Drive 13:

 

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.18.5-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Serial Number:    Z1F2HEZY
LU WWN Device Id: 5 000c50 05002695c
Firmware Version: CC24
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue May 26 22:13:53 2015 PDT

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(  584) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				No Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 340) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail  Always       -       85858456
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1865
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   066   055   030    Pre-fail  Always       -       30098780254
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16652
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       177
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       12 12 12
189 High_Fly_Writes         0x003a   097   097   000    Old_age   Always       -       3
190 Airflow_Temperature_Cel 0x0022   065   052   045    Old_age   Always       -       35 (Min/Max 34/37)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       124
193 Load_Cycle_Count        0x0032   061   061   000    Old_age   Always       -       78861
194 Temperature_Celsius     0x0022   035   048   000    Old_age   Always       -       35 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       4188h+49m+33.031s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       17989534392
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       253824366303

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     16629         -
# 2  Short offline       Completed without error       00%      5023         -
# 3  Short offline       Completed without error       00%      5019         -
# 4  Short offline       Completed without error       00%       168         -
# 5  Short offline       Interrupted (host reset)      00%       168         -
# 6  Short offline       Completed without error       00%       167         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

I got impatient, wrote down my drive order, and then ran the "New Config" command. The array is now back up and running. Everything looks good, and parity is rebuilding. I know this was a risky move, but ive had similar issues in the past, and as it comes down to it, I don't keep anything one of a kind on my array. It's a secondary backup location, and media server for me, so even if I did lose data, there isn't anything that couldn't be replaced. Thanks for your help.

Link to comment

I looked back through your syslog to see if I had missed anything on Disk 13, but found no clues at all, EXCEPT that there were no spinup's or spindown's for Disk 13!  That's somewhat aggravating, for support people, although it is the way the unRAID system is supposed to work, emulating a drive so well it looks like it is still working, even though dropped from the config.  I have to assume that it was red balled in a previous session and dropped.  There was also an unclean shutdown, so system may have crashed?

 

This is a perfect example where the Diagnostics dump would have helped, providing us detail behind the scene as to the true condition of the array and its drives.  I'm hoping the feature is added to v6 before it goes final, it's in testing now.  Another feature that would have helped you, it's requested but not coded, is an enhancement to New Config, an option to start with the old config, so you don't have to remember and re-select all of the drives again, which is both time consuming and error prone.

 

There were 2 ways to correct the situation.  You did one, and that's fine, rebuilding parity.  The other would have been safer, to remove and re-add Disk 13, and rebuild it in place.  If there were any writes to Disk 13, they have been lost now, since the writes would have been to the simulated Disk 13, not the real Disk 13.  It is possible that there is data loss now.

Link to comment

Hi again,

 

My parity failed to rebuild and now I have a write error on Disk 11. I was able to get the syslog this time, before any reboot.I tried to get a SMART report but it just gave me this error.

 

root@Unraid:~# smartctl -a -A /dev/sdw
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.18.5-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

sFH8nIN.png

 

Thanks

Unraid_issue.zip

Link to comment

That looked very familiar, and then I realized you are still running 6.0-beta14b.  We used to see a fair number of cases of this, where a whole SAS card seems to drop, and whatever drive is accessed next is often dropped from the array, because all or most of the drives on that card can no longer respond.  PLEASE upgrade, at least to 6.0-beta15, or the latest -rc.  Since beta15 was released, I don't think I've seen a case of this since.  SAS card support does not even yet seem to be completely stable, and if I had one, I would constantly check for newer firmware for the card, and newer unRAID releases, even if they are beta, in order to always have the latest SAS support modules and drivers and firmware.  Support for them does seem to be improving, with fewer problems now.

 

The SMART error is only because once the drive is dropped, it cannot respond, until after a reboot (sometimes needs full power off).  I expect that after a reboot, it will look fine.

 

By the way, Disk 13 also looked like it was about to be dropped too, but Disk 11 beat it.  And subsequently, there was file system trouble on Disk 8 and Disk 14.  You will want to run reiserfsck on all of them, just in case, see Check Disk File systems.

Link to comment

Hello RobJ,

 

This makes total sense!

 

I kept getting this in my logs

May 27 18:42:04 Unraid avahi-daemon[24573]: server.c: Packet too short or invalid while reading response record. (Maybe a UTF-8 problem?)

 

and when i googled it i found forum postings for other versions of linux indicating it was a bug, that was fixed in RHEL. So I assume something similar for Unraid. I want to upgrade to the newest version, but I was unsure if I should do so with my parity not built. I just created a new config again and got everything back up and running. I will upgrade to the newest version ASAP!

 

I just finished running the reiserfsck here are the results:

 

root@Unraid:~# reiserfsck --check /dev/md1
reiserfsck 3.6.24

Will read-only check consistency of the filesystem on /dev/md1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Wed May 27 18:02:49 2015
###########
Replaying journal: Done.
Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 722202
        Internal nodes 4430
        Directories 3223
        Other files 42938
        Data block pointers 725379565 (37 of them are zero)
        Safe links 0
###########
reiserfsck finished at Wed May 27 18:30:20 2015
###########

 

Thanks for confirming the issue, you guys rock!

Link to comment

I updated to 6-RC3 then tried to rebuild parity and basically the same thing happened, disk 11 has a ton of errors on it, and the parity rebuild has suspended. I tried to grab the syslog, but when I go to generate it by running the command:

 

cp /var/log/syslog /boot/syslog.txt

 

I get a zero kb file, with no log, I then tried to just print the log to the screen, and see nothing, like it's blank or the logging service is crashed.

 

At the console I see the following:

 

tar: domain.img: Cannot change ownership to uid 0, gid 999: operation not permitted
tar:Exiting with failure status due to previous errors
REISERFS abort (device md11): Journal write error in flush_commit_list 

 

I can type at the console, but have no prompt, so it's not taking commands.

 

Any suggestions on how to proceed? Should I replace disk 11?

ta6XqHi.png

Link to comment

OK, I got a real log of the problem failing, im pretty sure. Please see attached.

 

Actually, I messed up a bit because the parity check was still going, even with this error (I canceled it cause im dumb). The only btrfs filesystem I have is the docker image on my xfs cache drive. So I think it's a bit odd i'm receiving a btrfs file system error. I'm pretty much at a loss as to what is happening, please set me straight.

 

Thanks :-)

syslog.zip

Link to comment

OK, I got a real log of the problem failing, im pretty sure. Please see attached.

 

Actually, I messed up a bit because the parity check was still going, even with this error (I canceled it cause im dumb). The only btrfs filesystem I have is the docker image on my xfs cache drive. So I think it's a bit odd i'm receiving a btrfs file system error. I'm pretty much at a loss as to what is happening, please set me straight.

 

Thanks :-)

 

It does look like the other user's errors, except you have already upgraded to -rc3, so no solution there.  You might reconnect the Cache drive to one of the unused ports on the second 2 port card.

Link to comment

Hello RobJ,

 

I will do that, but I am remote to the server right this second (at work). However I did just run a few scrubs without -r, so that it corrects whatever it finds. I get a print out that looks like the following:

 

WARNING: errors detected during scrubbing, corrected.
scrub device /dev/loop0 (id 1) done
scrub started at Fri May 29 14:08:08 2015 and finished after 842 seconds
data_extents_scrubbed: 240178
tree_extents_scrubbed: 67046
data_bytes_scrubbed: 6628896768
tree_bytes_scrubbed: 1098481664
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 496
csum_discards: 158009
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 167
corrected_errors: 0
last_physical: 10737418240

 

One thing I dont understand is "unverified_errors", every time I run a scrub, that number changes. What does that mean?

Link to comment

However I did just run a few scrubs without -r, so that it corrects whatever it finds. I get a print out that looks like the following:

 

WARNING: errors detected during scrubbing, corrected.
scrub device /dev/loop0 (id 1) done
scrub started at Fri May 29 14:08:08 2015 and finished after 842 seconds
data_extents_scrubbed: 240178
tree_extents_scrubbed: 67046
data_bytes_scrubbed: 6628896768
tree_bytes_scrubbed: 1098481664
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 496
csum_discards: 158009
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 167
corrected_errors: 0
last_physical: 10737418240

 

One thing I dont understand is "unverified_errors", every time I run a scrub, that number changes. What does that mean?

 

I can't help you with that, practically no experience with BTRFS.  I've much to learn here, much to research, some day.  Hopefully, someone else with more relevant experience will jump in.

Link to comment

Hello RobJ,

 

If you have much to learn, then I have mountains to learn :-). In anycase, I had reached out to Alexandro from the other thread, and he informed me that running a few scrubs helped him. I have done the same, I re-ran Scrub several times without the "-r", until the Unverified_error count was 0 a few times in a row (It did fluctuate up and down for a bit, then went consistently down to 0). I am now 75.4% through rebuilding parity, I hope it finishes, ill update this ticket as solved once parity is rebuilt. I am also tailing the log, just in case something happens, so I don't miss it.

 

Thanks

Link to comment

WOOHOO! Parity is fully rebuilt and I have a healthy array again!

 

There must have been some sort of bug in 6-Beta14 that upgrading to 6-RC3 resolved, but then I needed to scrub the BTRFS system of my image file. Anyways the issue is now resolved and everything is right in the world. :D

 

Thanks guys!

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...