unable to build parity with new config option after multiple drive failure


MrLondon

Recommended Posts

ok I have now started the new configuration with 3 drives removed and 3 new drives inserted and started the rebuild, which has started. but am I meant to see the contents of my shares? I see see on under /diskxx the data but not under //tower/sharename... is that normal?

 

is this normally caused by cable issues?

 

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [7175 10213 0x0 SD]

Sep  8 12:47:59 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 14247 does not match to the expected one 3

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 373216999. Fsck?

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [9874 9875 0x0 SD]

Sep  8 12:47:59 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 25565 does not match to the expected one 2

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 353699823. Fsck?

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [7175 10213 0x0 SD]

Sep  8 12:47:59 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 14247 does not match to the expected one 3

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 373216999. Fsck?

Sep  8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [9874 9878 0x0 SD]

 

should I be stopping parity sync

 

Sep  8 12:31:51 Tower shfs/user: shfs_readdir: fstatat: Boss (13) Permission denied (Drive related)

Sep  8 12:31:51 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk14/series (13) Permission denied

Sep  8 12:31:51 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 31632 does not match to the expected one 1 (Minor Issues)

Sep  8 12:31:51 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 301802752. Fsck? (Errors)

Sep  8 12:31:51 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [723 1108 0x0 SD] (Errors)

Sep  8 12:33:01 Tower shfs/user: shfs_readdir: fstatat: Boss (13) Permission denied (Drive related)

Sep  8 12:33:01 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk14/series (13) Permission denied

Sep  8 12:33:01 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 31632 does not match to the expected one 1 (Minor Issues)

Sep  8 12:33:01 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 301802752. Fsck? (Errors)

Sep  8 12:33:01 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [723 1108 0x0 SD] (Errors)

Sep  8 12:33:54 Tower sshd[4302]: Accepted password for root from 192.168.0.7 port 55712 ssh2

Sep  8 12:34:35 Tower shfs/user: shfs_readdir: fstatat: Boss (13) Permission denied (Drive related)

Sep  8 12:34:35 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk14/series (13) Permission denied

 

how can it get permission denied?

 

I have now stopped the array and taken that drive out of a 5x3 cage which has already a suspect position and have now connected the sata port direct to the drive instead.

syslog-2014-09-08.zip

Link to comment

ok I finished the rebuild-tree and it found some files. However when I now try to run the resync I have the 3TB parity drop from the system. I have already reseated the drive, used a different port, I even moved it out of my Norco 5x3 cage, but still same problem. Any suggestions?

 

done a new configuration once again and used a different 3tb drive as a parity drive, lets see if that works.

syslog-2014-09-10.zip

Link to comment

Well it's good news and bad news!  The good news is there's nothing wrong with any of the drives, and probably never was!  The bad news is there's something seriously wrong with the SAS system, and I don't know if it's the SAS software (one of the drivers or other SAS support software) or something wrong with one of the SAS cards.

 

Syslog 2 and syslog 3 are roughly the same, an attempt to rebuild the parity drive, except syslog 2 is in safe mode and syslog 3 loads XEN and plugins.  Syslog 3 loads 18 drives and syslog 2 loads 17, missing Disk 13.  They both fail the same way, in one hour in syslog 2 and almost 4 hours in syslog 3.  At the SAS fail moment, sas_scsi_recover_host is entered and reports 18 failed (17 for syslog 2), then aborts all 18 tasks (17 in syslog 2), after which all writes to the parity drive are failed.  Shortly after, the parity drive is disabled by the kernel (no response at all).  You can then ignore ALL of the subsequent errors related to the Parity drive.  It's not the Parity drives fault, it just happened to be the drive that was being written, and I suspect other drives might have failed likewise if they had any I/O attempted.  What is really odd here (at least to me) is that the SAS subsystem is reporting 17 and 18 tasks, but there are 6 drives on the motherboard, using ahci not the SAS system!  So it seems as if the SAS error system should only have reported managing 11 or 12 drives/tasks.

 

In your first syslog, the start of the array showed transactions being replayed on almost all of the drives plus file system corruption on Disk 14.  I suspect you had previously had a similar SAS crash, during a write to Disk 14.  In other words, a similar cause, but I don't know that for sure.

 

I don't know how to advise, I haven't seen this before.  This seems to be a more major problem than just replacing drives.

Link to comment

Yeah especially as I already moved off the sas card onto the motherboard and still get same errors. The md14 was a failed drive replacement rebuilt which I only noticed after the failed drive had been shipped back. I have sent tom an email asking for help as I am fresh out of ideas

 

seems to be getting worse now... the monitor connected to it is being flooded with these messages

 

Sep 10 16:15:42 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 40958 does not match to the expected one 1

Sep 10 16:15:42 Tower kernel: REISERFS error (device md15): vs-5150 search_by_key: invalid format found in block 472002343. Fsck?

Sep 10 16:15:42 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 40958 does not match to the expected one 1

Sep 10 16:15:42 Tower kernel: REISERFS error (device md15): vs-5150 search_by_key: invalid format found in block 472002343. Fsck?

Sep 10 16:15:42 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 40958 does not match to the expected one 1

Sep 10 16:15:42 Tower kernel: REISERFS error (device md15): vs-5150 search_by_key: inva

 

I have decided to shutdown the array now... before I lose more data. I used the powerdown script however that did not even work, had to hard power down the array.. something very strange going on.

Link to comment

The log messages indicate there is file system corruption on disk15  ( assume that this is the same drive that was previously showing as disk14 in the log that RobJ commented on) and the drive number has changed for some reason?  I would have thought that it was worth resolving this issue before trying to rebuild parity?   

 

Having said that such corruption should not stop parity being rebuilt - it would just mean that the parity disk would also reflected the corrupt file system.  It depends on whether you have decided that it is more important to rebuild parity than correct the problem on disk15.  If you really want to give getting parity rebuilt priority, then since this drive is reporting errors I think you would be better of temporarily doing a 'new config' and defining the array with this drive omitted.  That may help with rebuilding parity without getting the syslog flooded with the reiserfs errors, and if not there is more chance of seeing if you have some other error being reported while the rebuild takes place.  You can then think about trying to recover data from the problem disk as a separate issue.

 

I cannot think of a reason not to use beta 9 now.  If anything it is probably the best one to go with as you want to know if beta 9 has any issue with your system that are specific to that release.  However this issue is probably not affected by what release of unRAID you are using - it feels more like something at the physical level.

Link to comment

I have now mounted the md15 drive into another machine and am running reiserfsck --check /dev/md1 against that one. see if I get anything from that drive.

 

 

###########

reiserfsck --check started at Fri Sep 12 04:13:27 2014

###########

Replaying journal: Done.

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

        Leaves 683156

        Internal nodes 4412

        Directories 1478

        Other files 19289

        Data block pointers 687587240 (0 of them are zero)

        Safe links 0

###########

reiserfsck finished at Fri Sep 12 04:32:06 2014

 

seems all ok

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  178  176  021    Pre-fail  Always      -      6075

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      834

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  093  093  000    Old_age  Always      -      5784

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      144

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      25

193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      808

194 Temperature_Celsius    0x0022  115  106  000    Old_age  Always      -      35

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      5784        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Link to comment

That is confusing?  Not sure why the reiserfsck issue was being reported against that drive then in the first place?  The SMART report also looks fine.  I guess the good news is that the data appears to be intact.

 

If you try and bring up the array and without that drive and start the parity sync do you still get these messages in the syslog?  If you do then that would suggest that the removed was not the one with the reiserfs error.

 

If not, I can only suggest that you stop the parity sync, reset the array to include disk15 again and restart the parity sync to see what happens.  If the errors re-appear then that suggest the disk is not currently being correctly handled as it worked OK in the other machine.  This would suggest some sort of hardware related issue, or a problem at the driver level for the disk controller.

Link to comment

I am now running the reiserfsck for all my 19 drives just to make sure they all pass before attempting another parity build.

Good idea. 

 

It is something I tend to do periodically as a safety check as I tend to always be running the latest beta releases.  The problem is that as drives get bigger (my largest are now 6TB) it is getting harder to find long enough slots to do this on.  Ideally I would like to be able to run the check with the array still active but I do not think this is possible.

Link to comment

Will read-only check consistency of the filesystem on /dev/md8

Will put log info to 'stdout'

 

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Fri Sep 12 13:32:21 2014

###########

Replaying journal: Trans replayed: mountid 212, transid 62314, desc 7440, len 1, commit 7442, next trans offset 7425

Trans replayed: mountid 212, transid 62315, desc 7443, len 1, commit 7445, next trans offset 7428

Replaying journal: Done.

Reiserfs journal '/dev/md8' in blocks [18..8211]: 2 transactions replayed

Zero bit found in on-disk bitmap after the last valid bit.

Checking internal tree.. \/  1 (of  22|/  9 (of  91// 31 (of  88|block 348880938: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (348880938), whole subtree/ 10 (of  91/block 322338874: The level of the node (0) is not correct, (2) expected

the problem in the internal node occured (322338874), whole subtree/  2 (of  22-block 305863291: The level of the node (0) is not correct, (3) expected

the problem in the internal node occured (305863291), whole subtree is skipped finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

3 found corruptions can be fixed only when running with --rebuild-tree

###########

reiserfsck finished at Fri Sep 12 13:34:26 2014

###########

 

so that is already one drive where it claims there is corruption, should I run the rebuild-tree?

 

I have tested all other md1-18 and all of them pass besides this one... so there appears to be only md8 which has this corruption...

Link to comment

yeah I tried to copy the files from that drive but it was constantly hanging so I started the rebuild-tree now and will see how many of the files are actually good and how many corrupted. It appears that this was one of the drives I did a successful rebuild, but now it might not have been so successful with this corruption.

Link to comment

ok still not better, I have now fixed the md8 drive issues which was the only drive which failed the reiserfsck --check all others passed successfully. I have also formatted 4 drives as xfs successfully, just to make sure they don't give me further grief. However when I started the parity build it almost instantly failed with same errors.

 

So the parity drive failed on a AOC-SASLP-MV8 - Supermicro port, actually two different ports on 2 different cables and now I moved it to a 3gb motherboard SATA port and it's actually failing faster than before. I highly doubt that all these ports could be faulty. But I don't know what else to try right now. I have now shutdown the server again and again I am getting errors from a reiserfs now from md16 which the check passed successfully.

syslog-2014-09-14.zip

Link to comment

As long as the SAS crashes keep happening (see my previous post), you will just be playing Whac-A-Mole with the drives.

 

Since the SAS subsystem appears to be involving itself somehow with ALL of the drives, including the motherboard drives, I think you have to stop and deal with the SAS problem first.  You have 2 different SAS cards.  The first thing to do is check their firmware versions, and upgrade them if possible, and test.  I really hope that fixes the issue.  Otherwise, I'm afraid you may have to buy another SAS card, and substitute for one of the installed ones and test.  If no improvement, then substitute for the other, and test again.

Link to comment

ok still not better, I have now fixed the md8 drive issues which was the only drive which failed the reiserfsck --check all others passed successfully. I have also formatted 4 drives as xfs successfully, just to make sure they don't give me further grief. However when I started the parity build it almost instantly failed with same errors.

 

So the parity drive failed on a AOC-SASLP-MV8 - Supermicro port, actually two different ports on 2 different cables and now I moved it to a 3gb motherboard SATA port and it's actually failing faster than before. I highly doubt that all these ports could be faulty. But I don't know what else to try right now. I have now shutdown the server again and again I am getting errors from a reiserfs now from md16 which the check passed successfully.

This syslog of 9-14 shows upgrade to 6.0-beta9, with unraidsafemode set (like syslog #2, which was -beta6), but shows the same SAS subsystem failures.  I don't fully understand the SAS task counts yet, the previous -beta6 showed failing task counts equivalent to the number of array data drives (17 and 18), while this -beta9 syslog shows 20 tasks, equivalent to the total number of drives.

Sep 13 23:58:00 Tower kernel: sas: Enter sas_scsi_recover_host busy: 20 failed: 20

All 20 tasks are then aborted, then immediately the parity drive (sdk on a SAS controller port) is found to be completely unresponsive and quickly disabled, exactly like the previous syslogs.

 

The Reiser file system corruption occurs over 9 hours later, out of the blue.  This NEVER happens normally!  Because it has happened multiple times with your setup, I have to assume there is a connection with the previous SAS failures, but by unknown mechanism.  So far, there is no indication of any physical issues with the drives themselves.

 

Actually I have two sas controllers plus onboard controller and the AoC took hours to fail and onboard quickly so don't think it's one controller giving issues

I'm not seeing how you conclude that, as it so far has always involved the SAS subsystem failing.  Can you post a syslog with same failure where the parity drive is attached to motherboard?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.