unable to build parity with new config option after multiple drive failure

September 8, 201411 yr

ok I have now started the new configuration with 3 drives removed and 3 new drives inserted and started the rebuild, which has started. but am I meant to see the contents of my shares? I see see on under /diskxx the data but not under //tower/sharename... is that normal?

is this normally caused by cable issues?

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [7175 10213 0x0 SD]

Sep 8 12:47:59 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 14247 does not match to the expected one 3

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 373216999. Fsck?

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [9874 9875 0x0 SD]

Sep 8 12:47:59 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 25565 does not match to the expected one 2

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 353699823. Fsck?

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [7175 10213 0x0 SD]

Sep 8 12:47:59 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 14247 does not match to the expected one 3

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 373216999. Fsck?

Sep 8 12:47:59 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [9874 9878 0x0 SD]

should I be stopping parity sync

Sep 8 12:31:51 Tower shfs/user: shfs_readdir: fstatat: Boss (13) Permission denied (Drive related)

Sep 8 12:31:51 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk14/series (13) Permission denied

Sep 8 12:31:51 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 31632 does not match to the expected one 1 (Minor Issues)

Sep 8 12:31:51 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 301802752. Fsck? (Errors)

Sep 8 12:31:51 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [723 1108 0x0 SD] (Errors)

Sep 8 12:33:01 Tower shfs/user: shfs_readdir: fstatat: Boss (13) Permission denied (Drive related)

Sep 8 12:33:01 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk14/series (13) Permission denied

Sep 8 12:33:01 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 31632 does not match to the expected one 1 (Minor Issues)

Sep 8 12:33:01 Tower kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 301802752. Fsck? (Errors)

Sep 8 12:33:01 Tower kernel: REISERFS error (device md14): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [723 1108 0x0 SD] (Errors)

Sep 8 12:33:54 Tower sshd[4302]: Accepted password for root from 192.168.0.7 port 55712 ssh2

Sep 8 12:34:35 Tower shfs/user: shfs_readdir: fstatat: Boss (13) Permission denied (Drive related)

Sep 8 12:34:35 Tower shfs/user: shfs_readdir: readdir_r: /mnt/disk14/series (13) Permission denied

how can it get permission denied?

I have now stopped the array and taken that drive out of a 5x3 cage which has already a suspect position and have now connected the sata port direct to the drive instead.

syslog-2014-09-08.zip

Quote

September 8, 201411 yr

Community Expert

That suggests that there is file system corruption on the disk.

As I am not sure of the overall state of your system, I am not sure what to recommend as the next step.

Quote

September 8, 201411 yr

Author

I actually believe that drive was the failed rebuild from a failed drive, but still don't know how to fix this now as the parity build process is running but it has lots of corruption on here.

Quote

September 9, 201411 yr

As he said it's Reiser file system corruption on Disk 14, so see Check Disk File systems. Parity is completely independent of the file systems.

Quote

September 9, 201411 yr

Author

the reiserfsck asked me to run the --rebuild-tree so doing that right now... takes forever.

Quote

September 10, 201411 yr

Author

ok I finished the rebuild-tree and it found some files. However when I now try to run the resync I have the 3TB parity drop from the system. I have already reseated the drive, used a different port, I even moved it out of my Norco 5x3 cage, but still same problem. Any suggestions?

done a new configuration once again and used a different 3tb drive as a parity drive, lets see if that works.

syslog-2014-09-10.zip

Quote

September 10, 201411 yr

Author

so another parity build failed with another 3TB drive which beforehand worked fine with preclear, this is with beta 6. now I don;'t know anymore

syslog-2014-09-101.zip

Quote

September 10, 201411 yr

Author

now I am even more confused. I have now connected the parity drive to the mainboard SATA port which is another sata cable and still the parity build is aborted and the drive is shown with a red x. I need urgent help here.

syslog-2014-09-102.zip

Quote

September 10, 201411 yr

Well it's good news and bad news! The good news is there's nothing wrong with any of the drives, and probably never was! The bad news is there's something seriously wrong with the SAS system, and I don't know if it's the SAS software (one of the drivers or other SAS support software) or something wrong with one of the SAS cards.

Syslog 2 and syslog 3 are roughly the same, an attempt to rebuild the parity drive, except syslog 2 is in safe mode and syslog 3 loads XEN and plugins. Syslog 3 loads 18 drives and syslog 2 loads 17, missing Disk 13. They both fail the same way, in one hour in syslog 2 and almost 4 hours in syslog 3. At the SAS fail moment, sas_scsi_recover_host is entered and reports 18 failed (17 for syslog 2), then aborts all 18 tasks (17 in syslog 2), after which all writes to the parity drive are failed. Shortly after, the parity drive is disabled by the kernel (no response at all). You can then ignore ALL of the subsequent errors related to the Parity drive. It's not the Parity drives fault, it just happened to be the drive that was being written, and I suspect other drives might have failed likewise if they had any I/O attempted. What is really odd here (at least to me) is that the SAS subsystem is reporting 17 and 18 tasks, but there are 6 drives on the motherboard, using ahci not the SAS system! So it seems as if the SAS error system should only have reported managing 11 or 12 drives/tasks.

In your first syslog, the start of the array showed transactions being replayed on almost all of the drives plus file system corruption on Disk 14. I suspect you had previously had a similar SAS crash, during a write to Disk 14. In other words, a similar cause, but I don't know that for sure.

I don't know how to advise, I haven't seen this before. This seems to be a more major problem than just replacing drives.

Quote

September 10, 201411 yr

Author

Yeah especially as I already moved off the sas card onto the motherboard and still get same errors. The md14 was a failed drive replacement rebuilt which I only noticed after the failed drive had been shipped back. I have sent tom an email asking for help as I am fresh out of ideas

seems to be getting worse now... the monitor connected to it is being flooded with these messages

Sep 10 16:15:42 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 40958 does not match to the expected one 1

Sep 10 16:15:42 Tower kernel: REISERFS error (device md15): vs-5150 search_by_key: invalid format found in block 472002343. Fsck?

Sep 10 16:15:42 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 40958 does not match to the expected one 1

Sep 10 16:15:42 Tower kernel: REISERFS error (device md15): vs-5150 search_by_key: invalid format found in block 472002343. Fsck?

Sep 10 16:15:42 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 40958 does not match to the expected one 1

Sep 10 16:15:42 Tower kernel: REISERFS error (device md15): vs-5150 search_by_key: inva

I have decided to shutdown the array now... before I lose more data. I used the powerdown script however that did not even work, had to hard power down the array.. something very strange going on.

Quote

September 12, 201411 yr

Author

anybody have any idea how I can get the parity build complete successfully, would beta9 help me here? I am getting desperate here. Any help greatly appreciated.

Quote

September 12, 201411 yr

Community Expert

The log messages indicate there is file system corruption on disk15 ( assume that this is the same drive that was previously showing as disk14 in the log that RobJ commented on) and the drive number has changed for some reason? I would have thought that it was worth resolving this issue before trying to rebuild parity?

Having said that such corruption should not stop parity being rebuilt - it would just mean that the parity disk would also reflected the corrupt file system. It depends on whether you have decided that it is more important to rebuild parity than correct the problem on disk15. If you really want to give getting parity rebuilt priority, then since this drive is reporting errors I think you would be better of temporarily doing a 'new config' and defining the array with this drive omitted. That may help with rebuilding parity without getting the syslog flooded with the reiserfs errors, and if not there is more chance of seeing if you have some other error being reported while the rebuild takes place. You can then think about trying to recover data from the problem disk as a separate issue.

I cannot think of a reason not to use beta 9 now. If anything it is probably the best one to go with as you want to know if beta 9 has any issue with your system that are specific to that release. However this issue is probably not affected by what release of unRAID you are using - it feels more like something at the physical level.

Quote

September 12, 201411 yr

Author

I have now mounted the md15 drive into another machine and am running reiserfsck --check /dev/md1 against that one. see if I get anything from that drive.

###########

reiserfsck --check started at Fri Sep 12 04:13:27 2014

###########

Replaying journal: Done.

Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

Leaves 683156

Internal nodes 4412

Directories 1478

Other files 19289

Data block pointers 687587240 (0 of them are zero)

Safe links 0

###########

reiserfsck finished at Fri Sep 12 04:32:06 2014

seems all ok

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 178 176 021 Pre-fail Always - 6075

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 834

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0

9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5784

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 144

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 808

194 Temperature_Celsius 0x0022 115 106 000 Old_age Always - 35

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 5784 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Quote

September 12, 201411 yr

Community Expert

That is confusing? Not sure why the reiserfsck issue was being reported against that drive then in the first place? The SMART report also looks fine. I guess the good news is that the data appears to be intact.

If you try and bring up the array and without that drive and start the parity sync do you still get these messages in the syslog? If you do then that would suggest that the removed was not the one with the reiserfs error.

If not, I can only suggest that you stop the parity sync, reset the array to include disk15 again and restart the parity sync to see what happens. If the errors re-appear then that suggest the disk is not currently being correctly handled as it worked OK in the other machine. This would suggest some sort of hardware related issue, or a problem at the driver level for the disk controller.

Quote

September 12, 201411 yr

Author

I am now running the reiserfsck for all my 19 drives just to make sure they all pass before attempting another parity build.

Quote

September 12, 201411 yr

Community Expert

I am now running the reiserfsck for all my 19 drives just to make sure they all pass before attempting another parity build.

Good idea.

It is something I tend to do periodically as a safety check as I tend to always be running the latest beta releases. The problem is that as drives get bigger (my largest are now 6TB) it is getting harder to find long enough slots to do this on. Ideally I would like to be able to run the check with the array still active but I do not think this is possible.

Quote

September 12, 201411 yr

Author

Will read-only check consistency of the filesystem on /dev/md8

Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Fri Sep 12 13:32:21 2014

###########

Replaying journal: Trans replayed: mountid 212, transid 62314, desc 7440, len 1, commit 7442, next trans offset 7425

Trans replayed: mountid 212, transid 62315, desc 7443, len 1, commit 7445, next trans offset 7428

Replaying journal: Done.

Reiserfs journal '/dev/md8' in blocks [18..8211]: 2 transactions replayed

Zero bit found in on-disk bitmap after the last valid bit.

Checking internal tree.. \/ 1 (of 22|/ 9 (of 91// 31 (of 88|block 348880938: The level of the node (0) is not correct, (1) expected

the problem in the internal node occured (348880938), whole subtree/ 10 (of 91/block 322338874: The level of the node (0) is not correct, (2) expected

the problem in the internal node occured (322338874), whole subtree/ 2 (of 22-block 305863291: The level of the node (0) is not correct, (3) expected

the problem in the internal node occured (305863291), whole subtree is skipped finished

Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.

Bad nodes were found, Semantic pass skipped

3 found corruptions can be fixed only when running with --rebuild-tree

###########

reiserfsck finished at Fri Sep 12 13:34:26 2014

###########

so that is already one drive where it claims there is corruption, should I run the rebuild-tree?

I have tested all other md1-18 and all of them pass besides this one... so there appears to be only md8 which has this corruption...

Quote

September 12, 201411 yr

Run with rebuild-tree.

Quote

September 12, 201411 yr

Author

yeah I tried to copy the files from that drive but it was constantly hanging so I started the rebuild-tree now and will see how many of the files are actually good and how many corrupted. It appears that this was one of the drives I did a successful rebuild, but now it might not have been so successful with this corruption.

Quote

September 14, 201411 yr

Author

ok still not better, I have now fixed the md8 drive issues which was the only drive which failed the reiserfsck --check all others passed successfully. I have also formatted 4 drives as xfs successfully, just to make sure they don't give me further grief. However when I started the parity build it almost instantly failed with same errors.

So the parity drive failed on a AOC-SASLP-MV8 - Supermicro port, actually two different ports on 2 different cables and now I moved it to a 3gb motherboard SATA port and it's actually failing faster than before. I highly doubt that all these ports could be faulty. But I don't know what else to try right now. I have now shutdown the server again and again I am getting errors from a reiserfs now from md16 which the check passed successfully.

syslog-2014-09-14.zip

Quote

September 14, 201411 yr

Post a SMART report.

Quote

September 17, 201411 yr

As long as the SAS crashes keep happening (see my previous post), you will just be playing Whac-A-Mole with the drives.

Since the SAS subsystem appears to be involving itself somehow with ALL of the drives, including the motherboard drives, I think you have to stop and deal with the SAS problem first. You have 2 different SAS cards. The first thing to do is check their firmware versions, and upgrade them if possible, and test. I really hope that fixes the issue. Otherwise, I'm afraid you may have to buy another SAS card, and substitute for one of the installed ones and test. If no improvement, then substitute for the other, and test again.

Quote

September 18, 201411 yr

Author

Actually I have two sas controllers plus onboard controller and the AoC took hours to fail and onboard quickly so don't think it's one controller giving issues

Quote

September 18, 201411 yr

ok still not better, I have now fixed the md8 drive issues which was the only drive which failed the reiserfsck --check all others passed successfully. I have also formatted 4 drives as xfs successfully, just to make sure they don't give me further grief. However when I started the parity build it almost instantly failed with same errors.

So the parity drive failed on a AOC-SASLP-MV8 - Supermicro port, actually two different ports on 2 different cables and now I moved it to a 3gb motherboard SATA port and it's actually failing faster than before. I highly doubt that all these ports could be faulty. But I don't know what else to try right now. I have now shutdown the server again and again I am getting errors from a reiserfs now from md16 which the check passed successfully.

This syslog of 9-14 shows upgrade to 6.0-beta9, with unraidsafemode set (like syslog #2, which was -beta6), but shows the same SAS subsystem failures. I don't fully understand the SAS task counts yet, the previous -beta6 showed failing task counts equivalent to the number of array data drives (17 and 18), while this -beta9 syslog shows 20 tasks, equivalent to the total number of drives.

Sep 13 23:58:00 Tower kernel: sas: Enter sas_scsi_recover_host busy: 20 failed: 20

All 20 tasks are then aborted, then immediately the parity drive (sdk on a SAS controller port) is found to be completely unresponsive and quickly disabled, exactly like the previous syslogs.

The Reiser file system corruption occurs over 9 hours later, out of the blue. This NEVER happens normally! Because it has happened multiple times with your setup, I have to assume there is a connection with the previous SAS failures, but by unknown mechanism. So far, there is no indication of any physical issues with the drives themselves.

Actually I have two sas controllers plus onboard controller and the AoC took hours to fail and onboard quickly so don't think it's one controller giving issues

I'm not seeing how you conclude that, as it so far has always involved the SAS subsystem failing. Can you post a syslog with same failure where the parity drive is attached to motherboard?

Quote

September 18, 201411 yr

Author

The last syslog I posted I am sure the parity is connected to motherboard but currently out of country so can only verify on Saturday

Quote

unable to build parity with new config option after multiple drive failure

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)