Jump to content
Joseph

[Resolved] 5 Errors After Every Parity Check

166 posts in this topic Last Reply

Recommended Posts

3 hours ago, EdgarWallace said:

What is making me nervous is this

 

You need to run xfs_repair on disk5 (md5)

Share this post


Link to post
20 hours ago, DoeBoye said:

Good call. It's not fun if the Marvell card dumps/corrupts a bunch of drives. I picked up a used 310 off Ebay a few weeks ago. Once I flashed my Dell 310 with fireball3's script, it runs perfectly. No need to cover any pins with tape on mine.

Where is this fireball script? Do you think I can re-flash my card with this script so I don't need to cover pins? Seems to me the pins would still have to be covered if the card is in the slot where the GPU is normally installed.

Share this post


Link to post
2 hours ago, Joseph said:

Where is this fireball script? Do you think I can re-flash my card with this script so I don't need to cover pins? Seems to me the pins would still have to be covered if the card is in the slot where the GPU is normally installed.

 

Here you go:

In my opinion, I really think that post needs to be stickied, as it is a extremely useful tool, and it's a bit buried in another thread :).

 

As far as covering pins or not, I'm not sure if it is slot dependent or card dependent. All I can tell you is that many people have installed a Dell H310 without needing to cover any pins, and I seem to have the same model that you purchased and did not need to....

Share this post


Link to post

forgot to mention right before shutdown this last time, I noticed a line that flashed on the screen quickly that I think said ACPI error. What effect on unRAID would it have? How can errors on shutdown be captured for analysis?

 

 

Share this post


Link to post
16 minutes ago, Joseph said:

forgot to mention right before shutdown this last time, I noticed a line that flashed on the screen quickly that I think said ACPI error. What effect on unRAID would it have?

That's above my pay grade. Someone else will need to answer that. I quick forum search pulled this answer from JoeL from 2012, but not sure if it's the same thing:

 

On 10/25/2012 at 8:38 AM, Joe L. said:

It means they all have the string of letters "error" somewhere in them.  (the criteria for coloring them "red" in the syslog viewer in unMENU.)

 

Other than that, the messages themselves usually indicate that ACPI has been disabled in the BIOS (or disabled with 'noacpi' in your syslinux.conf file),

or, ACPI is poorly implemented in the BIOS,

or, the BIOS is using ACPI features not yet implemented in the linux kernel.

 

Look first for a BIOS update for your MB, make sure you've not disabled ACPI in the BIOS, and other than that, ignore the messages if everything seems to be working.

 

Joe L.

 

 

16 minutes ago, Joseph said:

How can errors on shutdown be captured for analysis?

Not sure what would grab it for sure on shutdown. You could turn on "Troubleshooting Mode" in the Fix Common Problems Plugin (Install it if you don't already have it!) and then shutdown. It might catch it...

Edited by DoeBoye

Share this post


Link to post
11 minutes ago, DoeBoye said:

Not sure what would grab it for sure on shutdown. You could turn on "Troubleshooting Mode" in the Fix Common Problems Plugin (Install it if you don't already have it!) and then shutdown. It might catch it...

I think I'll use the camera on my phone to grab it... might not be useful, but you never know.

Share this post


Link to post

So my parity check jsut finished with no errors. 1 day 2 hours and 16 minutes lol. Amazingly one of the H310 cards arrived today from the orders I made on eBay yesterday, the other one is expected by Friday.

I have a new 4 port NIC also for the main server which I need to install. Hum, I will probably install each and then make sure they are working and after than, do another parity check. So this next check will also be after a machine shutdown, could confirm if there is an ACPI issue with the latest 6.3 .

Fingers crossed all goes well :)

Share this post


Link to post
1 hour ago, crowdx42 said:

So my parity check jsut finished with no errors. 1 day 2 hours and 16 minutes lol. Amazingly one of the H310 cards arrived today from the orders I made on eBay yesterday, the other one is expected by Friday.

I have a new 4 port NIC also for the main server which I need to install. Hum, I will probably install each and then make sure they are working and after than, do another parity check. So this next check will also be after a machine shutdown, could confirm if there is an ACPI issue with the latest 6.3 .

Fingers crossed all goes well :)

Well as an update I did not add any hardware as the 4 port NIC needs a PCIe x2 or higher and the board only has single PCIe slots left. I am not sure what 4 port 1gig NICs are out there that are only PCIe and work with unRAID. :(

So I am going to go ahead and run another parity check and see if that works without errors. It should complete by tomorrow night :(

Share this post


Link to post
On 4/12/2017 at 9:09 PM, crowdx42 said:

So is it safe to say that I can install a Dell 310 into a windows machine and then use the batch files from the zipped download to flash the card?

I ran .bat files from a bootable usb stick ok... but have you considered  fireball3's script approach? According to EdgarWallace at https://forums.lime-technology.com/topic/12114-lsi-controller-fw-updates-irit-modes/?page=51 he's somehow getting 168MB/s on parity checks.... I only get about 125 w/my dell H310 (flashed p10 P20) I assume there must be some difference between the the fireball3 script and the one I got from johnny.black. I'm open to suggestions to achieving higher transfer rates.

Edited by Joseph
update

Share this post


Link to post

Parity check speed has nothing to do with the firmware used (though I linked the same tools), it has mainly to do it the disks used and if there are or not other bottlenecks on your server, unRAID tunable settings also play a role.

Share this post


Link to post

@Joseph, this was at the start of the parity check, the average was about 90MB/s, which is about the same speed I was getting with my AOC-SAS2LP-MV8.

 

@johnnie.black thanks a lot, see the outcome below....looks pretty good right?

Parity Check ended with 24 errors. I am going to run a Parity Check once again with "Write corrections to parity" option.

 

root@Tower2:~# xfs_repair /dev/md5
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
Metadata corruption detected at xfs_inode block 0x11c002b18/0x2000
        - agno = 3
bad CRC for inode 6811791504
bad magic number 0x0 on inode 6811791504
bad version number 0x0 on inode 6811791504
bad CRC for inode 6811791504, will rewrite
bad magic number 0x0 on inode 6811791504, resetting magic number
bad version number 0x0 on inode 6811791504, resetting version number
imap claims a free inode 6811791504 is in use, correcting imap and clearing inode
cleared inode 6811791504
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 1
        - agno = 2
entry "ja.js" at block 0 offset 152 in directory inode 6642449595 references free inode 6811791504
	clearing inode number in entry at offset 152...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
bad hash table for directory inode 6642449595 (no data entry): rebuilding
rebuilding directory inode 6642449595
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

 

 

Edited by EdgarWallace

Share this post


Link to post

So a quick clarification, if the parity check returns errors, this just means that the data on the drives does not match the algorithm of the parity drive. The data on the data drives are not at risk unless a drive fails and then the parity may not be able to rebuild the drive if that section of the parity drive had an error.

Correct? Or is it possible that data on the source drives could have issues?

Share this post


Link to post
15 minutes ago, crowdx42 said:

So a quick clarification, if the parity check returns errors, this just means that the data on the drives does not match the algorithm of the parity drive. The data on the data drives are not at risk unless a drive fails and then the parity may not be able to rebuild the drive if that section of the parity drive had an error.

Correct? Or is it possible that data on the source drives could have issues?

 

I believe that's correct, i.e., data is safe unless there's a rebuild.

Share this post


Link to post
7 minutes ago, johnnie.black said:

I believe that's correct, i.e., data is safe unless there's a rebuild.

How do you know that it's the parity drive? I'm unclear on which piece of data doesn't match, or even at what stage of data reading the parity mismatch occurs. I aSSume that one or more of the drives attached to the suspect controller would be returning unreliable data, but if it's a data drive... 

However, it would seem that the errors only occur when ALL the drives on the controller are accessed, correct? Has anyone been able to actually catch the controller in the act and determine which port is suspect, or all of them randomly?

Share this post


Link to post

Something I have noticed is since I disabled INT13 on my controllers all the drives are not spun up at the same time, only the drive that parity is checking, hence a slower parity check rate. To be honest, this just stinks of the controller cards overheating when placed under heavy load. This makes sense due to a lot of server chassis have air directed over these SAS cards and so it would be kept cooler than in the average case. I put my finger on the heatsink of on of my cards when it was idle and it was very hot, I can only imagine how hot it would get under a full load.

Share this post


Link to post
3 minutes ago, crowdx42 said:

 

Something I have noticed is since I disabled INT13 on my controllers all the drives are not spun up at the same time, only the drive that parity is checking, hence a slower parity check rate

 

???? Until you get past the size of the smallest drive, all drives are part of the parity check, and none should be spun down. You can't just check one drive at a time, that's not how parity works.

Share this post


Link to post
4 minutes ago, jonathanm said:

However, it would seem that the errors only occur when ALL the drives on the controller are accessed, correct?

 

Yes, I was not saying that parity is the wrong, I believe that wrong data is return when all disks are being simultaneous accessed, be it a parity check or a disk rebuild, I *think* that the data is correctly written and read during normal reads/writes, but just remembered if a user is using turbo write it will be using all disks, so maybe there's also a chance of data corruption.

 

If we had more users using btrfs it would be easier to answer this, as btrfs will detect and abort any file read that fails checksum.

Share this post


Link to post
2 minutes ago, jonathanm said:

???? Until you get past the size of the smallest drive, all drives are part of the parity check, and none should be spun down. You can't just check one drive at a time, that's not how parity works.

Well take a look at my screenshot, the grey drives are showing as spun down/ inactive, but the parity check is still running.

 

Parity Screenshot.JPG

Share this post


Link to post
25 minutes ago, crowdx42 said:

Well take a look at my screenshot, the grey drives are showing as spun down/ inactive, but the parity check is still running.

 

That means it's pass the 4TB mark, only your paritys are larger, so they're the only ones still being checked.

Share this post


Link to post

I could have sworn seeing drives spun down even at 25% parity check, now I will have to check the next time :) . I know when INT13 was enabled all drives were showing active and the speed was close to twice what it is with INT13 disabled.

Share this post


Link to post
1 minute ago, crowdx42 said:

I could have sworn seeing drives spun down even at 25% parity check

 

That's simply not possible.

Share this post


Link to post
47 minutes ago, jonathanm said:

How do you know that it's the parity drive? I'm unclear on which piece of data doesn't match, or even at what stage of data reading the parity mismatch occurs. I aSSume that one or more of the drives attached to the suspect controller would be returning unreliable data, but if it's a data drive... 

However, it would seem that the errors only occur when ALL the drives on the controller are accessed, correct? Has anyone been able to actually catch the controller in the act and determine which port is suspect, or all of them randomly?

I'm fairly certain if there is a problem reading a drive in the array (whether its a physical problem or controller problem), then unRAID will knock it out off the array and let you know it needs to be rebuilt.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.