Joseph Posted April 12, 2017 Author Share Posted April 12, 2017 So it turns out, someone else had this issue last year on 6.1.9 with the exact same 5 sectors using a Marvell controller https://forums.lime-technology.com/topic/50698-monthly-5-parity-errors/ Quote Link to comment
JorgeB Posted April 12, 2017 Share Posted April 12, 2017 3 hours ago, EdgarWallace said: What is making me nervous is this You need to run xfs_repair on disk5 (md5) Quote Link to comment
Joseph Posted April 12, 2017 Author Share Posted April 12, 2017 20 hours ago, DoeBoye said: Good call. It's not fun if the Marvell card dumps/corrupts a bunch of drives. I picked up a used 310 off Ebay a few weeks ago. Once I flashed my Dell 310 with fireball3's script, it runs perfectly. No need to cover any pins with tape on mine. Where is this fireball script? Do you think I can re-flash my card with this script so I don't need to cover pins? Seems to me the pins would still have to be covered if the card is in the slot where the GPU is normally installed. Quote Link to comment
DoeBoye Posted April 12, 2017 Share Posted April 12, 2017 2 hours ago, Joseph said: Where is this fireball script? Do you think I can re-flash my card with this script so I don't need to cover pins? Seems to me the pins would still have to be covered if the card is in the slot where the GPU is normally installed. Here you go: In my opinion, I really think that post needs to be stickied, as it is a extremely useful tool, and it's a bit buried in another thread :). As far as covering pins or not, I'm not sure if it is slot dependent or card dependent. All I can tell you is that many people have installed a Dell H310 without needing to cover any pins, and I seem to have the same model that you purchased and did not need to.... Quote Link to comment
Joseph Posted April 12, 2017 Author Share Posted April 12, 2017 forgot to mention right before shutdown this last time, I noticed a line that flashed on the screen quickly that I think said ACPI error. What effect on unRAID would it have? How can errors on shutdown be captured for analysis? Quote Link to comment
DoeBoye Posted April 12, 2017 Share Posted April 12, 2017 (edited) 16 minutes ago, Joseph said: forgot to mention right before shutdown this last time, I noticed a line that flashed on the screen quickly that I think said ACPI error. What effect on unRAID would it have? That's above my pay grade. Someone else will need to answer that. I quick forum search pulled this answer from JoeL from 2012, but not sure if it's the same thing: On 10/25/2012 at 8:38 AM, Joe L. said: It means they all have the string of letters "error" somewhere in them. (the criteria for coloring them "red" in the syslog viewer in unMENU.) Other than that, the messages themselves usually indicate that ACPI has been disabled in the BIOS (or disabled with 'noacpi' in your syslinux.conf file), or, ACPI is poorly implemented in the BIOS, or, the BIOS is using ACPI features not yet implemented in the linux kernel. Look first for a BIOS update for your MB, make sure you've not disabled ACPI in the BIOS, and other than that, ignore the messages if everything seems to be working. Joe L. 16 minutes ago, Joseph said: How can errors on shutdown be captured for analysis? Not sure what would grab it for sure on shutdown. You could turn on "Troubleshooting Mode" in the Fix Common Problems Plugin (Install it if you don't already have it!) and then shutdown. It might catch it... Edited April 12, 2017 by DoeBoye Quote Link to comment
Joseph Posted April 12, 2017 Author Share Posted April 12, 2017 11 minutes ago, DoeBoye said: Not sure what would grab it for sure on shutdown. You could turn on "Troubleshooting Mode" in the Fix Common Problems Plugin (Install it if you don't already have it!) and then shutdown. It might catch it... I think I'll use the camera on my phone to grab it... might not be useful, but you never know. Quote Link to comment
crowdx42 Posted April 12, 2017 Share Posted April 12, 2017 So my parity check jsut finished with no errors. 1 day 2 hours and 16 minutes lol. Amazingly one of the H310 cards arrived today from the orders I made on eBay yesterday, the other one is expected by Friday. I have a new 4 port NIC also for the main server which I need to install. Hum, I will probably install each and then make sure they are working and after than, do another parity check. So this next check will also be after a machine shutdown, could confirm if there is an ACPI issue with the latest 6.3 . Fingers crossed all goes well Quote Link to comment
crowdx42 Posted April 12, 2017 Share Posted April 12, 2017 1 hour ago, crowdx42 said: So my parity check jsut finished with no errors. 1 day 2 hours and 16 minutes lol. Amazingly one of the H310 cards arrived today from the orders I made on eBay yesterday, the other one is expected by Friday. I have a new 4 port NIC also for the main server which I need to install. Hum, I will probably install each and then make sure they are working and after than, do another parity check. So this next check will also be after a machine shutdown, could confirm if there is an ACPI issue with the latest 6.3 . Fingers crossed all goes well Well as an update I did not add any hardware as the 4 port NIC needs a PCIe x2 or higher and the board only has single PCIe slots left. I am not sure what 4 port 1gig NICs are out there that are only PCIe and work with unRAID. So I am going to go ahead and run another parity check and see if that works without errors. It should complete by tomorrow night Quote Link to comment
crowdx42 Posted April 13, 2017 Share Posted April 13, 2017 On 4/3/2017 at 10:55 AM, johnnie.black said: http://lime-technology.com/wiki/index.php/Crossflashing_Controllers#LSI_SAS2008_chipset So is it safe to say that I can install a Dell 310 into a windows machine and then use the batch files from the zipped download to flash the card? Quote Link to comment
Joseph Posted April 13, 2017 Author Share Posted April 13, 2017 (edited) On 4/12/2017 at 9:09 PM, crowdx42 said: So is it safe to say that I can install a Dell 310 into a windows machine and then use the batch files from the zipped download to flash the card? I ran .bat files from a bootable usb stick ok... but have you considered fireball3's script approach? According to EdgarWallace at https://forums.lime-technology.com/topic/12114-lsi-controller-fw-updates-irit-modes/?page=51 he's somehow getting 168MB/s on parity checks.... I only get about 125 w/my dell H310 (flashed p10 P20) I assume there must be some difference between the the fireball3 script and the one I got from johnny.black. I'm open to suggestions to achieving higher transfer rates. Edited April 17, 2017 by Joseph update Quote Link to comment
JorgeB Posted April 13, 2017 Share Posted April 13, 2017 Parity check speed has nothing to do with the firmware used (though I linked the same tools), it has mainly to do it the disks used and if there are or not other bottlenecks on your server, unRAID tunable settings also play a role. Quote Link to comment
EdgarWallace Posted April 13, 2017 Share Posted April 13, 2017 (edited) @Joseph, this was at the start of the parity check, the average was about 90MB/s, which is about the same speed I was getting with my AOC-SAS2LP-MV8. @johnnie.black thanks a lot, see the outcome below....looks pretty good right? Parity Check ended with 24 errors. I am going to run a Parity Check once again with "Write corrections to parity" option. root@Tower2:~# xfs_repair /dev/md5 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 Metadata corruption detected at xfs_inode block 0x11c002b18/0x2000 - agno = 3 bad CRC for inode 6811791504 bad magic number 0x0 on inode 6811791504 bad version number 0x0 on inode 6811791504 bad CRC for inode 6811791504, will rewrite bad magic number 0x0 on inode 6811791504, resetting magic number bad version number 0x0 on inode 6811791504, resetting version number imap claims a free inode 6811791504 is in use, correcting imap and clearing inode cleared inode 6811791504 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 1 - agno = 2 entry "ja.js" at block 0 offset 152 in directory inode 6642449595 references free inode 6811791504 clearing inode number in entry at offset 152... Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... bad hash table for directory inode 6642449595 (no data entry): rebuilding rebuilding directory inode 6642449595 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... done Edited April 13, 2017 by EdgarWallace Quote Link to comment
crowdx42 Posted April 13, 2017 Share Posted April 13, 2017 So a quick clarification, if the parity check returns errors, this just means that the data on the drives does not match the algorithm of the parity drive. The data on the data drives are not at risk unless a drive fails and then the parity may not be able to rebuild the drive if that section of the parity drive had an error. Correct? Or is it possible that data on the source drives could have issues? Quote Link to comment
JorgeB Posted April 13, 2017 Share Posted April 13, 2017 15 minutes ago, crowdx42 said: So a quick clarification, if the parity check returns errors, this just means that the data on the drives does not match the algorithm of the parity drive. The data on the data drives are not at risk unless a drive fails and then the parity may not be able to rebuild the drive if that section of the parity drive had an error. Correct? Or is it possible that data on the source drives could have issues? I believe that's correct, i.e., data is safe unless there's a rebuild. Quote Link to comment
JonathanM Posted April 13, 2017 Share Posted April 13, 2017 7 minutes ago, johnnie.black said: I believe that's correct, i.e., data is safe unless there's a rebuild. How do you know that it's the parity drive? I'm unclear on which piece of data doesn't match, or even at what stage of data reading the parity mismatch occurs. I aSSume that one or more of the drives attached to the suspect controller would be returning unreliable data, but if it's a data drive... However, it would seem that the errors only occur when ALL the drives on the controller are accessed, correct? Has anyone been able to actually catch the controller in the act and determine which port is suspect, or all of them randomly? Quote Link to comment
crowdx42 Posted April 13, 2017 Share Posted April 13, 2017 Something I have noticed is since I disabled INT13 on my controllers all the drives are not spun up at the same time, only the drive that parity is checking, hence a slower parity check rate. To be honest, this just stinks of the controller cards overheating when placed under heavy load. This makes sense due to a lot of server chassis have air directed over these SAS cards and so it would be kept cooler than in the average case. I put my finger on the heatsink of on of my cards when it was idle and it was very hot, I can only imagine how hot it would get under a full load. Quote Link to comment
JonathanM Posted April 13, 2017 Share Posted April 13, 2017 3 minutes ago, crowdx42 said: Something I have noticed is since I disabled INT13 on my controllers all the drives are not spun up at the same time, only the drive that parity is checking, hence a slower parity check rate ???? Until you get past the size of the smallest drive, all drives are part of the parity check, and none should be spun down. You can't just check one drive at a time, that's not how parity works. Quote Link to comment
JorgeB Posted April 13, 2017 Share Posted April 13, 2017 4 minutes ago, jonathanm said: However, it would seem that the errors only occur when ALL the drives on the controller are accessed, correct? Yes, I was not saying that parity is the wrong, I believe that wrong data is return when all disks are being simultaneous accessed, be it a parity check or a disk rebuild, I *think* that the data is correctly written and read during normal reads/writes, but just remembered if a user is using turbo write it will be using all disks, so maybe there's also a chance of data corruption. If we had more users using btrfs it would be easier to answer this, as btrfs will detect and abort any file read that fails checksum. Quote Link to comment
crowdx42 Posted April 13, 2017 Share Posted April 13, 2017 2 minutes ago, jonathanm said: ???? Until you get past the size of the smallest drive, all drives are part of the parity check, and none should be spun down. You can't just check one drive at a time, that's not how parity works. Well take a look at my screenshot, the grey drives are showing as spun down/ inactive, but the parity check is still running. Quote Link to comment
JorgeB Posted April 13, 2017 Share Posted April 13, 2017 25 minutes ago, crowdx42 said: Well take a look at my screenshot, the grey drives are showing as spun down/ inactive, but the parity check is still running. That means it's pass the 4TB mark, only your paritys are larger, so they're the only ones still being checked. Quote Link to comment
crowdx42 Posted April 13, 2017 Share Posted April 13, 2017 I could have sworn seeing drives spun down even at 25% parity check, now I will have to check the next time . I know when INT13 was enabled all drives were showing active and the speed was close to twice what it is with INT13 disabled. Quote Link to comment
JorgeB Posted April 13, 2017 Share Posted April 13, 2017 1 minute ago, crowdx42 said: I could have sworn seeing drives spun down even at 25% parity check That's simply not possible. Quote Link to comment
Joseph Posted April 13, 2017 Author Share Posted April 13, 2017 47 minutes ago, jonathanm said: How do you know that it's the parity drive? I'm unclear on which piece of data doesn't match, or even at what stage of data reading the parity mismatch occurs. I aSSume that one or more of the drives attached to the suspect controller would be returning unreliable data, but if it's a data drive... However, it would seem that the errors only occur when ALL the drives on the controller are accessed, correct? Has anyone been able to actually catch the controller in the act and determine which port is suspect, or all of them randomly? I'm fairly certain if there is a problem reading a drive in the array (whether its a physical problem or controller problem), then unRAID will knock it out off the array and let you know it needs to be rebuilt. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.