Joseph Posted April 9, 2017 Author Share Posted April 9, 2017 (edited) @johnny.black, could the parity issue be caused by having some drives on the H310 and others on the Mobo controllers? The Cache & Parity Drives are on the Mobo 6Gb/s; 2 of the older, slower HDDs are on the Mobo 3Gb/s controller. update: the onboard 6Gb/s, 2 are intel & 2 are marvell Edited April 9, 2017 by Joseph update Quote Link to comment
Joseph Posted April 9, 2017 Author Share Posted April 9, 2017 (edited) 26 minutes ago, crowdx42 said: Sounds like your issues may finally be resolved. I am not sure if you posted this, what motherboard, cpu, memory etc are you running? My main setup is all intel and so I am wondering if there is any connection. Also, for the SAS card update, did you have to check via command line? If memory serves, its done via command line from a bootable USB stick. There should be a utility included inside the zip file to set this up. here are my specs (which can be found in my signature; thought I had it set right so people could see) Tower: unRAID 6.3.3 Pro | Mobo: ASUS P8Z68 DELUXE/GEN3 | CPU: Intel® Core™ i7-2600 CPU @ 3.40GHz | HBA: SUPERMICRO AOC-SAS2LP-MV8 DELL H310 PERC 0HV52W FLASHED: LSI SAS2008 (P10) PSU: CoolerMaster G750M | RAM: 4GBx4 Non-ECC | Case: Sharkoon T9 Value | Parity: 4TBx2 Segate NAS | Cache: 500MBx2 Samsung 850 EVO Network: Dual Gigabit; Bonded | Storage: 25TB | 4TBx3 Seagate NAS, 2TBx1 Seagate, 4TBx1, HGST, 3TBx1 Seagate, 2TBx2 Hitatchi | Dockers: BOINC for SETI Edited April 9, 2017 by Joseph update Quote Link to comment
crowdx42 Posted April 9, 2017 Share Posted April 9, 2017 Seems like the cpu and I would assume chipset on your system are little older than what I am running. I am on Haswell (mainly hoping for better power consumption ) . I am going to drop the 8tb second parity drive into my server and set it to rebuild. As stated earlier, I now have both parity drives on separate controllers (originally they were both on the same one) . I too have some of my drives on the motherboard sata controllers but that has never caused an issue in the past. Something I am thinking about though, when a parity check is run, it has to be stressful on the whole system and so if there is any point of failure (particularly a failing drive) a parity check may push it over the edge. I am wondering if my initial issue was a failing drive which caused the first parity errors and then it was that same drive I moved to become a parity drive (dumb luck) , thinking the parity drive was failing. A parity check validates against the drives in the system? If so, if a drive in the array is failing, it could cause parity errors? Quote Link to comment
Joseph Posted April 9, 2017 Author Share Posted April 9, 2017 (edited) 10 minutes ago, crowdx42 said: A parity check validates against the drives in the system? If so, if a drive in the array is failing, it could cause parity errors? I believe that's correct...It looks at the parity bit, and does a parity check, if there is a match, all ok. If there is a mismatch, it writes the corrected parity bit to the parity drive(s) and updates the error counter and logs. and now I'm wondering if my older drives, which did pass SMART, might be an issue. Last SMART test result: Completed without error # Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value 199 UDMA CRC error count 0x000a 200 200 000 Old age Always Never 1 <~~~~Shouldn't this be 0? Edited April 9, 2017 by Joseph Quote Link to comment
crowdx42 Posted April 9, 2017 Share Posted April 9, 2017 I know from my system, I got no smart errors on the failing drive, only some bad sectors. So I popped it into an enclosure on my Windows machine and the minute it was detected, Stablebit Scanner which is installed on my windows machine gave a smart error indicating the the drive was not passing some of the checks it does. After a full scan, it also detected bad sectors. Stablebit scanner flags some SMART flags for "pre-fail" , those were the ones it flagged my dying drive with. In general, older drives make me nervous, also I had a slew of issues with Seagate drives and moved to Western Digital and HGST drives (now owned by WD) . The HGST drives run cooler than the WDs and overall have been a great drive. I have only a single WD Red in my system which actually came from an enclosure which I pulled the drive from. In general I buy external drives, run preclear on them and then pull the drive to use in the array, this works out cheaper than buying internal drives. The new WD enclosures are also a lot easier than the older ones and so the drives can now be put back into their original enclosure in the case of failure. The two 8tb drives I just got were both $210 each, whereas the same internal drive is over $300 each. Quite the savings. Quote Link to comment
Joseph Posted April 9, 2017 Author Share Posted April 9, 2017 I might need to go that route then when checking SMART. The 2 2tb drives are the last of the old batch. I've spent a small fortune on hard drives last year when I built this box; I'm considering upgrading the mobo, cpu and memory as soon as the Mrs. gives me back my credit card. lol Building a 32-Thread Xeon Monster PC for Less Than the Price of a Flagship Core i7 http://www.techspot.com/review/1155-affordable-dual-xeon-pc/ Quote Link to comment
crowdx42 Posted April 9, 2017 Share Posted April 9, 2017 (edited) I think my current bottleneck is the 1 gigabyte connection. I have been looking at upgrading to 10 but it all seems to be very expensive. I currently only have a single NIC in my server which I am wondering is enough when backing it up and also trying to stream a movie. I believe maybe even a 4 port NIC might help. Edited April 9, 2017 by crowdx42 Quote Link to comment
Joseph Posted April 9, 2017 Author Share Posted April 9, 2017 49 minutes ago, crowdx42 said: I think my current bottleneck is the 1 gigabyte connection. I have been looking at upgrading to 10 but it all seems to be very expensive. I currently only have a single NIC in my server which I am wondering is enough when backing it up and also trying to stream a movie. I believe maybe even a 4 port NIC might help. Our house wasn't wired for internet. I discovered the phone lines used CAT5e and I was able to get a few rooms connected that way. I have dual-giga for a couple of workstations, but that's about all I can do without tearing into walls for the other locations. I have a couple of APs for wireless devices. If I could get 2 or 4gb/s via bonding or 10gb/s ($$$) everywhere that would awesome. fwiw, Emby allows you to throttle back the streaming bandwidth if need be. Quote Link to comment
crowdx42 Posted April 9, 2017 Share Posted April 9, 2017 Well I ran some cat6 outside and some in the attic so that I got wired through out our house. The office where my main computers are, are next door to where the unRAID server resides. But it does seem that when the main unRAID server is put under load the GUI becomes unresponsive and very laggy. , hence I wonder if I added a 4 port gigabyte NIC would it help. They are not very expensive and there is a switch right beside the server, so it would not require large runs of additional cables. Quote Link to comment
Joseph Posted April 10, 2017 Author Share Posted April 10, 2017 (edited) On 4/9/2017 at 11:45 AM, Joseph said: with the new card installed, the first parity check corrected 5 errors (this was to be expected) However the second parity check found the same 5 errors. Parity check 3 came back ok. I'm running check 4 and will post what I find. If it goes well, the plan is to reboot afterward and try it again. Parity check 4 came back clean. Log Posted. Next steps: reboot and try again. Last check completed on Sun 09 Apr 2017 10:30:47 PM CDT (today), finding 0 errors.Duration: 11 hours, 30 minutes, 19 seconds. Average speed: 96.6 MB/sec Edited April 20, 2017 by Joseph Quote Link to comment
crowdx42 Posted April 10, 2017 Share Posted April 10, 2017 It sounds like your issue may be resolved, hopefully the stress of all the parity checks does not break something else Quote Link to comment
Joseph Posted April 10, 2017 Author Share Posted April 10, 2017 (edited) On 4/10/2017 at 7:40 AM, crowdx42 said: It sounds like your issue may be resolved, hopefully the stress of all the parity checks does not break something else NOPE...FAIL! Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069768 Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069776 Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069784 Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069792 Apr 10 04:19:16 Tower kernel: md: recovery thread: Q corrected, sector=3519069800 Same sectors each time. The only difference I see is that the system was rebooted. The check is still going (90% complete), but I'm posting the syslog Edited April 20, 2017 by Joseph Quote Link to comment
crowdx42 Posted April 10, 2017 Share Posted April 10, 2017 So maybe it is time to take a step back. Do you have any spare data drives? I am thinking if you save your current configuration and then pull all the drives, then add back a single parity and single data drive and see if you can get a stable system with no errors that way. If that works you could start slowly adding back the data drives and from there you could pin down whether it is a dying drive or some other piece of hardware? One other thing I would consider is pulling any add in card and just use onboard SATA , just to test. You could be just unlucky and got a bad SAS card I did complete my parity rebuild on my system and I now have both parity drives back up and running. Quote Link to comment
Joseph Posted April 10, 2017 Author Share Posted April 10, 2017 I hear you...In all likelihood I may have to go that route, but its gonna be awhile before I can carve out some time to try. Its strange that it only seems to happen after a reboot. fwiw, I'm not getting unclean shutdown messages. Once complete I will run another parity check without rebooting and then reboot and run parity to verify the pattern. if anyone has any thoughts or ideas, feel free to chime in. Quote Link to comment
crowdx42 Posted April 10, 2017 Share Posted April 10, 2017 I did see some ACPI errors in your logs, I wonder if one of your drives is not disconnecting correctly at shutdown. Is it possible the unRAID has a time and then shuts down even if a drive is hanging? Quote Link to comment
JorgeB Posted April 10, 2017 Share Posted April 10, 2017 24 minutes ago, Joseph said: but I'm posting the syslog Post your diagnostics instead of just the syslog, much more info. Tools -> Diagnostics Quote Link to comment
Joseph Posted April 10, 2017 Author Share Posted April 10, 2017 (edited) On 4/10/2017 at 11:20 AM, johnnie.black said: Post your diagnostics instead of just the syslog, much more info. Tools -> Diagnostics done... see attached. Edited April 20, 2017 by Joseph Quote Link to comment
crowdx42 Posted April 10, 2017 Share Posted April 10, 2017 Are you using an external Buffalo drive in your array? I see one mentioned in one of the logs? Quote Link to comment
Joseph Posted April 10, 2017 Author Share Posted April 10, 2017 6 minutes ago, crowdx42 said: Are you using an external Buffalo drive in your array? I see one mentioned in one of the logs? Its outside the array as an unassigned device. Quote Link to comment
JorgeB Posted April 10, 2017 Share Posted April 10, 2017 So, your having a lot of ATA errors on one of your cache SSDs: Apr 10 06:04:01 Tower kernel: ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 10 06:04:01 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible Apr 10 06:04:01 Tower kernel: ata12.00: disabling queued TRIM support Apr 10 06:04:01 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible Apr 10 06:04:01 Tower kernel: ata12.00: disabling queued TRIM support Apr 10 06:04:01 Tower kernel: ata12.00: configured for UDMA/133 Apr 10 06:04:01 Tower kernel: ata12: EH complete Apr 10 06:04:51 Tower kernel: ata12.00: NCQ disabled due to excessive errors Apr 10 06:04:51 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x100 SErr 0x0 action 0x6 frozen Apr 10 06:04:51 Tower kernel: ata12.00: failed command: READ FPDMA QUEUED Apr 10 06:04:51 Tower kernel: ata12.00: cmd 60/20:40:80:dc:35/00:00:36:00:00/40 tag 8 ncq dma 16384 in Apr 10 06:04:51 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 10 06:04:51 Tower kernel: ata12.00: status: { DRDY } Apr 10 06:04:51 Tower kernel: ata12: hard resetting link Apr 10 06:04:51 Tower kernel: ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 10 06:04:51 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible Apr 10 06:04:51 Tower kernel: ata12.00: disabling queued TRIM support Apr 10 06:04:51 Tower kernel: ata12.00: supports DRM functions and may not be fully accessible Apr 10 06:04:51 Tower kernel: ata12.00: disabling queued TRIM support Apr 10 06:04:51 Tower kernel: ata12.00: configured for UDMA/133 Apr 10 06:04:51 Tower kernel: ata12: EH complete This SSD is connected on a Marvell controller, these are also known to be problematic (SAS2LP also uses a Marvell chipset), parity2 disk is connected on the same controller, since you replaced the SAS2LP all errors are on parity2, can you connect the parity2 on the LSI? you are only using 7 ports there. Quote Link to comment
Joseph Posted April 10, 2017 Author Share Posted April 10, 2017 (edited) 10 minutes ago, johnnie.black said: So, your having a lot of ATA errors on one of your cache SSDs: This SSD is connected on a Marvell controller, these are also known to be problematic (SAS2LP also uses a Marvell chipset), parity2 disk is connected on the same controller, since you replaced the SAS2LP all errors are on parity2, can you connect the parity2 on the LSI? you are only using 7 ports there. ok, I'll have to do it later when I have more time...thanks for looking into it. Fwiw, the Mobo has 2 6Gp/s & 4 3Gp/s Intel Z68 ports and 2 Marvell 6Gp/s ports. Edited April 10, 2017 by Joseph Quote Link to comment
JorgeB Posted April 10, 2017 Share Posted April 10, 2017 9 minutes ago, Joseph said: and 2 Marvell 6Gp/s ports. Don't use those, also move the SSD. Quote Link to comment
crowdx42 Posted April 10, 2017 Share Posted April 10, 2017 So a question on these controllers, is it that they have issues with certain motherboard chipsets or is there simply an unRAID compatibility? Quote Link to comment
JorgeB Posted April 10, 2017 Share Posted April 10, 2017 3 minutes ago, crowdx42 said: So a question on these controllers, is it that they have issues with certain motherboard chipsets or is there simply an unRAID compatibility? It's an issue with linux, mostly on latest kernels, mostly with vt-d enable and it doesn't affect every user. Quote Link to comment
JonathanM Posted April 10, 2017 Share Posted April 10, 2017 1 minute ago, crowdx42 said: So a question on these controllers, is it that they have issues with certain motherboard chipsets or is there simply an unRAID compatibility? Some people have issues, some don't. I suspect it's a question of the specific implementation, whether or not the design specs were pushed to the ragged edge or not. I think I heard somewhere that heat may also be an issue, the controller chip itself may not tolerate higher temperatures (loads) like it should. Certain configurations may push the issue into failure, where others allow it to operate ok. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.