June 3, 201313 yr I am running 5.0-rc12a, and finished setting up the unRaid system last week. I have a mix of WD Reds and Seagate 3TB drives. The Seagates all start reporting read errors on Sector 564416 and while they don't Red Ball they go non-responsive. Anyone seen an issue like this? I attached the current syslog which includes one of the disks going "bad". syslog.txt
June 3, 201313 yr Post a picture of the main screen from the Web GUI. Also, is there anything different between the way you're connecting the WD Reds and the Seagate drives? [e.g. a different SATA controller card; a port extender; etc.] The picture will answer this, but what's the mix and assignment of the drives?
June 3, 201313 yr I am running 5.0-rc12a, and finished setting up the unRaid system last week. I have a mix of WD Reds and Seagate 3TB drives. The Seagates all start reporting read errors on Sector 564416 and while they don't Red Ball they go non-responsive. Anyone seen an issue like this? I attached the current syslog which includes one of the disks going "bad". Since it is nearly impossible for the exact same sector to fail on all your disks I'd 1. perform a memory check, for at least one or more full cycles, preferably overnight. 2. suspect a bad disk controller chipset or something common to all the failing drives.. Joe L.
June 3, 201313 yr Since it is nearly impossible for the exact same sector to fail on all your disks ... Agree ... that's why I asked about a connection difference between the Seagates (which are having the problem) and the WD Reds (which aren't). A memory issue would seem likely to effect both; so I suspect a controller (or perhaps port expander) issue.
June 3, 201313 yr Author Thanks guys for the ideas. Picture attached. I am running all the drives on the same RES2SV240 SAS off the same IBM M1015. I did a 24 hour memory burn prior to the build. Note that unraid is running inside a 4G VM on ESXI. All the disks passed a 3 count preclear.
June 3, 201313 yr Author By the way, ignore the throughput numbers on the 3 preclears at the bottom, I just checked and they are all running above 100MB/s.
June 3, 201313 yr Nothing "jumps out" here. I really anticipated that you'd say the drives were on a different controller, or some such. I can think of no reason the Seagates would all have errors on the SAME sector !! Were these drives all purchased together?
June 3, 201313 yr Author Oh and thanks for taking the time to look into this. I'll see if I can narrow down a trigger or something.
June 3, 201313 yr Isolating this kind of error can be very time-consuming and require a lot of experimentation, as I'm sure you know. One thing that would be interesting to see -- but will take you several days to do if you choose to try it -- would be to simply replace (one-at-a-time) the Seagates that show errors with those 3 new WD Reds you're pre-clearing.
June 3, 201313 yr Author Yes I will try that. It will also be interesting to see if I leave the array alone if the other 2 also go "bad". After that I'll try a different M105 I have, as well as trying different backplanes (this is in a Norco 4224). Finally I'll try pure Unraid, no VM.
June 3, 201313 yr Sounds like a good testing plan. Conceptually very simple ... and for that matter, doesn't take much of "your" time => but clearly takes a LOT of "clock time" to do the rebuilds and then use the array a bit to see if it's still generating errors.
June 3, 201313 yr I don't run a VM, or have any real experience doing so, and thus it might color my comment. But ... I'd drop the VM first and run straight hardware unRaid before I'd try all the other steps.
June 3, 201313 yr Although it's unlikely that the pass-through of the controller is inducing these errors, I nevertheless agree that running "bare metal" for the testing is a good idea.
June 3, 201313 yr Author Quick update (for those that care). I wrote down all the info on the failed drives (physical location, logical drive map, serials, SAS/M105 port connectivity, etc) and decided that since I wanted the preclears to finish that for now I would try stopping and starting the array before trying bare-metal. That was spectacularly unsuccessful. The web page timed out after I did the stop from the web gui, and attempts to log back in via the GUI fail due to unable to resolve name. Attempts to use direct IP access also fails (timeout). I took a look at the process list and I don't recall it being this "busy" when I was running prior to the errors. I attached the list if anyone wants to look at it. I couldn't find the webserver in the process list unless it's named something non-obvious. For now my preclears seem to be going ok so I will leave the array down and let them finish before I move on to more disruptive debugging. ShepStor1pslist.txt
June 3, 201313 yr I have the exact same drive and I thought I had a bad one.. I quickly replaced with a different drive only for the same error/symptom. Turns out the SAS/SATA breakout cable might be at fault.. I wiggled the SATA cable on the drive side and it came back, same issue.. happened again and I unplugged the SAS side and replugged it and haven't had the problem for three weeks now. I have the Supermicro AOC-SASLP-MV8 with the included breakout cable(s). Also to be noted it ran for a year without issue with a vanilla setup. The errors happened a few weeks ago before I went to the ESXi world.. Mike
June 4, 201313 yr Cables are certainly a possibility ... and with breakout cables you could easily have 4 drives on the same cable, thus generating the same errors !! When you get ready to continue debugging, be sure to test this possibility.
June 4, 201313 yr Author I agree, cables are a possibility, although the failing cards are on two different cables. Thanks for the ideas guys, keep em coming.
June 4, 201313 yr ... the failing cards are on two different cables. ... in that case it's not very likely the issue is your cables
June 7, 201313 yr Author Here's an update on my issue. The preclears I had running finally finished last night. The array was still down from trying to stop it previously and the gui completely unresponsive so I decided to see if the issue with the Seagate's happened again while as a VM before moving to bare metal. So I shutdown the VM, start it up, GUI is fine, parity check starts running due to the hard shutdown and finds thousands of errors (I think 23k of them). I'm resigning myself to just erasing everything in the array and recopying the files when about an hour later unraid becomes unresponsive and the parity speed goes to 100k/s (not the 100MB/s it had been). Eventually the system becomes unresponsive again and so I shut down the VM, shut down the other VMs and move to bare metal. Boot up, parity starts, drives are responsive, an hour later the system grinds to a halt again. By now I am cursing. Next I start pulling seagate drives to see if I can find out if it's a bad drive causing the problem. The system dies after an hour if I have ANY of the seagate drives in the chassis. Since I didn't have enough precleared 3Ts to replace the 5 seagates I bit the biscuit and reformatted the USB, added Simple Features and started from scratch. So I put in the 9 Reds I had either already in the array or precleared, reset it all up, formatted the new drives and let parity run overnight. No slow downs, parity finished this morning and all disks are responsive in the raid. I plan to leave it for the rest of the day but based on what I am experiencing right now those Seagates are NOT working with my hardware config and unraid, even on bare metal. Here is my config: unRaid: unRAID Server 5.0-rc12a Case: Norco 4224 CPU: Intel Xeon E3-1240 (Sandy Bridge) MB: Supermicro X9SCM-IIF RAM: 4 x Hynix DDR3-1333 8GB ECC CL9 (on memory list for the X9SCM-IIF) - note I am using the 4G option on boot due to my board has the slow write issue with larger memory in use PSU: SeaSonic X Series X-850 850W 80 PLUS GOLD HBA1: IBM M1015 (Port 0 - cascade to RES2SV240; Port 1 - 4X SAS2 ports) HBA2: Intel RES2SV240 SAS Expander (Ports 0&1 - cascade from M1015s, Ports 2-5 - 16X SAS2 ports) HBA3 : IBM M1015 (Port 0 - cascade to RES2SV240; Port 1 - 4xSAS2 ports not used) Next step is to introduce one of the seagates in the known "working" config tonight and try to diagnose why the system grinds to a halt. I may just need to buy 5 more 3T Reds and stick with WD for my setup. Note I am not saying everyone will have an issue with the Seagate, at least one other person in this thread has no issues. Unfortunately that is not the case for me.
June 8, 201313 yr Run stock unRAID until you're done debugging the system. Simplefeatures may be causing the issue or confusing the diagnosis...
June 8, 201313 yr I'm in the process of RMAing a 4TB Seagate drive which won't even pass a short SMART test. It fails the read test. What do your Seagate drives report for smart test and diagnosis?
June 8, 201313 yr Author Quick update that I put in the first drive and parity completed and I left the system for 12 hours; no issues. I have now added 2 more drives (actually all 3 drives are the previous "problem" drives. Parity was going fine after 6 hours so I am wondering if actual data writes cause the issue over time so I am dumping 1.5G of data on each of the drives and then will leave the system alone and see what is going on. Run stock unRAID until you're done debugging the system. Simplefeatures may be causing the issue or confusing the diagnosis... Man I really like SimpleFeatures. If I can get the issue to re-occur yes I'll go bare bones. I'm in the process of RMAing a 4TB Seagate drive which won't even pass a short SMART test. It fails the read test. What do your Seagate drives report for smart test and diagnosis? My drives pass SMART tests and went through 3 preclears with no issue.
June 9, 201313 yr A bit late to this thread but I had a similar problem in the same case and same hardware with the same Seagate 3TB drives. The drives would drop from the array, usually within a few hours after I had put them in. This would happen every time I tried to swap them in with the unraid VM shutdown...until I powered off the entire server and then loaded the drives in with it powered off. Since then I've not had a single problem with the drives. I chalked it up to some kind of firmware issue with hot swapping on the drives and/or that Norco backplanes.
Archived
This topic is now archived and is closed to further replies.