Jump to content

Consistent read errors and Now disk problems


david11129

Recommended Posts

First off my hardware: Asus p9d-ws motherboard, xeon E3-1285Lv3,16gb Crucial ECC ram, 3x HBA's, Norco RPC4224 24 bay hotswap rackmount chassis, and 4x8tb drives, 4x5tb, and 1x2tb. The 8tb drives are less than 6mo's old, and the 5tb drives are about a year old.

 

I put this build together about a month ago. I managed to pick up 2 of the 24 bay Norcos, and 3 HBA's for $100. Way too good of a deal to pass up. My Server used to be in a Node 804 Case, with a LSI 9211-8i HBA. I chose to use the HBA's that came with the cases rather than the 9211 I used to have because the Ports on the cards are in a more convenient position.

 

About 2 weeks ago, I got an email saying the array had read errors. I checked, and 3 of my disks all had 4 or so read errors. I rebooted and figured I'd keep an eye on it. A few days later, my Parity disk reballed, and had ~300 read errors. The same disks also had ~300 read errors each as well. They were all on one port of one of the HBA's, so I moved all four disks to another row on the backplane figuring it was probably a bad cable. After I swapped them around and rebooted, I did a parity check. I don't remember if I had to do a new config and check that parity was already valid, or If rebooting brought the red balled drive back online automatically. 

 

Once the parity was completed, the error counter showed over 1.5 million errors that were corrected. I went most of the week with no further errors. My parity check on the first showed 4 errors corrected during the monthly parity check again. So if this wasn't enough problems, this morning I awoke to an email showing that 2 of my 5tb drives all of a sudden have 32 reallocated sectors on them. One of the emails said one drive had 8 sectors reallocated, then 5 minutes later I got another email showing it had increased to 32. I find it strange that they each all of a sudden have 32 reallocated sectors, when neither were pending before, and I ran long smart tests on each drive when the read errors first started cropping up and they all passed.

 

At this point, I'm leaning towards either the backplane is flakey, the HBA's or cables are faulty, the motherboard is bad,or maybe I really do have 2-3 disks all failing at the same time, which doesn't seem too likely. I'm looking for guidance at this point to determine which component is faulty, if it is my disks, and maybe some new ideas.

 

I'm uploading the diagnostics from 6/22 when I had the triple disk read errors, and 7/3 which is when I went to bed and saw the emails about the new reallocated sectors. Please take a look and help me out. I shut the server down after I downloaded the most recent diagnostics figuring that if multiple disks are failing, perhaps if they're off, I'll be able to recover what I can. I do have backups of my important things, the rest are just media I can reacquire if need be.

tower-diagnostics-20180703-0024.zip

tower-diagnostics-20180622-1637.zip

 

 

Edit: Other info that may be important:

 

I switched from a Supermicro X10SL7-F-O to this Asus because The supermicro didn't have enough PCIE slots for the VM's I wanted to run. The Asus Motherboard has been in use for around 3 months. When it was in another case I have, it worked great. Shortly after putting it into this Norco, Only the first 2 ram slots work. I pulled the CPU out when I was moving into Norco case. I did look for bent pins and saw none, in case that may have been why 2 ram slots quit working once I changed cases. 

 

The CPU is actually an engineering sample. I doubt it's the problem, because I've used it for several months in the supermicro motherboard with no issues at all. 

 

I ran memtest for around 18 hours as well to test the ram, and it passed with no errors. 

Link to comment

Any suggestions for diagnosing? I'm going to swap the motherboard for the known good supermicro I have.

 

Once that's done, is there any way to speed up the time until a read error occurs or the reallocated sectors increase? I don't want to have to wait up to a week again for the errors to happen. 

 

 

Link to comment

Ok,  so I got home and fired the server back up. I ran a short smart test on each disk. One has remained at 32 reallocated sectors. The other went up to 40. 5 minutes after the test ended, it's up to 56. It's still under the 2 year warranty. I'm gonna RMA and replace it.

 

Could 2 disks going bad but not throwing smart errors yet cause the read errors and the parity errors? I find it weird that the rear errors in the sys log were all on the same sector on each disk. 

Link to comment
8 hours ago, david11129 said:

or If rebooting brought the red balled drive back online automatically. 

Once a disk is disable it will never be automatically re-enabled.

 

4 hours ago, david11129 said:

Could 2 disks going bad but not throwing smart errors yet cause the read errors and the parity errors?

Possible, but having errors on 3 different disks at the same time like these:

 

Jun 22 15:56:59 Tower kernel: md: disk6 read error, sector=4322791096
Jun 22 15:57:00 Tower kernel: md: disk3 read error, sector=4322791096
Jun 22 15:57:00 Tower kernel: md: disk4 read error, sector=4322791096

 

Suggests another problem, like a controller, cable or power issue.

 

The reallocated sectors are likely unrelated to these, though bad power can fiscally damage disks.

 

Also any reason for FCP troubleshooting mode being enable? That is more for server crashing and makes the syslog much more difficult to read.

Link to comment
10 hours ago, johnnie.black said:

Once a disk is disable it will never be automatically re-enabled.

 

Possible, but having errors on 3 different disks at the same time like these:

 


Jun 22 15:56:59 Tower kernel: md: disk6 read error, sector=4322791096
Jun 22 15:57:00 Tower kernel: md: disk3 read error, sector=4322791096
Jun 22 15:57:00 Tower kernel: md: disk4 read error, sector=4322791096

 

Suggests another problem, like a controller, cable or power issue.

 

The reallocated sectors are likely unrelated to these, though bad power can fiscally damage disks.

 

Also any reason for FCP troubleshooting mode being enable? That is more for server crashing and makes the syslog much more difficult to read.

Not sure if I should be quoting or not, if not sorry.

 

I'm not sure why I had troubleshooting mode on. I have it on now because when I was trying to run smart tests, the server stopped responding and I had to power down. I ended up swapping motherboard,cpu,ram and one of the HBA's. I'm now using one original HBA, and the one builtin on the mobo. 

 

Both disks have increasing failing sectors. One is now at 48, and the other is almost to 200. I began transferring data off each disk onto others in the array, with the plan of doing a new config with a smaller number of disks once the bad two have been pulled. 

Now, another disk has been redballed. I was transferring off of disk6 using rsync to disk7, and disk7 is throwing a ton of write errors and has been dropped. This disk never had smart problems at all. I'm uploading the diagnostics from shortly after the drive got dropped.

tower-diagnostics-20180704-1059.zip

Link to comment

Now DIsk1 is having errors as well. WTF? It hasn't redballed yet, so I stopped transferring to it and just picked ANOTHER disk to hopefully transfer to. Is there a need for a system log again?

EDIT: Disk1 is still in the array, but also now in unassaigned devices, along with disk7. I am getting read errors on Disk3 as well. I am up to 3 disks with read errors.

EDIT2: Disk6 is now having read errors. I am up to 4 out of 9 disks with read errors in the array. One of the disks is making like a chirping kinda noise. Hopefully I get the biggest stuff off of the drives at least.  

 

I am glad I have a backup of the important things like photos etc. My wife would kill me, so I have like 4 copies of those in various places on and off site.

Link to comment

Ok, rsync is crawling with various input output errors. Is it likely the 2 failing disks are the reason the other two that I was trying to recover to are read erroring and the one dropping completely? How often do dual simultaneous disk failures occur? Like both worked fine, both had smart fails, and both started spitting back dirty data? At the same time? 

 

Since they are all on a backplane which provides power and data, that really only leaves the possibilites for failure as the HBA cables,the HBA's, or the PSU. How would I narrow it down from here? i'm not sure it's the HBA cables, since I had to swap from a sf8087-8087 to a 4xsata to sff8087 reverse breakout to use the onboard lsi controller on the motherboard. So really, it's probably the backplane itself correct?

Link to comment

Well both disks haven't redballed, but they look like they have been soft fropped ro something. They are now in unassigned devices, and are not showing nay read activity, even though they should both be copying off of still.

 

Am I good to do a new config and just use the good disks? Parity can't save me now anyways. Not even dual could have. I wanted to ask before I destroyed and rebuilt the array.

tower-diagnostics-20180704-1149.zip

Link to comment
53 minutes ago, johnnie.black said:

Like I mentioned you have other problems besides the disks, e.g., controller, cable, PSU, you need to start ruling things out.

Well I've replaced the controller except one, but it has one disk on it and not even a failing disk, again, the cables were replaced other than them two off of the old HBA. But they are only running the single disk, and that disk hasn't had any failures.So that leaves the back plane, and the PSU correct?

 

I basically have enough hardware to replace almost everything. To start, where would you begin? he only thing I don't have is more sff8087 cables, because I've already swapped them. I have an empty 24bay hot swap case I could move into. That's a lot of work, so before that, what would you do to begin diagnosing?

Link to comment

Sorry I keep posting. I have other problems besides the disks, but those 2 disks are for sure failing right? There's no way whatever hardware problem is occurring is the cause of the two disks? Or rather, it may be the cause of them failing, but they are definitely failing either way correct?

Link to comment

I had some disks red balling on me - I would re-enable them let the parity run and it seemed fine for a while - then another disk would go, in the same cage of four drives.

Ended up figuring it was a power issue - moved the cage of four drives to a different molex on my PSU and things have been fine ever since - I was occasionally getting Machine Checks in my logs and they seem gone as well.

So I would really consider putting the drives on another PSU (even temporarily) to see if your problems go away!?

Link to comment

FWIW: I haven't had any luck with those Toshiba drives in my Norco, they drop like dominoes, and the WD Greens I would replace at around the same number of hours you have on 'em. I've also had one of those WD 8TB crib die on me too, so it's not that unusual. Unlucky, sure.

 

I've felt your pain brother.

 

Link to comment
4 hours ago, Michael_P said:

FWIW: I haven't had any luck with those Toshiba drives in my Norco, they drop like dominoes, and the WD Greens I would replace at around the same number of hours you have on 'em. I've also had one of those WD 8TB crib die on me too, so it's not that unusual. Unlucky, sure.

 

I've felt your pain brother.

 

Well crap. Did you have any luck RMA'ing the Toshibas? I haven't tried yet, and I'm worried it'll be denied, because at one point the drives went over the max operating temp by 2C. What is interesting is that now that I've replace the PSU like @johnnie.black said, the reallocated sector count is no longer increasing on either drive, and I was able to pull all the data off of one of them, and the second one ran overnight and is still being copied from. That one had 4tb on it. I also eliminated all the HBA's other than the internal one on the motherboard, and am running the two failing drives on sata cables attached to the motherboard directly. Disk 7 that dropped out is still redballed, with a note that it has no mountable filesystem. I'm planning on directly attaching it too once I get all the data off of the two Toshibas. 

 

Is there any way the reallocated sector count increasing is due to whatever problem I seem to have fixed? At least so far everything is working with no errors anyway. 

Link to comment

Bad power can damage disks, including causing pending and/or reallocated sectors, but this is also not good at all:

1 minute ago, david11129 said:

because at one point the drives went over the max operating temp by 2C

For better reliability disks should always run below 40C, 45C tops.

Link to comment

I realize that. I try to keep them below 45c, with a goal of under 40 like you said. The case doesn't have the best cooling, but with bigger fans I've added, they rarely hit 44c. I've had the Toshibas for a little over a year, so I'm not sure when or in what case they hit 57c. The reds all show a max temp of 44c iirc.

Link to comment

@Michael_P , I see you have a norco 4224 as well. Have you shucked any of the white label 8tbs? In mine, they don't work unless I tape over the 3.3v reset pins. The tape seems to be working, but I do wonder if it may be the reason I had a few red's drop out on occasion. For insurance, I redid all the tape last night on them, but I'm wondering if you have a better solution? I was under the impression the backplane on these don't have the 3.3v feed, yet mine won't work without the tape. IT's weird too, since the backplane is run off of 6 molex connectors,and molex doesn't even have the 3.3v wire. The backplane must somehow be feeding the 3.3v reset wire using the molex connector I guess.

Link to comment

@johnnie.black, I'm getting read errors again now.  On 3 different disks than before it seems. I've swapped everything I can think of. Perhaps my reverse breakout cables are bad as well? The only thing that hasn't changed is the back-plane, but the "back-plane" on the RPC-4224 is actually 6 mini back-planes, each powering 4 drives. The drives are in different slots than before, and therefore are on a different mini back-plane. Any more ideas? Since Tuesday I've swapped out Motherboard,CPU,Ram,Power supply, and removed my HBA's other than the internal motherboard one, and cables. Disk problems? I doubt it though, because once again all the failures are on the same sectors in each disk. I'm attaching new diagnostics again, without troubleshooting mode being enabled.

tower-diagnostics-20180706-2239.zip

Link to comment
17 hours ago, david11129 said:

Well crap. Did you have any luck RMA'ing the Toshibas? I haven't tried yet, and I'm worried it'll be denied, because at one point the drives went over the max operating temp by 2C. What is interesting is that now that I've replace the PSU like @johnnie.black said, the reallocated sector count is no longer increasing on either drive, and I was able to pull all the data off of one of them, and the second one ran overnight and is still being copied from. That one had 4tb on it. I also eliminated all the HBA's other than the internal one on the motherboard, and am running the two failing drives on sata cables attached to the motherboard directly. Disk 7 that dropped out is still redballed, with a note that it has no mountable filesystem. I'm planning on directly attaching it too once I get all the data off of the two Toshibas. 

 

Is there any way the reallocated sector count increasing is due to whatever problem I seem to have fixed? At least so far everything is working with no errors anyway. 

 

Yeah, all of them were accepted - I doubt they even checked, tho mine did not overtemp

 

17 hours ago, david11129 said:

@Michael_P , I see you have a norco 4224 as well. Have you shucked any of the white label 8tbs? In mine, they don't work unless I tape over the 3.3v reset pins. The tape seems to be working, but I do wonder if it may be the reason I had a few red's drop out on occasion. For insurance, I redid all the tape last night on them, but I'm wondering if you have a better solution? I was under the impression the backplane on these don't have the 3.3v feed, yet mine won't work without the tape. IT's weird too, since the backplane is run off of 6 molex connectors,and molex doesn't even have the 3.3v wire. The backplane must somehow be feeding the 3.3v reset wire using the molex connector I guess.

 

All of my WDs save for the 4TB drive are from shucked enclosures, mix of white labels and Reds - I didn't have to do anything funky with them. Just popped them in a sled and loaded into the bay.

 

Link to comment
On 7/4/2018 at 8:58 PM, david11129 said:

So that leaves the back plane, and the PSU correct?

 

Problem with the power or vibrations can produce reallocated and pending sectors. The disks needs a stable environment to be able to work as expected.

Link to comment
On 7/7/2018 at 12:11 AM, johnnie.black said:

I would say it's almost certainly not the disks, start by what parity and disks 1 and 2 share, like cable, backplane, etc.

Ok, @johnnie.black, I've been stable for a couple weeks now. I was able to get all the data off of those two disks. Both of the "failing" disks are no longer showing increasing smart values. I removed them from the array, and ran preclear on both. They both passed, without increasing smart values still. is it safe to assume they are probably trustworthy now, and there reallocated sector count was caused by some other problem that is now rectified? In addition to te things I swapped out a few weeks ago, I also moved the drives around so they aren't as close together. Now none of them even hit 40c during a parity check. I wonder if the vibrations caused the x300's to start reallocating sectors?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...