December 30, 20178 yr So, a week or so ago, I had a failed/disabled disk. I ran extended SMART tests, which came back with no issues. I removed the drive and replaced it with a brand new 3Tb drive, zeroed, and ran extended SMART tests.. everything seemed fine.. a few days later, it was time for a parity check, and the same 'slot' (disk 4) showed errors and was disabled. At this point, I started to suspect a bad SATA sable.. Since I use an LSI SAS controller, I swapped out the SFF-8087 to SATA cable and re-enabled the drive.. everything seemed fine again, until today. Once again, the same 'slot' has a failed drive in it. Should I be suspecting the power supply, or the LSI card, or something else?
December 30, 20178 yr Community Expert Post the diagnostics instead, ideally after the error and before rebooting: Tools -> Diagnostics
December 30, 20178 yr Author I will do that, if it happens again.. what I've done for the time being, is swap which power cable goes to which drive, to see if a different drive fails.. will keep you posted.
January 10, 20188 yr Author Ok, so pretty much, every drive passes SMART extended tests, without issue. I can write files to the array on each disk, without issue.. but twice in a row now, as soon as I start a parity check, a disk gets upset, spits errors and gets disabled. I can't explain what is causing it. I'll try to trigger it again soon, and get Diagnostics straight away... unless anyone thinks there is a better course of action?
January 10, 20188 yr Notice that the power consumption and CPU load is higher when you run a parity check. So you will run the PSU much hotter. And there will be bigger ripples on the power lines. And potentially the PSU can't support the biggest current spikes. And at the same time, you run RAM and CPU hotter. Hotter means closer to the stability limit, because the chips gets slower the hotter they get. What types of burn-in tests have you done?
January 10, 20188 yr Author It's been the same machine for many years, always been stable. The only change is that lately I swapped out 2 x 2Tb drives (1 parity, 1 data) for 2 x 3Tb drives.. so perhaps power usage is the issue? The power supply struggling under load?
January 10, 20188 yr Author Quick specs: Gigabyte F2A68HM-DS2 motherboard AMD A4-7300 CPU 8Gb DDR3 RAM LSI 9211-8i (LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]) SAS controller Power Supply is 500W (Thermaltake TT-500NL2NK-A) - I've had this for a few years now I believe. Hard disks are as follows: Cache: 128Gb Kingston SSD (SV300) Parity: 3Tb Seagate NAS (ST3000VN000) Disk 1: 2Tb Seagate (ST2000DM001) Disk 2: 2Tb Western Digital Purple (WDC_WD20PURX) Disk 3: 2Tb Western Digital Purple (WDC_WD20PURZ) Disk 4: 3Tb Seagate NAS (ST3000VN000) Disk 5: 2Tb Seagate (ST2000DM001) Disk 6: 2Tb Western Digital Purple (WDC_WD20PURX)
January 10, 20188 yr Maybe start by using compressed air and blow out dust from PSU, CPU heatsink etc and look at how much dust you have on fan blades and eventual dust filters and see if that makes a difference.
January 10, 20188 yr Author The case is very clean, and there are dust filters on almost every entrance point.. its all housed in a Fractal Designs Define R5. Im now wondering if its the power supply ageing, or perhaps those cheap molex -> 2 x SATA power cable splitter things..
January 10, 20188 yr If you have dust filters, then the case would normally be quite clean. But how clean are the dust filters? It's more likely to be the PSU than the cable splitters if your machine works well for single-disk writes but starts to misbehave when you start a full parity sync.
January 10, 20188 yr Author I'll set it to not write corrections to parity, and try a sync shortly.. if it fails almost instantly, im happy to assume its the PSU and replace it.
January 10, 20188 yr Author Ok, so, within 1-2 seconds after pressing "Check" to start a parity sync, it aborted with a 'drive failure'. Diagnostics are attached. I suspect power issues. wishie-diagnostics-20180111-0101.zip
January 10, 20188 yr 1-2 seconds is too short time to be an heat issue. But more than enough time for a disk to reset because of issues with the power. I would definitely recommend to test with a different PSU.
January 10, 20188 yr Author ..on start of the parity check [451898.927735] mdcmd (45): check correct [451898.927760] md: recovery thread: check P ... [451898.931876] md: using 1536k window, over a total of 2930266532 blocks. [451899.163482] md: recovery thread: P corrected, sector=0 [451903.311373] mpt2sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01) [451903.311382] mpt2sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01) [451903.319258] sd 7:0:6:0: [sdi] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 [451903.319271] sd 7:0:6:0: [sdi] tag#0 Sense Key : 0x2 [current] [451903.319274] sd 7:0:6:0: [sdi] tag#0 ASC=0x4 ASCQ=0x0 [451903.319277] sd 7:0:6:0: [sdi] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 01 0c 40 00 00 04 00 00 00 [451903.319280] print_req_error: I/O error, dev sdi, sector 68672
January 11, 20188 yr Author From a little reading, I've determined that the Key, ASC, and ASCQ point to the error code "Not Ready - Cause not reportable.".. sadly, this doesn't shed any more light on the situation.
January 13, 20188 yr Author I've ordered a 700W Thermaltake PSU, so I hope that will fix the problems.
January 14, 20188 yr Author My new power supply hasnt arrived yet, but the disk just got disabled overnight (2:21am)... ive just woken up (8:30am) and grabbed Diagnostics.. can someone see if they can spot anything please? wishie-diagnostics-20180115-0846.zip
January 17, 20188 yr Author Ok, new power supply is in, parity check started, and so far so good.. its been running a few minutes and hasn't disabled the disk yet. I'll let it do this pass without corrections (I unticked 'write corrections to disk') and if it succeeds, ill run a check again, with it ticked.
Archived
This topic is now archived and is closed to further replies.