March 15, 20179 yr Had a drive fail. I replaced the failed drive and then a day later had another drive give an error. Wondering if this might be a controller issue but I'm not sure. Changed out the cables and re-enabling the drive. Rebuilding it now. I downloaded the SMART report and started an extended report. Not sure if this is another bad drive could use some help understanding the report. tower-smart-20170314-2013.zip
March 15, 20179 yr An interesting SMART report! Looks fine at first, like a brand new drive, with only 74 Power On Hours. But there are 2 big discrepancies in the report! One is that the test section mentions there were 2 vendor test at 21639 hours! Both completed without errors, and there's no indication there of any other testing. Yet in the General Values section near the top, it says there's a test in progress, with 90% yet to complete. So one part says no test in progress, another part says there is. One part says the drive is new with only 74 hours, another part says there was testing at 21639 hours! Is this drive possibly a refurb? It looks like they may have reset most but not all of the SMART attributes and values.
March 15, 20179 yr But SMART only tells you about the drive's health. Many drive errors have to do with other components, and for that you would need to check the syslog. Please see Need help? Read me first!, and attach the diagnostics zip.
March 15, 20179 yr Author Thanks for your help Rob. Attached the most recent completed SMART report and sys log. Plus sys log from before the last reboot. tower-smart-20170314-2028.zip tower-diagnostics-20170314-2210.zip tower-diagnostics-20170305-1658.zip
March 16, 20179 yr The testing section in the SMART report for that new Hitachi (Disk 2, SN=...R7R) has improved, now showing the test that was in progress before, plus a completed test. No issues on the tests, but still has the age discrepancy. Was this drive a refurbished one? You have another Hitachi (Disk 4, SN=...YKP) with an odd age discrepancy too. Possibly another refurb? You have another Hitachi (Disk 1, SN=...5TV) with 2 Pending sectors, have had them for awhile, and you've acknowledged it, to ignore them. That's a very bad idea! There are many SMART attribute changes you can safely ignore, but not this one. This one will block drive rebuilds, and parity checks and builds. If you can't use it for drive rebuilds, then having parity is useless. (And if you don't care about parity, then you don't really need unRAID!) You have IDE emulation turned on for some of your onboard SATA drives. When you next boot, go into the BIOS settings and look for the SATA mode, and change it to a native SATA mode, preferably AHCI if available, anything but IDE emulation mode. AHCI should be faster and safer (and as you'll see below, IDE proved disastrous!) Worse, it reported "limited to UDMA/33 due to 40-wire cable" for them, so performance is likely to be terrible for those 2 drives, earlier syslog - Seagate ST1000 (Disk 2, sdc, ...M5Y) and Sandisk SSD (Cache, sdb, ...343). In the current syslog, it was Disk 5 (Hitachi ...5EA) and Parity (WD30EZRX ...492). None of those you want to be on the ports configured to be the slowest! Your tunables are not optimal, which will result in much slower parity check performance. md_sync_thresh should be raised to either 512 or about 990, depending on whether you have certain SATA controllers (don't remember which). Test for yourself. Syslog is filling and rolling over with BTRFS errors, indicating a corrupt file system for the Cache pool. On March 5 at 14:55:15, after days of BTRFS errors related to the corrupted BTRFS file system on the Cache drives, the SSD holding the Cache drive stopped responding. Multiple attempts were made to recover it, unsuccessful, so it was dropped from the system. Because it was on an emulated IDE channel instead of a SATA channel, the whole channel was dropped, which included Disk 2, the Seagate ST1000! There was no fault with it, just happened to be tied to a drive that was dropped, so it was dropped too. You removed it, but I believe you will find that it's completely fine. Check it for sure. After the SSD was dropped, there were several Call Traces, but they were from BTRFS because the Cache drive was gone, and therefore loop0 was gone! So you can ignore them. Now for the new syslog! Drives have moved, are changed. You started then stopped, assigned Hitachi ...R7R to Disk 2, then restarted and Disk 2 is about 2 hours into rebuilding. I'm not optimistic it will finish because of the bad sectors on Disk 1. You still have the BTRFS corruption in the Cache pool. The CrashPlan container complains that the path "/mnt/disks" does not exist. So ... - The BTRFS corruption in the Cache pool *has* to be fixed. I'm not sure you can completely trust anything stored on it. I suspect you will have to stop Dockers and VM's, then copy whatever you can off of it, and reformat it. I believe there are FAQ entries about this. Might be best to disable Dockers, and delete and recreate docker.img. - AHCI needs to be set in the BIOS - The Pending sectors on Disk 1 *have* to be fixed, probably by rebuilding Disk 1 onto itself - And you are trying to rebuild Disk 2 now. If it does not go through, you might try putting back the original Disk 2. You would have to do a New Config with Retain all and set Parity is already valid, in order to put the smaller 1TB drive back into Disk 2. Once the array is working fine again, you can then try again to add the 3TB drive. I may have forgotten something, forgive me.
Archived
This topic is now archived and is closed to further replies.