October 16, 20178 yr I recently had some trouble with several drives on my Unraid server. I had the issues solved (or at least I thought they were solved). I had support in this thread: Just today I installed the unBalance plugin and looked around in the program. All disks spun up and unBalance got its information on where the files are stored. I did not use unBalance to scatter or gather any files, I just looked around. When I return to the Unraid WebGUI I have read errors on two of my drives. Disk 2: 4 errors. Disk 5: 1 Error. Both drives are fairly new (disk 2 is about 6 months, disk 5 2 months or so). I has a successful parity rebuild without errors only 2-3 weeks ago. I have not used the array much after that. I looked at the SMART reports. Disk 2 now has: 197 Current pending sector: Raw Value 1. Disk 5: I cannot see what the error on disk 5 is. I compared the new SMART data with an old SMART report and there was no change in the values that are visible in Unraid (Disk 5 had 521 199 UDMA CRC errors, but there were from before because of bad cabling). At the moment management of Unraid is giving me nothing but headaches. I am really considering moving to a simpler device, such as a Synology NAS. Is data storage really this unreliable? I never knew. So, what do I do next? Edit: Attachments removed. Edited November 3, 20178 yr by kno
October 16, 20178 yr Community Expert Disk2 has a pending sector and needs to be replaced. Disk5 SMART looks fine, there are reports that this specific model (WD60EFRX-68L0BN1) has firmware issues, older model (WD60EFRX-68MYMN1) is fine.
October 16, 20178 yr Author How can I explain the error on disk 5 if there is no trace of it in the SMART reports? I would like to know what went wrong and how to prevent it in the future? Do random read errors just happen sometimes? It is possible that the HDD generates random read errors because of a firmware error? If this is the case, how can I possibly rebuild a failed drive? I purchased three WD6TB drives a few weeks ago. I think that they are all WD60EFRX-68L0BN1. I have searched for WD60EFRX-68L0BN1 firmware error but found nothing specific. If WD hasn’t saying anything about it will be difficult to RMA the drive. If I need to replace and rebuild disk 2, other read errors will cause problems. Do you think one pending sector is reason for RMA or do I need to preclear the drive until it fails?
October 16, 20178 yr Community Expert 2 minutes ago, kno said: Do random read errors just happen sometimes? They shouldn't if everything is working as it should, it can happen one once in a while, but if they happen frequently there's a problem. 4 minutes ago, kno said: It is possible that the HDD generates random read errors because of a firmware error? If this is the case, how can I possibly rebuild a failed drive? I purchased three WD6TB drives a few weeks ago. I think that they are all WD60EFRX-68L0BN1. I have searched for WD60EFRX-68L0BN1 firmware error but found nothing specific. If WD hasn’t saying anything about it will be difficult to RMA the drive. I've been reading on unexplained errors with these disks on various system, including FreeNAS, QNAP, openmediavault and unRAID, but it's still not confirmed the disks are the actual problem, it may be a specif hardware combination, but for example the FreeNAS case was only solved by replacing all those disks, and the user tried replacing everything else before. 8 minutes ago, kno said: Do you think one pending sector is reason for RMA or do I need to preclear the drive until it fails? It's enough to RMA, it's also enough to fail a SMART test, extended for sure but even the short test should fail with a read failure. The problem can also be caused by cables, controller (you're using a Promise controller I don't see much with unRAID so no idea if it's reliable), power supply, etc.
October 16, 20178 yr Author 27 minutes ago, johnnie.black said: The problem can also be caused by cables, controller (you're using a Promise controller I don't see much with unRAID so no idea if it's reliable), power supply, etc. This referring to the random read errors, right? Am I correct in assuming that the current pending sector is always a drive error, caused by a bad sector (or failing sector) on one of the platters? Once a sector is bad it has to be remapped and "closed off" so that not additional data is written to this sector? I understand the drive can remap some number of bad sectors and still be usable. I read that some users continue to use their drives and force remapping by zero'ing the drive, but if I can RMA the disk that would be preferable. I do run my Unraid server on older hardware, but I am thinking that all drives on the same controller or power-line should have errors if it was another HW issue. Still, I am considering getting new hardware. Also, I have a problem of keeping some of the data. If I have to replace 4 x 6TB disk I have no place to copy the data! One thing is certain. I am finished with WD disks! I have nothing but bad experiences with them now.
October 16, 20178 yr Community Expert 14 minutes ago, kno said: Am I correct in assuming that the current pending sector is always a drive error, caused by a bad sector (or failing sector) on one of the platters? Yes 14 minutes ago, kno said: Once a sector is bad it has to be remapped and "closed off" so that not additional data is written to this sector? A new write to that sector will remap or re-enable it if considered good.
October 16, 20178 yr Author So, what do you think? Should I replace disk 2 and try to rebuild (and hope disk 5 does not give many errors)? I have two more WD60EFRX-68L0BN1 drives installed in the array.
October 16, 20178 yr Author I tried running a SMART short self-test on disk 2: I got the following message: Interrupted (host reset) What does that mean? The syslog contains the following: Oct 16 22:53:20 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 16 22:53:20 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE Oct 16 22:53:20 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in Oct 16 22:53:20 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Oct 16 22:53:20 Tower kernel: ata10.00: status: { DRDY } Oct 16 22:53:20 Tower kernel: ata10: hard resetting link Oct 16 22:53:22 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Oct 16 22:53:22 Tower kernel: ata10.00: configured for UDMA/100 Oct 16 22:53:22 Tower kernel: ata10: EH complete Can something be read from the log above?
October 16, 20178 yr Community Expert host reset means the test was interrupted, maybe by the controller, try again and if same connect it on a different controller.
October 17, 20178 yr Author Ran SMART test on disk 2 and 5. Disk 2 failed the short test (and made mechanical noise from the server chassis). This disk is definitely bad. It is now marked with a red X in the web gui. Disk 5 passed both short and extended SMART test. I cannot explain the read error I had on it. Hopefully it won’t happen again. I will now exchange disk 2 for a spare disk and hope the rebuild is successful. What do you do with you disks before you send it in for RMA. There is still data on it, so I would prefer to wipe it if possible. Any suggestions?
October 17, 20178 yr Community Expert 25 minutes ago, kno said: What do you do with you disks before you send it in for RMA. There is still data on it, so I would prefer to wipe it if possible. Any suggestions? I don't really care about that, but you can try running preclear on it.
October 18, 20178 yr Author I need some more advice. I replaced disk 2 with a new drive and started rebuild. Now I cannot reach the webgui and neither can I access server with PuTTY. It is about 13 hours since I started the rebuild and lost contact. With PuTTY I get the error message “Unable to open connection to TOWER. Host does not exist.” So I plugged my monitor into the Unraid server so see if something could be seen that way. There was a lot of fast text scrolling on the screen, but it was flickering so fast that it is impossible to make out what it says. I grabbed a camera and tried to capure it. I can make out some of it. I think it might be the same message repairing over and over, but I am not 100% sure. I am not sure what it means and I am not sure what I am looking for. I can make out the following: Fixing recursive fault but reboot is needed! BUG: unable to handle kernel paging request as 0000002b00000008 IP: [<ffffffff81385377>] blk_flush_plug_list+0x59/0x219 PGD 0 Oops: 0002 [#4287628] PREEMPT SMP Modules linked in: md_mod coretemp kvm_intel kvm i2c_i801 i2c_smbus i2c_core sat A_promise sata sil24 ahci intel_agp libahci pata_jmicron ntel_gtt atl1 agpgart Mii asus_atk0110 acpi_cpufreq [last unloaded:md_mod] CPU: 3 PID 1216 Comm: kworker/u8:15 Tainted: G D W 4.9.30-unRAID #1 Hardware name System manufacturer System Product Name/P5B-E, BIOS 1807 04/15/2009 [then there is some more code] Call Trace: […] dump_stack+0x61/07e […] __warn+0xb8/0xd3 […] warn_slowpath_nuyy+0x18/0x1a […] do_exit+0x5c/0x86a […] rewind_stack_d0_exit+0x17/0x19 ---[ end trace 7672d1c585583b99 ]--- I think there is some other info in there as well that I might have overlooked. Please have a look the attached screenshots taken with camera. So, I guess something is wrong? What could it be? Is it safe to just switch off/kill power to the system? So I do it? BTW. I ordered an used LSI 9201-16i and some SAS to SATA cables yesterday. Hopefullt that will arrive within 2 weeks or so.
October 18, 20178 yr Community Expert 17 minutes ago, kno said: Is it safe to just switch off/kill power to the system? So I do it? It's the only option, rebuild should restart from the beginning.
October 18, 20178 yr Author Any thoughts on what the above error is about? I killed power to the server and rebooted. Disk rebuild started automatically, however both disk 1 and disk 5 are reporting errors, so I stopped it as once. I have 1 read error on disk 1 and 126 read errors on disk 5, just from the first minute of rebuilding. Disk 5 passed both short and extended SMART yesterday without any errors, so this is strange. There is something seriously wrong somewhere, but what is it?!? Both drives are new WD60EFRZ-68L0BN1 disks. Can this make/model really be the problem? If they are so troublesome there must be some more info about it somewhere? If that is the culprit, how can I ever rebuild my failed disk? I am thinking to do the following: 1. Do nothing at the moment. Power off and leave it alone (Wait for the LSI card, cables and second hand Matrox PCI GPU to arrive). 2. Order a new PSU, something like 600-700 watts with removable cables. 3. Remove the controllers, SATA-cables and old PSU. Replace with new PSU, LSI-card and connect all HDDs to the LCI (including those that are on the MB today). 4. Change the MB battery and reset MB bios to default values. 5. Run memcheck86 for a couple of hours. 6. Boot unRaid and try to rebuild failed drive. What do you think? Have I missed anything? Is there a good way to test motherboard and CPU stability in addition to memcheck86 for the memory? Perhaps I should test system stability with only CPU, GPU, and memory installed (no controllers or HDDs).
October 18, 20178 yr Author BTW, here is diagnostics. Edit: Diagnostics removed. Edited November 3, 20178 yr by kno
October 18, 20178 yr Community Expert 36 minutes ago, kno said: Any thoughts on what the above error is about? Difficult to say. 37 minutes ago, kno said: I am thinking to do the following: 1. Do nothing at the moment. Power off and leave it alone (Wait for the LSI card, cables and second hand Matrox PCI GPU to arrive). 2. Order a new PSU, something like 600-700 watts with removable cables. 3. Remove the controllers, SATA-cables and old PSU. Replace with new PSU, LSI-card and connect all HDDs to the LCI (including those that are on the MB today). 4. Change the MB battery and reset MB bios to default values. 5. Run memcheck86 for a couple of hours. 6. Boot unRaid and try to rebuild failed drive. What do you think? Have I missed anything? Sounds like a good plan, if the LSI arrives first I'd try again after changing just the controller.
October 19, 20178 yr Author I pulled all the power cables and SATA cables to the HDD’s and wrote down the connections in a matrix to see if there is a common denominator for the drives that has random errors. Both drives are at the end of one of the molex power cables. I have a theory that this might be a contributing factor that could explain some of the problems I am experiencing. This could explain why I do not get any errors during SMART test but do get errors during rebuild. During SMART one or two drives are running. The other drives are idling or in standby. During rebuild all drives are at full work and probably drawing maximum power. Perhaps the PSU is struggling to deliver stable power from the 12V rail to all drives? Perhaps it becomes unstable or there are small fluctuations that cause random read errors? Could this also explain why I have had several drive failures? Can bad/low quality power cause pending sectors or bad sectors? What do you think of this theory? At least this little exercise has made is clear to me that I should get a new power supply just to be certain. Also, I plan to runs some days of stability tests while waiting for the LSI card and cables. I think I will run the memory and CPU test on the Ultimate Boot CD. This way I am assured that the rest of the system is stable.
October 19, 20178 yr Community Expert 2 minutes ago, kno said: I have a theory that this might be a contributing factor that could explain some of the problems I am experiencing. Definitely possible. 5 minutes ago, kno said: Could this also explain why I have had several drive failures? Can bad/low quality power cause pending sectors or bad sectors? Not so likely but also possible, in certain situations a power failure/cut can leave a bad sector, it was very common with Windows XP after a power failure, I must have repaired more than 100 PCs with a disk getting a single bad sector after that happening, Windows XP would crash and re-start on boot, the first question I asked was: did it happened after a power failure/unclean shutdown?
October 23, 20178 yr Author On 19.10.2017 at 10:45 PM, johnnie.black said: Did it happened after a power failure/unclean shutdown? This is hard to say. There was one power failure recently, but I cannot say for sure if it is connected. I ordered a new 850W EVGA PSU, that based on the reviews should be able to deliver ample stable power. I am still waiting for the new parts to arrive.
October 23, 20178 yr Author BTW: Would there be any advantage for me to start with a new configuration, once I have successfully rebuilt my failed drive? My Unraid has been upgraded since version 4 (unsure from exact version). I am thinking that starting with a new config might have some advantages such as new files format? There might be other advantages as well?
October 23, 20178 yr Community Expert New config won't change anything on the existing data disks, don't see any advantage for your case, but you may want to convert all disks to xfs after this problem is resolved.
October 27, 20178 yr Author My LSI card has arrived, but I am still waiting for the break-out cables. They will take another couple of days. Can I boot my existing Unraid installation without having any HDD's attached? My drives are completely disconnected. Is it ok to boot, or will it just mess things up? I just want to see if it boots up ok with the card installed.
October 27, 20178 yr Author Perhaps there is a simple answer to this: I booted up the system. I have a screen attached to the Unraid server. Also a wireless mouse and keyboard. On the screen I see: Tower login: _ However, I am not able to access the Unraid Web GUI from another computer. I have tried with two different ones. I think I have noticed this earlier, but I might be wrong. Is there something that causes the Web Gui to be deactivated if monitor/keyboard is attached to Unraid server?
October 27, 20178 yr Community Expert 32 minutes ago, kno said: Is there something that causes the Web Gui to be deactivated if monitor/keyboard is attached to Unraid server? No, did you try the boot GUI mode?
Archived
This topic is now archived and is closed to further replies.