thespooler Posted February 17, 2017 Share Posted February 17, 2017 So I rebooted to finish installing 6.3.1 3 days ago. Did some Docker updates before the reboot. Today, I logged into the web GUI to update a Docker and it told me there was a read error on a drive. Sure enough I see one of my hard drives has an X on it, and 3 read errors. Okay. I acknowledge the notifications, but the GUI is clearly struggling. Every page wants me to install the Preclear Tor crap which is annoying. The plugin page isn't even showing Preclear to uninstall the thing. It only shows unRAID and the web GUI installed. The main tab isn't doing anything. There are no Dockers, or Docker tab. Checking the syslog makes the browser unresponsive, but I can see this in gray. Feb 14 04:40:01 Tera liblogging-stdlog: [origin software=rsyslogd" swVersion="8.23.0" x-pid="1426" x-info="http://www.rsyslog.com] rsyslogd was HUPed Feb 14 04:40:01 Tera root: Feb 14 04:40:01 Tera root: Warning: file_put_contents(/boot/config/docker.cfg): failed to open stream: No such file or directory in /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/dockerconfig.php on line 35 Feb 14 04:40:01 Tera kernel: fat__get_entry: 182 callbacks suppressed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8192) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8193) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8194) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8195) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8196) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8197) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8198) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8199) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8200) failed Feb 14 04:40:01 Tera kernel: FAT-fs (sda1): Directory bread(block 8201) failed Feb 14 04:40:02 Tera root: Feb 14 04:40:02 Tera root: Warning: file_put_contents(/boot/config/domain.cfg): failed to open stream: No such file or directory in /usr/local/emhttp/plugins/dynamix.vm.manager/scripts/libvirtconfig.php on line 39 I'm assuming SDA1 is my USB flash drive, as the /boot directory is empty in MC. How did the system even boot?! It must have died after it booted?! If so this would be the 3rd USB flash drive I will have burned through since I started using unRAID last year. These issues are always discovered after an upgrade, probably because it gets rebooted so infrequently. I plugged the USB flash drive into my Windows machine and it lights up and seems fine. I was able to copy all the files, Windows sees nothing wrong with the FAT. Windows can open /config/docker.cfg no problem. But plugging it back into my unRAID box and it doesn't light up in any USB port. I'm reluctant to reboot as my files are being served and my dockers actually are running right now. One thing I did think was strange with 6.3.1 was after it booted, I could see my USB flash drive listed under Boot Drive as sda (which is confirming my suspicion that it is my boot drive that is no longer readable), but also listed under Unassigned Devices as sdg. Boot Device Device Identification Temp. Reads Writes Errors FS Size Used Free View Flash USB_Flash_Drive - 16.0 GB (sda) * 443,036 411,842 0 vfat 16.0 GB 304 MB 15.7 GB Unassigned Devices Device Identification Temp FS Size Open files Used Free Auto mount Share Script Log Script sdg USB_Flash_Drive_AA437O23EKVFSJJYFSVA Mount * vfat 16.0 GB - - - The main screen seems to cache the Boot Device, as even if I unplug it, it's always there. System Devices shows it missing and when I plug it in, it appears as sdg. I tried to mount it with Unassigned Devices to see if the files were there now, but it just refreshes and shows the Mount option again, never getting past it. Quote Link to comment
RobJ Posted February 17, 2017 Share Posted February 17, 2017 Without your diagnostics, I'm just guessing, but it clearly looks like the USB drive is being dropped and reconnected. When it reconnects and is recognized later, it's given the symbol sdg, which will never be associated with /boot. And the only way to fix it is to reboot. And all of the unRAID configuration persistence depends on that /boot, so yes, the system will misbehave badly. The critical thing is to find out why the USB port is so loose. Might try another one... Please see Need help? Read me first!, and attach the diagnostics zip. Quote Link to comment
thespooler Posted March 10, 2017 Author Share Posted March 10, 2017 Thanks for your reply. I've been through a lot since was originally posted this in an effort to resolve it. In any case, I have my replacement drive. It's precleared. I popped it in hoping it would be picked up through hot plug-in support, but it wasn't. To sum up where I was, the /boot drive had dropped, and disk 2 was being emulated and listed a few write errors, but the array itself was still serving files, dockers were running. I ultimately mounted /boot on /sdh1 which was where the flash drive had shown up after checking it out on a Win10 PC. I was hoping that would allow me to save the config and rebuilt, but since the hot plug in didn't work it didn't matter. I restarted today and the /boot drive was back. Previously drive 2 couldn't show SMART info while it was being emulated. I thought this was normal, but I see now when unRAID came back up, the array was not started but SMART tests were available. So I started to wonder if disk 2 dropped in the same way the /boot drive had. Maybe there wasn't anything wrong with it at all. I left it running a SMART extended test and it seemed stuck at 10% for a long time. I clicked Main, and was a bit shocked to find all the drives now sitting in Unassigned Devices. I'm just using the motherboard SATA connections. I didn't previously post the diagnostics because there wasn't much in them once /boot dropped days earlier. This one catches the drop at Mar 10 15:55:41 Tera kernel: usb 4-4: USB disconnect, device number 2. I'll remove and reseat all the cables next. tera-diagnostics-20170310-1602.zip Quote Link to comment
trurl Posted March 10, 2017 Share Posted March 10, 2017 7 minutes ago, thespooler said: I clicked Main, and was a bit shocked to find all the drives now sitting in Unassigned Devices. Beginning to think you have power issues. Is everything connected properly? What is the exact model of your power supply? Quote Link to comment
thespooler Posted March 10, 2017 Author Share Posted March 10, 2017 Corsair CX430M. This is an i3 setup. Nothing major. The server ran in its original state for many months, with I think 2 flash drives needing to be replaced during the a time. Adding this new drive today was the only other physical change. Quote Link to comment
RobJ Posted March 10, 2017 Share Posted March 10, 2017 Once you lost the initial access to the boot drive, and therefore lost /boot, there was no way to save any reconfiguration or drive assignments. Drive assignments are in super.dat in the config folder of the boot drive. /boot is tied at boot time to a USB device with a FAT file system containing a volume label of UNRAID, in your case /dev/sda. Before the array was started, the drive was dropped, and all assignments lost at that time. It was quickly found again, assigned to /dev/sdh, but not seen any longer at /boot. I *think* the port you used was a USB 3.0 port, and we have seen a number of reports of flaky behavior with those under Linux. Try a USB 2.0 port, that has worked for many. I suspect that the USB 3.0 Linux driver support still needs maturing. Quote Link to comment
thespooler Posted March 10, 2017 Author Share Posted March 10, 2017 23 minutes ago, RobJ said: Once you lost the initial access to the boot drive, and therefore lost /boot, there was no way to save any reconfiguration or drive assignments. Drive assignments are in super.dat in the config folder of the boot drive. /boot is tied at boot time to a USB device with a FAT file system containing a volume label of UNRAID, in your case /dev/sda. Before the array was started, the drive was dropped, and all assignments lost at that time. It was quickly found again, assigned to /dev/sdh, but not seen any longer at /boot. 3 Sorry for the confusion. These are two unrelated events. Pre reboot, I remounted /boot to sdh1 in case I modified the drives. The mounting worked, the errors went away, but I ultimately didn't do anything since the new SATA device wasn't detected. After reboot, everything was detected and in its place. All drives were assigned, disk 2 was still red x'ed, but now SMART was working. Later, if you look at the system log, when the USB was dropped, so were all the drives. That's when everything showed up under Unassigned Devices. In the original scenario, only the flash and disk 2 dropped. This time after adding a new drive to the system, they all did. I think the configuration is good, but I haven't rebooted yet. Going to redo the cables first. Quote Link to comment
thespooler Posted March 11, 2017 Author Share Posted March 11, 2017 (edited) I've disconnected and reconnected the cables and moved the flash drive to a USB2 port. I also plugged in my spare precleared drive to a PCIE SATA card I wasn't using just to see if it persists if the others drop again. unRAID has booted, everything assigned correctly without me doing anything, but array is stopped and drive 2 is still red x'd. Since drives are dropping, I'm not sure how concerned I should be with the array. How do I get disk 2 back? Remove it, start array, stop array and reassign the original disk 2 back to disk 2? Will that cause a rebuild of that drive? I'm a little nervous with what happens if the drive or other drives start dropping during that process. Edited March 11, 2017 by thespooler Clarity Quote Link to comment
RobJ Posted March 11, 2017 Share Posted March 11, 2017 The diagnostics you posted only showed the drop of the boot drive, a USB connection. No other drive was dropped. None of the other drives were assigned because there was no super.dat available, so they would have appeared as unassigned drives. We would need to see the errors that occur when Disk 2 is dropped, in order to know what is happening to it. It would not be the same thing as the boot drive dropping, because they are SATA connected, very different from being USB connected. In general, without knowing why Disk 2 was dropped, your steps look correct - unassign Disk 2, start and stop the array, reassign Disk 2, and restart the array to rebuild it. Quote Link to comment
thespooler Posted March 12, 2017 Author Share Posted March 12, 2017 21 hours ago, RobJ said: The diagnostics you posted only showed the drop of the boot drive, a USB connection. No other drive was dropped. None of the other drives were assigned because there was no super.dat available, so they would have appeared as unassigned drives. Ah, interesting! Previously, when the flash dropped, all the other drives stayed assigned. But I guess that's the difference between having an array started, and then the flash dropping vs the flash dropping before the array is started. I've rebuilt drive 2 and so far so good. I didn't realize dropping of drives was an issue. I would have saved myself $500 bucks in new hardware had I known. I just assumed it was a definite hard drive failure. Thanks for your assistance. Quote Link to comment
RobJ Posted March 12, 2017 Share Posted March 12, 2017 That's been a common problem here. We often have a very simplistic view of what reading and writing to a drive involves, and tend to immediately blame the drive if any errors occur. But there are many more components involved, all of which can cause issues, can fail. Drives and drive controllers are computers, have their own software, typically called firmware, that can crash, can be buggy, can need updates, can have power issues. Then there are the other components like cables (data and power), cable connectors, power splitters, ports (both on drives and on controllers), plus the actual software in the server, like the app involved, the file system software, the busses and their management, and the various drivers involved at higher and lower levels. That's why many of us don't recommend hot swapping, because only a percentage of what appears to be a drive failure actually *is* a drive failure. We would rather check the logs and SMART reports first, then proceed accordingly. I don't know what the percentage is, but it's my opinion that it's more often NOT the fault of the drive than it is. Just wildly guessing, maybe 30% of the time it's a true drive fault? We have seen many cases where a drive suddenly disappears, but after analysis determined the drive was perfect, and the controller crashed or malfunctioned, or a loose power or data cable slipped off, or there was a power issue, etc. Quote Link to comment
trurl Posted March 12, 2017 Share Posted March 12, 2017 1 hour ago, RobJ said: Just wildly guessing, maybe 30% of the time it's a true drive fault? My guess would be less than that. We often get reports on the forum when somebody has actually been inside the case for some reason. And more experienced users don't even bother to report since they know what happens and know what to do, such as on my backup server recently. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.