Fatal bug in v 5.0 RC 5

July 18, 201213 yr

I've discovered a pretty serious bug here in v 5.0 RC 5 that is totally crashing unRAID. By 'crashing', I mean the console becomes entirely unresponsive, telnet ceases to work, the web page stops working, and the 'shares' disappear from the network. The system just completely freezes and the only way to get unRAID back is to reboot by cycling the power switch. After reboot, the system comes back on-line and appears to operate fine as long as I'm only READING from the array. However, any attempt to WRITE to the array will recreate the problem. I'm going to try and be very thorough in documenting this because I THINK this bug I've discovered MAY be related to the "slow down" problem I keep seeing people discussing in this forum.

First, a little about my baseline test configuration. I'm running a P45 Neo3 (MS-7514) motherboard with 4 GB RAM and using the 8 onboard SATA II controllers. The first 6 of the onboard SATA ports use an Intel 1CH10/1CH10R controller chip and the remaining 2 onboard SATA ports use the J Micron JMB363 controller. I'm currently only using the first 6 SATA ports (the Intel controller) tied to 6 three TB drives. Two of the HD's are WD30EZRX's and the other four HD's are SD3000DM001's. The BIOS is configured for AHCI. The power supply is a Tagen ITZ Series 750 watt with a single 12 volt rail. For testing purposes, I kept the software install as "clean" as I possibly could. Only unRAID v 5.0 RC 5 (pro version) and the preclear script is installed. (To avoid introducing yet another varible, I did NOT even install unmenu, opting to do everything off the console.) During initial install, I allowed the RAM test to run for 24 hours straight with no problems noted. Also during initial install, I ran the preclear script with the "-A" option three times each on five of the 3 TB drives. No problems were noted with the HD's during preclear. I used these five HD's (One WD30EZRX and four SD3000DM001's) to build my initial array. All five drives quickley formatted, the parity disk did its' initial sync and the array came right up looking absolutely rock solid. I then set up two shares, one for music and one for video, and verified that I could see the shares over the network. Again, no problems were noted. (However, on a side bar, I did note that unlike v 4.7, v 5.0 RC 5 did not appear to let me turn off the Disk1 through Disk4 shares... Not a problem.)

At this point, I mapped the two unRAID shares on a stand alone Windows 7 machine and began to load the the array with data over my internal gigabit network. I observed the initial transfer rate between the Windows 7 machine and the UnRAID system to be a very consistent 90 MBPs. Over the next several days, mostly out of curiosity, I used the web management interface to occasionally watch how the disks were being populated with data. I watched the first disk get filled to the high water mark (half way) and then cascade over to the next disk to begin filling it to the high water mark and so forth for all four data disks. It's important to note that throughout this entire time (several days) while the disks filled to the half way point, I observed the transfer rate over the network to remain around 90 MBPs. However, when the last disk hit the high water mark (half way) and unRAID cascaded back over to the first disk to resume filling it, I noted a significant reduction in the transfer rate between the two machines. The transfer rate over the network suddenly dropped to about half of what I'd previously seen - Around 45 MBPs. Over the next several days while data continued to transfer, I watched with the web interface as the disks continued to fill to the new high water mark (halfway of the remaining half) and cascade over to each sucessive disk. Throughout this phase, the network transfer rate was observed to remain consistently around 45 MBPs. When the last disk filled to the new high water mark and unRAID cascaded over to resume filling the first, the network transfer rate again suddenly dropped to about half (around 20 mBPs) of what it had previously been. With great regularity, over the next several days, I observed this pattern of the network transfer rate dropping by half each time the new high water mark was reached on all four disks until the system actually got into a fractional BPs transfer rate! Finally, the Windows 7 machine popped up a message saying that "There was a problem tranfering data...". At this point I discovered that the unRAID system had frozen up and disappeared off the face of the earth. I couldn't telnet into it, bring up the web page, or even ping the tower. When I took a look at the console on the tower it had puked up all sorts of trash that looked like snippets of procedure or function calls. (WebGUI() this and that and similar looking stuff...) The console was entirely unresponsive and the only thing I could do was cycle the power. The system rebooted and everything looked fine - As long as I was only READING from the array. Playback of movies in native uncompressed BluRay format with sustained data rates of 50 MBPs over the network works fine, but if you make any attempt to WRITE to the array it initially shows extremly SLOW data transfer rates and then freezes up - The ONLY way to get the system back is to cycle the BRS (big red switch).

Just to see if this had something to do with unhandled error exceptions regarding the amount of disk space, I prepped another 3 TB drive (running preclear 3 times) and added it to the existing array. The new disk came right up, formatted and it now shows 3 TB open in the array. Same as before - The system works fine for playback but freezes up if any attempt is made to write data to the array.

SysLog after last reboot is attached but I'm not sure it'll be much help. I guess I could start a SysLog tail on the console and HOPE it doesn't get wiped out when the system bombs, but short of that I don't know where to go from here. Any suggestions or comments on what I can do to help get to the bottom of this would be greatly appreciated.

Steve

syslog.txt

Quote

July 18, 201213 yr

Please clarify the units. Magabytes is written MB. Magabits it written Mb. Upper case B is bytes and lowercase b is bits.

Quote

July 18, 201213 yr

Author

Sorry about mix up on units - What I was using was the indicated speed in the Windows 7 tranfer window and it's indicated in "MB/Second"

Quote

July 18, 201213 yr

If you are going to do a console tail, use PuTTY so you don't lose the log.

It may not give you enough information anyway.. your testing is quite thorough though, I would be curious to see if anyone else can replicate this fault.

Quote

July 18, 201213 yr

While I haven't spent the time to track the issue, my server too has become locked in a state as described by the OP. The only way to recover is a series of reboots.

Reboot #1 - Once the array is manually started, scans all the journal entries for the disks and mounts them, and initiates a parity check. Only the first share appears. The other shares never show up.

Reboot #2 - Since shares are missing, another reboot. Again, the array refuses to automatically start, and I have to start it manually again. This time, all shares appear.

At first I assumed it was the syslog overload problem of consuming too much memory. But this last time I verified the syslog size and that is not the case. I am using RC 3, and has only started happening since I migrated from beta14 to RC 3. I haven't started a topic because I haven't taken the time to actually dump syslog information and do thorough testing without plugins.

Quote

July 18, 201213 yr

We cannot support old release candidates. Please update to the latest RC.

Quote

July 18, 201213 yr

Lol... wasn't asking for support. I was simply indicating that what the OP was discussing exists in other versions. Unless specifically addressed in release notes, I tend to assume it's still in the wild.

Quote

July 18, 201213 yr

... Please update to the latest RC.

This is actually a very valid point. There are some significant changes between RC5 & RC6, including reverting back to a newer kernel.

Quote

July 19, 201213 yr

... Please update to the latest RC.

This is actually a very valid point. There are some significant changes between RC5 & RC6, including reverting back to a newer kernel.

Unfortunately some users cannot upgrade to RC6, as it breaks LSI cards.

Quote

July 20, 201213 yr

Author

Well, I tried running a tail on SysLog to the console and then writing to the array but when v 5.0 RC 5 bombs it barfs all over the screen so nothing usable. After rebooting, I tried opening a tail on SysLog with redirection to a file on the flash drive and then writing to the array - Captured nothing in the file but the ethernet exchanges related to the attempt to write to the array - No errors were noted in SysLog leading up to the crash.

Unable to WRITE data to the array, I went back to revisit what seemed to work - READING data from the array. (No science here - Just attempting to playback movies that were on the array...) This time I discovered something that I had missed earlier. Some movies worked and some didn't. In fact a lot didn't work. Ran a few spot checks doing a binary compare of the data that ended up on the array knocking it against the original source files. Wow. Not one problem was noted when I had initially uploaded the data to the array and yet the binary compare showed that in many cases what had ended up on the array was corrupted. The directory and file names on the array all seemed OK but the actual data in the files (especially the larger files ie: > 1 GB) were in many cases just total trash. The smaller files (< 1 GB) all seemed OK. Brought up the web interface to see if unRAID had reported any parity errors or showed anything abnormal. Everything looked fine and not even one parity error had been reported by unRAID. Spent several hours looking through the data trying to discover a pattern that might explain why some files came across fine and others didn't - Nothing jumped out at me as being obvious, but in my preliminary examination of the data, it does appears to me that possibly the corruption started to occur after I'd hit the first high water mark on each hard drive. However, I would have to run another test under more controlled circumstances to verify this.

Not getting anywhere with v 5.0 RC 5, so I think it's time to punt.

I agree with you aiden - Looking back through the forum threads, I think this gremlin has been around a while. (I just got "lucky" in that my very first test was to try and push unRAID's array to saturation to see how it behaved.) I did notice, and also as chickensoup has pointed out, that Tom released an RC 6 r8168 version of the code that also contains an earlier version of the kernal. However, I also noticed that Tom said he was keeping RC 6 r8168 out of the regular beta release channels because it was intended to test a specific driver. Since I didn't use that specific driver it made little sence for me to install RC 6 r8168 because I couldn't contribute anything to the testing of that specific driver. But at this point, I'm dead in the water with RC 5, so it does make sence to at least try RC 6 r8168 since this gremlin appears to have started rearing its' head around the same time the kernal was updated to the newer version.

Yesterday, I copied bzroot and bzimage from the RC 6 r8168 distro to my flash drive (no security so that's all I need) and rebooted unRAID. RC 6 r8168 came right up and indicated that a parity check had to be run because an abnormal shutdown had been detected. (Duh!) I fully expected the worst with all of the abnormal shutdowns (at least 20) and the massive amounts of corrupted data I'd previously discovered in the array. But five hours later unRAID came up indicating not even one parity error had been discovered!

I opened a Windows Explorer session on the Windows 7 machine and grabbed the same file folder I'd been using to religiously make RC 5 crash before and dropped it on the unRAID share. No problem! Using the Web interface to examine which HD unRAID had written the folder and files to I discovered they had been written to the new 3 TB drive that I could not access under RC 5. I did a binary compare of the source and destination files and noted no corruption.

Rather than wipe the 15 TB array and start from scratch to reload the data files, I decided to try and just let unRAID overwrite the corrupted files. Also, this time instead of using Windows Explorer to transfer the files, I decided to use TeraCopy because it will tranfer the files and then do a full read back verification of the data in the array. That 3 TB transfer with read back verification was started last night and I'm about half way through the data transfer portion with absolutely no problems. I expect the read back verification to start sometime this evening and to take about 12 hours.

I will keep everyone posted on the outcome.

Quote

Fatal bug in v 5.0 RC 5

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)