August 14, 201312 yr So in addition to the "transport endpoint not connected" from last week (which, by the way, I thought would have garnered >0 interest, but hey!), I have also experienced a complete freeze of every interface: telnet, IPMI console, unmenu, emhttp during a rebuild and preclear (hard reboot on that); finally had that sorted last night, and I wake up today to a red ball on disk 8. I'm not sure that it's not just a cable problem or something, but now I can't get the swapfile to disable so I can stop the array. Syslog attached, but I've tried unmenu's user script, the rc.unraid_swapfile option, swapoff -a, swapoff -av. I have reached the limit of my knowledge. Is kswapd the daemon for the swapfile? Can I just kill that? *frustrated* Actually, syslog NOT attached, cause I can't get to the box to get the syslog off. No samba, no ftp. Can't even ftp backward from unraid to my macbook. Perhaps someone here can assist? Thanks! Pengrus
August 14, 201312 yr Copy the syslog to the flash using the attached console. Power down and move the flash to your PC. Attach the syslog to a post. See my sig to disable all add-ons. Then put the flash back in the server and turn it on.
August 14, 201312 yr Author I did copy the syslog to flash, but I couldn't powerdown cleanly without the swapfile unmounting. I discovered, largely by trial and error, that the "umount -l" option would let me unmount the cache drive without actually disabling the swapfile. Or something very close to that. I'm attaching two syslogs here, but it seems like disk8 and the cache drive had a problem at the same time (?), and the cache was converted read-only. Thus not being able to remove the swapfile. I powered down and am rebuilding disk8 now, but when I tried to reestablish my cache apps, it threw another error and mounted the cache read-only again. So I've disabled the trinity from the cache drive and will wait until disk8 is rebuilt and parity checked before resuming testing with the cache drive. It's been a workhorse for me, 22k+ power-on hours, so it might just be time to take her back behind the shed, so to speak. But I would like to know what's wrong with the drive, hence the second syslog. I'm familiar with the no-addons process for the betas/RCs. I never upgrade without using my stock go file and verifying nothing breaks as I add everything back in piecemeal. The upgrade to RC16c was no exception. But as I understand it, the transport endpoint problem is an error that happens with very little prediction or warning, especially since I don't use Plex, and I'm not going to run without sab/sick/cp for a month just to verify I'm not going to have the issue, especially since that wouldn't even guarantee I won't! I have no syslog from the freeze period, as the entire interface was frozen. From a stress-testing point of view, I was rebuilding parity, preclearing a 3TB, and moving a significant amount of files to load-balance my drives when everything went sideways. If I had to guess, I ran completely out of memory, but I have no way to tell. That said, please take a look at the logs, I'm certain others will see things I can't, or certainly don't understand. Specifically, these "Tower kernel: REISERFS error (device sdl1): vs-5150 search_by_key: invalid format found in block 3054983. Fsck? (Errors)" are puzzling... Thanks much for your help! P p.s. From a stress-testing point-of-view, should the system be able to rebuild a disk and preclear another at the same time? Archive.zip
August 14, 201312 yr So in addition to the "transport endpoint not connected" from last week (which, by the way, I thought would have garnered >0 interest, but hey!), I have also experienced a complete freeze of every interface: telnet, IPMI console, unmenu, emhttp during a rebuild and preclear (hard reboot on that); finally had that sorted last night, and I wake up today to a red ball on disk 8. I'm not sure that it's not just a cable problem or something, but now I can't get the swapfile to disable so I can stop the array. Syslog attached, but I've tried unmenu's user script, the rc.unraid_swapfile option, swapoff -a, swapoff -av. I have reached the limit of my knowledge. Is kswapd the daemon for the swapfile? Can I just kill that? *frustrated* Actually, syslog NOT attached, cause I can't get to the box to get the syslog off. No samba, no ftp. Can't even ftp backward from unraid to my macbook. Perhaps someone here can assist? Thanks! Pengrus Can it be that it is not a unraid RC16c problem? But in fact the problem is caused by your hardware/disk(s). You leave us guessing without a system configuration or proper syslog. What swapfile are you talking about? I never used/configured one. Maybe the problem lies there?
August 14, 201312 yr I did copy the syslog to flash, but I couldn't powerdown cleanly without the swapfile unmounting. I discovered, largely by trial and error, that the "umount -l" option would let me unmount the cache drive without actually disabling the swapfile. Or something very close to that. I'm attaching two syslogs here, but it seems like disk8 and the cache drive had a problem at the same time (?), and the cache was converted read-only. Thus not being able to remove the swapfile. I powered down and am rebuilding disk8 now, but when I tried to reestablish my cache apps, it threw another error and mounted the cache read-only again. So I've disabled the trinity from the cache drive and will wait until disk8 is rebuilt and parity checked before resuming testing with the cache drive. It's been a workhorse for me, 22k+ power-on hours, so it might just be time to take her back behind the shed, so to speak. But I would like to know what's wrong with the drive, hence the second syslog. I'm familiar with the no-addons process for the betas/RCs. I never upgrade without using my stock go file and verifying nothing breaks as I add everything back in piecemeal. The upgrade to RC16c was no exception. But as I understand it, the transport endpoint problem is an error that happens with very little prediction or warning, especially since I don't use Plex, and I'm not going to run without sab/sick/cp for a month just to verify I'm not going to have the issue, especially since that wouldn't even guarantee I won't! I have no syslog from the freeze period, as the entire interface was frozen. From a stress-testing point of view, I was rebuilding parity, preclearing a 3TB, and moving a significant amount of files to load-balance my drives when everything went sideways. If I had to guess, I ran completely out of memory, but I have no way to tell. That said, please take a look at the logs, I'm certain others will see things I can't, or certainly don't understand. Specifically, these "Tower kernel: REISERFS error (device sdl1): vs-5150 search_by_key: invalid format found in block 3054983. Fsck? (Errors)" are puzzling... See Check Disk Filesystems in my sig. Thanks much for your help! P p.s. From a stress-testing point-of-view, should the system be able to rebuild a disk and preclear another at the same time? Yes, a stock system should have no problem doing this. What model PSU?
August 14, 201312 yr Author Good point on verifying the filesystem, I definitely should have done that last night, would save me some time! The short smart test took significantly longer than normal, about 8-10 minutes, and I started a long test right before turning in, I'll check that when I get off work and run the fsck(s). Good news on the stress test issue. It's a Corsair AX-750. IIRC, it has a 61 or 62A rail, which I would think is sufficient for 16 drives. It certainly draws more wattage in it's new case though, so maybe. I think, from some forensics last night, that the power cable may have come loose for that cage (I use norco ss-500s); interestingly, the other four drives had no issues, but the failed drive did not ever remount, as in an intermittent power loss during a write or something, not sure if that was a mechanical issue, or the kernel disabled the drive. @dikkiedirk - As to the system configuration, I just changed my signature, hopefully that populates properly. As far as proper syslogs are concerned, you have what I have. I am well aware that this may be a hardware or disk problem, and not a RC16c problem, but I have *never* seen the transport endpoint error before, including after 189 days of uptime on RC12a. It appeared to me to be a RC-16c problem, and was certainly a previous RC problem, as was indicated in the release notes, and all of these other fun issues did not happen until after that upgrade. The swapfile is a, in my case, 8GB file on my cache drive that allows the kernel to flex in case it runs out of normal RAM. It's automatically generated by an unraid script, but as far as I know, it's a configuration present in the underlying slackware/filesystem itself. I'll check on that. I'm sure my original post was not clear, so to clarify, I was originally (in this thread anyway) asking for help to disable the swapfile to unmount the cache drive so could get a syslog, the other info was for just that, info, I don't expect anyone to diagnose a problem without at least a syslog! As always unraiders, thanks for your help! P
August 16, 201312 yr Author ok, so I ran the fsck, it told me to run with the rebuild-tree option. I have done that, and it appeared to complete successfully. But when I started the array again to check, virtually my entire cache drive was deleted. And there was no lost+found directory. That kinda sucks! Is that a normal consequence of this, or have I made a mistake somewhere? P relevant messages follow: Will rebuild the filesystem (/dev/sdl1) tree Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes Replaying journal: Done. Reiserfs journal '/dev/sdl1' in blocks [18..8211]: 0 transactions replayed ########### reiserfsck --rebuild-tree started at Thu Aug 15 16:45:23 2013 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 28940074 blocks marked used Skipping 9403 blocks (super block, journal, bitmaps) 28930671 blocks will be read 0%....20%....40%....60%....80%....100% left 0, 12979 /sec 27653 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 28930671 Leaves among those 32877 Objectids found 27692 Pass 1 (will try to insert 32877 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....100% left 0, 117 /sec Flushing..finished 32877 leaves read 32814 inserted 63 not inserted ####### Pass 2 ####### Pass 2: 0%....20%....40%....60%....80%....100% left 0, 0 /sec Flushing..finished Leaves inserted item by item 63 Pass 3 (semantic): ####### Pass 3 ######### ... S.cp(tt1753813)/farewell.my.queen.2012.limited.1080p.bluray.x264-geckos.mkvvpf-10680: The file [269977 270017] has the wrong block count in the StatData (16033648) - corrected to (1171952) Flushing..finished Files found: 26375 Directories found: 1280 Broken (of files/symlinks/others): 1 Pass 3a (looking for lost dir/files): ####### Pass 3a (lost+found pass) ######### Looking for lost directories: Flushing..finishede 1, 0 /sec Pass 4 - finished done 17613, 320 /sec Deleted unreachable items 1835 Flushing..finished Syncing..finished ########### reiserfsck finished at Thu Aug 15 17:29:00 2013 ###########
August 16, 201312 yr ok, so I ran the fsck, it told me to run with the rebuild-tree option. I have done that, and it appeared to complete successfully. But when I started the array again to check, virtually my entire cache drive was deleted. And there was no lost+found directory. That kinda sucks! Is that a normal consequence of this, or have I made a mistake somewhere? P relevant messages follow: Will rebuild the filesystem (/dev/sdl1) tree Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes Replaying journal: Done. Reiserfs journal '/dev/sdl1' in blocks [18..8211]: 0 transactions replayed ########### reiserfsck --rebuild-tree started at Thu Aug 15 16:45:23 2013 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 28940074 blocks marked used Skipping 9403 blocks (super block, journal, bitmaps) 28930671 blocks will be read 0%....20%....40%....60%....80%....100% left 0, 12979 /sec 27653 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 28930671 Leaves among those 32877 Objectids found 27692 Pass 1 (will try to insert 32877 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....100% left 0, 117 /sec Flushing..finished 32877 leaves read 32814 inserted 63 not inserted ####### Pass 2 ####### Pass 2: 0%....20%....40%....60%....80%....100% left 0, 0 /sec Flushing..finished Leaves inserted item by item 63 Pass 3 (semantic): ####### Pass 3 ######### ... S.cp(tt1753813)/farewell.my.queen.2012.limited.1080p.bluray.x264-geckos.mkvvpf-10680: The file [269977 270017] has the wrong block count in the StatData (16033648) - corrected to (1171952) Flushing..finished Files found: 26375 Directories found: 1280 Broken (of files/symlinks/others): 1 Pass 3a (looking for lost dir/files): ####### Pass 3a (lost+found pass) ######### Looking for lost directories: Flushing..finishede 1, 0 /sec Pass 4 - finished done 17613, 320 /sec Deleted unreachable items 1835 Flushing..finished Syncing..finished ########### reiserfsck finished at Thu Aug 15 17:29:00 2013 ########### When I mistakenly tried to rebuild onto a full cache drive - I had to run --rebuild-tree on the cache drive as well. I copied it to a new drive of the same size I bought at BestBuy with a DD command off the forums. Then used that image on the new drive with the --rebuild-tree command. Worked great for me. I got back all but at most 100GB of files and only had 3 metadata files and 2 (mpg or ts) files in Lost+Found the rest were in their original directories and were accessable. Sorry to hear it didn't work for you.
August 16, 201312 yr Author IT'S ALIVE!!! I spoke too soon, it seems. A reboot remounted the cache drive the proper way, and all my files seem to be in place. The lost+found directory is present, though nothing is in it at the moment. I think I have reconstructed what happened: I had to manually unmount the cache drive to stop the array, as it was read-only with a swapfile. I then restarted in maintenance mode and ran the fsck sequence. But, when I went to remount the drive, it did not mount properly, maybe due to the "-l" unmount option, or something else. Hope if someone else has this problem, they find this thread early. Thanks to all for your help! P
August 16, 201312 yr IT'S ALIVE!!! I spoke too soon, it seems. A reboot remounted the cache drive the proper way, and all my files seem to be in place. The lost+found directory is present, though nothing is in it at the moment. I think I have reconstructed what happened: I had to manually unmount the cache drive to stop the array, as it was read-only with a swapfile. I then restarted in maintenance mode and ran the fsck sequence. But, when I went to remount the drive, it did not mount properly, maybe due to the "-l" unmount option, or something else. Hope if someone else has this problem, they find this thread early. Thanks to all for your help! P By any chance are you running Transmission and using the cache drive as the storage location? I had issues with Transmission and using the cache drive as my download location, after a day or two it would make my cache drive kick in read-only mode and all my apps would not download. I have everything installed on my cache drive and everything uses the cache drive for download, EXCEPT transmission, I had to use /mnt/disk1 which in fact was my failed drive lately. Wonder if transmission is too much stress on the drive? Could there be maybe an update to transmission to store small packets in memory until it's enough to write a big section on the drive? (an IDEA to the Transmission author...)
Archived
This topic is now archived and is closed to further replies.