January 22, 201115 yr I upgraded from 4.5 to 4.6 a few weeks ago and had no problems, but then last weekend, I changed the spin down delay from never to 15 minutes. After that, I was unable to save files to the cache drive after the mover had run until I rebooted. I also just noticed that the mover did not actually move my files as the cache drive still had 83 GB on it. I have attached part of the syslog as this forum would not allow the entire thing, says the file size was too large. So i trimmed it. TowerSyslog.txt
January 22, 201115 yr Your cache drive has file-system corruption. Jan 22 03:55:34 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jan 22 03:55:34 Tower kernel: REISERFS error (device sde1): vs-5150 search_by_key: invalid format found in block 91122236. Fsck? Jan 22 03:55:34 Tower kernel: REISERFS (device sde1): Remounting filesystem read-only Jan 22 03:55:34 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jan 22 03:55:34 Tower kernel: REISERFS error (device sde1): vs-5150 search_by_key: invalid format found in block 91122236. Fsck? Jan 22 03:55:34 Tower kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1 Jan 22 03:55:34 Tower kernel: REISERFS error (device sde1): vs-5150 search_by_key: invalid format found in block 91122236. Fsck? It is being mounted a "read only" (no files can be written to it, or deleted) You need to run reiserfsck on it: umount /mnt/cache reiserfsck /dev/sde1
January 24, 201115 yr Author Shortly after posting the question, I manualy checked each folder on the cache drive to make sure the same files/folders existed on one of the drives in the array and deleted them from the cache drive as I found them. Once I got to a specific folder, I started getting the error messages again when trying to delete them. I was able to find all the files on the cache drive on drives in the array, so I then removed the cache drive and ran CrystalDiskInfo on it (WD 500GB) to check the S.M.A.R.T. monitoring and it showed the drive was having issues. The green lights next to the cache drive never indicated a problem. I am assuming that unraid does not check the S.M.A.R.T. monitoring features of the drive. Could this be a future option?
January 24, 201115 yr Shortly after posting the question, I manualy checked each folder on the cache drive to make sure the same files/folders existed on one of the drives in the array and deleted them from the cache drive as I found them. Once I got to a specific folder, I started getting the error messages again when trying to delete them. I was able to find all the files on the cache drive on drives in the array, so I then removed the cache drive and ran CrystalDiskInfo on it (WD 500GB) to check the S.M.A.R.T. monitoring and it showed the drive was having issues. The green lights next to the cache drive never indicated a problem. I am assuming that unraid does not check the S.M.A.R.T. monitoring features of the drive. Could this be a future option? You still probably need to run the file-system check on the drive as I described in my prior post.
January 24, 201115 yr Author I was planning on throwing the drive in the trash. I replaced it last night thinking it is likely hardware failure. Do you suspect otherwise? What does that command do?
January 24, 201115 yr What Joe said is correct, all we can see is some file system corruption, which is similar to Scan Disk or Check Disk issues in Windows. There is no evidence here of anything at all wrong with the drive itself. Your syslog does show other oddities with this drive. This drive is your Cache drive and was mounted slowly because of 398 transactions that had to be replayed, an indication of a system crash when the drive was last used. But what is odd is that no other drive has any transactions replayed, indicating this Cache drive must have just been installed, or ... ? Perhaps it was unable to unmount previously because some other process(es) kept it mounted??? But that barely explains having 398 transactions to replay, that is quite a lot. Another oddity is that the syslog reports a number of duplicate files, on Disk 1! To have a duplicate file means to have multiple copies across multiple disks, but the first copy of a file practically by definition cannot be a duplicate. The second copy found on a higher numbered disk is the first duplicate copy, so duplicates should normally be reported from Disk 2 or greater. Either Tom has made a change in how these are reported (something I'm not aware of), or this is first checking the Cache drive for copies, and finding the same files on the Cache disk before finding them on Disk 1. Just an oddity, not a serious issue. You mentioned that the SMART info showed issues (and I too like CrystalDiskInfo). I have absolutely no reason to doubt you, and I don't know you, but I'll just make a couple of comments. It's been my experience that most users do not know how to interpret correctly the SMART attributes and numbers. In particular, they very often become needlessly alarmed at some of the attributes and very large numbers showing, which actually don't matter AT ALL. In general, from a failure perspective, the only 2 things that matter are whether any attribute is marked FAILING_NOW, and whether the reallocated sector count is growing over time. Technically, the only thing that matters is whether the WORST value (only for attributes marked Pre_fail) is decreasing over time, and especially if it is approaching the THRESHOLD value for that attribute. Of course, you may have more experience than most, and I don't wish to impugn that in any way. Do you mind indicating what SMART issues you saw?
January 24, 201115 yr Oops, I forgot that, as Joe said, the system made the drive ReadOnly upon discovering file system corruption. And that explains the duplication. The Mover had previously copied the files from the Cache drive to Disk 1, but was unable to delete them from the locked Cache drive. It may also explain the transactions, since the Reiser file system may actually go around the Read Only flag for certain things such as journaling. Just follow Joe's instructions to correct the file system. The reiserfsck command is similar to ScanDisk and Check Disk.
January 24, 201115 yr Author ########### reiserfsck --check started at Sun Jan 23 22:06:43 2011 ########### Replaying journal: Done. Reiserfs journal '/dev/sde1' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. \/ 13 (of 72\/ 94 (of 170/block 57286470: The level of the node (0) is not correct, (1) expected the problem in the internal node occured (57286470), / 24 (of 72\/ 42 (of 155/ block 91122236: The level of the node (0) is not correct, (1) expected the problem in the internal node occured (91122236)finished | Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Bad nodes were found, Semantic pass skipped 2 found corruptions can be fixed only when running with --rebuild-tree ########### reiserfsck finished at Sun Jan 23 22:08:08 2011 ###########
January 24, 201115 yr Author FYI My previous build was as follows 3 x 2TB drives - parity, disk1, disk2 - both 75% utilized 1 x 1TB drive - disk3 - 100GB on it 1 x 750GB drive - disk4 - empty 1 x 500GB drive - cache - corrupted I replaced the 750 with another 1TB and then used the 750GB as the new cache drive...so the new config was as follows 3 x 2TB drives - parity, disk1, disk2 - both 75% utilized 2 x 1TB drives - disk3, disk4 - about 100GB on disk3, disk4 is empty 1 x 750GB drive - cache drive However I then tested the mover by copying about 4GB to the cache drive, then doing a Move Now. It left about 2GB on the cache drive and the syslog showed errors. I then removed the 750GB and put the 500GB back in and ran the commands you suggested, then posted the result on the previous comment. I have also attached another syslog. Because it did not copy the files as it should with the 750GB drive either, does this suggest a possible hardware problem with the sata slot/board? I am using (2) icy dock bays (5 drives each) and the cache drive is the only drive in the second one. The first has the parity and 4 storage drives. I guess the reason I was quick to blame the drive is because this drive was a christmas gift several years ago as 1 of 2 500GB drives in a WD My Book. Somehow it used 2 x 500GB drives to create 1 x 1TB drive. Anyway one of the drives failed and I lost about half my stuff and several hours of time attempting to save everything I could. I threw the failed drive away, but this one seemed to be fine and I have continued to use it and hang on to it. Several months back, I decided to use it as a cache drive and a few weeks after beggining to use it, it did this same thing, would not allow me to save files. I rebooted the server and all was good for several months, until just last weekend, which has been frequent since. I've been somewhat nervouse about that drive for awhile...lol TowerSyslog2.txt
January 24, 201115 yr My respect for CrystalDiskInfo has dropped some, I thought I remembered it having tabs to check the rest of the SMART detail, such as the ATA error log, the previous SMART test results, and the ability to initiate SMART tests. I guess I confused it with the great SMART tool on the Parted Magic live CD I've been using more frequently lately. I wanted to check to see if those 3 problematic sectors are recent, and since I believe Uncorrectable sectors cause error log entries, you could have checked to see if they occurred recently, or well in the past. Aside from those 3 questionable sectors, the SMART report looks very good, nothing reallocated yet. Current Pending sectors are ones that were detected as possibly faulty, but aren't remapped or recovered until they are written to, which forces the drive to re-analyze them and either remap a spare sector in their place or accept them as good and write fresh data correctly to the sector. There are 2 kinds of bad sectors, hard read errors are those with defective magnetic media under them (which results in them being replaced with a spare), and soft read errors are those with good media but scrambled/corrupted data. If after testing the sector, the drive decides the media is fine, then the sector is written to, and considered recovered. Current Pending is decreased for each sector that is tested this way. Reallocated sectors only increases each time a sector is replaced. The fact that you have exactly the same number of Uncorrectable sectors probably mean they are the same 3 sectors, and probably just scrambled data. This can often occur because of a strong electrical spike while the drive was being written to, which could have occurred with a nearby lightning strike or a sudden power outage. So chances are good that these are soft errors, and the sectors can be recovered. But even if they are not, it is only 3 out of thousands of available spare sectors, for me a very minor issue. The important thing is to monitor it to see if more show up. A few once in a while is not a serious thing, unless it is the start of a bad trend. Hmmm... that is a sad reiserfsck report. The Reiser file system is damaged enough that it is requiring the --rebuild-tree option, and that usually ends up being a lot of hassle. It usually repairs the file system, but often makes a mess of the files it recovers and the directory structure and the file names, which then results in a very time-consuming process of manually deciding what each file is, and renaming to the correct name, then moving it where it belongs. You don't want to do that if you don't need to. But you did say that you were ready to trash the drive? Which means there is nothing of value on it? If there is, and you can copy it off, your best option by far is to just Pre-clear the drive! No messing with rebuild-tree either! Plus it forces the drive to test those 3 sectors and recover or replace them. And the additional SMART reports Pre-clear provides should be able to help you decide whether to trash or trust the drive. And if you are still not sure, run Pre-clear twice more on it! As to the 750GB drive, I'd have to see the syslog when it ran, to figure out what went wrong. Or you can try it again, and post the syslog afterward. There are NO hardware issues AT ALL so far, in the 2 syslogs I have seen. heading off to bed now... hope my tiredness did not result in any mistakes...
January 26, 201115 yr Still doing it with another drive. I'll try another slot. It's generally better to include the entire syslog, zipped if necessary. I don't know if this is the same drive or not, but this syslog piece is reporting a new problem - media errors, so it would be good to get a SMART report for this drive.
January 26, 201115 yr Author Well, Crap!!...lol I powered down the server, moved the drive to another slot, powered it up and it wouldn't boot, even after moving things back the way they were. Its not even requesting an IP address from my router. I replaced the Ethernet cable tried different USB ports, now trying to format the flash drive and re-install.
January 26, 201115 yr Well, Crap!!...lol I powered down the server, moved the drive to another slot, powered it up and it wouldn't boot, even after moving things back the way they were. Its not even requesting an IP address from my router. I replaced the Ethernet cable tried different USB ports, now trying to format the flash drive and re-install. When you moved the drive the BIOS on your motherboard might have assigned it as the boot device instead of the flash drive. Might check to see if the flash drive is still assigned as the boot device before you pull out all your hair.
January 26, 201115 yr Author It did, I formated the flash, installed the latest version, 4.7, then noticed it was trying to boot from a drive instead of the flash. Now my unraid is only allowing 3 drives, showing not registered and is the Basic version. I used the same flash drive, so my Flash GUID should not have changed right? Shouldn't this be loading up with the full version?
January 26, 201115 yr Author Dang, its been awhile since I installed, didn't think about saving a key file...lol. I had it saved in the array, so I brought the array online with only the parity and 2 disks. Luckily most of the files were on the first 2 storage drives as they are the largest. The key was there as well.
January 26, 201115 yr Author I got the key back on the flash drive and and was able to bring all the drives back online. While it was rebuiliding the parity drive, I ran CrysalDiskInfo on the 750Gb cache drive and it is showing caution as well. I'm curious if this is actual hardware failure or just corruption. Since I'm more familiar with windows hardware tests, I started an NTFS format on it last night and plan to run a few scans that will test the hardware of the drive.
January 27, 201115 yr Author Ok, both the 750 and the 500 showed "Caution" with CrystalDiskInfo. I replaced the cache drive with a brand new 320GB drive and everything is perfect again. I then decided to shut down the server and remove each drive 1 by 1 running CrystalDiskInfo on each of them. They Parity drive is showing "Caution" as well. It's still under warranty as it is less than 1 year old so I have RMA'd it. Now for the million dollar question....can I perform this same type of S.M.A.R.T. test without having to remove the drives and plug them into my desktop pc. Is there an addon that I can install? Future update maybe? You guys have been alot of help and I really like the software. Thanks, Steve
January 27, 201115 yr Yes. unRAID includes 'smartctl' that has the ability to run SMART tests. You do this from the linux command line. Obviously replace /dev/sd# with the drive you want to run it on, such as /dev/sda (1st drive) or /dev/sde (5th drive) if they're SATA using AHCI, or /dev/hda (1st drive) or /dev/hde (5th drive) if they're SATA not using AHCI or they're IDE. smartctl -t short /dev/sd# wait 3 minutes until short test is finished, then run: smartctl -a /dev/sd# You will then see the output of the SMART report and have to examine the details to determine if it's good or not. There is an easier way to do this by using unMenu and the myMain addon(s). They present a nice user interface for running SMART tests and I think they know how to present the SMART reports to the user in a friendlier manner.
Archived
This topic is now archived and is closed to further replies.