kno

Members
  • Posts

    71
  • Joined

Everything posted by kno

  1. Success! New cables arrived and rebuild of failed drive was successful. I think it was a power supply issues, but I cannot be sure because I changed cables and controller at the same time. The new controller was also much faster than my old solution. Rebuild has 97MB/s average, completing in 17 hours. My old solution would have used about 50-60 hours. Thanks for the help.
  2. In my System Log I have observed the following error message: ntpd[1495]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Does anyone know what this is?
  3. Perhaps the Sharkoon REX8 value? This one has 7 external bays: https://www.sharkoon.com/product/1678/Rex8Val#desc
  4. Your bottom disk will probably run very hot. Also the disk above might get hot as well because it will absorb heat from the disk below it. If you want to place them like this in the bottom of the case you are much better off if you put them on their sides with some air around them. Also, you can strip a fan somewhere, just to create some airflow along the drives. Any fan you might have laying around might to the job. At least your drives won’t cook. Also, make some kind of support so they don't tip over if you move the case.
  5. Your drives don’t look very healthy. You will probably not be able to rebuild one of the disks, when you have two disks with these errors. I would try and backup the data from the disks to another place as soon as possible. I have learned a lot about SMART data in the last couple of months. I have three disks with exclamation marks on them. I did not take it too seriously because I did not understand the meaning of the SMART reports. When I get my Unraid back up and running I will start following the SMART reports on a monthly basis and look for trends. One interesting read was an article from Backblaze: https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/ It is about the statistical correlation between various SMART values and drive failures. Conclusion is that there are some values that need to be taken very seriously with regards to drive failures. If they start to change, it is time to swap out the drive. It might still be usable, but your data is not safe. The most problematic ones according to the article are these: Reallocated Sector Count Reported Uncorrectable Errors Command Timeout Current Pending Sector Count Uncorrectable Sector Count Your drives have high Current Pending Sector Count and Reallocated Sector Count.
  6. I have done some searching and reading today. I have found that several users with this card (and some other models) have problems booting from USB when the LSI card is updated to the newer versions. I guess there might be some compatibility issues with some motherboards and boot sequences. The information in the threads was not conclusive however. However, I think I will leave the LSI card alone for the moment. When I get the cable I will try to use it and if it works out ok. I will not do any updating until it is required. Either way, I would still like some input on my questions above if anyone has any comments.
  7. A third thing that is a bit confusing. On the LSI webpage these is both Bios and Firmware for the card: https://www.broadcom.com/support/download-search/?pg=Legacy+Products&pf=Legacy+Products&pn=SAS+9201-16i+Host+Bus+Adapter&po=&pa=&dk= Do I need to flash both, or does the firmware contain the bios? I think this is really confusing stuff. I am not very comfortable with it. I am just a computer layman, no IT professional. I don't think my MB support UEFI neither. Looks like that complicates the matter. I see that LSI has a utility called LSI Pre-Boot USB, but this looks like it is for newer cards?
  8. I just got an LSI 9201-16i card I picked up from eBay. My first surprise when looking for information at the LSI support website is that the documentation for the card is severely lacking. It only contains a useless installation guide. Almost no other information is provided. There is no info on what the LED signals, no info on the configuration utility and no info on how to do updates. Pretty bad I must say. I expected better documentation. Guess I need some forum help instead. Hopefully there are som LSI owners that can shed some light on my questions. I plugged the card into my computer and booted up. I have no HDDs attached at the moment (waiting for cables). My Unraid runs on an older Asus P5B-E motherboard. The LSI posts the following during boot: LSI Corporation MPT SAS2 BIOS MPT2BIOS-7.05.01.00 (2010.02.09) Copyright 2000-2010 LSI Corporation Looks to me like the card is running an old bios version? Newest seems to be 7.27.01.01 on the LSI/Broadcom website. So, first question is: Do I need to update the card or will it run just fine without updating? I would prefer not to update if it is fine as it is. I also tried to access the LSI Corp Configuration Utility (CRTL+C during boot). When I do, nothing happens. The card posts that it will access the utility after initialisation. However, I only get a blank screen / nothing happening. So, second question is: How can I access the configuration utility? Do I need to have HDD’s attached to be able to load it? Do I need the configuration utility at all?
  9. My bad, I figuered it out. Looks like the network cable was loose/not properly attached. Now onto the next problem. I will make a new thread for the LSI card and keep this for data rebuild.
  10. Perhaps there is a simple answer to this: I booted up the system. I have a screen attached to the Unraid server. Also a wireless mouse and keyboard. On the screen I see: Tower login: _ However, I am not able to access the Unraid Web GUI from another computer. I have tried with two different ones. I think I have noticed this earlier, but I might be wrong. Is there something that causes the Web Gui to be deactivated if monitor/keyboard is attached to Unraid server?
  11. My LSI card has arrived, but I am still waiting for the break-out cables. They will take another couple of days. Can I boot my existing Unraid installation without having any HDD's attached? My drives are completely disconnected. Is it ok to boot, or will it just mess things up? I just want to see if it boots up ok with the card installed.
  12. BTW: Would there be any advantage for me to start with a new configuration, once I have successfully rebuilt my failed drive? My Unraid has been upgraded since version 4 (unsure from exact version). I am thinking that starting with a new config might have some advantages such as new files format? There might be other advantages as well?
  13. This is hard to say. There was one power failure recently, but I cannot say for sure if it is connected. I ordered a new 850W EVGA PSU, that based on the reviews should be able to deliver ample stable power. I am still waiting for the new parts to arrive.
  14. I pulled all the power cables and SATA cables to the HDD’s and wrote down the connections in a matrix to see if there is a common denominator for the drives that has random errors. Both drives are at the end of one of the molex power cables. I have a theory that this might be a contributing factor that could explain some of the problems I am experiencing. This could explain why I do not get any errors during SMART test but do get errors during rebuild. During SMART one or two drives are running. The other drives are idling or in standby. During rebuild all drives are at full work and probably drawing maximum power. Perhaps the PSU is struggling to deliver stable power from the 12V rail to all drives? Perhaps it becomes unstable or there are small fluctuations that cause random read errors? Could this also explain why I have had several drive failures? Can bad/low quality power cause pending sectors or bad sectors? What do you think of this theory? At least this little exercise has made is clear to me that I should get a new power supply just to be certain. Also, I plan to runs some days of stability tests while waiting for the LSI card and cables. I think I will run the memory and CPU test on the Ultimate Boot CD. This way I am assured that the rest of the system is stable.
  15. BTW, here is diagnostics. Edit: Diagnostics removed.
  16. Any thoughts on what the above error is about? I killed power to the server and rebooted. Disk rebuild started automatically, however both disk 1 and disk 5 are reporting errors, so I stopped it as once. I have 1 read error on disk 1 and 126 read errors on disk 5, just from the first minute of rebuilding. Disk 5 passed both short and extended SMART yesterday without any errors, so this is strange. There is something seriously wrong somewhere, but what is it?!? Both drives are new WD60EFRZ-68L0BN1 disks. Can this make/model really be the problem? If they are so troublesome there must be some more info about it somewhere? If that is the culprit, how can I ever rebuild my failed disk? I am thinking to do the following: 1. Do nothing at the moment. Power off and leave it alone (Wait for the LSI card, cables and second hand Matrox PCI GPU to arrive). 2. Order a new PSU, something like 600-700 watts with removable cables. 3. Remove the controllers, SATA-cables and old PSU. Replace with new PSU, LSI-card and connect all HDDs to the LCI (including those that are on the MB today). 4. Change the MB battery and reset MB bios to default values. 5. Run memcheck86 for a couple of hours. 6. Boot unRaid and try to rebuild failed drive. What do you think? Have I missed anything? Is there a good way to test motherboard and CPU stability in addition to memcheck86 for the memory? Perhaps I should test system stability with only CPU, GPU, and memory installed (no controllers or HDDs).
  17. I need some more advice. I replaced disk 2 with a new drive and started rebuild. Now I cannot reach the webgui and neither can I access server with PuTTY. It is about 13 hours since I started the rebuild and lost contact. With PuTTY I get the error message “Unable to open connection to TOWER. Host does not exist.” So I plugged my monitor into the Unraid server so see if something could be seen that way. There was a lot of fast text scrolling on the screen, but it was flickering so fast that it is impossible to make out what it says. I grabbed a camera and tried to capure it. I can make out some of it. I think it might be the same message repairing over and over, but I am not 100% sure. I am not sure what it means and I am not sure what I am looking for. I can make out the following: Fixing recursive fault but reboot is needed! BUG: unable to handle kernel paging request as 0000002b00000008 IP: [<ffffffff81385377>] blk_flush_plug_list+0x59/0x219 PGD 0 Oops: 0002 [#4287628] PREEMPT SMP Modules linked in: md_mod coretemp kvm_intel kvm i2c_i801 i2c_smbus i2c_core sat A_promise sata sil24 ahci intel_agp libahci pata_jmicron ntel_gtt atl1 agpgart Mii asus_atk0110 acpi_cpufreq [last unloaded:md_mod] CPU: 3 PID 1216 Comm: kworker/u8:15 Tainted: G D W 4.9.30-unRAID #1 Hardware name System manufacturer System Product Name/P5B-E, BIOS 1807 04/15/2009 [then there is some more code] Call Trace: […] dump_stack+0x61/07e […] __warn+0xb8/0xd3 […] warn_slowpath_nuyy+0x18/0x1a […] do_exit+0x5c/0x86a […] rewind_stack_d0_exit+0x17/0x19 ---[ end trace 7672d1c585583b99 ]--- I think there is some other info in there as well that I might have overlooked. Please have a look the attached screenshots taken with camera. So, I guess something is wrong? What could it be? Is it safe to just switch off/kill power to the system? So I do it? BTW. I ordered an used LSI 9201-16i and some SAS to SATA cables yesterday. Hopefullt that will arrive within 2 weeks or so.
  18. Ran SMART test on disk 2 and 5. Disk 2 failed the short test (and made mechanical noise from the server chassis). This disk is definitely bad. It is now marked with a red X in the web gui. Disk 5 passed both short and extended SMART test. I cannot explain the read error I had on it. Hopefully it won’t happen again. I will now exchange disk 2 for a spare disk and hope the rebuild is successful. What do you do with you disks before you send it in for RMA. There is still data on it, so I would prefer to wipe it if possible. Any suggestions?
  19. I tried running a SMART short self-test on disk 2: I got the following message: Interrupted (host reset) What does that mean? The syslog contains the following: Oct 16 22:53:20 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 16 22:53:20 Tower kernel: ata10.00: failed command: IDENTIFY DEVICE Oct 16 22:53:20 Tower kernel: ata10.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in Oct 16 22:53:20 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Oct 16 22:53:20 Tower kernel: ata10.00: status: { DRDY } Oct 16 22:53:20 Tower kernel: ata10: hard resetting link Oct 16 22:53:22 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Oct 16 22:53:22 Tower kernel: ata10.00: configured for UDMA/100 Oct 16 22:53:22 Tower kernel: ata10: EH complete Can something be read from the log above?
  20. So, what do you think? Should I replace disk 2 and try to rebuild (and hope disk 5 does not give many errors)? I have two more WD60EFRX-68L0BN1 drives installed in the array.
  21. This referring to the random read errors, right? Am I correct in assuming that the current pending sector is always a drive error, caused by a bad sector (or failing sector) on one of the platters? Once a sector is bad it has to be remapped and "closed off" so that not additional data is written to this sector? I understand the drive can remap some number of bad sectors and still be usable. I read that some users continue to use their drives and force remapping by zero'ing the drive, but if I can RMA the disk that would be preferable. I do run my Unraid server on older hardware, but I am thinking that all drives on the same controller or power-line should have errors if it was another HW issue. Still, I am considering getting new hardware. Also, I have a problem of keeping some of the data. If I have to replace 4 x 6TB disk I have no place to copy the data! One thing is certain. I am finished with WD disks! I have nothing but bad experiences with them now.
  22. So, even if I use user shares I can still move data with disk-to-disk shares. Won’t this affect parity and make problems for the user shares? Let’s say I want to move data from one disk to another disk. Both are included in the user share, however the disk I want to move the data to does not contain any data from the user share at the moment (so the right directories do not exist). Can I still move the files if I make sure that the file-paths/directories are the same?
  23. How can I explain the error on disk 5 if there is no trace of it in the SMART reports? I would like to know what went wrong and how to prevent it in the future? Do random read errors just happen sometimes? It is possible that the HDD generates random read errors because of a firmware error? If this is the case, how can I possibly rebuild a failed drive? I purchased three WD6TB drives a few weeks ago. I think that they are all WD60EFRX-68L0BN1. I have searched for WD60EFRX-68L0BN1 firmware error but found nothing specific. If WD hasn’t saying anything about it will be difficult to RMA the drive. If I need to replace and rebuild disk 2, other read errors will cause problems. Do you think one pending sector is reason for RMA or do I need to preclear the drive until it fails?
  24. Ok, thanks. I will look into ut. Do you know if MC will work as suggested above?
  25. I recently had some trouble with several drives on my Unraid server. I had the issues solved (or at least I thought they were solved). I had support in this thread: Just today I installed the unBalance plugin and looked around in the program. All disks spun up and unBalance got its information on where the files are stored. I did not use unBalance to scatter or gather any files, I just looked around. When I return to the Unraid WebGUI I have read errors on two of my drives. Disk 2: 4 errors. Disk 5: 1 Error. Both drives are fairly new (disk 2 is about 6 months, disk 5 2 months or so). I has a successful parity rebuild without errors only 2-3 weeks ago. I have not used the array much after that. I looked at the SMART reports. Disk 2 now has: 197 Current pending sector: Raw Value 1. Disk 5: I cannot see what the error on disk 5 is. I compared the new SMART data with an old SMART report and there was no change in the values that are visible in Unraid (Disk 5 had 521 199 UDMA CRC errors, but there were from before because of bad cabling). At the moment management of Unraid is giving me nothing but headaches. I am really considering moving to a simpler device, such as a Synology NAS. Is data storage really this unreliable? I never knew. So, what do I do next? Edit: Attachments removed.