PFT

Members
  • Posts

    36
  • Joined

  • Last visited

Everything posted by PFT

  1. I moved the Parity 1 disk to an unused port on an AOC-SAS2LP-MV8 controller card and disk is now present. The array starts without problem, as originally expected. I conclude there is something configured on the Mobo which means its onboard controller ports are not compatible. Next time I power up the system I'll take a look especially to see if they are configured for Raid. But the bottom line is everything works as anticipated so thanks again for your help.
  2. Both parity disks are connected to controller ports on the Supermicro X10 mobo, Originally (prior to switching cable to P1) P1 was connected to Sata0 and P2 to Sata1. After the switch P1 is connected to Sata2 (which was previously connected to an unused slot, slot 2).
  3. OK thanks, I'll give it another go. Presumably getting diagnostics without starting the array will be OK?
  4. I have a dual parity setup running 6.9.2. Parity 1 is in slot 0 and Parity 2 is in the adjacent slot, slot 1. Both disks are 16TB. Parity 1 has been reporting incrementing UDMA_CRC errors for some time now, which usually seem to occur when the disk is under heavy use e.g. at parity check time. I’ve done the recommended physical checks (cabling, connector seating etc) but the issue is recurring. I thought an easy fix would be to disconnect the Sata cable to the mobo port and connect a different cable to an unused controller port, in the belief that Unraid doesn’t care which controller port a disk is connected to. I did this but on restarting the system the Parity 1 disk is shown as missing. Being unsure how to proceed from here, I powered down the system and restored the cabling as before. Any advice will be appreciated.
  5. OK thanks for the response. I had been pressing done and not apply. Now it works and you get get a message saying 'array reset'. The old disk slot is now unassigned and I can start parity. Thanks!
  6. OK thank you. I have finally got round to upgrading the array and am following the 'Faster Method' as given in the article linked above. All the files have been copied to the new 16T disk and the old 4T disk removed. After powering up I am sitting at #8 'Use New Config to unassign removed disks and assign parity'. The parity assignment worked OK but I am unable to unassign the old disk which Unraid says is 'missing' Consequently I am unable to start the parity check due to 'Too many wrong and/or missing disks' What am I doing wrong? Thanks in advance.
  7. I’ve been an Unraid user for over ten years and have had very few issues that I haven’t been able to resolve using the docs and the fantastic resource provided by the forum experts. Current scenario concerns a 24-disc array in a Supermicro enclosure running 6.9.2. I would like to replace around half of the disks, which are 4TB with power-on hours around the 60,000 mark, with a smaller number (3 or 4) of much larger disks. I’ve already replaced the 2 parity disks with 16TB Seagate Exos as well as one of the old data disks which Unraid had previously disabled. At this point I would like to start moving data files from some of the oldest 4TB disks to the new 16TB disk, then remove the now-empty 4TB(s) from the array, so I’m looking for an idiot-proof guide on how to do this. Especially for a two-parity configuration running 6.9.x. The closest I’ve come across so far is this: https://wiki.unraid.net/Replacing_Multiple_Data_Drives_with_a_Single_Larger_Drive Is this the latest ‘official’ procedure? Is it suitable for a 2-parity setup? I don’t see this wiki document in the current (6.9.2) pdf version of the Unraid manual. Thanks in advance for any advice.
  8. I need to upgrade 4 of the older 2TB disks in my 24-disk server running 5.0. The replacements are 4TB all already pre-cleared. I understand that a rebuild must be done after swapping out each individual disk but is it necessary to run a parity check after each rebuild or can I wait until all four have been rebuilt? What if anything bad might happen if I do defer the parity check until the end of the process? Thanks.
  9. Ok thanks. Once the parity check completes I'll go through the cable/port routine again to make sure I didn't miss anything.
  10. Syslog showing errors with new disk (at end). syslog5-swapped_disks.zip
  11. Well the mystery deepens. The disk passed the long smart so I replaced it with a newly pre-cleared one and let the system start the rebuild. Everything seemed to be OK at first but then several hours into the process there was a flurry of errors including some '10B8B LinkSeq' that I hadn't seen before. All again pointing to ata1 and slot 3. Right now I'm planning to let the rebuild finish and then move the disk to the slot currently occupied by the cache leaving slot 3 empty.
  12. Yes, I tried that in #3 above. The motherboard has 14 SATA ports so that's an easy one. Also put the backplane (which supports 4 disks) on a separate power rail, but no change. Presently running another long smart test since the only constant in all this appears to be the drive in slot 3. Strangely though the system is telling me it will take 520 minutes, the previous one took 255.
  13. PFT

    Supermicro support?

    It turns out that there was nothing wrong with the chassis and I now have a spare power module. The problem was a faulty CPU.
  14. Got an issue with a new-build server into which I've moved an existing 22-disk array plus cache. On start up and intermittently thereafter I'm getting the following posted in the syslog. Tower2 kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen (Errors) Tower2 kernel: ata1.00: irq_stat 0x09000000, interface fatal error (Errors) Tower2 kernel: ata1: SError: { UnrecovData 10B8B BadCRC } (Errors) Ata1 (and it's always ata1 that has the error) is associated with the disk in slot 3 of the chassis - that is, the rightmost slot on the top row. All the forum info I have seen so far indicates that Bad CRC errors are caused by hardware issues, power, cables etc. So this is what I've done so far: 1. Put half the backplanes (3) on a second 12V power rail 2. Replaced the SATA cable with a known good one from another chassis (twice) 3. Moved the individual cable for slot 3 to another SATA socket on the motherboard 4. Swapped over the top two backplanes in the chassis No matter what, the errors return and are always pointing to slot 3. My next thought was the disk itself so I ran a short SMART test which it passed and have just started a long one which I'll post later. There are no problems reported on the MyMain Smart page. In the meantime does anyone have an insight into what might be causing this? Or even whether this is anything to be concerned about? The array itself seems to working fine and IPMI confirms that the CPU temperature is remaining below threshold. Configuration: Norco 4224, Supermicro X10SL7-F, 2x AOC-SASLP-MV8, Core i7, 16GB and Corsair 850W psu. Running Unraid 5.0.5. syslog4_-_changed_backplanes.zip smart-short2.txt
  15. PFT

    Supermicro support?

    Yes thanks for the phone number, I didn't see that as a tech support option. I called and spoke to a pleasant and knowledgeable support rep. It seems that the power distributor module is faulty. I've ordered a new one and will report back so other SM chassis will be aware.
  16. Has anyone had any success getting a response from Supermicro tech support? My SC846 chassis has gone belly-up and I'd like to get an expert opinion before scrapping it. I tried emailing twice but no response. The unit was bought from Newegg, is there any value in signing up for their paid support or any other ideas?
  17. I'd like to report an experience I'm having with the current beta software (both b12a and b13) which appears to indicate an issue with NFS. This relates to a test server which is based on the Supermicro X8-SIL motherboard, with 12 data drives, cache and parity, all Hitachi 3TB drives. The parity, cache and four data drives are connected to SATA ports on the motherboard, with the other eight drives connected to a Supermicro AOC-SASLP-MV8 SATA controller. When user shares are accessed from a Duneplayer via NFS, the Duneplayer does not display all the icons for the titles within the selected folder. For example, the following screen shot shows a sample page which should include 18 icons displayed in a 6x3 matrix. As can be seen, there are multiple 'holes' where icons should appear but do not. Additionally, it is not possible to invoke the associated video folders by clicking on the area where the missing icons should be. Effectively, the affected video files are inaccessible via the Duneplayer. This behavior is replicable across multiple Duneplayers connected to the same UnRaid server running 5b12a or 5b13. Switching now to SMB (at the Duneplayer end), everything appears normal; all icons are displayed properly and all video files are accessible - see following screenshot with the 'holes' filled in. I'd also note that this issue does not occur with my main server, which has similar hardware except for the drives which are all 2TB, this server is running version 4.7 and all shares are accessed via NFS from the same Duneplayers without any problem.
  18. Hmm. Now that's interesting. I did have a second (empty) SASLP in the chassis while the first (failed) batch of pre-clears were being processed, but not for the second (successful) batch. I thought I'd been paying fairly close attention the dedicated beta thread but I must have missed the part where the 'two SASLP' problem was discussed. Can you point me to it? The second SASLP is on its way back to Supermicro to be re-flashed with the 'Non-RAID' firmware version. When I do the third batch of three I'll try it with the array running. BTW It's not a major issue for me that the array is 'parked' while this is going on since all the files on Server 2 are still available on Server 1, or somewhere else on my network.
  19. Update: Well that's curious. Disk 1 of 2 has successfully pre-cleared after 37 hours, Disk 2 is still going strong about an hour behind the first and it is looking good too. So, it is not true that 3TB disks can't be pre-cleared successfully when attached to an AOC-SASLP-MV8. I'm still wondering though what caused the first batch to fail and am coming round to the view that the system was simply overtaxed with five concurrent pre-clears and an active array. But then the other (4.7) server has 22 disks and will happily pre-clear a single 3TB with the array running. Both servers are identical except for the chassis (Supermicro 846 running 5.0b12a and Norco 4224 running 4.7. Both have the X8SILI-F-O motherboard, AOC-SASLP-MV8, Intel i3, 4 GB RAM and 850W Corsair psu. I'm now going to attempt again to pre-clear a pair of the original failed disks, again with the array offline.
  20. Update: After 19-20 hours both disks have finished zeroing and are now in the pre-read phase. No errors reported in the log. Once again, the array is not started.
  21. That's the first time I've heard that, I wonder if Joe L might comment. I've already started a new pre-clear with two different disks I was holding in reserve. So far it's been running about 9 hours and just about to finish the pre-read. So far, so good. The difference this time is that the array has not been started. Would be good if an experienced member could look at the log and see if he can determine what actually went wrong at around the 08:55 mark, which was when the first two original disks seem to given up the ghost.
  22. I was incorrect in assuming that the syslog did not include the events leading up to the failures. They are still in the very front of the log here: syslog.2011.10.11.zip
  23. This is a new server with a fresh 5.0b12a install. It has a seven disk array (incl parity and cache) - all 3TB Hitachi 5K3000. The existing discs were all successfully pre-cleared in an older chassis with 4.7 and the 1.12beta pre-clear script. The present failures involve a further five disks of the same type and size that I attempted to pre-clear using the 1.13 script in the new (5.0b12a) chassis. All five failed to pre-clear successfully. Two of the five gave up relatively quickly at 10h36, one was 97% through the pre-read, the other at around the same time, was into the zeroing phase at that point. Unfortunately this happened late in the evening so I was not around to see it happen, and the log subsequently filled with thousands of lines of 'garbage' (more on that below). The other three went on to complete the cycle in the expected 42 hours or so but all reported an unsuccessful pre-clear, due to "Post-read detected un-expected non-zero bytes on disk". The other two produced identical reports. The SMART reports do not (as far as I can see) indicate anything untoward, no reallocated sectors or such. This is a screen capture of the final status for one of the disks: ================================================================== 1.13 = unRAID server Pre-Clear disk /dev/sdm = cycle 1 of 1, partition start on sector 1 = Disk Pre-Clear-Read completed DONE = Step 1 of 10 - Copying zeros to first 2048k bytes DONE = Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE = Step 3 of 10 - Disk is now cleared from MBR onward. DONE = Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE = Step 5 of 10 - Clearing MBR code area DONE = Step 6 of 10 - Setting MBR signature bytes DONE = Step 7 of 10 - Setting partition 1 to precleared state DONE = Step 8 of 10 - Notifying kernel we changed the partitioning DONE = Step 9 of 10 - Creating the /dev/disk/by* entries DONE = Step 10 of 10 - Verifying if the MBR is cleared. DONE = Disk Post-Clear-Read completed DONE Disk Temperature: 35C, Elapsed Time: 40:59:04 ========================================================================1.13 == Hitachi HDS5C3030ALA630 MJ1311YNG1AVEA == Disk /dev/sdm has NOT been precleared successfully == skip=332600 count=200 bs=8225280 returned instead of 00000 ============================================================================ ** Changed attributes in files: /tmp/smart_start_sdm /tmp/smart_finish_sdm ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VA LUE Temperature_Celsius = 171 157 0 ok 35 No SMART attributes are FAILING_NOW 0 sectors were pending re-allocation before the start of the preclear. 0 sectors were pending re-allocation after pre-read in cycle 1 of 1. 0 sectors were pending re-allocation after zero of disk in cycle 1 of 1. 0 sectors are pending re-allocation at the end of the preclear, the number of sectors pending re-allocation did not change. 0 sectors had been re-allocated before the start of the preclear. 0 sectors are re-allocated at the end of the preclear, the number of sectors re-allocated did not change. root@Tower2:/boot# ============================================================================= Here is the preclear report for the same disk, as posted in /boot/preclear_reports: ========================================================================1.13 == invoked as: ./preclear_disk.sh /dev/sdm == == Disk /dev/sdm has NOT been successfully precleared == Postread detected un-expected non-zero bytes on disk== == Ran 1 cycle == == Using :Read block size = 8225280 Bytes == Last Cycle's Pre Read Time : 10:06:48 (82 MB/s) == Last Cycle's Zeroing time : 10:43:47 (77 MB/s) == Last Cycle's Post Read Time : 20:07:15 (41 MB/s) == Last Cycle's Total Time : 40:59:04 == == Total Elapsed Time 40:59:04 == == Disk Start Temperature: 38C == == Current Disk Temperature: 35C, == ============================================================================ ** Changed attributes in files: /tmp/smart_start_sdm /tmp/smart_finish_sdm ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Temperature_Celsius = 171 157 0 ok 35 No SMART attributes are FAILING_NOW 0 sectors were pending re-allocation before the start of the preclear. 0 sectors were pending re-allocation after pre-read in cycle 1 of 1. 0 sectors were pending re-allocation after zero of disk in cycle 1 of 1. 0 sectors are pending re-allocation at the end of the preclear, the number of sectors pending re-allocation did not change. 0 sectors had been re-allocated before the start of the preclear. 0 sectors are re-allocated at the end of the preclear, the number of sectors re-allocated did not change. ============================================================================ And here is the SMART for the same disk =========================================================================== Disk: /dev/sdm smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: Hitachi HDS5C3030ALA630 Serial Number: MJ1311YNG1AVEA Firmware Version: MEAOA580 User Capacity: 3,000,592,982,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Oct 12 15:22:07 2011 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (36667) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 4 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 44 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 5 194 Temperature_Celsius 0x0002 171 171 000 Old_age Always - 35 (Min/Max 25/41) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ============================================================================ When I finally came to look at the log, after realising that two of the discs had prematurely stopped their pre-clear activity, it was filled with thousands of repetitions of the following pair of messages, repeating every thirty seconds or so. The log is still filling now . Oct 12 17:59:02 Tower2 kernel: sas: command 0xef34c240, task 0xc0800280, not at initiator: BLK_EH_RESET_TIMER Oct 12 17:59:02 Tower2 kernel: sas: command 0xefa2ab40, task 0xebbef000, not at initiator: BLK_EH_RESET_TIMER Oct 12 17:59:33 Tower2 kernel: sas: command 0xefa2ab40, task 0xebbef000, not at initiator: BLK_EH_RESET_TIMER Oct 12 17:59:33 Tower2 kernel: sas: command 0xef34c240, task 0xc0800280, not at initiator: BLK_EH_RESET_TIMER It may or not be relevant, but all five of these disks are connected to the same AOC-SASLP-MV8 controller. Any suggestions and thoughts about where to go from here will be muchappreciated.