spasszeit

Members
  • Posts

    62
  • Joined

  • Last visited

Posts posted by spasszeit

  1. I seem to have been having problems with 4.7 which I did not have before. I run two servers

    and ever since upgrading to 4.7 I get occasional drive drive drop off from the array. First time

    it happened I thought the drive was bad. Re-seated the cables, that did not help - during the

    system post I see one port not being detected. When array starts the disk is missing. I replaced

    the drive with a new one and the array rebuilt itself just fine. I put the bad drive into the other

    tower and ran pre-clear on it - no smart errors. Last night I had the same problem on the other

    server. The problem exhibited itself by inability to access web interface, though access to shares

    via windows and via Telnet worked fine. I captured the syslog - see attached. There is an awful

    lot of messages related to, I think, my SM AOC-SASLP-MV8 and a disk. I run two AOC-SASLP-MV8

    cards in each server.

     

    Now, my friend built a server recently as well. He is running one AOC-SASLP-MV8 card, and he mentioned

    to me that ever since upgrading to 4.7 he has also experienced a problem like mine.

     

    I hope Tom can look into this and advise on what's going on here and how we can address it.

     

    Syslog_Apr_14.zip

  2.  

    Disk1 is a "real" disk.  You can try to copy its contents to another drive on a workstation (I would not recommend writing anything to the array right now).

     

    You could remove disk1 from the server, mount in an external USB, and attach to a workstation computer.  There is a reiserfs program you can run that will give you read-only access to the disk.  If you fear your unRAID computer is failing, this would be a way to eliminate the unRAID server.  If the copy works smoothly you may be justified in doubting the server hardware.  But if it is slow and problematic, you have confirmed the disk really is bad.

     

    Disk7 is imposssible to rebuild without disk1 in the array.  If you were able to get disk1 working well in the step above, it would mean that your unRAID server is suspect and you should upgrade it.  Once it is rebuilt, you can retry the rebuild.

     

    But if disk1 is bad, there is not much I know to do.  Perhaps you could send it to some recovery service that could create an image copy of the disk.

     

    Is the physical disk that used to be in slot7 still around?  If so, you might have more luck recovering from that than recovering from the array with the failing disk1.

     

    I have a very strong suspicion that, like you said before, it may be PSU related. The disks are dropping like flies. Must be something with the hardware that is causing this epidemic of failures. I did not mention this before, but during one of the reboots the other night another two drives went missing. I powered down the server, re-seated power and SATA cables and powered it up. The drives, thank God, came back up. It finally dawned on me that I should not take any more chances so I stopped the rebuild and powered down the server. I don't rule out other hardware problems either, especially that I know I've had memory/mobo issues before. So, I decided to rebuild the entire machine. It's been in my plans anyway. The new hardware is on the way so I should have the new build next week.

     

    As of right now, my plan is as follows:

     

    1. Take out disks 1 and 9 and copy files off of them to my second unraid machine.

    I am looking to find a way to do it via some kind of gui as my command line skills are lacking

    and I don't trust them 100%.

    Your suggestion to copy files to a Windows machine sounds like something I'd like to try,

    but I am not sure I want to do that as I have two empty 2TB drives in my second unRaid.

     

    2. Once the new machine is ready I will put disks 1 and 9 back in, and try to rebuild disk 7,

    or maybe I can extract files from the 'virtual' disk 7. Yes, I still have the physical disk 7, but it is bricked.

     

    3. If step 2 fails, I will back up files off of another 1TB disk that is the same model and same

    firmware and swap controller boards with the dead drive. Perhaps I can revive it.

     

    4. If that fails as well... well, I hope I can somehow figure out what was on it and

    get that content back. Good thing is all of my critical files are backed up on the second server.

     

    Question, suppose I can copy files off Disks 1 and 9, is there an efficient way to tell if all files are there

    and if there are any corruptions? Or should I just compare the 'used' space in the original config with the

    one on the new drive, and then go check each individual file manually?

     

    Thanks for your support bjp999.

  3. Would it maybe make sense to stop the rebuild, upgrade the hardware and then simply copy

    the the files off Disk 1, and Disk 7 to another server? I suppose at this point I don't care

    how retarded or time consuming the process will be, the only thing I want is to not lose

    any or much of my data. Desperately need some expert advice.

  4.  

    Can you access disk7 (I'm not sure, when doing this type of operation, if the disk being reconstructed can be accessed, but I think it probably can.)  If so, you might want to look at some of the earliest written files to the disk and see if they look good.  The new disk7 seems pretty happy.

     

     

     

    Yes, last night I was able to access Disk 7. Though it was extremely slow. After clicking on the drive it took several minutes

    before the directory under the drive appeared.

  5. The errors next to disk 1 are increasing. Initially the count went to 27000

    and then stopped. Later I came home the count was 47K. It was about the same

    this morning, and then it increased. Currently the count is at 49,600. So,

    I would say it is sporadic.

     

    Here is the smart test report on Disk 1:

     

     

     

    Statistics for /dev/sdl 00R_WD-WCAVY0252674

     

    smartctl -a -d ata /dev/sdl

    smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

    Home page is http://smartmontools.sourceforge.net/

     

    === START OF INFORMATION SECTION ===

    Device Model:     WDC WD20EADS-00R6B0

    Firmware Version: 01.00A01

    User Capacity:    2,000,398,934,016 bytes

    Device is:        Not in smartctl database [for details use: -P showall]

    ATA Version is:   8

    ATA Standard is:  Exact ATA specification draft version not indicated

    Local Time is:    Wed Feb  9 09:57:47 2011 EST

    SMART support is: Available - device has SMART capability.

    SMART support is: Enabled

     

    === START OF READ SMART DATA SECTION ===

    SMART overall-health self-assessment test result: PASSED

     

    General SMART Values:

    Offline data collection status:  (0x85) Offline data collection activity

    was aborted by an interrupting command from host.

    Auto Offline Data Collection: Enabled.

    Self-test execution status:      (  41) The self-test routine was interrupted

    by the host with a hard or soft reset.

    Total time to complete Offline

    data collection: (40800) seconds.

    Offline data collection

    capabilities: (0x7b) SMART execute Offline immediate.

    Auto Offline data collection on/off support.

    Suspend Offline collection upon new

    command.

    Offline surface scan supported.

    Self-test supported.

    Conveyance Self-test supported.

    Selective Self-test supported.

    SMART capabilities:            (0x0003) Saves SMART data before entering

    power-saving mode.

    Supports SMART auto save timer.

    Error logging capability:        (0x01) Error logging supported.

    General Purpose Logging supported.

    Short self-test routine

    recommended polling time: (   2) minutes.

    Extended self-test routine

    recommended polling time: ( 255) minutes.

    Conveyance self-test routine

    recommended polling time: (   5) minutes.

    SCT capabilities:       (0x303f) SCT Status supported.

    SCT Feature Control supported.

    SCT Data Table supported.

     

    SMART Attributes Data Structure revision number: 16

    Vendor Specific SMART Attributes with Thresholds:

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

     1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       204446

     3 Spin_Up_Time            0x0027   149   148   021    Pre-fail  Always       -       9541

     4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       75

     5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

     7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

     9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12088

    10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0

    11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

    12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       53

    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7

    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       67

    194 Temperature_Celsius     0x0022   127   114   000    Old_age   Always       -       25

    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

    197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1374

    198 Offline_Uncorrectable   0x0030   199   196   000    Old_age   Offline      -       394

    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

    200 Multi_Zone_Error_Rate   0x0008   025   001   000    Old_age   Offline      -       35182

     

    SMART Error Log Version: 1

    No Errors Logged

     

    SMART Self-test log structure revision number 1

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

    # 1  Short offline       Interrupted (host reset)      90%     12088         -

    # 2  Short offline       Interrupted (host reset)      90%     12085         -

    # 3  Short offline       Completed: read failure       10%     12023         461689921

    # 4  Short offline       Completed: read failure       10%     12023         551849266

    # 5  Short offline       Completed without error       00%      6976         -

    # 6  Short offline       Completed without error       00%      5520         -

    # 7  Short offline       Completed without error       00%      4787         -

    # 8  Short offline       Completed without error       00%      4761         -

    # 9  Short offline       Completed without error       00%      4711         -

    #10  Short offline       Completed without error       00%      4710         -

     

    SMART Selective self-test log data structure revision number 1

    SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

       1        0        0  Not_testing

       2        0        0  Not_testing

       3        0        0  Not_testing

       4        0        0  Not_testing

       5        0        0  Not_testing

    Selective self-test flags (0x0):

     After scanning selected spans, do NOT read-scan remainder of disk.

    If Selective self-test is pending on power-up, resume after 0 minute delay.

     

     

  6. The rebuild has not been going too well guys. Seems like Disk 1 is also having problems.

    I captured current syslog and it's pretty much all red. Can someone please take a look

    and give me a general idea what seems to be the problem?

     

     

    So, I have new hardware on the way. Should I just stop now and perform the rebuild

    on the new hardware or let this one finish? I am afraid though that something else

    will break by the time it's done.

     

    The speed goes up to 14000-23000KB/s briefly and stays below 500MB/s most of the time.

    Over last night the progress was from 32.3% to 39% today. At this rate it will take it

    a month to complete...

     

     

    Feb 9 07:14:52 Tower kernel: ata11: hard resetting link

    Feb 9 07:14:53 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 07:14:53 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 07:14:53 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 07:14:53 Tower kernel: ata11: EH complete

    Feb 9 07:15:23 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 07:15:23 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 07:15:23 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 07:15:23 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 07:15:23 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 07:15:23 Tower kernel: ata11: hard resetting link

    Feb 9 07:15:25 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 07:15:25 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 07:15:25 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 07:15:25 Tower kernel: ata11: EH complete

    Feb 9 07:15:55 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 07:15:55 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 07:15:55 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 07:15:55 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 07:15:55 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 07:15:55 Tower kernel: ata11: hard resetting link

    Feb 9 07:15:56 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 07:15:56 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 07:15:56 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 07:15:56 Tower kernel: ata11: EH complete

    Feb 9 07:16:27 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 07:16:27 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 07:16:27 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 07:16:27 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 07:16:27 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 07:16:27 Tower kernel: ata11: hard resetting link

     

     

    ..........................

     

    Feb 9 08:01:16 Tower kernel: handle_stripe read error: 1529839464/1, count: 1

    Feb 9 08:01:16 Tower kernel: md: disk1 read error

    Feb 9 08:01:16 Tower kernel: handle_stripe read error: 1529839472/1, count: 1

    Feb 9 08:02:02 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:02:02 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:02:02 Tower kernel: ata11.00: cmd 25/00:00:b7:9e:2f/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:02:02 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 08:02:02 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:02:02 Tower kernel: ata11: hard resetting link

    Feb 9 08:02:02 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:02:02 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:02:02 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 08:02:02 Tower kernel: ata11: EH complete

    Feb 9 08:04:26 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

    Feb 9 08:04:26 Tower kernel: ata11.00: irq_stat 0x40000001

    Feb 9 08:04:26 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:04:26 Tower kernel: ata11.00: cmd 25/00:00:57:02:31/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:04:26 Tower kernel: res 51/40:2f:21:03:31/00:03:5b:00:00/e0 Emask 0x9 (media error)

    Feb 9 08:04:26 Tower kernel: ata11.00: status: { DRDY ERR }

    Feb 9 08:04:26 Tower kernel: ata11.00: error: { UNC }

    Feb 9 08:04:26 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:04:26 Tower kernel: ata11: EH complete

    Feb 9 08:10:53 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

    Feb 9 08:10:53 Tower kernel: ata11.00: irq_stat 0x40000001

    Feb 9 08:10:53 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:10:53 Tower kernel: ata11.00: cmd 25/00:00:af:15:79/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:10:53 Tower kernel: res 51/40:7f:26:19:79/00:00:5b:00:00/e0 Emask 0x9 (media error)

    Feb 9 08:10:53 Tower kernel: ata11.00: status: { DRDY ERR }

    Feb 9 08:10:53 Tower kernel: ata11.00: error: { UNC }

    Feb 9 08:10:53 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:10:53 Tower kernel: ata11: EH complete

    Feb 9 08:19:04 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:19:04 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:19:04 Tower kernel: ata11.00: cmd 25/00:00:27:04:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:19:04 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 08:19:04 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:19:04 Tower kernel: ata11: hard resetting link

    Feb 9 08:19:05 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:19:05 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:19:05 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 08:19:05 Tower kernel: ata11: EH complete

    Feb 9 08:19:36 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:19:36 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:19:36 Tower kernel: ata11.00: cmd 25/00:00:27:04:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:19:36 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 08:19:36 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:19:36 Tower kernel: ata11: hard resetting link

    Feb 9 08:19:36 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:19:36 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:19:36 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 08:19:36 Tower kernel: ata11: EH complete

    Feb 9 08:20:52 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:20:52 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:20:52 Tower kernel: ata11.00: cmd 25/00:00:27:0d:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:20:52 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 08:20:52 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:20:52 Tower kernel: ata11: hard resetting link

    Feb 9 08:20:53 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:20:53 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:20:53 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 08:20:53 Tower kernel: ata11: EH complete

    Feb 9 08:21:23 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:21:23 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:21:23 Tower kernel: ata11.00: cmd 25/00:00:27:0d:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:21:23 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 08:21:23 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:21:23 Tower kernel: ata11: hard resetting link

    Feb 9 08:21:24 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:21:24 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:21:24 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 08:21:24 Tower kernel: ata11: EH complete

    Feb 9 08:23:58 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:23:58 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:23:58 Tower kernel: ata11.00: cmd 25/00:00:27:02:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:23:58 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

    Feb 9 08:23:58 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:23:58 Tower kernel: ata11: hard resetting link

    Feb 9 08:23:59 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:23:59 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:23:59 Tower kernel: ata11.00: device reported invalid CHS sector 0

    Feb 9 08:23:59 Tower kernel: ata11: EH complete

    Feb 9 08:25:03 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

    Feb 9 08:25:03 Tower kernel: ata11.00: irq_stat 0x40000001

    Feb 9 08:25:03 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:25:03 Tower kernel: ata11.00: cmd 25/00:00:27:22:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:25:03 Tower kernel: res 51/40:ff:26:23:bf/00:02:5b:00:00/e0 Emask 0x9 (media error)

    Feb 9 08:25:03 Tower kernel: ata11.00: status: { DRDY ERR }

    Feb 9 08:25:03 Tower kernel: ata11.00: error: { UNC }

    Feb 9 08:25:03 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:25:03 Tower kernel: ata11: EH complete

    Feb 9 08:25:20 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

    Feb 9 08:25:20 Tower kernel: ata11.00: irq_stat 0x40000001

    Feb 9 08:25:20 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:25:20 Tower kernel: ata11.00: cmd 25/00:00:27:22:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

    Feb 9 08:25:20 Tower kernel: res 51/40:9f:78:22:bf/00:03:5b:00:00/e0 Emask 0x9 (media error)

    Feb 9 08:25:20 Tower kernel: ata11.00: status: { DRDY ERR }

    Feb 9 08:25:20 Tower kernel: ata11.00: error: { UNC }

    Feb 9 08:25:20 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:25:20 Tower kernel: ata11: EH complete

    Feb 9 08:32:12 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

    Feb 9 08:32:12 Tower kernel: ata11.00: irq_stat 0x40000001

    Feb 9 08:32:12 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:32:12 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

    Feb 9 08:32:12 Tower kernel: res 51/40:27:7f:cf:2d/00:01:5c:00:00/e0 Emask 0x9 (media error)

    Feb 9 08:32:12 Tower kernel: ata11.00: status: { DRDY ERR }

    Feb 9 08:32:12 Tower kernel: ata11.00: error: { UNC }

    Feb 9 08:32:12 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:32:12 Tower kernel: ata11: EH complete

    Feb 9 08:32:35 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

    Feb 9 08:32:35 Tower kernel: ata11.00: irq_stat 0x40000001

    Feb 9 08:32:35 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:32:35 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

    Feb 9 08:32:35 Tower kernel: res 51/40:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x9 (media error)

    Feb 9 08:32:35 Tower kernel: ata11.00: status: { DRDY ERR }

    Feb 9 08:32:35 Tower kernel: ata11.00: error: { UNC }

    Feb 9 08:32:35 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:32:35 Tower kernel: ata11: EH complete

    Feb 9 08:33:05 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:33:05 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:33:05 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

    Feb 9 08:33:05 Tower kernel: res 40/00:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x4 (timeout)

    Feb 9 08:33:05 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:33:05 Tower kernel: ata11: hard resetting link

    Feb 9 08:33:06 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:33:06 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:33:06 Tower kernel: ata11: EH complete

    Feb 9 08:33:36 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Feb 9 08:33:36 Tower kernel: ata11.00: failed command: READ DMA EXT

    Feb 9 08:33:36 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

    Feb 9 08:33:36 Tower kernel: res 40/00:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x4 (timeout)

    Feb 9 08:33:36 Tower kernel: ata11.00: status: { DRDY }

    Feb 9 08:33:36 Tower kernel: ata11: hard resetting link

    Feb 9 08:33:37 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

    Feb 9 08:33:37 Tower kernel: ata11.00: configured for UDMA/33

    Feb 9 08:33:37 Tower kernel: ata11: EH complete

     

    Total lines: 3000

  7. Good to know that I can repeat the process again with the new hardware.

    As far as the reallocated sectors, I checked every drive and I see 5 drives with reallocated sectors,

    ranging from 1 to 5. The one drive that failed first I checked later had 1700 pending reallocated,

    but no reallocated sectors.

  8. I bet your are right about the hardware. But I have no one but myself to blame. I have been having spats of problems with this server from time to time, like once a year, but then it stabilizes and I keep delaying hardware upgrade. A while ago after some troubleshooting I discovered that my mobo (P5BVMDO) all of a sudden refused to work with the RAM that was installed. Eventually I replaced the RAM but left only one stick as with two sticks the system would not boot. As soon as I recover the data I will upgrade the hardware. Perhaps, I should have done that before following the steps that you outlined.

  9. bjp999, thank you so much for the detailed response. I followed the steps and so far so good... I think.

    The writes on drive 7 are increasing, and the reads on all other drives are increasing. Though, I do see

    some minor write increases on other drives as well. The speed started off at 16,000KB/s then went down

    to 200Kb/s and then ramped back up to 19,500KB/s... I am letting it run and let's see what happens.

  10. Need some help here. I am running version 4.5.6.

    Two days ago a 2TB drive (1736 in slot 9) got spun down by itself

    and had a green blinking ball next to it. I rebooted the server and after the reboot the drive

    had a red ball next to it. I got a new drive, precleared it on a different server then replaced

    the bad drive and started the rebuild process. One day into rebuild I noticed the speed was

    14KB/s, barely crawling. The drive in slot 2 had numerous errors. I refreshed the page and

    the server became unresponsive. So I hard rebooted it.

     

    This time another drive (PHBV in slot 7) became red was reported as missing. I powered the

    server down, re-seated the cables and powered the server on. No change.

    I took both drives out and put them into my other server. The PHBV is dead completely, perhaps

    the controller board went dead. The 1736 I can mount, short smart test shows 1700 pending reallocations.

    I decided to put the 1736 back into the original server and try to rebuild the PHBV drive, but now

    Slot 9 wants the 1007 drive, which is the replacement that I bought for the 1736 in slot 9.

    I attached the Disk Status page. Any advice on how I can rebuild the PHBV drive now and then

    rebuild the 1736 drive? Probably wishful thinking. In that case, how can I copy the data from this

    drive? If I am able to mount, I should be able to copy. Just can't figure out how. As for the other

    driver, perhaps the controller got fried and I may be able to revive it...

     

    Did not capture the system log originally, unfortunately.

  11. spasszeit, you have been running this for a while now, any ideas on what your typical CPU temperatures are?  I put it all together last night (not started the array yet, the 4 in 3 fan is a 3 pin and there are only 4 pin headers on the motherboard), used the stock HSF that came with the 3430 just want to know what i should be expecting, so i know when to hit the panic button if i need to, lol.

     

    Congrats on finishing off your build. Enjoy it.

    Since SM's measurement of CPU temps is strange - all it says 'low' under PC health tab in IPMI dashboard,

    the only other way I can check the temps is in Unmenu. Here you go:

    coretemp-isa-0000

    Adapter: ISA adapter

    Core 0:      +35.0 C  (high = +84.0 C, crit = +100.0 C)  

     

    coretemp-isa-0001

    Adapter: ISA adapter

    Core 1:      +33.0 C  (high = +84.0 C, crit = +100.0 C)  

     

    coretemp-isa-0002

    Adapter: ISA adapter

    Core 2:      +35.0 C  (high = +84.0 C, crit = +100.0 C)  

     

    coretemp-isa-0003

    Adapter: ISA adapter

    Core 3:      +32.0 C  (high = +84.0 C, crit = +100.0 C)

     

    The ambient temps are about 22-24 C I would guess. I am using stock HSF as well. Front of case has 2x 120mm intake fans,

    and rear has 2x 80mm exhaust fans.

  12. Now that I got my second server up and running, I would like to set up a scheduled back up of certain shares on Tower1 to Tower2.

    I think i got down the basics of the syntax for the rsyncd.conf file, and am able to sync Photos share (for now) manually but when it comes to

    automating all this I am in quite over my head, so I'd really appreciate some guidance on this.

     

    Here is what I am doing and questions I have:

     

    1. Following JoeL's examples, I set up rsyncd.conf file on Tower2:

    uid            = root

    gid            = root

    use chroot      = no

    max connections = 4

    pid file        = /var/run/rsyncd.pid

    timeout        = 600

    log file        = /var/log/rsyncd.log

     

    [Photos]

        path = /mnt/user/media/Backups/Photos

        comment = /mnt files

        read only = FALSE

     

     

    2. Automatically invoke rsync daemon process on Tower2 every time the server is rebooted.

    So, manually the daemon is invoked with this command:

    rsync --daemon --config=/boot/config/rsyncd.conf

     

    Should it be added to the go script? It would make sense, but I am curious

    why I don't see this command in the 'go' script in the example from this thread - http://lime-technology.com/forum/index.php?topic=3417.0

     

    3. To start the rsync process based on some schedule, I understand I need to

    add something similar to this cron job to 'go' script on Tower1:

     

    #set up rsync between the two servers every other day at 3 am - will be commented out for Server2 go script

    echo "0 3 2-6,8-13,15-20,21-31 * * /usr/bin/rsync rsync://Server2/disk1/*" >>/tmp/crontab

    echo "0 3 2-6,8-13,15-20,21-31 * * /usr/bin/rsync rsync://Server2/disk2/*" >>/tmp/crontab

    echo "0 3 2-6,8-13,15-20,21-31 * * /usr/bin/rsync rsync://Server2/disk3/*" >>/tmp/crontab

     

    Say if in my case I want to build upon this manual command to do daily backups:

    cd /mnt

    rsync -avrH user/media/Photos tower2::Photos

     

    What should my entry be? I am trying to make sense of the example above but I am not sure I get all the syntax yet.

     

    Anything else I am missing?

     

     

     

  13. As for deciding if ECC functionality is worth it, that all comes down to personal preferences. I haven't had issues with memory glitches that I am aware of, but then my server doesn't get the workout that enterprise production servers do.

     

    I, on the other hand, have had my share of problems with standard RAM sticks on my first unRaid built on P5BVM-DO.

    Still not sure what went wrong there. All of a sudden I started seeing numerous errors in the log, system freezes, etc.

    Eventually I narrowed the problem down to RAM, and ended up exchanging it, but running only one stick as with two sticks

    of new RAM the system wouldn't boot. Took me two weeks to get the server stable and problem free. But from what I

    see on the forums the issues I experienced are very uncommon.

  14. Thanks very much for testing that, that's put my mind at ease :)

     

    I'm still unsure what CPU to use, originally I was going to go with a Core i3 530, but you have to use ECC memory with this motherboard anyway, and the ECC only works with Xeon processors apparently, but the X3430 is nearly double the price (£156 for the X3430 vs £83 for the i3 530).

     

    Whatever I get is going to have more processing power than I have now (an old AMD FX55) so really its just a case of trying to justify getting a Xeon, is the ECC worth the extra?  I believe the Xeons also open up some options for VM hyperviser or something, but not sure I'd have any need for this.

     

    Any opinions?

     

    My basic reason for going with Xeon was that since I am buying a server grade mobo with ECC memory, I might as well buy a server grade CPU and take advantage of ECC. I've read somewhere that ECC memory provides greater stability and reliability, hence it is a must for mission critical applications. My unRaid has become pretty mission critical for the members of my family:-) Whenever it is down for maintenance or break-fixing, I get bombarded by complaints.

  15. I could have swore i saw a response in this thread (maybe it was a different one) saying the i3 530 will work and is compatible with ECC memory, just that the ECC functionality won't be used. It included several links, one to a rather nice tested review too.

     

    LOL... that was me... I misread Kode's post and thought he'd said he wasn't sure if i3530 would work with ECC memory.

     

    Since you mentioned it, here is the link to that review. It really is very thoughtful and nicely written:

     

    http://www.servethehome.com/supermicro-x8silf-motherboard-v102-review-whs-v2-diy-server/

  16. Thanks very much for testing that, that's put my mind at ease :)

     

    I'm still unsure what CPU to use, originally I was going to go with a Core i3 530, but you have to use ECC memory with this motherboard anyway, and the ECC only works with Xeon processors apparently, but the X3430 is nearly double the price (£156 for the X3430 vs £83 for the i3 530).

     

    Whatever I get is going to have more processing power than I have now (an old AMD FX55) so really its just a case of trying to justify getting a Xeon, is the ECC worth the extra?  I believe the Xeons also open up some options for VM hyperviser or something, but not sure I'd have any need for this.

     

    Any opinions?

     

    Another alternative for you would be L3406. Albeit it is also priced much higher than the i3.

  17. Thanks adelias for the info on the 1 stick of memory, i don't really want to get 2 sticks straight off, spasszeit any chance of testing with 1 stick again? :P

     

    No problem. Channel 1 (blue) slots don't work with one stick, all I get is long beeps and no post.

    Channel 2 (black) slots each work with 1 stick. I from the get go put sticks (both and 1 at a time)

    into channel 1 slots and assumed the same behavior for channel two slots... Now looking at the manual

    I see a reference to one channel taking 2 populated slots and one channel also taking 1...

    but I am a typical guy, I hate reading manuals:-)

  18. Also, I tried booting the board with only one stick and it won't boot. Gives me a long beep, which means memory problem according to the manual. My other boards work with one stick just fine. As a matter of fact, my P5BVM-DO doesn't like two sticks, so I have been running it with just one 2GB stick.

     

    I have this board and am running it with a single 2GB UDIMM in slot DIMM1A.

    Brand?  Model?  Link?

     

    Micron MT18JSF25672AY from eBay. It was on Supermicro's tested memory list.

     

    Interesting. It refused one stick of my Crucial memory. Did you get the long beep at all?

     

     

    No long beep. Did you use slot DIMM1A as it states in the manual? Also what revision is your board?

    Mine is 1.02.

     

    Not sure. It's quite possible I stuck it in DIMM2A. I did not reference the manual for that.

  19. Also, I tried booting the board with only one stick and it won't boot. Gives me a long beep, which means memory problem according to the manual. My other boards work with one stick just fine. As a matter of fact, my P5BVM-DO doesn't like two sticks, so I have been running it with just one 2GB stick.

     

    I have this board and am running it with a single 2GB UDIMM in slot DIMM1A.

    Brand?  Model?  Link?

     

    Micron MT18JSF25672AY from eBay. It was on Supermicro's tested memory list.

     

    Interesting. It refused one stick of my Crucial memory. Did you get the long beep at all?

     

  20. Added WD20EARS as parity and recalculated the parity.

     

     

    Aug 25 20:09:59 Tower2 kernel: mdcmd (379): spinup 0

    Aug 25 20:09:59 Tower2 kernel:

    Aug 25 20:10:00 Tower2 kernel: mdcmd (383): spinup 0

    Aug 25 20:10:00 Tower2 kernel:

    Aug 25 20:10:26 Tower2 kernel: mdcmd (388): spinup 0

    Aug 25 20:10:26 Tower2 kernel:

    Aug 26 03:00:46 Tower2 kernel: md: sync done. time=28083sec rate=69562K/sec

    Aug 26 03:00:46 Tower2 kernel: md: recovery thread sync completion status: 0

     

     

    A bit of improvement, vs the original sync rate using Seagate 500GB for parity:

     

    Aug 24 22:41:27 Tower2 kernel: md: sync done. time=9469sec rate=51577K/sec

     

     

    I'm assuming these are onboard SATA rates.  I wonder if there is any performance boost going through the SASLP-MV8.  My Atom averages about 55000K/sec on a parity check with 7200 rpm Hitachis, so I'm betting you could see 80 - 90 M/sec with non-green drives.

     

    Actually, 2 are on board, and 4 connected to the SASLP card. Wanted to test the card

    and left it like that afterward.

  21. Added WD20EARS as parity and recalculated the parity.

     

     

    Aug 25 20:09:59 Tower2 kernel: mdcmd (379): spinup 0

    Aug 25 20:09:59 Tower2 kernel:

    Aug 25 20:10:00 Tower2 kernel: mdcmd (383): spinup 0

    Aug 25 20:10:00 Tower2 kernel:

    Aug 25 20:10:26 Tower2 kernel: mdcmd (388): spinup 0

    Aug 25 20:10:26 Tower2 kernel:

    Aug 26 03:00:46 Tower2 kernel: md: sync done. time=28083sec rate=69562K/sec

    Aug 26 03:00:46 Tower2 kernel: md: recovery thread sync completion status: 0

     

     

    A bit of improvement, vs the original sync rate using Seagate 500GB for parity:

     

    Aug 24 22:41:27 Tower2 kernel: md: sync done. time=9469sec rate=51577K/sec

     

  22. I am not going to repeat pros and cons that were already mentioned,

    nor am I going to describe my very painless experience of 1 failed disk recovery,

    or how easy it is to expand the array or replace a drive with a larger one.

     

    I'll just say that I did recently seriously contemplate running a different home server software than unRaid.

    The reason was that I had a hard time booting unRaid from a flash drive on my

    new hardware purchased for a second unRaid system. I was so frustrated that I started looking at

    alternatives for a while. I considered WHS, FlexRaid, FreeNAS, Openfiler and ZFS.

    I have to tell you, to me absolutely nothing came even close to unRaid which I have been

    using since 2007. It totally satisfies my needs and is very simple to setup and maintain

    for a non-Linux-savvy user like myself.

     

    My only concern is that if I do lose more than one drive the data on those two drives will be lost.

    I have some data that I cannot lose, like family videos in HD. Until recently I had it backed up elsewhere, but since

    the size is growing fast, I need another solution. I decided to build a second server which will be located

    in a different location and where I will keep duplicates of critical data. In a way, it is similar to

    the WHS's duplication, but I don't have to duplicate everything to have some kind of protection. To me unRaid is a much more elegant solution, more stable, feature rich

    and over the past three years it did not let me down. I will not go other way unless I absolutely have to.

     

    Finally, I just don't think that duplication on WHS is worth much. In a properly protected system (non-fail surge protector,

    UPS and good ventilation) a chance of hard drive failure is very small. I had only one failed hard drive in unRaid over the past

    3 years, out of 20 drives I am currently running, from which I easily recovered. This tells me that duplication would have been

    a total waste of a lot of money. Now, if lightning struck the house or there was a fire or some other disasterous event, the

    whole system would have been destroyed, again, that duplication would have been worth nothing in the end.

    A more prudent way to use duplication is to set up different servers and keep them as far away from each other as reasonably

    possible. Two unRaid servers would do the trick.