spasszeit

April 15, 2011

I seem to have been having problems with 4.7 which I did not have before. I run two servers

and ever since upgrading to 4.7 I get occasional drive drive drop off from the array. First time

it happened I thought the drive was bad. Re-seated the cables, that did not help - during the

system post I see one port not being detected. When array starts the disk is missing. I replaced

the drive with a new one and the array rebuilt itself just fine. I put the bad drive into the other

tower and ran pre-clear on it - no smart errors. Last night I had the same problem on the other

server. The problem exhibited itself by inability to access web interface, though access to shares

via windows and via Telnet worked fine. I captured the syslog - see attached. There is an awful

lot of messages related to, I think, my SM AOC-SASLP-MV8 and a disk. I run two AOC-SASLP-MV8

cards in each server.

Now, my friend built a server recently as well. He is running one AOC-SASLP-MV8 card, and he mentioned

to me that ever since upgrading to 4.7 he has also experienced a problem like mine.

I hope Tom can look into this and advise on what's going on here and how we can address it.

Syslog_Apr_14.zip

February 22, 2011

Yes, I did two builds this year based on X8SIL-F motherboards and ordered most of the parts from them.

No complaints. Very quick shipping. Packaged well.

February 22, 2011

Also noticed that you could save substantially on the Supermicro card and 3Ware cables by buying from

Provantage... at least $45 in aggregate. Did not check your other hardware.

February 10, 2011

Disk1 is a "real" disk. You can try to copy its contents to another drive on a workstation (I would not recommend writing anything to the array right now).

You could remove disk1 from the server, mount in an external USB, and attach to a workstation computer. There is a reiserfs program you can run that will give you read-only access to the disk. If you fear your unRAID computer is failing, this would be a way to eliminate the unRAID server. If the copy works smoothly you may be justified in doubting the server hardware. But if it is slow and problematic, you have confirmed the disk really is bad.

Disk7 is imposssible to rebuild without disk1 in the array. If you were able to get disk1 working well in the step above, it would mean that your unRAID server is suspect and you should upgrade it. Once it is rebuilt, you can retry the rebuild.

But if disk1 is bad, there is not much I know to do. Perhaps you could send it to some recovery service that could create an image copy of the disk.

Is the physical disk that used to be in slot7 still around? If so, you might have more luck recovering from that than recovering from the array with the failing disk1.

I have a very strong suspicion that, like you said before, it may be PSU related. The disks are dropping like flies. Must be something with the hardware that is causing this epidemic of failures. I did not mention this before, but during one of the reboots the other night another two drives went missing. I powered down the server, re-seated power and SATA cables and powered it up. The drives, thank God, came back up. It finally dawned on me that I should not take any more chances so I stopped the rebuild and powered down the server. I don't rule out other hardware problems either, especially that I know I've had memory/mobo issues before. So, I decided to rebuild the entire machine. It's been in my plans anyway. The new hardware is on the way so I should have the new build next week.

As of right now, my plan is as follows:

1. Take out disks 1 and 9 and copy files off of them to my second unraid machine.

I am looking to find a way to do it via some kind of gui as my command line skills are lacking

and I don't trust them 100%.

Your suggestion to copy files to a Windows machine sounds like something I'd like to try,

but I am not sure I want to do that as I have two empty 2TB drives in my second unRaid.

2. Once the new machine is ready I will put disks 1 and 9 back in, and try to rebuild disk 7,

or maybe I can extract files from the 'virtual' disk 7. Yes, I still have the physical disk 7, but it is bricked.

3. If step 2 fails, I will back up files off of another 1TB disk that is the same model and same

firmware and swap controller boards with the dead drive. Perhaps I can revive it.

4. If that fails as well... well, I hope I can somehow figure out what was on it and

get that content back. Good thing is all of my critical files are backed up on the second server.

Question, suppose I can copy files off Disks 1 and 9, is there an efficient way to tell if all files are there

and if there are any corruptions? Or should I just compare the 'used' space in the original config with the

one on the new drive, and then go check each individual file manually?

Thanks for your support bjp999.

February 9, 2011

Would it maybe make sense to stop the rebuild, upgrade the hardware and then simply copy

the the files off Disk 1, and Disk 7 to another server? I suppose at this point I don't care

how retarded or time consuming the process will be, the only thing I want is to not lose

any or much of my data. Desperately need some expert advice.

February 9, 2011

Can you access disk7 (I'm not sure, when doing this type of operation, if the disk being reconstructed can be accessed, but I think it probably can.) If so, you might want to look at some of the earliest written files to the disk and see if they look good. The new disk7 seems pretty happy.

Yes, last night I was able to access Disk 7. Though it was extremely slow. After clicking on the drive it took several minutes

before the directory under the drive appeared.

February 9, 2011

The errors next to disk 1 are increasing. Initially the count went to 27000

and then stopped. Later I came home the count was 47K. It was about the same

this morning, and then it increased. Currently the count is at 49,600. So,

I would say it is sporadic.

Here is the smart test report on Disk 1:

Statistics for /dev/sdl 00R_WD-WCAVY0252674

smartctl -a -d ata /dev/sdl

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Device Model: WDC WD20EADS-00R6B0

Firmware Version: 01.00A01

User Capacity: 2,000,398,934,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: Exact ATA specification draft version not indicated

Local Time is: Wed Feb 9 09:57:47 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x85) Offline data collection activity

was aborted by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 41) The self-test routine was interrupted

by the host with a hard or soft reset.

Total time to complete Offline

data collection: (40800) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x303f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 204446

3 Spin_Up_Time 0x0027 149 148 021 Pre-fail Always - 9541

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 75

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 12088

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 53

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 7

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 67

194 Temperature_Celsius 0x0022 127 114 000 Old_age Always - 25

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1374

198 Offline_Uncorrectable 0x0030 199 196 000 Old_age Offline - 394

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 025 001 000 Old_age Offline - 35182

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Interrupted (host reset) 90% 12088 -

# 2 Short offline Interrupted (host reset) 90% 12085 -

# 3 Short offline Completed: read failure 10% 12023 461689921

# 4 Short offline Completed: read failure 10% 12023 551849266

# 5 Short offline Completed without error 00% 6976 -

# 6 Short offline Completed without error 00% 5520 -

# 7 Short offline Completed without error 00% 4787 -

# 8 Short offline Completed without error 00% 4761 -

# 9 Short offline Completed without error 00% 4711 -

#10 Short offline Completed without error 00% 4710 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

February 9, 2011

The rebuild has not been going too well guys. Seems like Disk 1 is also having problems.

I captured current syslog and it's pretty much all red. Can someone please take a look

and give me a general idea what seems to be the problem?

So, I have new hardware on the way. Should I just stop now and perform the rebuild

on the new hardware or let this one finish? I am afraid though that something else

will break by the time it's done.

The speed goes up to 14000-23000KB/s briefly and stays below 500MB/s most of the time.

Over last night the progress was from 32.3% to 39% today. At this rate it will take it

a month to complete...

Feb 9 07:14:52 Tower kernel: ata11: hard resetting link

Feb 9 07:14:53 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 07:14:53 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 07:14:53 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 07:14:53 Tower kernel: ata11: EH complete

Feb 9 07:15:23 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 07:15:23 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 07:15:23 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 07:15:23 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 07:15:23 Tower kernel: ata11.00: status: { DRDY }

Feb 9 07:15:23 Tower kernel: ata11: hard resetting link

Feb 9 07:15:25 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 07:15:25 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 07:15:25 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 07:15:25 Tower kernel: ata11: EH complete

Feb 9 07:15:55 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 07:15:55 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 07:15:55 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 07:15:55 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 07:15:55 Tower kernel: ata11.00: status: { DRDY }

Feb 9 07:15:55 Tower kernel: ata11: hard resetting link

Feb 9 07:15:56 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 07:15:56 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 07:15:56 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 07:15:56 Tower kernel: ata11: EH complete

Feb 9 07:16:27 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 07:16:27 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 07:16:27 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 07:16:27 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 07:16:27 Tower kernel: ata11.00: status: { DRDY }

Feb 9 07:16:27 Tower kernel: ata11: hard resetting link

..........................

Feb 9 08:01:16 Tower kernel: handle_stripe read error: 1529839464/1, count: 1

Feb 9 08:01:16 Tower kernel: md: disk1 read error

Feb 9 08:01:16 Tower kernel: handle_stripe read error: 1529839472/1, count: 1

Feb 9 08:02:02 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:02:02 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:02:02 Tower kernel: ata11.00: cmd 25/00:00:b7:9e:2f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:02:02 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:02:02 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:02:02 Tower kernel: ata11: hard resetting link

Feb 9 08:02:02 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:02:02 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:02:02 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:02:02 Tower kernel: ata11: EH complete

Feb 9 08:04:26 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:04:26 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:04:26 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:04:26 Tower kernel: ata11.00: cmd 25/00:00:57:02:31/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:04:26 Tower kernel: res 51/40:2f:21:03:31/00:03:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:04:26 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:04:26 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:04:26 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:04:26 Tower kernel: ata11: EH complete

Feb 9 08:10:53 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:10:53 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:10:53 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:10:53 Tower kernel: ata11.00: cmd 25/00:00:af:15:79/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:10:53 Tower kernel: res 51/40:7f:26:19:79/00:00:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:10:53 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:10:53 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:10:53 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:10:53 Tower kernel: ata11: EH complete

Feb 9 08:19:04 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:19:04 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:19:04 Tower kernel: ata11.00: cmd 25/00:00:27:04:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:19:04 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:19:04 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:19:04 Tower kernel: ata11: hard resetting link

Feb 9 08:19:05 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:19:05 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:19:05 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:19:05 Tower kernel: ata11: EH complete

Feb 9 08:19:36 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:19:36 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:19:36 Tower kernel: ata11.00: cmd 25/00:00:27:04:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:19:36 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:19:36 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:19:36 Tower kernel: ata11: hard resetting link

Feb 9 08:19:36 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:19:36 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:19:36 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:19:36 Tower kernel: ata11: EH complete

Feb 9 08:20:52 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:20:52 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:20:52 Tower kernel: ata11.00: cmd 25/00:00:27:0d:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:20:52 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:20:52 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:20:52 Tower kernel: ata11: hard resetting link

Feb 9 08:20:53 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:20:53 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:20:53 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:20:53 Tower kernel: ata11: EH complete

Feb 9 08:21:23 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:21:23 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:21:23 Tower kernel: ata11.00: cmd 25/00:00:27:0d:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:21:23 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:21:23 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:21:23 Tower kernel: ata11: hard resetting link

Feb 9 08:21:24 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:21:24 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:21:24 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:21:24 Tower kernel: ata11: EH complete

Feb 9 08:23:58 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:23:58 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:23:58 Tower kernel: ata11.00: cmd 25/00:00:27:02:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:23:58 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:23:58 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:23:58 Tower kernel: ata11: hard resetting link

Feb 9 08:23:59 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:23:59 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:23:59 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:23:59 Tower kernel: ata11: EH complete

Feb 9 08:25:03 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:25:03 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:25:03 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:25:03 Tower kernel: ata11.00: cmd 25/00:00:27:22:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:25:03 Tower kernel: res 51/40:ff:26:23:bf/00:02:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:25:03 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:25:03 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:25:03 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:25:03 Tower kernel: ata11: EH complete

Feb 9 08:25:20 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:25:20 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:25:20 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:25:20 Tower kernel: ata11.00: cmd 25/00:00:27:22:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:25:20 Tower kernel: res 51/40:9f:78:22:bf/00:03:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:25:20 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:25:20 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:25:20 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:25:20 Tower kernel: ata11: EH complete

Feb 9 08:32:12 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:32:12 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:32:12 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:32:12 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:32:12 Tower kernel: res 51/40:27:7f:cf:2d/00:01:5c:00:00/e0 Emask 0x9 (media error)

Feb 9 08:32:12 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:32:12 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:32:12 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:32:12 Tower kernel: ata11: EH complete

Feb 9 08:32:35 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:32:35 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:32:35 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:32:35 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:32:35 Tower kernel: res 51/40:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x9 (media error)

Feb 9 08:32:35 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:32:35 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:32:35 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:32:35 Tower kernel: ata11: EH complete

Feb 9 08:33:05 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:33:05 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:33:05 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:33:05 Tower kernel: res 40/00:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x4 (timeout)

Feb 9 08:33:05 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:33:05 Tower kernel: ata11: hard resetting link

Feb 9 08:33:06 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:33:06 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:33:06 Tower kernel: ata11: EH complete

Feb 9 08:33:36 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:33:36 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:33:36 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:33:36 Tower kernel: res 40/00:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x4 (timeout)

Feb 9 08:33:36 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:33:36 Tower kernel: ata11: hard resetting link

Feb 9 08:33:37 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:33:37 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:33:37 Tower kernel: ata11: EH complete

Total lines: 3000

February 8, 2011

Good to know that I can repeat the process again with the new hardware.

As far as the reallocated sectors, I checked every drive and I see 5 drives with reallocated sectors,

ranging from 1 to 5. The one drive that failed first I checked later had 1700 pending reallocated,

but no reallocated sectors.

February 8, 2011

I bet your are right about the hardware. But I have no one but myself to blame. I have been having spats of problems with this server from time to time, like once a year, but then it stabilizes and I keep delaying hardware upgrade. A while ago after some troubleshooting I discovered that my mobo (P5BVMDO) all of a sudden refused to work with the RAM that was installed. Eventually I replaced the RAM but left only one stick as with two sticks the system would not boot. As soon as I recover the data I will upgrade the hardware. Perhaps, I should have done that before following the steps that you outlined.

February 8, 2011

The speed dropped to 2,00KB/s... then back up to 6,000KB/s... keeps jumping up and down every time I refresh the

browser. There is one error on disk 1.

February 8, 2011

bjp999, thank you so much for the detailed response. I followed the steps and so far so good... I think.

The writes on drive 7 are increasing, and the reads on all other drives are increasing. Though, I do see

some minor write increases on other drives as well. The speed started off at 16,000KB/s then went down

to 200Kb/s and then ramped back up to 19,500KB/s... I am letting it run and let's see what happens.

February 8, 2011

Need some help here. I am running version 4.5.6.

Two days ago a 2TB drive (1736 in slot 9) got spun down by itself

and had a green blinking ball next to it. I rebooted the server and after the reboot the drive

had a red ball next to it. I got a new drive, precleared it on a different server then replaced

the bad drive and started the rebuild process. One day into rebuild I noticed the speed was

14KB/s, barely crawling. The drive in slot 2 had numerous errors. I refreshed the page and

the server became unresponsive. So I hard rebooted it.

This time another drive (PHBV in slot 7) became red was reported as missing. I powered the

server down, re-seated the cables and powered the server on. No change.

I took both drives out and put them into my other server. The PHBV is dead completely, perhaps

the controller board went dead. The 1736 I can mount, short smart test shows 1700 pending reallocations.

I decided to put the 1736 back into the original server and try to rebuild the PHBV drive, but now

Slot 9 wants the 1007 drive, which is the replacement that I bought for the 1736 in slot 9.

I attached the Disk Status page. Any advice on how I can rebuild the PHBV drive now and then

rebuild the 1736 drive? Probably wishful thinking. In that case, how can I copy the data from this

drive? If I am able to mount, I should be able to copy. Just can't figure out how. As for the other

driver, perhaps the controller got fried and I may be able to revive it...

Did not capture the system log originally, unfortunately.

September 11, 2010

spasszeit, you have been running this for a while now, any ideas on what your typical CPU temperatures are? I put it all together last night (not started the array yet, the 4 in 3 fan is a 3 pin and there are only 4 pin headers on the motherboard), used the stock HSF that came with the 3430 just want to know what i should be expecting, so i know when to hit the panic button if i need to, lol.

Congrats on finishing off your build. Enjoy it.

Since SM's measurement of CPU temps is strange - all it says 'low' under PC health tab in IPMI dashboard,

the only other way I can check the temps is in Unmenu. Here you go:

coretemp-isa-0000

Adapter: ISA adapter

Core 0: +35.0 C (high = +84.0 C, crit = +100.0 C)

coretemp-isa-0001

Adapter: ISA adapter

Core 1: +33.0 C (high = +84.0 C, crit = +100.0 C)

coretemp-isa-0002

Adapter: ISA adapter

Core 2: +35.0 C (high = +84.0 C, crit = +100.0 C)

coretemp-isa-0003

Adapter: ISA adapter

Core 3: +32.0 C (high = +84.0 C, crit = +100.0 C)

The ambient temps are about 22-24 C I would guess. I am using stock HSF as well. Front of case has 2x 120mm intake fans,

and rear has 2x 80mm exhaust fans.

August 28, 2010

Now that I got my second server up and running, I would like to set up a scheduled back up of certain shares on Tower1 to Tower2.

I think i got down the basics of the syntax for the rsyncd.conf file, and am able to sync Photos share (for now) manually but when it comes to

automating all this I am in quite over my head, so I'd really appreciate some guidance on this.

Here is what I am doing and questions I have:

1. Following JoeL's examples, I set up rsyncd.conf file on Tower2:

uid = root

gid = root

use chroot = no

max connections = 4

pid file = /var/run/rsyncd.pid

timeout = 600

log file = /var/log/rsyncd.log

[Photos]

path = /mnt/user/media/Backups/Photos

comment = /mnt files

read only = FALSE

2. Automatically invoke rsync daemon process on Tower2 every time the server is rebooted.

So, manually the daemon is invoked with this command:

rsync --daemon --config=/boot/config/rsyncd.conf

Should it be added to the go script? It would make sense, but I am curious

why I don't see this command in the 'go' script in the example from this thread - http://lime-technology.com/forum/index.php?topic=3417.0

3. To start the rsync process based on some schedule, I understand I need to

add something similar to this cron job to 'go' script on Tower1:

#set up rsync between the two servers every other day at 3 am - will be commented out for Server2 go script

echo "0 3 2-6,8-13,15-20,21-31 * * /usr/bin/rsync rsync://Server2/disk1/*" >>/tmp/crontab

echo "0 3 2-6,8-13,15-20,21-31 * * /usr/bin/rsync rsync://Server2/disk2/*" >>/tmp/crontab

echo "0 3 2-6,8-13,15-20,21-31 * * /usr/bin/rsync rsync://Server2/disk3/*" >>/tmp/crontab

Say if in my case I want to build upon this manual command to do daily backups:

cd /mnt

rsync -avrH user/media/Photos tower2::Photos

What should my entry be? I am trying to make sense of the example above but I am not sure I get all the syntax yet.

Anything else I am missing?

August 28, 2010

As for deciding if ECC functionality is worth it, that all comes down to personal preferences. I haven't had issues with memory glitches that I am aware of, but then my server doesn't get the workout that enterprise production servers do.

I, on the other hand, have had my share of problems with standard RAM sticks on my first unRaid built on P5BVM-DO.

Still not sure what went wrong there. All of a sudden I started seeing numerous errors in the log, system freezes, etc.

Eventually I narrowed the problem down to RAM, and ended up exchanging it, but running only one stick as with two sticks

of new RAM the system wouldn't boot. Took me two weeks to get the server stable and problem free. But from what I

see on the forums the issues I experienced are very uncommon.

August 28, 2010

Thanks very much for testing that, that's put my mind at ease

I'm still unsure what CPU to use, originally I was going to go with a Core i3 530, but you have to use ECC memory with this motherboard anyway, and the ECC only works with Xeon processors apparently, but the X3430 is nearly double the price (£156 for the X3430 vs £83 for the i3 530).

Whatever I get is going to have more processing power than I have now (an old AMD FX55) so really its just a case of trying to justify getting a Xeon, is the ECC worth the extra? I believe the Xeons also open up some options for VM hyperviser or something, but not sure I'd have any need for this.

Any opinions?

My basic reason for going with Xeon was that since I am buying a server grade mobo with ECC memory, I might as well buy a server grade CPU and take advantage of ECC. I've read somewhere that ECC memory provides greater stability and reliability, hence it is a must for mission critical applications. My unRaid has become pretty mission critical for the members of my family:-) Whenever it is down for maintenance or break-fixing, I get bombarded by complaints.

August 28, 2010

I could have swore i saw a response in this thread (maybe it was a different one) saying the i3 530 will work and is compatible with ECC memory, just that the ECC functionality won't be used. It included several links, one to a rather nice tested review too.

LOL... that was me... I misread Kode's post and thought he'd said he wasn't sure if i3530 would work with ECC memory.

Since you mentioned it, here is the link to that review. It really is very thoughtful and nicely written:

http://www.servethehome.com/supermicro-x8silf-motherboard-v102-review-whs-v2-diy-server/

August 28, 2010

Thanks very much for testing that, that's put my mind at ease

I'm still unsure what CPU to use, originally I was going to go with a Core i3 530, but you have to use ECC memory with this motherboard anyway, and the ECC only works with Xeon processors apparently, but the X3430 is nearly double the price (£156 for the X3430 vs £83 for the i3 530).

Whatever I get is going to have more processing power than I have now (an old AMD FX55) so really its just a case of trying to justify getting a Xeon, is the ECC worth the extra? I believe the Xeons also open up some options for VM hyperviser or something, but not sure I'd have any need for this.

Any opinions?

Another alternative for you would be L3406. Albeit it is also priced much higher than the i3.

August 27, 2010

Thanks adelias for the info on the 1 stick of memory, i don't really want to get 2 sticks straight off, spasszeit any chance of testing with 1 stick again?

No problem. Channel 1 (blue) slots don't work with one stick, all I get is long beeps and no post.

Channel 2 (black) slots each work with 1 stick. I from the get go put sticks (both and 1 at a time)

into channel 1 slots and assumed the same behavior for channel two slots... Now looking at the manual

I see a reference to one channel taking 2 populated slots and one channel also taking 1...

but I am a typical guy, I hate reading manuals:-)

August 27, 2010

Also, I tried booting the board with only one stick and it won't boot. Gives me a long beep, which means memory problem according to the manual. My other boards work with one stick just fine. As a matter of fact, my P5BVM-DO doesn't like two sticks, so I have been running it with just one 2GB stick.

I have this board and am running it with a single 2GB UDIMM in slot DIMM1A.

Brand? Model? Link?

Micron MT18JSF25672AY from eBay. It was on Supermicro's tested memory list.

Interesting. It refused one stick of my Crucial memory. Did you get the long beep at all?

No long beep. Did you use slot DIMM1A as it states in the manual? Also what revision is your board?

Mine is 1.02.

Not sure. It's quite possible I stuck it in DIMM2A. I did not reference the manual for that.

August 27, 2010

Also, I tried booting the board with only one stick and it won't boot. Gives me a long beep, which means memory problem according to the manual. My other boards work with one stick just fine. As a matter of fact, my P5BVM-DO doesn't like two sticks, so I have been running it with just one 2GB stick.

I have this board and am running it with a single 2GB UDIMM in slot DIMM1A.

Brand? Model? Link?

Micron MT18JSF25672AY from eBay. It was on Supermicro's tested memory list.

Interesting. It refused one stick of my Crucial memory. Did you get the long beep at all?

August 27, 2010

Added WD20EARS as parity and recalculated the parity.

Aug 25 20:09:59 Tower2 kernel: mdcmd (379): spinup 0

Aug 25 20:09:59 Tower2 kernel:

Aug 25 20:10:00 Tower2 kernel: mdcmd (383): spinup 0

Aug 25 20:10:00 Tower2 kernel:

Aug 25 20:10:26 Tower2 kernel: mdcmd (388): spinup 0

Aug 25 20:10:26 Tower2 kernel:

Aug 26 03:00:46 Tower2 kernel: md: sync done. time=28083sec rate=69562K/sec

Aug 26 03:00:46 Tower2 kernel: md: recovery thread sync completion status: 0

A bit of improvement, vs the original sync rate using Seagate 500GB for parity:

Aug 24 22:41:27 Tower2 kernel: md: sync done. time=9469sec rate=51577K/sec

I'm assuming these are onboard SATA rates. I wonder if there is any performance boost going through the SASLP-MV8. My Atom averages about 55000K/sec on a parity check with 7200 rpm Hitachis, so I'm betting you could see 80 - 90 M/sec with non-green drives.

Actually, 2 are on board, and 4 connected to the SASLP card. Wanted to test the card

and left it like that afterward.

August 26, 2010

Added WD20EARS as parity and recalculated the parity.

Aug 25 20:09:59 Tower2 kernel: mdcmd (379): spinup 0

Aug 25 20:09:59 Tower2 kernel:

Aug 25 20:10:00 Tower2 kernel: mdcmd (383): spinup 0

Aug 25 20:10:00 Tower2 kernel:

Aug 25 20:10:26 Tower2 kernel: mdcmd (388): spinup 0

Aug 25 20:10:26 Tower2 kernel:

Aug 26 03:00:46 Tower2 kernel: md: sync done. time=28083sec rate=69562K/sec

Aug 26 03:00:46 Tower2 kernel: md: recovery thread sync completion status: 0

A bit of improvement, vs the original sync rate using Seagate 500GB for parity:

Aug 24 22:41:27 Tower2 kernel: md: sync done. time=9469sec rate=51577K/sec

August 26, 2010

I am not going to repeat pros and cons that were already mentioned,

nor am I going to describe my very painless experience of 1 failed disk recovery,

or how easy it is to expand the array or replace a drive with a larger one.

I'll just say that I did recently seriously contemplate running a different home server software than unRaid.

The reason was that I had a hard time booting unRaid from a flash drive on my

new hardware purchased for a second unRaid system. I was so frustrated that I started looking at

alternatives for a while. I considered WHS, FlexRaid, FreeNAS, Openfiler and ZFS.

I have to tell you, to me absolutely nothing came even close to unRaid which I have been

using since 2007. It totally satisfies my needs and is very simple to setup and maintain

for a non-Linux-savvy user like myself.

My only concern is that if I do lose more than one drive the data on those two drives will be lost.

I have some data that I cannot lose, like family videos in HD. Until recently I had it backed up elsewhere, but since

the size is growing fast, I need another solution. I decided to build a second server which will be located

in a different location and where I will keep duplicates of critical data. In a way, it is similar to

the WHS's duplication, but I don't have to duplicate everything to have some kind of protection. To me unRaid is a much more elegant solution, more stable, feature rich

and over the past three years it did not let me down. I will not go other way unless I absolutely have to.

Finally, I just don't think that duplication on WHS is worth much. In a properly protected system (non-fail surge protector,

UPS and good ventilation) a chance of hard drive failure is very small. I had only one failed hard drive in unRaid over the past

3 years, out of 20 drives I am currently running, from which I easily recovered. This tells me that duplication would have been

a total waste of a lot of money. Now, if lightning struck the house or there was a fire or some other disasterous event, the

whole system would have been destroyed, again, that duplication would have been worth nothing in the end.

A more prudent way to use duplication is to set up different servers and keep them as far away from each other as reasonably

possible. Two unRaid servers would do the trick.

spasszeit

Posts

Joined

Last visited

Content Type

Profiles

Forums

Downloads

Store

Gallery

Bug Reports

Documentation

Landing

Posts posted by spasszeit

unRAID Server Release 4.7 "final" Available

Close to ordering, need some final evaluation

Close to ordering, need some final evaluation

2 failed drives

2 failed drives

2 failed drives

2 failed drives

2 failed drives

2 failed drives

2 failed drives

2 failed drives

2 failed drives

2 failed drives

Supermicro X8SIL-F - Level I test passed

Automating backup to second unRaid server

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

Supermicro X8SIL-F - Level I test passed

UnRaid vs WHS