SMB behavior - Parity-check stalls - unRAID Server 4.5 [No new topics]

September 2, 201015 yr

My SMB shares have been acting strange the last few days. I have been able to see and mount them, but they will disappear within a few minutes. Sometimes I'm able to mount them and browse the content, and sometimes not.

I have rebooted the machine a couple times and let the parity-check run and it gets stuck at 65%.

Here is a link to my last log:

http://pastebin.com/J03wpswH

Any help would be great. Thanks!

September 4, 201015 yr

Author

It's completely unusable now. I see the shares, mount one, then try open and browse the files on that share... it locks the finder on every machine I try.

It's going through another parity check now, but my guess is that it will stall out at 65% again.

Is there anything I can do to diagnose this on the system? Can I rebuild the parity and hope for the best? Should I upgrade from 4.5.1 to 4.5.6 or 5.0b2?

I've had the system running fine for quite a while now, and I have tons of data on here that I would hate to lose.

Help? Please?

September 4, 201015 yr

It's completely unusable now. I see the shares, mount one, then try open and browse the files on that share... it locks the finder on every machine I try.

It's going through another parity check now, but my guess is that it will stall out at 65% again.

Is there anything I can do to diagnose this on the system? Can I rebuild the parity and hope for the best? Should I upgrade from 4.5.1 to 4.5.6 or 5.0b2?

I've had the system running fine for quite a while now, and I have tons of data on here that I would hate to lose.

Help? Please?

Step 1. Perform a memory test.

Step 2. Do NOT rebuild parity just for the heck of it. You are more likely to damage things than to fix things.

Step 3. Attach a syslog to your next post.

Joe L.

September 4, 201015 yr

Author

Thanks much! Running a memtest now.

September 7, 201015 yr

Author

Well, I was running a memtest for a couple days (not sure if it's supposed to take that long) and the power went out for an extended period of time ... I did notice that there were 219 errors before the power went out though. Is that typical? Should I just go get new RAM or should I wait for another test to run?

Thanks!

September 7, 201015 yr

There should NEVER be ANY errors in memtest. If so, then the mobo or RAM is bad (or the RAM is not configured right or is incompatible with that mobo).

September 7, 201015 yr

Well, I was running a memtest for a couple days (not sure if it's supposed to take that long) and the power went out for an extended period of time ... I did notice that there were 219 errors before the power went out though. Is that typical? Should I just go get new RAM or should I wait for another test to run?

Thanks!

Memtest runs until you quit from it. if you did not have that power failure, it would still be running.

As already stated, most likely you have either bad memory, or the timing, voltage or clock speed are not set properly for the memory you do have. It is also possible for the errors to be caused by a bad motherboard and/or power supply.

You should NEVER see any errors on a memory test. If you do, your server will randomly crash or data get corrupted.

September 7, 201015 yr

Author

The server had been running for a year or so with no issues, so should I assume that it isn't a timing/voltage/clock speed issue? If so, I guess I should go pick up some new RAM and give that a shot.

September 7, 201015 yr

The server had been running for a year or so with no issues, so should I assume that it isn't a timing/voltage/clock speed issue?

No, you should not just assume your motherboard gets it correct. It must be set for your specific memory make and model. Some BIOS get it right many do not. It could just show under some conditions of temperature and voltage, or activity on the server, but if wrong now and you replace the ram with identical ram, it will still be wrong.

If so, I guess I should go pick up some new RAM and give that a shot.

Only after checking the current settings for timing, voltage, and clock speed.... and run memtest on the new memory too. remember, it could be a power supply issue or a motherboard issue causing the errors.

September 7, 201015 yr

First step should be to unplug memory and other expansion cards, and re-seat them in their sockets.

If memtest is reporting faults at specific fixed locations and not at random locations, I would suspect the memory. Do you have two memory modules? If so, swap their physical locations and see whether the reported errors change (ie, faulty address moved by the size of the modules, or moved from odd to even addresses (or vice-versa)).

Otherwise, if the system has been running without problem for a year, and there have been no significant changes to hardware or software, I would suspect the power supply ... even if the voltages appear to be correct.

September 11, 201015 yr

Author

Thanks again for your suggestions!

I pulled the RAM and PCI cards... put them back in and checked my BIOS setting on reboot. The timing looks all correct (mostly set to Auto, but reading the values correctly), but I did change one performance setting for the memory (I forget what it was called) from "Turbo" to "Standard". It said something about making a machine more stable after overclocking, so I thought why not? Ran another memtest and had no errors. I just rebooted, and I have attached my syslog. Should I be able to capture the memtest results somehow?

It is running at the moment, but not long enough to tell if anything has truly been fixed. Should I run more tests of some sort before risking data corruption letting this thing run?

I should also mention that I moved the server to a different location in the house for testing, so my battery backup is not connected.

syslog-2010-09-11.zip

September 11, 201015 yr

Author

Well, back to the same issues $:-\$

Another thing... I almost always have to telnet in and /sbin/powerdown to reboot. I try to stop the array in the main UI, but it sits there forever "stopping..." Could that be somehow related?

Here is another syslog... anything else I can do?

syslog-20100911-113715.zip

September 11, 201015 yr

Well, back to the same issues $:-\$

Another thing... I almost always have to telnet in and /sbin/powerdown to reboot. I try to stop the array in the main UI, but it sits there forever "stopping..." Could that be somehow related?

Here is another syslog... anything else I can do?

The server will not shut down if it is unable to un-mount the disks. Disks cannot be un-mounted if they are busy.

Disks are busy if they are in use, either by having file on them open, or by being the "current" directory for a process.

For example, if you log in and change directory to /mnt/disk1, then disk1 will be busy since it is your current working directory.

If you were to start a process while there, that direct willl be the current working directory for that process. That too would prevent the disk from being un-mounted until the process is terminated.

So... stop any add-on programs you might have started, log off any telnet sessions, and then you should be able to shut down the server.

Joe L.

September 11, 201015 yr

Author

Did the syslog point to anything that might be my issues?

September 11, 201015 yr

Yes.

At one point one of your user shares was in use:

Sep 11 11:36:00 MediaServer emhttp: Retry unmounting user share(s)...

Sep 11 11:36:05 MediaServer emhttp: shcmd (104): umount /mnt/user >/dev/null 2>&1

Sep 11 11:36:05 MediaServer emhttp: _shcmd: shcmd (104): exit status: 1

Sep 11 11:36:05 MediaServer emhttp: shcmd (105): rmdir /mnt/user >/dev/null 2>&1

Sep 11 11:36:05 MediaServer emhttp: _shcmd: shcmd (105): exit status: 1

Sep 11 11:36:05 MediaServer emhttp: Retry unmounting user share(s)...

Sep 11 11:36:10 MediaServer emhttp: shcmd (106): umount /mnt/user >/dev/null 2>&1

Sep 11 11:36:10 MediaServer emhttp: _shcmd: shcmd (106): exit status: 1

Sep 11 11:36:10 MediaServer emhttp: shcmd (107): rmdir /mnt/user >/dev/null 2>&1

Sep 11 11:36:10 MediaServer emhttp: _shcmd: shcmd (107): exit status: 1

Sep 11 11:36:10 MediaServer emhttp: Retry unmounting user share(s)...

Then, more specifically, disk9:

Sep 11 11:37:15 MediaServer rc.unRAID[4471]: umount: /mnt/disk9: device is busy

My guess, based on this in your log, is that you are forgetting to stop the AirVideo server before attempting to stop the array:

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: # Generated settings:^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareExport=e^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareExportNfs=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareComment=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareValidUsers=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareInvalidUsers=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareExceptions=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareInclude=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareExclude=^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareAllocator=highwater^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareSplitLevel=2^M

Sep 11 11:37:11 MediaServer AirVideo.cfg[4340]: shareFloor=1,000,000^M

September 12, 201015 yr

Did the syslog point to anything that might be my issues?

It will be helpfull if people post their hardware configuration when there is a problem as this is hard (and almost immpossible) to get from the syslog.

Possible issues:

1. You have and SIL3xxx based card (not sure if is on-board or it is add-on) and they were recent posts here regarding the possible data corruption and possible high mortality of such cards (they are probably tens of manufacturers and this may affect just a certain brand)

Check this for more info - http://lime-technology.com/forum/index.php?topic=7742.0

as your case certanly fills the bill.

Sep 11 10:24:12 MediaServer kernel: sata_sil 0000:05:00.0: version 2.4

Sep 11 10:24:12 MediaServer kernel: sata_sil 0000:05:00.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20

Sep 11 10:24:12 MediaServer kernel: sata_sil 0000:05:00.0: Applying R_ERR on DMA activate FIS errata fix

Sep 11 10:24:12 MediaServer kernel: scsi1 : sata_sil

Sep 11 10:24:12 MediaServer kernel: scsi2 : sata_sil

Sep 11 10:24:12 MediaServer kernel: scsi3 : sata_sil

Sep 11 10:24:12 MediaServer kernel: scsi4 : sata_sil

Also here you have this:

Sep 11 10:24:12 MediaServer kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Sep 11 10:24:12 MediaServer kernel: ata1.00: ATA-8: ST31500341AS, SD17, max UDMA/133

Sep 11 10:24:12 MediaServer kernel: ata1.00: 2930277168 sectors, multi 16: LBA48 NCQ (not used)

Sep 11 10:24:12 MediaServer kernel: ata1.00: WARNING: device requires firmware update to be fully functional.

Sep 11 10:24:12 MediaServer kernel: ata1.00: contact the vendor or visit http://ata.wiki.kernel.org.

Sep 11 10:24:12 MediaServer kernel: ata1.00: configured for UDMA/100

Sep 11 10:24:12 MediaServer kernel: scsi 1:0:0:0: Direct-Access ATA ST31500341AS SD17 PQ: 0 ANSI: 5

2. You have HPA on at least one disk:

Sep 11 10:24:12 MediaServer kernel: ata2.00: HPA detected: current 2930275055, native 2930277168

Sep 11 10:24:12 MediaServer kernel: ata2.00: ATA-8: ST31500341AS, CC1H, max UDMA/133

It also look that the disk in use is coincidently the one with HPA:

Sep 11 10:24:12 MediaServer kernel: md: import disk0: [8,64] (sde) Hitachi HDS72202 JK1130YAGSEJWT offset: 63 size: 1953514552

Sep 11 10:24:12 MediaServer kernel: md: import disk1: [8,96] (sdg) ST31500541AS 6XW0GRH1 offset: 63 size: 1465138552

Sep 11 10:24:12 MediaServer kernel: md: import disk2: [8,80] (sdf) ST3500630AS 5QG044WP offset: 63 size: 488386552

Sep 11 10:24:12 MediaServer kernel: md: import disk3: [8,112] (sdh) ST3500630AS 9QG0TJ4H offset: 63 size: 488386552

Sep 11 10:24:12 MediaServer kernel: md: import disk4: [8,128] (sdi) ST31500541AS 5XW00XL3 offset: 63 size: 1465138552

Sep 11 10:24:12 MediaServer kernel: md: import disk5: [8,144] (sdj) WDC WD20EADS-00S WD-WCAVY2751289 offset: 63 size: 1953514552

Sep 11 10:24:12 MediaServer kernel: md: import disk6: [8,32] (sdc) ST31500341AS 9VS2X15H offset: 63 size: 1465138552

Sep 11 10:24:12 MediaServer kernel: md: import disk7: [8,48] (sdd) ST31500341AS 9VS1TJRW offset: 63 size: 1465138552

Sep 11 10:24:12 MediaServer kernel: md: import disk8: [8,0] (sda) ST31500341AS 9VS09WWJ offset: 63 size: 1465138552

Sep 11 10:24:12 MediaServer emhttp: Spinning up all drives...

Sep 11 10:24:12 MediaServer kernel: md: import disk9: [8,16] (sdb) ST31500341AS 9VS28GN1 offset: 63 size: 1465137496

I am not sure how to handle this - perhaps Joe will take it from here.

September 12, 201015 yr

Author

Thanks for the suggestions. I'll look at replacing that controller and figure out what the HPA thing is all about and check back in with my results. Do those things have a good chance of being the answer to my original issues, too? I still can't seem to have these shares stay mounted. I can mount them, and sometimes get to the data, but they go away pretty quickly.

Thanks again!

BTW, here is my config:

Gigabyte MB (GA-EP45-UD3L)

4GB Corsair RAM (2x2GB DDR2 800Mhz 1.8v vers.4.3)

Corsair Power Supply (TX750W)

SYBA SATA Card (SD-SATA-4P)

October 12, 201015 yr

Author

I found my issue and thought I would share in case anyone else comes across it.

I had a bad/corrupt file.

When I turned off my media server (Plex9), everything was back to normal for a long while. So, I thought I would give it another shot to see if it would start freaking out again. This time, I watched the Media Server loading/scanning content. After a few minutes, I noticed that it was stuck scanning one file... then *poof*... the share goes away. I quit MS and tried to mount the share again with no luck, so I restarted and was able to mount the share fine. When I went to get info on the file that it was stuck on, the finder hung for a while, then *poof*. Restarted, tried to delete the file through the finder... *poof*... so, I manually deleted the file on the unRaid box. Restart, mount share, start MS... scans all content, and everything is working!

Thanks for everyones suggestions. It prompted me to fix a couple issues that I didn't know I had, so it was a great learning experience!

October 12, 201015 yr

I found my issue and thought I would share in case anyone else comes across it.

I had a bad/corrupt file.

When I turned off my media server (Plex9), everything was back to normal for a long while. So, I thought I would give it another shot to see if it would start freaking out again. This time, I watched the Media Server loading/scanning content. After a few minutes, I noticed that it was stuck scanning one file... then *poof*... the share goes away. I quit MS and tried to mount the share again with no luck, so I restarted and was able to mount the share fine. When I went to get info on the file that it was stuck on, the finder hung for a while, then *poof*. Restarted, tried to delete the file through the finder... *poof*... so, I manually deleted the file on the unRaid box. Restart, mount share, start MS... scans all content, and everything is working!

Thanks for everyones suggestions. It prompted me to fix a couple issues that I didn't know I had, so it was a great learning experience!

Deleting the file may have fixed it, but to be absolutely certain I'd perform a file system check on the disk that had the file you deleted.

Instructions are here in the wiki: http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems

SMB behavior - Parity-check stalls

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)