[SOLVED-WORK-AROUND] ReiserFS - kernel panic and stall on CPU


Recommended Posts

I too am having the exact same issue.  I have been running Unraid ver 5 with no issues for the past year under Esxi. 

 

I upgraded to unraid ver 6 when it entered public beta, again with no issues.  However past beta ver 6, after an average of 2 days, the webgui would not be accessible,  nor the user shares.  However I did have access to SSH and direct access to the virtual machine in Esxi.  The unraid server would not respond to console shutdown commands or commands to restart Unraid. 

 

I thought it might have something to do with the mover script.  Once the mover script finished running, access to the webgui became noticeably slower if it was accessible at all.  Running the mover script twice pretty much guaranteed a problem.  I thought, well I can do without a cache drive, however turning off all access to the cache drive did not 'fix' the issue, only added a little bit of time before the gui stopped responding.

 

After downgrading back to unraid 5, once again I have not had any issues. 

 

 

Was so excited to see beta 12 be released, but I am disappointed to see this bug still exists.  I wish I could put beta 12 up and grab some of the log files from my server to help diag this issue, however having the media server go down randomly is not something I am willing to risk.

 

What kind of hardware do you have? And what version of ESXi?

 

I want to see if we have anything in common.

Link to comment
  • Replies 99
  • Created
  • Last Reply

Top Posters In This Topic

Decided to boot into Arch and start Unraid-6.0b12 under Xen 4.4.1 and try the following:

  • Ran the Mover script which moved 10GB of data
  • Ran the bitrot script "bitrot.sh -a -p /mnt/disk4/TV/ -m *.mkv"
  • Updated a docker image which live on the cache drive

 

And so far so good. I know that when I ran "bitrot.sh -a -p /mnt/user/TV" it gave me a kernel panic the minute it started putting the SHA256 into the file itself.

 

I am monitoring the syslog both via the browser and "tail -f /var/log/syslog" and so far so good. This is the longest Unraid and the bitrot script has ever lasted.

 

Bitrot script was able to process 94GB successfully under /mnt/disk4/TV and another 50GB under /mnt/disk12/Anime..

 

Here comes the real test: "bitrot.sh -a -p /mnt/user/TV/ -m *.mkv" Let's see if Unraid has a (CPU) panic with this.

 

So right when it started the bitrot accessing /mnt/user/TV and watching XBMC at the same time it decided to panic:

 

Nov 30 22:45:24 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 1} (t=6000 jiffies g=107510 c=107509 q=91452)
Nov 30 22:45:24 Tower kernel: Task dump for CPU 1:
Nov 30 22:45:24 Tower kernel: shfs R running task 0 14976 1 0x00000008
Nov 30 22:45:24 Tower kernel: 0000000000000000 ffff88012ae83c88 ffffffff8105cc09 0000000000000001
Nov 30 22:45:24 Tower kernel: 0000000000000001 ffff88012ae83ca0 ffffffff8105f2c4 ffffffff81822d00
Nov 30 22:45:24 Tower kernel: ffff88012ae83cd0 ffffffff810766a5 ffffffff81822d00 ffff88012ae8e0c0
Nov 30 22:45:24 Tower kernel: Call Trace:
Nov 30 22:45:24 Tower kernel: [] sched_show_task+0xbe/0xc3
Nov 30 22:45:24 Tower kernel: [] dump_cpu_task+0x34/0x38
Nov 30 22:45:24 Tower kernel: [] rcu_dump_cpu_stacks+0x6a/0x8c
Nov 30 22:45:24 Tower kernel: [] rcu_check_callbacks+0x1e1/0x4ff
Nov 30 22:45:24 Tower kernel: [] ? tick_sched_handle+0x34/0x34
Nov 30 22:45:24 Tower kernel: [] update_process_times+0x38/0x60
Nov 30 22:45:24 Tower kernel: [] tick_sched_handle+0x32/0x34
Nov 30 22:45:24 Tower kernel: [] tick_sched_timer+0x35/0x53
Nov 30 22:45:24 Tower kernel: [] __run_hrtimer.isra.29+0x57/0xb0
Nov 30 22:45:24 Tower kernel: [] hrtimer_interrupt+0xd9/0x1c0
Nov 30 22:45:24 Tower kernel: [] xen_timer_interrupt+0x2b/0x108
Nov 30 22:45:24 Tower kernel: [] handle_irq_event_percpu+0x26/0xec
Nov 30 22:45:24 Tower kernel: [] handle_percpu_irq+0x39/0x4d
Nov 30 22:45:24 Tower kernel: [] generic_handle_irq+0x19/0x25
Nov 30 22:45:24 Tower kernel: [] evtchn_fifo_handle_events+0x12d/0x156
Nov 30 22:45:24 Tower kernel: [] __xen_evtchn_do_upcall+0x48/0x75
Nov 30 22:45:24 Tower kernel: [] xen_evtchn_do_upcall+0x2e/0x3f
Nov 30 22:45:24 Tower kernel: [] xen_do_hypervisor_callback+0x1e/0x30
Nov 30 22:45:24 Tower kernel: [] ? __discard_prealloc+0x71/0xb1
Nov 30 22:45:24 Tower kernel: [] ? reiserfs_discard_all_prealloc+0x43/0x4c
Nov 30 22:45:24 Tower kernel: [] ? do_journal_end+0x4e1/0xc57
Nov 30 22:45:24 Tower kernel: [] ? journal_end+0xad/0xb4
Nov 30 22:45:24 Tower kernel: [] ? reiserfs_xattr_set+0xd2/0x114
Nov 30 22:45:24 Tower kernel: [] ? user_set+0x3f/0x4d
Nov 30 22:45:24 Tower kernel: [] ? reiserfs_setxattr+0x9b/0xa9
Nov 30 22:45:24 Tower kernel: [] ? __vfs_setxattr_noperm+0x69/0xd5
Nov 30 22:45:24 Tower kernel: [] ? vfs_setxattr+0x7c/0x99
Nov 30 22:45:24 Tower kernel: [] ? setxattr+0x118/0x162
Nov 30 22:45:24 Tower kernel: [] ? final_putname+0x2f/0x32
Nov 30 22:45:24 Tower kernel: [] ? user_path_at_empty+0x60/0x87
Nov 30 22:45:24 Tower kernel: [] ? __sb_start_write+0x9a/0xce
Nov 30 22:45:24 Tower kernel: [] ? __mnt_want_write+0x43/0x4a
Nov 30 22:45:24 Tower kernel: [] ? SyS_lsetxattr+0x66/0xa8
Nov 30 22:45:24 Tower kernel: [] ? system_call_fastpath+0x16/0x1b

 

If I browse \\tower I see all of the unraid shares but I am unable to browse to \\tower\TV as it freezes Explorer

Browsing via SSH has the same issue. I can see /mnt/user/* but going to /mnt/user/TV freezes Putty

XBMC is unable to find the files to play

 

I notice that it "stalled" on shfs. Any way to debug or troubleshoot if it's shfs causing this issue? I'm guessing that's why it didn't freeze when moving files between /mnt/disk(1-13) since it bypasses shfs.

Link to comment

Just went through the changelogs and noted only the changes to "shfs"

 

Snce the past few stalls have been pointing towards shfs and noticing that it only freezes when the Mover (which uses /mnt/user and /mnt/user0) and Bitrot script (when using /mnt/user/TV) is running than it has to be something with shfs.

 

When I manually copy 100s of GB of data via MC or run the bitrot script against /mnt/disk(1-13) it doesn't have any issues, hell, I can even watch shows via XBMC, copy data and run bitrot without issues.

 

Please correct me if I'm wrong as I'm trying not to jump to conclusions but I believe I've narrowed it down to shfs...

 

/etc/mtab:

shfs /mnt/user0 fuse.shfs rw,nosuid,nodev,noatime,allow_other 0 0

shfs /mnt/user fuse.shfs rw,nosuid,nodev,noatime,allow_other 0 0

 

Latest crash:

Nov 30 22:45:24 Tower kernel: Task dump for CPU 1:

Nov 30 22:45:24 Tower kernel: shfs R running task 0 14976 1 0x00000008

 

The only one that sticks out is from beta6 to beta7 since beta7 is where we started having the issues.

 

Summary of changes from 6.0-beta10a to 6.0-beta10b

--------------------------------------------------

- shfs: fix chown() operation on symlink

 

Summary of changes from 6.0-beta10 to 6.0-beta10a

-------------------------------------------------

- shfs: fix issue preventing new object creation on use_cache=yes shares when cache not present

 

Summary of changes from 6.0-beta8 to 6.0-beta9

----------------------------------------------

- shfs: fixed improper handling of global share cache floor

 

Summary of changes from 6.0-beta7 to 6.0-beta8

----------------------------------------------

- shfs: honor cache floor settings for 'unsplittable' paths

 

Summary of changes from 6.0-beta6 to 6.0-beta7

--------------------------------------------------

- shfs: fixed improper handling of cache floor

Link to comment

 

 

What kind of hardware do you have? And what version of ESXi?

 

I want to see if we have anything in common.

 

Running ESXi, 5.5.0, 2068190

 

Intel S2400C Motherboard, 2x Intel Xeon E5-2407 2.20GHz, 32GB RAM

 

I have the unRAID server set as a VM with 8GB allocated, 2 CPUs

2x SAS Cards in Passthough mode (IBM ServeRaid M1015)

Link to comment

 

 

What kind of hardware do you have? And what version of ESXi?

 

I want to see if we have anything in common.

 

Running ESXi, 5.5.0, 2068190

 

Intel S2400C Motherboard, 2x Intel Xeon E5-2407 2.20GHz, 32GB RAM

 

I have the unRAID server set as a VM with 8GB allocated, 2 CPUs

2x SAS Cards in Passthough mode (IBM ServeRaid M1015)

 

We have a few things in common: ESXi 5.5.0 2068190, 2 CPUs but I only have 4GB of RAM and 1 M1015 flashed to LSI2008 IT firmware.

 

Can you test something for me.. would it be possible for you to move 50-100GB+ between /mnt/disks using MC via SSH/Telnet? I want to see if bypassing shfs works for you.

 

If it doesn't lock up and works just fine we might be onto something.

Link to comment

@razorslinky

 

I think I am having the same issue you're having but I am not running under ESXi, bare metal in fact.

The problem started once I reached beta9 but I think thats just coincidence. You can see my thread here: http://lime-technology.com/forum/index.php?topic=35357.0 but its probably not necessary.

 

My theory is that since I write alot of data to my server, probably 100-200GB a day I have been caught by the RFS bug that beta 7-8 had. For the last few months my server has frozen roughly twice a week. Webui cannot be accessed, nor can shares etc. but SSH can unless I try to list something under /mnt meaning I have to do a hard reboot.

Currently I am moving all my drives to XFS which i have calculated will take about 15 days.

 

Link to comment

@razorslinky

 

I think I am having the same issue you having but I am not running under ESXi, bare metal in fact.

The problem started once I reached beta9 but I think thats just coincidence. You can see my thread here: http://lime-technology.com/forum/index.php?topic=35357.0 but its probably not necessary.

 

My theory is that since I write alot of data to my server, probably 100-200GB a day I have been caught by the RFS bug that beta 7-8 had. For the last few months my server has frozen roughly twice a week. Webui cannot be accessed, nor can shares etc. but SSH can unless I try to list something under /mnt meaning I have to do a hard reboot.

Currently I am moving all my drives to XFS which i have calculated will take about 15 days.

 

So perhaps there is some internal RFS corruption causing the issue.

I wonder if running a scrub on the filesystem itself will reveal it.

I did this recently on a live filesystem (Fills all empty space with data) to reallocate some pending sectors. It worked for that usage case.

In this case, I wonder if it will reveal problems in the metadata enough to cause a problem.

Link to comment

So perhaps there is some internal RFS corruption causing the issue.

I wonder if running a scrub on the filesystem itself will reveal it.

I did this recently on a live filesystem (Fills all empty space with data) to reallocate some pending sectors. It worked for that usage case.

In this case, I wonder if it will reveal problems in the metadata enough to cause a problem.

 

If the scrub is unlikely to cause data loss I am happy to run it and report back on the results. Though I don't know how to go about that.

Link to comment

Have you made the test under bare metal with the user share / shfs?

 

That's actually my next troubleshooting step tonight.

 

So perhaps there is some internal RFS corruption causing the issue.

I wonder if running a scrub on the filesystem itself will reveal it.

I did this recently on a live filesystem (Fills all empty space with data) to reallocate some pending sectors. It worked for that usage case.

In this case, I wonder if it will reveal problems in the metadata enough to cause a problem.

 

Well that sounds very interesting.. Do you happen to have the command lines by chance? I'll try this right after I try booting baremetal unraid tonight

Link to comment

It has to be compiled for 64 bit unRAID and I haven't created that environment yet.

 

it's pretty simple to run,

http://linux.die.net/man/1/scrub

 

root@unRAID1:/mnt/disk3# ./scrub
Usage: scrub [OPTIONS] file [file...]
  -v, --version           display scrub version and exit
  -p, --pattern pat       select scrub pattern sequence
  -b, --blocksize size    set I/O buffer size (default 4m)
  -s, --device-size size  set device size manually
  -X, --freespace dir     create dir+files, fill until ENOSPC, then scrub
  -D, --dirent newname    after scrubbing file, scrub dir entry, rename
  -f, --force             scrub despite signature from previous scrub
  -S, --no-signature      do not write scrub signature after scrub
  -r, --remove            remove file after scrub
  -L, --no-link           do not scrub link target
  -R, --no-hwrand         do not use a hardware random number generator
  -t, --no-threads        do not compute random data in a parallel thread
  -n, --dry-run           verify file arguments, without writing
  -h, --help              display this help message
Available patterns are:
  nnsa          3-pass   NNSA NAP-14.1-C
  dod           3-pass   DoD 5220.22-M
  bsi           9-pass   BSI
  usarmy        3-pass   US Army AR380-19
  random        1-pass   One Random Pass
  random2       2-pass   Two Random Passes
  schneier      7-pass   Bruce Schneier Algorithm
  pfitzner7     7-pass   Roy Pfitzner 7-random-pass method
  pfitzner33   33-pass   Roy Pfitzner 33-random-pass method
  gutmann      35-pass   Gutmann
  fastold       4-pass   pre v1.7 scrub (skip random)
  old           5-pass   pre v1.7 scrub
  dirent        6-pass   dirent
  fillzero      1-pass   Quick Fill with 0x00
  fillff        1-pass   Quick Fill with 0xff
  verify        1-pass   Quick Fill with 0x00 and verify
  custom        1-pass   custom="str" 16 chr max, C esc like \r, \xFF, \377, \\

 

I used the -X /mnt/disk#/tmp parameter along with -p verify which selects to use NULLs 0x00's write then read.

scrub -X /mnt/disk#/tmp -p verify

 

It created a huge number of files in the directory then removed them.

 

** DISCLAIMER **  I don't know if it will aggravate corruption that may already exist in the metadata. So use it at your own risk.

For me, I had pending sectors on the parity drive, so I turned on turbo write mode and elected to create these files to force sector writes.

YMMV.

Link to comment

Does this happen on unRAID 5 on that specific drive?

What about outside of ESX.

 

It seems like there is some corruption at the reiserfs level.

 

FWIW, we know for sure beta 7 & 8 had the potential for silent corruption at the reiserfs level.

If you ran those for an extended time and wrote to the filesystems issues could have grown from there.

 

Do you have md5sums of each drive to verify integrity?

Link to comment

Does this happen on unRAID 5 on that specific drive?

What about outside of ESX.

 

It seems like there is some corruption at the reiserfs level.

 

FWIW, we know for sure beta 7 & 8 had the potential for silent corruption at the reiserfs level.

If you ran those for an extended time and wrote to the filesystems issues could have grown from there.

 

Do you have md5sums of each drive to verify integrity?

 

Well not so funny story... My parity drive died and I haven't had the money and time (new born)  to replace it so I'm running on hope that nothing else dies or is getting corrupted while I get a replacement. I don't have any md5sums of each drive to verify integrity and haven't tested this under Unraid 5 or Unraid 6 beta 6.

 

Every-time I start working on rebooting Unraid to baremetal something comes up and I don't have the time to do it... but I am hoping to do that tonight. I'm also going to try running the scrub software under Arch Linux and see what happens.

 

If I find out what drives are corrupted by running the scrub command.. should I try and convert those over to XFS? Or is there another way to fix the silent corruption issue?

 

Also I REALLY appreciate all of the help that you are providing. Thank you!

Link to comment

If I find out what drives are corrupted by running the scrub command.. should I try and convert those over to XFS? Or is there another way to fix the silent corruption issue?

 

I suppose you could try doing the reiserfsck on the drive in question, but no one really knows how bad the corruption is.

Before I did anything, I would try to copy/rsync at least 1 disk to a newer XFS disk.

 

At that point you can migrate to XFS for other drives, or attempt to fix the disk giving you issues with reiserfsck.

That's a learning experience in itself.  It's like do I **** or get off the potty?

 

Given the maintenance and the future of reiserfs, this may be the impetus for you to get off reiserfs completely.

 

Frankly we don't know where the corruption came from. Beta7,8 or other ESX related crashes thus causing issue on the filesystem in question.

I.E. Did this problem start with reiserfs corruption?

or did ESX cause unRAID to fail abnormally, this causing the reiserfs corruption.

 

I've seen any form of power failure cause corruption on filesystems.  Usually they fix what they can during a fsck.

 

The only way to 'safely' determine this is to copy what you can to another drive (do not write anything if you do not have to).

go bare metal or go unRAID 5 and test the questionable filesystem.

Link to comment

If I find out what drives are corrupted by running the scrub command.. should I try and convert those over to XFS? Or is there another way to fix the silent corruption issue?

 

I suppose you could try doing the reiserfsck on the drive in question, but no one really knows how bad the corruption is.

Before I did anything, I would try to copy/rsync at least 1 disk to a newer XFS disk.

 

At that point you can migrate to XFS for other drives, or attempt to fix the disk giving you issues with reiserfsck.

That's a learning experience in itself.  It's like do I **** or get off the potty?

 

Given the maintenance and the future of reiserfs, this may be the impetus for you to get off reiserfs completely.

 

Frankly we don't know where the corruption came from. Beta7,8 or other ESX related crashes thus causing issue on the filesystem in question.

I.E. Did this problem start with reiserfs corruption?

or did ESX cause unRAID to fail abnormally, this causing the reiserfs corruption.

 

I've seen any form of power failure cause corruption on filesystems.  Usually they fix what they can during a fsck.

 

The only way to 'safely' determine this is to copy what you can to another drive (do not write anything if you do not have to).

go bare metal or go unRAID 5 and test the questionable filesystem.

 

In dealing with all of the recent issues with reiserfs, I've been wanting to move to XFS and this is my push to do so. I've been baffled with the choice of ReiserFS due to it no longer being "officially" supported anymore. No amount of people can really replace the dude (and company) who created it and knows it best.

 

In regards to the corruption, I never had any major issues until after beta6. This I know for a fact because I've never had a single lock up or freezing until after beta6. My ESXi box has an APC attached and only has had a power failure two to three times in the past 5-6 months due to the power going out in my apartment for 3+ hours and my damn script not shutting down Unraid properly. I have 3 other workstations on my ESXi box (RDP, Mediaserver and Arch linux box) that have never had any corruption issues. ESXi has been rock solid and I've pretty much set it up just like I've done in the enterprise world (minus SAN/iSCSI.. that's coming much later in 2015)

 

I will be rsyncing disk3 over to another drive, formatting it to XFS and rsyncing it back. I will then do another scrub test just to verify that it's working as it should.

 

I'm happy with the results from this forum and thank everyone for reading/following along.. I also don't want to come across like I'm blaming Unraid for the data corruption and nothing being done to resolve it seeing as it's a wide spread issue that was fixed months ago: https://bugzilla.kernel.org/show_bug.cgi?id=83121

Link to comment

Did you use beta 7 or 8. It's a known there were reiserfs bugs with silent corruption on those versions.

This could all be stemming from that.

 

I will be rsyncing disk3 over to another drive, formatting it to XFS and rsyncing it back.

I will then do another scrub test just to verify that it's working as it should.

 

In the same array? I would not rsync it to another reiserfs disk.

I would acquire a new empty disk, format it as XFS and rsync from reiserfs to XFS, not back and forth.

Who knows how deep the corruption lies

Link to comment

Did you use beta 7 or 8. It's a known there were reiserfs bugs with silent corruption on those versions.

This could all be stemming from that.

 

I will be rsyncing disk3 over to another drive, formatting it to XFS and rsyncing it back.

I will then do another scrub test just to verify that it's working as it should.

 

In the same array? I would not rsync it to another reiserfs disk.

I would acquire a new empty disk, format it as XFS and rsync from reiserfs to XFS, not back and forth.

Who knows how deep the corruption lies

 

Oh yes! I used beta7/beta8 and probably downloaded between 300GB+ a week and of course used mover on a daily basis.

 

The drives that I am using are outside the array on an actual physical computer not attached to Unraid. I'm not going to risk anything by syncing inside the Unraid array for now.

Link to comment

I created a new unRAID6b12 VM under ESXi, passed thru and  added the AOC-SASLP-MV8 patch to the VM, rebooted the host and then  fired the unraid guest up.  Running on drives that I've ran reiserfsck against under 6b6 (so I know they're fine), and starting bitrot (first clearing any extended attributes, then creating new ones) on a usershare results in the system hanging after 30 or so minutes (yes I can connect via ssh, but that's not really useful if I can't access anything or even safely shut down).

Link to comment

I created a new unRAID6b12 VM under ESXi, passed thru and  added the AOC-SASLP-MV8 patch to the VM, rebooted the host and then  fired the unraid guest up.  Running on drives that I've ran reiserfsck against under 6b6 (so I know they're fine), and starting bitrot (first clearing any extended attributes, then creating new ones) on a usershare results in the system hanging after 30 or so minutes (yes I can connect via ssh, but that's not really useful if I can't access anything or even safely shut down).

 

Not sure if you want to test this but what happens if you run bitrot against the actual disk... /mnt/disk1/$SHARE? This will bypass the shfs mounting method and write straight to the disk itself.

 

Link to comment

Alright... Well here's my progress so far:

 

Moved everything off disk3 and onto another computer and formatted the disk to XFS. After formatting it XFS, I ran the scrub command for ~30 minutes just to see if it would crash, and it actually kept going. When running scrub with ReiserFS, Unraid would crash within 2-3 minutes. So I rsycned another drive to Disk3 and so far it's written 1.45 TB and has 2,801,338 writes on it so far. And it's going strong, averaging between 50-90 MB/s.

 

I think all of my crashing has to do with the ReiserFS corruption issue. So I'm now in the process of converting every drive I have to XFS. I'm very tempted to do BTRFS instead but I'm a little hesitant since it's still considered "experimental." I love living on the bleeding edge but I don't think I want to deal with any BTRFS issues anytime soon.

 

Anyone here want to chime in about XFS vs BTRFS?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.