copy files, wait 30min, hard reboot = data loss [thread -rc16a]

binhex · June 28, 2013

Shouldn't Frankie actually be called Frankie's monster? After all Frankenstein created the monster :-) just a small point hehe

WeeboTech · June 28, 2013

My last test of writing to the array in parallel to the area's that are being parity sync'ed was successful.

This test consisted of clearing all file systems of data

creating a 4GB file (so it would be on the outer most part of the drive)

unassigning parity.

assigning parity.

starting a parity check.

immediately starting two cp's in parallel to the other two disks of the 4gb file.

When the parity sync was almost at 4gb.

I fired off my other crashtestdummy parallel write program on the drive that had the first 4GB file.

This insured I wrote at least 20gb to the array during a parity sync, to the same area's that were being parity synced.

This morning all MD5's check out good.

archedraft · June 28, 2013

This insured I wrote at least 20gb to the array during a parity sync, to the same area's that were being parity synced.

This morning all MD5's check out good.

Wow that's impressive. I have always been scared to do anything during a parity sync but I might be less hesitant in the future.

WeeboTech · June 28, 2013

This insured I wrote at least 20gb to the array during a parity sync, to the same area's that were being parity synced.

This morning all MD5's check out good.

Wow that's impressive. I have always been scared to do anything during a parity sync but I might be less hesitant in the future.

It took a very long time.

My parity sync speeds dropped to 15MB/s while all the writes were being done in parallel.

At the very least, everything checked out!

nars · June 28, 2013

Nice stress tests to the whole thing, not only to reiserfs but to Tom's modified md driver... really good to know all went fine. There is one more test you could eventually do, repeat it on /mnt/user to test such stress on shfs

kizer · June 28, 2013

Its not been said, but I'm sure a lot of us are thinking it.

Thank you all for taking all this time to squash these bugs, using your own equipment and attempting to wrap your minds around these issues.

binhex · June 28, 2013

Its not been said, but I'm sure a lot of us are thinking it.

Thank you all for taking all this time to squash these bugs, using your own equipment and attempting to wrap your minds around these issues.

+1

WeeboTech · June 28, 2013

Nice stress tests to the whole thing, not only to reiserfs but to Tom's modified md driver... really good to know all went fine. There is one more test you could eventually do, repeat it on /mnt/user to test such stress on shfs

I thought mejutty tested using the user share filesystem.

I was going to do it last night, but my drives are not equal in size.

All the files would go to the first 4TB drive and none to the second & third 3TB drive.

Although his test wasn't during a parity-sync.

mejutty · June 28, 2013

Nice stress tests to the whole thing, not only to reiserfs but to Tom's modified md driver... really good to know all went fine. There is one more test you could eventually do, repeat it on /mnt/user to test such stress on shfs

All of my testing has been done on a user share so I could write to all 20 disks at the same time. Let me tell you the system become pretty unresponsive for a while when it is writing out, even just an ls command run on the directory that is being written to locks up and takes a good 30-45 seconds to start responding. it does appear to respond normally after the initial lag of the first ls command. It does not seem to like 512 simultaneous copies being done to it over 20 data disks.

WeeboTech · June 28, 2013

Nice stress tests to the whole thing, not only to reiserfs but to Tom's modified md driver... really good to know all went fine. There is one more test you could eventually do, repeat it on /mnt/user to test such stress on shfs

All of my testing has been done on a user share so I could write to all 20 disks at the same time. Let me tell you the system become pretty unresponsive for a while when it is writing out, even just an ls command run on the directory that is being written to locks up and takes a good 30-45 seconds to start responding. it does appear to respond normally after the initial lag of the first ls command. It does not seem to like 512 simultaneous copies being done to it over 20 data disks.

There's allot going on!

On my system, it's responsive until I try to do something with the browser interface or on the directory being written to.

Other then that it's been responsive.

However, I am writing 512 files to one fast disk. 4TB with a 4TB parity.

When the parity sync is in operation, it's allot slower.

There are delayed responses, but it does respond.

I only have 4 drives in my system.

On my torrent server I usually set some kernel options to cache/buffer more.

The system is more responsive.

Yes there's a potential for file loss, but with torrents I can double check them.

mejutty · June 28, 2013

Ok it's much worse now that I have parity (not the bug just responsiveness) I can browse the disk shares and see the folder and files being written but if I try and browse the share itself it's not responding either from the terminal windows or windows machine. I'm not surprised really copying 1 single file from the flash drive to 512 different files spread across 21 data disks and parity . the initial writing of the files has been going now for 15 minutes and the share has been unresponsive the while time.

WeeboTech · June 28, 2013

Ok it's much worse now that I have parity (not the bug just responsiveness) I can browse the disk shares and see the folder and files being written but if I try and browse the share itself it's not responding either from the terminal windows or windows machine. I'm not surprised really copying 1 single file from the flash drive to 512 different files spread across 21 data disks and parity . the initial writing of the files has been going now for 15 minutes and the share has been unresponsive the while time.

On my system it took over an hour in writing the 512 files to one disk.

I would expect the user share to be bogged down. This is unlike any normal test.

What's your CPU and Memory?

I bet if you do a top you will see some high %WA wait IO times.

limetech · June 28, 2013

Ok it's much worse now that I have parity (not the bug just responsiveness) I can browse the disk shares and see the folder and files being written but if I try and browse the share itself it's not responding either from the terminal windows or windows machine. I'm not surprised really copying 1 single file from the flash drive to 512 different files spread across 21 data disks and parity . the initial writing of the files has been going now for 15 minutes and the share has been unresponsive the while time.

Responsiveness of I/O while parity sync/check is in process is something I can improve post-5.0-final.

mejutty · June 28, 2013

I do not have a parity check going atm this is just the 512 parallel copy running

I've actually never had many issues while running a parity check.

I'm going to test the command next out of ram instead of from the flash disk as well.

95.6%wa

load average: 405.22, 406.78, 331.38

It was much faster yesterday doing the command when there was no parity (of course) it only took 5 minutes to run but then again my md5 checks only took 10-15 minutes.

WeeboTech · June 28, 2013

I do not have a parity check going atm this is just the 512 parallel copy running

I've actually never had many issues while running a parity check.

I'm going to test the command next out of ram instead of from the flash disk as well.

95.6%wa

load average: 405.22, 406.78, 331.38

It was much faster yesterday doing the command when there was no parity (of course) it only took 5 minutes to run but then again my md5 checks only took 10-15 minutes.

I doubt the flash is what's slowing this down. Once the file is read, it's in memory.

You could copy it to /tmp and it will be in memory also.

parity is what is slowing it all down.

95.6%wa

load average: 405.22, 406.78, 331.38

not unexpected. LOL! Those are some high loads! I didn't even think to check mine.

mejutty · June 28, 2013

Took 45 Minutes just re running from a folder I created /usr/test.

The shared folder was completely unresponsive for the whole test. Disk shares remained fine and responsive just the usr share.

nars · June 28, 2013

Nice stress tests to the whole thing, not only to reiserfs but to Tom's modified md driver... really good to know all went fine. There is one more test you could eventually do, repeat it on /mnt/user to test such stress on shfs

All of my testing has been done on a user share so I could write to all 20 disks at the same time. Let me tell you the system become pretty unresponsive for a while when it is writing out, even just an ls command run on the directory that is being written to locks up and takes a good 30-45 seconds to start responding. it does appear to respond normally after the initial lag of the first ls command. It does not seem to like 512 simultaneous copies being done to it over 20 data disks.

Sorry missed that your test was on user share Very good to know it and thank you both mejutty and WeeboTech for your extensive tests, you gone a lot deeper than me with them and I would really be unable to run such sort of extensive tests as I have no spare baremetal system available for tests.

mr-hexen · June 28, 2013

So what ver should a user be on right now? Is 16a stable enough?

Ver 16a has only been distributed to a limited number of testers. I do not think it has been released to the general public (or even if it will )
From all reports so far, it is working very well with respect to the bug identified in this thread.

I also know from reading the other threads Tom has just found the "cache drive will not spin down" bug, so he might release one more "rc" version before releasing 5.0.

Or, he might just decide there it is little enough at risk to just release 5.0. We'll just have to wait and see.

Joe L.

so what version should a non-rc16 user stick with until released??

mejutty · June 28, 2013

All good md5 checks all done in 15 minutes passed without error I can't think of any other ways to torture test as far as I can see bug is gone, the array will start and stop at will (I have sat the and just put it online and offline 20 times as fast as it will allow) transfer speeds are what I would expect, parity checks with these slow drives are as fast as I would expect. Not that I have tested for anything else but it all looks the goods.

garycase · June 28, 2013

The unresponsiveness during that massive write is understandable -- writing to all those disks, with parity, means you have something like 88 pending disk I/Os for each write !!

The important thing is there were no errors !!

I'd say that test confirms that 16a is ready for "final" status with just a couple of tweaks that Tom has mentioned in other threads.

JustinChase · June 28, 2013

Its not been said, but I'm sure a lot of us are thinking it.

Thank you all for taking all this time to squash these bugs, using your own equipment and attempting to wrap your minds around these issues.

+1

The important thing is there were no errors !!

I'd say that test confirms that 16a is ready for "final" status with just a couple of tweaks that Tom has mentioned in other threads.

mejutty · June 30, 2013

I cannot reproduce the test as I have dissected Frankenstein to build him into my real server.

RockDawg · July 2, 2013

So what's the status of all this? Are we to expect rc17 soon, 5.0 final soon or are we in a holding pattern?

garycase · July 2, 2013

So what's the status of all this? Are we to expect rc17 soon, 5.0 final soon or are we in a holding pattern?

My question as well. Several folks who are doing new builds have asked what version they should use. They can't use v4.7, as they've bought 3 or 4 TB disks; but they've also read the thread about the potential data loss issues with RC15a, so they're skittish.

Tom: You really should release an RC17 with the fix that was in RC16a and the other couple "tweaks" you were planning for v5 as a "placeholder" for v5 final (which I assume you're holding up until the ReiserFS maintainer "blesses" your patch or provides one of his own). The data loss potential in RC15a is causing a lot of concern.

mejutty · July 2, 2013

I personally have tested the 2 unreleased builds as well as the rest of the RC's and on my "Live" server I currently run RC10.

copy files, wait 30min, hard reboot = data loss [thread -rc16a]

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment