copy files, wait 30min, hard reboot = data loss [thread -rc16a]


Recommended Posts

Just a little update, I have Frank up and did a few small tests to get the script correct make sure no drives are giving errors (this build was specifically to test my new hardware, I had access to 26 500Gb drives so why not, my real server will use my existing 3tb and 4tb drives).  I have now run a first pass to generate the 512 files and I am just waiting for it to generate the md5 hashes of the files. I'll check the files to make sure all the hashes are the same/correct.  I will then be deleting all the test files but leave behind the md5 file.  I am going to take the array offline then put it back online just to sync and clear everything.  Then I will run the script wait 30 seconds after it finishes then reset the server and run the hash checks once the server is back up.  I am running the writes to a share with split level set to 99 and most free space.  this ensures I am writing to all the drives.  My test shows some of the drive end up with a heap more files than others but it writes to all of them.

 

Stay tuned.

Link to comment
  • Replies 82
  • Created
  • Last Reply

Top Posters In This Topic

For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers?  Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption?  I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits).

 

Or is that not an appropriate benchmark?

 

Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it.

 

 

Well all I was saying is, in line with your post about fair tests, is that if we wanted a benchmark for what is reasonable to expect from RSF, then maybe that benchmark can be found by looking at another FS.  But I acknowledge that I only know enough to be dangerous and so maybe it is not a valid comparison.

 

Hmmm, OK I get the idea. I think we were just looking for valid data to be present with reiserfs after an unclean shutdown.  According to the kernel values and what I observed by monitoring the dirty buffer values, it seems around 30 seconds.

 

nars had tested EXT3 and did not report any corruption.

 

In our other tests, we found that with reiserfs you could wait 3 minutes, 5 minutes, 10 minutes, 30 minutes and/or 2 hours and find corruption and/or missing files on the reiserfs.

 

Ok so 30 seconds is the realistic time after which a plug-pull should not result in data loss. 

 

Back into lurker-mode :)

Link to comment

Just waiting for the md5 checks to finish running, half way through no failures.

 

while you wait here's Frankie

 

3vvf.jpg

 

 

That picture makes me cringe. The location of those USB drives are just waiting for someone to make a wrong move and break the plug off in the socket.

Link to comment

Just waiting for the md5 checks to finish running, half way through no failures.

 

while you wait here's Frankie

 

3vvf.jpg

 

 

That picture makes me cringe. The location of those USB drives are just waiting for someone to make a wrong move and break the plug off in the socket.

 

Yeah, the position of the USB drives is the biggest issue with this server...

Link to comment

Just waiting for the md5 checks to finish running, half way through no failures.

 

while you wait here's Frankie

 

 

That picture makes me cringe. The location of those USB drives are just waiting for someone to make a wrong move and break the plug off in the socket.

 

Yeah, the position of the USB drives is the biggest issue with this server...

 

 

I thought it was a la-boar-a-tory  He's gotta spark electric somehow!

Link to comment

I would be more worried with the 'naked' hdd's on top of each other than the usb sticks... doesn't they have pcb at bottom with chances to touch metallic top of each other?

 

Btw, we are spamming, but the topic seems dead anyway with no problems found ;)

Link to comment

I would be more worried with the 'naked' hdd's on top of each other than the usb sticks... doesn't they have pcb at bottom with chances to touch metallic top of each other?

 

Btw, we are spamming, but the topic seems dead anyway with no problems found ;)

 

I double and triple checked the hdd's on the top and the pcb was just below the surface, but they did heat up to 66 degrees C and when I was taking it apart I couldn't hold the drives to pull cables out.

Link to comment

I would be more worried with the 'naked' hdd's on top of each other than the usb sticks... doesn't they have pcb at bottom with chances to touch metallic top of each other?

 

Btw, we are spamming, but the topic seems dead anyway with no problems found ;)

 

I double and triple checked the hdd's on the top and the pcb was just below the surface, but they did heat up to 66 degrees C and when I was taking it apart I couldn't hold the drives to pull cables out.

 

Place a small fan to blow on the drives.

Link to comment

The Release Notes wiki page contains a section of MD5's for almost all known UnRAID releases, both official and unofficial, in order to help users know for sure that they are using an unmodified, untampered-with distribution.  In keeping with that, I am requesting the MD5 hash for v5.0-rc16 and v5.0-rc16a, from any user with a copy of either.  I'd like at least 2 responses for confirmation, either 2 providing the same MD5 or an MD5 confirmed by a second user, for each of the 2 releases.  Thank you ahead of time.

Link to comment

My parallel test to one drive on an idle system is good

I've passed the nars test and my own parallel test.

This test takes a long time on my system yet it does sort of simulate a fast torrent coming in and scribbling all over the drive.

 

I've attached some snippets from the log.

 

An interesting thing about this is that it took almost 35 seconds for the dirty buffers to reach 0.

On this particular system, with a number of write tests, it was about the same number all the time.

It was pretty interesting to watch the HighFree LowFree Dirty and WriteBack through all of this.

 

That command is

watch -n1 grep -e Dirty: -e Writeback: -e HighFree: -e LowFree: /proc/meminfo

 

I have another test I'm planning to do with discarding the current array config.

Starting over and doing the write test while the parity generate is active.

 

I think this release is good for the rest of the community to test with

rc15 & rc15a had me very concerned, this one is solid with all of our tests confirming it.

 


`/mnt/disk2/crashtestdummy/md5sum' -> `/mnt/disk3/crashtestdummy/md5sum'
Jun 27 20:17:50 /boot/crashtestdummy[1405]: Waiting 1 of 40 seconds. Dirty: 50652
Jun 27 20:17:51 /boot/crashtestdummy[1405]: Waiting 2 of 40 seconds. Dirty: 50652
Jun 27 20:17:52 /boot/crashtestdummy[1405]: Waiting 3 of 40 seconds. Dirty: 50652
Jun 27 20:17:53 /boot/crashtestdummy[1405]: Waiting 4 of 40 seconds. Dirty: 50652
Jun 27 20:17:54 /boot/crashtestdummy[1405]: Waiting 5 of 40 seconds. Dirty: 51084
Jun 27 20:17:55 /boot/crashtestdummy[1405]: Waiting 6 of 40 seconds. Dirty: 892
Jun 27 20:17:56 /boot/crashtestdummy[1405]: Waiting 7 of 40 seconds. Dirty: 892
Jun 27 20:17:57 /boot/crashtestdummy[1405]: Waiting 8 of 40 seconds. Dirty: 892
Jun 27 20:17:58 /boot/crashtestdummy[1405]: Waiting 9 of 40 seconds. Dirty: 892
Jun 27 20:17:59 /boot/crashtestdummy[1405]: Waiting 10 of 40 seconds. Dirty: 892
Jun 27 20:18:00 /boot/crashtestdummy[1405]: Waiting 11 of 40 seconds. Dirty: 892
Jun 27 20:18:01 /boot/crashtestdummy[1405]: Waiting 12 of 40 seconds. Dirty: 892
Jun 27 20:18:02 /boot/crashtestdummy[1405]: Waiting 13 of 40 seconds. Dirty: 892
Jun 27 20:18:03 /boot/crashtestdummy[1405]: Waiting 14 of 40 seconds. Dirty: 892
Jun 27 20:18:04 /boot/crashtestdummy[1405]: Waiting 15 of 40 seconds. Dirty: 892
Jun 27 20:18:05 /boot/crashtestdummy[1405]: Waiting 16 of 40 seconds. Dirty: 892
Jun 27 20:18:06 /boot/crashtestdummy[1405]: Waiting 17 of 40 seconds. Dirty: 892
Jun 27 20:18:07 /boot/crashtestdummy[1405]: Waiting 18 of 40 seconds. Dirty: 892
Jun 27 20:18:08 /boot/crashtestdummy[1405]: Waiting 19 of 40 seconds. Dirty: 892
Jun 27 20:18:09 /boot/crashtestdummy[1405]: Waiting 20 of 40 seconds. Dirty: 892
Jun 27 20:18:10 /boot/crashtestdummy[1405]: Waiting 21 of 40 seconds. Dirty: 892
Jun 27 20:18:11 /boot/crashtestdummy[1405]: Waiting 22 of 40 seconds. Dirty: 892
Jun 27 20:18:12 /boot/crashtestdummy[1405]: Waiting 23 of 40 seconds. Dirty: 892
Jun 27 20:18:13 /boot/crashtestdummy[1405]: Waiting 24 of 40 seconds. Dirty: 892
Jun 27 20:18:14 /boot/crashtestdummy[1405]: Waiting 25 of 40 seconds. Dirty: 892
Jun 27 20:18:15 /boot/crashtestdummy[1405]: Waiting 26 of 40 seconds. Dirty: 892
Jun 27 20:18:16 /boot/crashtestdummy[1405]: Waiting 27 of 40 seconds. Dirty: 892
Jun 27 20:18:17 /boot/crashtestdummy[1405]: Waiting 28 of 40 seconds. Dirty: 892
Jun 27 20:18:18 /boot/crashtestdummy[1405]: Waiting 29 of 40 seconds. Dirty: 892
Jun 27 20:18:19 /boot/crashtestdummy[1405]: Waiting 30 of 40 seconds. Dirty: 892
Jun 27 20:18:20 /boot/crashtestdummy[1405]: Waiting 31 of 40 seconds. Dirty: 864
Jun 27 20:18:21 /boot/crashtestdummy[1405]: Waiting 32 of 40 seconds. Dirty: 864
Jun 27 20:18:22 /boot/crashtestdummy[1405]: Waiting 33 of 40 seconds. Dirty: 864
Jun 27 20:18:24 /boot/crashtestdummy[1405]: Waiting 34 of 40 seconds. Dirty: 864
Jun 27 20:18:25 /boot/crashtestdummy[1405]: Waiting 35 of 40 seconds. Dirty: 864
Jun 27 20:18:26 /boot/crashtestdummy[1405]: Waiting 36 of 40 seconds. Dirty: 0
Jun 27 20:18:27 /boot/crashtestdummy[1405]: Waiting 37 of 40 seconds. Dirty: 0
Jun 27 20:18:28 /boot/crashtestdummy[1405]: Waiting 38 of 40 seconds. Dirty: 0
Jun 27 20:18:29 /boot/crashtestdummy[1405]: Waiting 39 of 40 seconds. Dirty: 0
Jun 27 20:18:30 /boot/crashtestdummy[1405]: Waiting 40 of 40 seconds. Dirty: 0
Jun 27 20:18:31 /boot/crashtestdummy[1405]: ok crash the system now




root@unRAID2:/boot # cd /mnt/disk1/crashtestdummy && md5sum -c md5sum
...
bzroot.516.511: OK
bzroot.516.512: OK


root@unRAID2:/boot# cd /mnt/disk2/crashtestdummy && md5sum -c md5sum
...
bzroot.516.512: OK


root@unRAID2:/boot# cd /mnt/disk3/crashtestdummy && md5sum -c md5sum
bzroot.516.1: OK
...
bzroot.516.511: OK
bzroot.516.512: OK

 

 

If anyone is interested here's the script snippet of how I monitored the dirty buffers at the end of my script.


let i=0
let m=40
while ! read -n1 -t1
do
   [[ $(( i+=1 )) -gt $m ]] && break
   echo -e "`date '+%b %d %T'` $0[$$]: Waiting $i of $m seconds. Dirty: \c"
   awk ' /Dirty:/ { printf("%s\n",$2); } ' /proc/meminfo
done

echo "`date '+%b %d %T'` $0[$$]: ok crash the system now"

Link to comment

The Release Notes wiki page contains a section of MD5's for almost all known UnRAID releases, both official and unofficial, in order to help users know for sure that they are using an unmodified, untampered-with distribution.  In keeping with that, I am requesting the MD5 hash for v5.0-rc16 and v5.0-rc16a, from any user with a copy of either.  I'd like at least 2 responses for confirmation, either 2 providing the same MD5 or an MD5 confirmed by a second user, for each of the 2 releases.  Thank you ahead of time.

 

From the e-mail from Tom (he supplied the MD5 when he forwarded the links to the pre-releases).

 

rc16  = md5: aa1ef29ca4fb2068a069591ab313c421

rc16a = md5: 4289035a041fe2dfcbb0c3bb39628589

 

Joe L.

 

Link to comment

I think this release is good for the rest of the community to test with

rc15 & rc15a had me very concerned, this one is solid with all of our tests confirming it.

I could be wrong, but I don't think Tom wants to release these to the general public with his code patches until either the code maintainers accept his patches or issue comparable patches. Certain people ... can get huffy about source code issues.
Link to comment

I think this release is good for the rest of the community to test with

rc15 & rc15a had me very concerned, this one is solid with all of our tests confirming it.

I could be wrong, but I don't think Tom wants to release these to the general public with his code patches until either the code maintainers accept his patches or issue comparable patches. Certain people ... can get huffy about source code issues.

I think I'd wait for Tom to deal with the issue.  He's always played fair with the GPL source for the "md" driver.  His prior comments are just that the "official" source for the reiserfs patch should be from the "official" reiserfs maintainer.  (he would not have gotten the type of cooperation he has if he did not have the SUSE developer's respect in following the guidelines of the GPL. ) Actually, I'm very impressed at the rapid response and cooperation he received from the SUSE developer.

 

Joe L.

Link to comment

So what ver should a user be on right now? Is 16a stable enough?

Ver 16a has only been distributed to a limited number of testers.  I do not think it has been released to the general public (or even if it will )

From all reports so far, it is working very well with respect to the bug identified in this thread.

 

I also know from reading the other threads Tom has just found the "cache drive will not spin down" bug, so he might release one more "rc" version before releasing 5.0.

Or, he might just decide there it is little enough at risk to just release 5.0.  We'll just have to wait and see.

 

Joe L.

Link to comment

I think this release is good for the rest of the community to test with

rc15 & rc15a had me very concerned, this one is solid with all of our tests confirming it.

I could be wrong, but I don't think Tom wants to release these to the general public with his code patches until either the code maintainers accept his patches or issue comparable patches. Certain people ... can get huffy about source code issues.

I think I'd wait for Tom to deal with the issue.  He's always played fair with the GPL source for the "md" driver.  His prior comments are just that the "official" source for the reiserfs patch should be from the "official" reiserfs maintainer.  (he would not have gotten the type of cooperation he has if he did not have the SUSE developer's respect in following the guidelines of the GPL. ) Actually, I'm very impressed at the rapid response and cooperation he received from the SUSE developer.

 

Joe L.

 

Let me clarify a little. I apologize if my comment seems a little out of line.

I have no push or pull if this release goes out or not.

My comment is based on giving a nod to Tom if he wants to widen the release candidate testing.

I would have protested very strongly about rc15a and rc16 as a general release with out a strong warning.

I was very concerned even though comments were made that it's not a big issue if you gracefully shutdown.

Things happen, and if we can do something to help prevent data loss, we should.

 

So I tip my hat to this release and "this bug".

I have not been involved in other bugs yet.

Normally I don't get so involved in the betas and release candidates.

As you may have noticed I felt very strongly about this one.

Had I found an ounce of corruption or unreliability that I could not explain, I would protest strongly.

 

I'm still doing one further test where I am beating the hell out of my array.

The drives are thrashing like crazy, but I have to be sure my data is safe once it's placed on the server.

 

I suppose a good thing that came out of all this is a new wave of regression testing.

I have ideas for a new way to write out the data and write out the md5 in parallel.

Test the array really hard.

There's a reason they give me things to test at work. I usually break them!

Link to comment

 

Just waiting for the md5 checks to finish running, half way through no failures.

 

while you wait here's Frankie

 

3vvf.jpg

 

 

That picture makes me cringe. The location of those USB drives are just waiting for someone to make a wrong move and break the plug off in the socket.

 

Yeah, the position of the USB drives is the biggest issue with this server...

 

 

Haha, touché.

Link to comment

I would be more worried with the 'naked' hdd's on top of each other than the usb sticks... doesn't they have pcb at bottom with chances to touch metallic top of each other?

 

Btw, we are spamming, but the topic seems dead anyway with no problems found ;)

 

I've blown up two hard drives (in a row) by accidentally shorting out the PCB when I rested them on top of a metal drive cage. Won't make that mistake again.

Link to comment