alg_diff_standard MMX patch
Introduction
This patch contains MMX assembler addition to the
alg_diff_standard
function. The benefit is a major speed improvement, as can be seen below. The old non-MMX code remains in the function and handles remaining pixels in the case a weird resolution that isn't divisible by 8 is used.
Note: This patch is experimental for now, please apply with care!
Bugs
There are two bugs in the old
alg_diff_standard
that are fixed in this patch:
- The cast to
char
in unsigned register char curdiff=(int)(abs((char)(*ref-*new)));
truncates the result from the subtraction.
- The cast to
char
in curdiff=((int)((char)curdiff**mask++)/255);
truncates curdiff
.
Description of Patch
The patch contains a new header file,
mmx.h (copied from the ffmpeg distribution), and some extra MMX code in
alg_diff_standard
. The MMX code is thoroughly commented and is hopefully not too hard to understand :-).
Let's look at some numbers.
Default case
In the default case, there is no mask, and the smartmask feature is not active. This is the output from my test program:
5000 iterations of old_alg_diff_standard w/ img size 320x240: 43614 ms => 8.72 ms/iter
5000 iterations of new_alg_diff_standard w/ img size 320x240: 8774 ms => 1.75 ms/iter
Testing accuracy of new_alg_diff_standard compared to old_alg_diff_standard; 1000 iterations with image size 320x240:
differed on avg in 0.00% of the pixels
Note that this is the output from one run of the test program only. Due to caching and load reasons, the numbers may be slightly different in a second run. Anyway, these ones show that the MMX version runs 80% faster than the non-MMX version. Also, the accuracy is 100%.
With mask
Adding the use of a (static) mask, the performance numbers are as follows:
3500 iterations of old_alg_diff_standard w/ img size 320x240: 41268 ms => 11.79 ms/iter
3500 iterations of new_alg_diff_standard w/ img size 320x240: 7791 ms => 2.23 ms/iter
Testing accuracy of new_alg_diff_standard compared to old_alg_diff_standard; 1000 iterations with image size 320x240:
differed on avg in 0.00% of the pixels
In other words, there is an 80% speed improvement also when using a static mask. The accuracy is 100% also in this case.
With mask and smartmask
My test program does not emulate the smartmask. Thus, I had to use profiling instead. Here is an excerpt from using the non-MMX version (when using both mask and smartmask, i.e. the "worst" case):
% cumulative self self total
time seconds seconds calls ms/call ms/call name
37.47 33.24 33.24 280 118.72 118.72 alg_diff_standard
Here is an excerpt from using the MMX version:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
...
17.21 24.80 11.55 251 46.02 46.02 alg_diff_standard
These numbers show a 60% speed improvement when using both mask and smartmask. Note that the figures cannot be compared directly to the ones from my test program. Due to profiling overhead, they are much higher than when not profiling.
I have tested the accuracy of the smartmask code by tweaking Motion to run both the non-MMX version and the MMX version of
alg_diff_standard
in parallel. In all my tests, the two versions produced the same contents in both
imgs.out
and
imgs.smartmask_buffer
.
Installation of Patch
Note: This patch is experimental for now, please apply with care!
The installation is very straightforward:
-
tar xzf motion-3.1.18_snap8.tar.gz
-
cd motion-3.1.18
-
zcat ../motion-3.1.18_snap8-algdiffstd_v2.patch.gz | patch -p1
-
./configure
and make
.
I should add this: The patch is experimental because I haven't really tested the smartmask part (I normally don't use smartmask, so I don't know how to test it).
I have tested the non-mask case and using an ordinary mask, though. Both cases seem to work fine.
--
PerJonsson - 31 Dec 2004
I have installed your patch and it is running since 2 days without problems. Smartmask is also running fine as far as I can see.
Thank you for this major performance improvement!
--
JoergWeber - 02 Jan 2005
Sounds great, thanks for testing!
Update: I found a bug in my test program - it reported too high precision loss when using a mask. The real figure is 0.05% mismatching pixels instead of 0.30%.
Just for the fun of it, though, I'm working on a version of the patch without precision loss in the mask application.
--
PerJonsson - 02 Jan 2005
Don't waste too much time and CPU on it. It is already much more precise than necessary. If the result differs 1/255... who cares:-)
BTW: smartmask only uses set or not set for a pixel.
--
JoergWeber - 02 Jan 2005
Too late, I already did it
I found a bug as well. Haven't fixed it yet, but I'm working on it!
--
PerJonsson - 03 Jan 2005
I have uploaded a new version of the patch. I fixed the two bugs listed above and also some bugs in the MMX code. Moreover, there is no longer a loss in precision (compared to a bugfixed version of the old code) when running with a static mask.
Joerg, the smartmask code works in my tests, but I would appreciate if you could test it as well.
--
PerJonsson - 03 Jan 2005
Even though the testing is limited I included the patch in 3.1.18_snap9 and changed status. Otherwise I loose track of what is included and what is not.
Again. Great job.
--
KennethLavrsen - 03 Jan 2005
Per, I have it running since a few hours and cannot see any bad side effect on smartmask. It's working well.
--
JoergWeber - 04 Jan 2005
I'm trying to reduce motion's load on my CPU and this seems like the right step forward. Has this been added to motion already? Is it still being looked at?
--
BobSaggeth - 30 Sep 2007
It was added in 3.1.18.
You can see this in the form at the bottom
--
KennethLavrsen - 30 Sep 2007