x264 - OpenSource AVC/H.264 Video Codec

Schotenhüter

r2183

Zitat

r2183
Sliced-threads: do hpel and deblock after returning
Lowers encoding latency around 14% in sliced threads mode with preset superfast.
Additionally, even if there is no waiting time between frames, this improves parallelism, because hpel+deblock are done during the (singlethreaded) lookahead.
For ease of debugging, dump-yuv forces all of the threads to wait and finish instead of setting b_full_recon.

r2182
Add full-recon API option
Fully reconstruct frames even without dump-yuv.

r2181
x86inc: switch to amdnops
Recent AMD CPUs' instruction decoders choke horribly on extremely long nops (i.e. with 4 prefixes).
Won't affect much, since we don't use ALIGN much.

r2180
BMI1 decimate functions
Intel was nice enough to make tzcnt equal to "rep bsf", which is backwards-compatible.
This means we don't actually have to add new functions to make it work.

r2179
Minor asm changes

r2178
Add row-reencoding support to VBV for improved accuracy
Extremely accurate, possibly 100% so (I can't get it to fail even with difficult VBVs).
Does not yet support rows split on slice boundaries (occurs often with slice-max-size/mbs).
Still inaccurate with sliced threads, but better than before.

r2177
Abstract bitstream backup/restore functions
Required for row re-encoding.

r2176
Add an small per-MB cost penalty for lowres
Helps avoid VBV predictors going nuts with very low-cost MBs.
One particular case this fixes is zero-cost MBs: adaptive quantization decreases the QP a lot, but (before this patch), no cost penalty gets factored in for this, because anything times zero is zero.

r2175
Remove explicit run calculation from coeff_level_run
Not necessary with the CAVLC lookup table for zero run codes.

r2174
Export PSNR/SSIM in x264 API

r2173
x86inc: support yasm -f win64
Not necessary for x264, as -m amd64 already does the right thing, but used by external users of x86inc.

r2172
Fix incorrect zero-extension assumptions in x86_64 asm
Some x264 asm assumed that the high 32 bits of registers containing "int" values would be zero.
This is almost always the case, and it seems to work with gcc, but it is *not* guaranteed by the ABI.
As a result, it breaks with some other compilers, like Clang, that take advantage of this in optimizations.
Accordingly, fix all x86 code by using intptr_t instead of int or using movsxd where neccessary.
Also add checkasm hack to detect when assembly functions incorrectly assumes that 32-bit integers are zero-extended to 64-bit.

r2171
Fix possible alignment crash when linking from MSVC
x264_cavlc_init needs to be stack-aligned now.

r2170
Fix rare overflow in 10-bit intra_satd_x3_16x16 asm

r2169
ICL: fix out of tree building and resource file usage on Windows

r2168
Add error handling for out-of-tree build

r2167
Fix RGB colorspace input
BGR/BGRA input was correct.

r2166
Fix interlaced + extremal slice-max-size
Broke if the first macroblock in the slice exceeded the set slice-max-size.

r2165
Fix regression in r2141
Broke register preservation in x264_cpu_cpuid and x264_cpu_xgetbv.
Did not cause any problems.

Alles anzeigen

sneaker2

r2184

Zitat

Fix clobbering of mutex/cvs
Regression in r2183.
Bizarrely seemed to work on many platforms, but crashed on win64 and may have been slower.
Only affected sliced threads during encoding, but could cause crashes on x264 encoder close even without sliced threads.

Schotenhüter

2197

Zitat

r2197
Add mb_info API for signalling constant macroblocks
Some use-cases of x264 involve encoding video with large constant areas of the frame.
Sometimes, the caller knows which areas these are, and can tell x264.
This API lets the caller do this and adds internal tracking of modifications to macroblocks to avoid problems.
This is really only suitable without B-frames.
An example use-case would be using x264 for VNC.

r2196
Faster chroma weight cost calculation

New assembly function with SSE2, SSSE3 and XOP implementations for calculating absolute sum of differences.

r2195
Add Level 5.2 support

r2194
Eradicate all mention of Extended Profile
x264 never supported it and never will because nobody uses it.

r2193
Fix disabling of mbtree when using 2pass encoding and zones

r2192
configure: force select -mXX gcc option for i386/x86-64
Makes multilib compilation more convenient.

r2191
Update config.guess and config.sub
Adds support for a bunch of targets, including:
aarch64 (armv8)
arm-linux-androideabi

r2190
configure: correct use of RC variable and add --extra-rcflags

r2189
ICL/MSVS: Fix shared library generation and usage
MSVS requires exported variables to be declared with the DATA keyword, and requires that imported variables be declared with dllimport.
This does not fix x264 cli being unable to use a shared library built by ICL however.

r2188
Fix intra-refresh + hrd

r2187
Fix frame input colorspace check

r2186
Fix comment in deblock.c
The code does, in fact, handle CAVLC+8x8dct correctly already.

r2185
Fix sliced-threads ratecontrol bug
Was using qp instead of qscale; could cause NANs (not to mention less accurate results).

Alles anzeigen

may24

2200

Zitat

r2198
Fix some bugs in mb_info code

r2199
Add support for RGB formats in bit-depth conversion filter

r2200
Threaded lookahead

Split each lookahead frame analysis call into multiple threads.
Has a small impact on quality, but does not seem to be consistently any worse.

This helps alleviate bottlenecks with many cores and frame threads. In many case, this massively increases performance on many-core systems.
For example, over 100% faster 1080p encoding with --preset veryfast on a 12-core i7 system.
Realtime 1080p30 at --preset slow should now be feasible on real systems.

For sliced-threads, this patch should be faster regardless of settings (~10%).

By default, lookahead threads are 1/6 of regular threads. This isn't exacting, but it seems to work well for all presets on real systems.
With sliced-threads, it's the same as the number of encoding threads.

Alles anzeigen

Schotenhüter

2207

Zitat

r2207
Faster predictor checking with subme<3
Fix a typo that made an early-skip less effective.
Avoid a relatively unpredictable branch.
Slightly changed output due to the typo-fix.
~50 cycles faster on Core i7.

r2206
Try 8x8 transform analysis even when sub8x8 partitions are present
Turn off the sub8x8 partitions, try it, and turn them back on if it didn't help.
Small compression improvement with p4x4 on (~0.1-0.5%).
Also update related comments.

r2205
Support changing resolutions between passes with macroblock-tree
Implement a basic separable bilinear filter to rescale the quantizer offsets.
Structure inspired by swscale, but floating-point instead of fixed-point.
Not as optimized as it could be, but it's quite fast already.

Example compression penalties on a 720p video game recording:
First pass with 720p and second as 480p: ~-1.5% (vs. same res)
First pass with 480p and second as 720p: ~-3% (vs. same res)

r2204
Print elapsed time in encoding progress indicator

r2203
Cap ratecontrol predictor parameters
Limits VBV mispredictions after long periods of relatively constant video.

r2202
x86inc: import patches from libav
Allow manual invocation of WIN64_SPILL_XMM even under INIT_MMX
SSE version of mova is movaps rather than movdqa.
YMM version of movnta.
Add mp size for named arguments.
Fix DEFINE_ARGS when used outside of a cglobal.
Define a few more cpuflags.
3-argument wrappers for a few more instructions.

r2201
Fix crash with --fps 0
Fix some integer overflows and check input parameters better.
Also fix incorrect type specifiers for demuxer info printing.

Alles anzeigen

sneaker2

r2208

Zitat

r2208
Revert r2204
People don't seem to like this so I'm just going to get rid of it.

Schotenhüter

2216

Zitat

r2216
Enhance mb_info: add mb_info_update
This feature lets the callee know which decoded macroblocks have changed.

r2215
Fix mb_info_free with sliced threads
x264 would free mb_info before it was completely done using it.

r2214
Enhance nalu_process
Add the input frame opaque pointer to the arguments.
This makes it easier to use with multiple simultaneous x264 encodes.

r2213
Improve mb_info constant mb optimization
Allow fast skipping even if the pskip MV isn't zero.

r2212
Export the average effective CRF of each frame
Useful to judge the resulting quality of a frame when VBV is enabled.

r2211
Remove special-casing for OpenBSD pthread handling
Previously it was policy to use -pthread, but OpenBSD now recommends -lpthread.
its been libpthread anyway and policy has changed to stop using -pthread.

r2210
x86inc: automatically insert vzeroupper for YMM functions
Backported from libav.

r2209
Free user supplied data when deleting a frame
This eliminates a memory leak when calling x264_encoder_close.

Alles anzeigen

Schotenhüter

2245

Zitat

r2245
Bump dates to 2013

r2244
x86inc: Drop tzcnt workaround

It is no longer needed now that we've bumped the version requirement of yasm to 1.2.0.

r2243
AVX2/FMA3 version of mbtree_propagate
First AVX2 function for testing.
Bump yasm version to 1.2.0 for AVX2 support.

r2242
x86inc: Use VEX-encoded instructions in AVX functions
Automatically use VEX-encoding in AVX/AVX2/XOP/FMA3/FMA4 functions for all instructions that exists in a VEX-encoded version.
This change makes it easier to extend existing code to use AVX2.
Also add support for AVX emulation of a few instructions that were missing before.

r2241
x86inc: activate REP_RET automatically
Now RET checks whether it immediately follows a branch, so the programmer dosen't have to keep track of that condition.
REP_RET is still needed manually when it's a branch target, but that's much rarer.
The implementation involves lots of spurious labels, but that's ok because we strip them.

r2240
x86inc: support stack mem allocation and re-alignment in PROLOGUE
Use this in 8-bit loopfilter functions so they can be used if
there is no aligned stack (e.g. x86-32 MSVC or ICC 10.x).

r2239
Update config.guess and config.sub

r2238
Fix crash if the first frame is forced to a non-keyframe
This is obviously bad user input, but x264 shouldn't crash if it happens.

r2237
Fix build on ARM with binutils >= 2.23.51.0.6
GAS doesn't seem to like spaces in vld1 anymore, so remove those.

r2236
Fix pthread_join emulation on win32 and BeOS
Doesn't actually affect x264, but it's more correct.

r2235
Fix typo in r2222
Slightly wrong numbers in level table.

r2234
configure: fix gpac detection with -Wp,-D_FORTIFY_SOURCE=2

r2233
Solaris: use sysconf to get processor count
Solaris responds correctly to the same value as Cygwin, so let's use that.

r2232
lavf input: allocate AVFrame correctly
Allocate AVFrames correctly with avcodec_alloc_frame().
This caused crashes with newer libavcodecs that try to free frame extradata.

r2231
Fix crash when using libx264.dll compiled with ICL for X86_64

r2230
Fix possible issues with out-of-spec QP values
Fixes a possible regression in r2228.

r2229
Attempt to optimize PPS pic_init_qp in 2-pass mode
Small compression improvement; up to ~0.5% in extreme cases.
Helps more with small slice sizes (tiny resolutions or slice-max-size).
Note that this changes the 2-pass stats file format.

r2228
Improve slice header QP selection
Use the first macroblock of each slice instead of the last of the previous.
Lets us pick a reasonable initial QP for the first slice too.
Slightly improved compression.

r2227
Update level dpb size calculation to match newer H.264 spec
Doesn't actually change encoding behavior, but makes it more correct.
Warning messages should now be accurate at higher bit depths and non-4:2:0.
Technically, since it redefines x264_level_t, this is an API version increment.

r2226
Add support for the ffmpeg/vapoursynth high bit depth y4m extensions

r2225
x86inc: Rename 3dnow2 to 3dnowext
The name "3dnowext" is more common than "3dnow2". Doesn't affect x264.

r2224
x86inc: only define program_name if the macro is unset.
This allows overriding the value from outside the file.
This can be useful if x86inc.asm is used outside of x264.

r2223
Disable ARM NEON MRC CPU test for Apple devices
The Apple A6 CPU doesn't support performance counters, so this test caused a crash.

r2222
Fix crash with no-scenecut + mbtree

r2221
Fix reconfiguring to crf=0
Lossless mode can't currently be enabled mid-stream.

r2220
Fix ALIGNED_ARRAY_EMU macros on ICL
ICL's preprocessor doesn't handle it correctly.
This fix is similar to libav's fix in 0db2d9.

r2219
Fix use of deprecated av_close_input_file call

r2218
Fix pkg-config for dynamic vs static linking

r2217
Set libm in the configure script if the OS has libm
Prerequisite for another configure patch after this.
Idea copied from libpthread.

Alles anzeigen

Schotenhüter

2273

Zitat

r2273
ARM: update NEON mc_chroma to work with NV12 and re-enable it

Up to 10-15% faster overall.

r2272
CABAC/CAVLC: use the new bit-iterating macro here too

r2271
quant_4x4x4: quant one 8x8 block at a time

This reduces overhead and lets us use less branchy code for zigzag, dequant,
decimate, and so on.
Reorganize and optimize a lot of macroblock_encode using this new function.
~1-2% faster overall.

Includes NEON and x86 versions of the new function.
Using larger merged functions like this will also make wider SIMD, like
AVX2, more effective.

r2270
Add AvxSynth support to the AviSynth input module.

Uses dlopen to load AvxSynth on Linux and OS X.

Allows the use of --demuxer avs for AvxSynth, though the only source filter it
can currently use is FFMS2.

Add a local copy of avxsynth_c.h and its dependent headers in extras/ so that
users don't need to actually have AvxSynth development headers installed to
enable support for it (mirroring the AviSynth behavior).

Based on a patch by 0x09 (tab@lavabit.com)

r2269
Eliminate some branchiness in ME/analysis

Faster, fewer branch mispredictions.

r2268
Fix some store forwarding stalls
There's quite a few others, but most of them don't help to fix or there's no
easy way to avoid them.

r2267
x86: faster AVX satd/sa8d/sa8d_satd/hadamard_ac

Use Conroe-style movddup in AVX transforms; both Sandy Bridge and Bulldozer
do movddup in the load unit, so it's totally free this way.

On Sandy Bridge:
~6% faster sa8d_satd
~5% faster hadamard_ac
~9% faster 32-bit satd
~2% faster sa8d

r2266
x86: detect Bobcat, improve Atom optimizations, reorganize flags

The Bobcat has a 64-bit SIMD unit reminiscent of the Athlon 64; detect this
and apply the appropriate flags.

It also has an extremely slow palignr instruction; create a flag for this to
avoid massive penalties on palignr-heavy functions.

Improve Atom function selection and document exactly what the SLOW_ATOM flag
covers.

Add Atom-optimized SATD/SA8D/hadamard_ac functions: simply combine the ssse3
optimizations with the sse2 algorithm to avoid pmaddubsw, which is slow on
Atom along with other SIMD multiplies.

Drop TBM detection; it'll probably never be useful for x264.

Invert FastShuffle to SlowShuffle; it only ever applied to one CPU (Conroe).

Detect CMOV, to fail more gracefully when run on a chip with MMX2 but no CMOV.

r2265
x86: combined SA8D/SATD dsp function

Speedup is most apparent for 8-bit (~30%), but gives some improvements
for 10-bit too (~12%).
64-bit only for now.

r2264
x86: port SSE2+ SATD functions to high bit depth

Makes SATD 20-50% faster across all partition sizes but 4x4.

r2263
x86: faster high bit depth ssd

About 15% faster on average.

r2262
x86: optimize and clean up predictor checking
Branchlessly handle elimination of candidates in MMX roundclip asm.
Add a new asm function, similar to roundclip, except without the round part.
Optimize and organize the C code, and make both subme>=3 and subme<3 consistent.
Add lots of explanatory comments and try to make things a little more understandable.
~5-10% faster with subme>=3, ~15-20% faster with subme<3.

r2261
Fix two bugs in predictor checking
pmv wasn't checked properly in some cases, as well as zero vector.
Output-changing portion of the following patch.

r2260
Improve lookahead-threads auto selection
Smarter decision to improve fast-first-pass performance in 2-pass encodes.
Dramatically improves CPU utilization on multi-core systems.

Tested on a quad-core Ivy Bridge (12 threads, 1080p):
Fast first pass:
veryfast: ~7% faster
faster: ~11% faster
fast/medium: ~15% faster
slow/slower: ~42% faster
veryslow: ~55% faster
CRF/1-pass:
veryfast: ~9% faster
(all others remained the same)

r2259
x86: Use SSE instead of SSE2 for copying data

Reduces code size because movaps/movups is one byte shorter than movdqa/movdqu.
Also merge MMX and SSE versions of memcpy_aligned into a single macro.

r2258
64-bit cabac optimizations

~4% faster PIC

WIN64:
~3% faster and 16 byte shorter cabac_encode_bypass
~8% faster cabac_encode_terminal
Benchmarked on Ivy Bridge

UNIX64:
One instruction less in cabac_encode_bypass

r2257
configure: add QNX support

r2256
Windows: Enable DEP and ASLR

r2255
x86inc: Set ELF hidden visibility for global constants

r2254
x86inc: Add cvisible macro for C functions with public prefix

This allows defining externally visible library symbols.

Signed-off-by: Diego Biurrun <diego@biurrun.de>

r2253
x86inc: rename program_name to private_prefix
Synced from libav.
The new name is more descriptive and will allow defining a separate public
prefix for externally visible library symbols.

r2252
x264.h: improve x264_encoder_reconfig documentation

r2251
Cosmetics: stricter definition of parameterless functions

r2250
Update "Install and compile x264" in doc/regression_test.txt

r2249
Fix possible non-determinism with mbtree + open-gop + sync-lookahead

Code assumed keyframe analysis would only pull one frame off the list; this
isn't true with open-gop.

r2248
x86: don't use the red zone on win64

r2247
x86-64: fix trellis asm with interlacing

Regression in r2145.
Assembly assumed array was [2][64] when it was actually [2][63].
Tiny (~0.1%) compression improvement.

r2246
x86-32: use simple nop codes for <= sse

The "CentaurHauls family 6 model 9 stepping 8" family of CPUs (flags:
fpu vme de pse tsc msr cx8 sep mtrr pge mov pat mmx fxsr sse up rng
rng_en ace ace_en) SIGILLs on long nop codes.

Alles anzeigen

Schotenhüter

2309

Zitat

r2309
x86: SSSE3 LUT-based faster coeff_level_run

~2x faster coeff_level_run.
Faster CAVLC encoding: {1%,2%,7%} overall with {superfast,medium,slower}.
Uses the same pshufb LUT abuse trick as in the previous ads_mvs patch.

r2308
x86-64: BMI2 cabac_residual functions

r2307
x86: SSSE3 ads_mvs

~55% faster ads in benchasm, ~15-30% in real encoding.
~4% faster "placebo" preset overall.

r2306
x86: AVX2 pixel_ssd_nv12_core

r2305
x86: AVX2 high bit-depth pixel_ssd

r2304
x86: AVX2 high bit-depth pixel_sad_x3/pixel_sad_x4

Also reduce the number of xmm registers used by sse2/ssse3 pixel_sad_x3.

r2303
x86: AVX2 high bit-depth vsad

r2302
x86: AVX2 high bit-depth pixel_sad

Also use loops instead of duplicating code; reduces code size by ~10kB with
negligible effect on performance.

r2301
x86: AVX2 high_bit_depth pixel_avg2, get_ref, mc_copy_w16, mc_luma

Also reduce the number of xmm registers used by mc_copy_* to avoid
saving and restoring xmm6 and xmm7 on 64-bit Windows.

r2300
x86: AVX2 nal_escape

Also rewrite the entire function to be faster and drop the AVX version which is no longer useful.

r2299
x86: AVX memzero_aligned

r2298
x86: AVX2 predict_16x16_dc

r2297
x86: AVX2 predict_8x8c_p/predict_8x16c_p

r2296
x86: AVX2 predict_16x16_p

Also fix the AVX implementation to correctly use the SSSE3 inline asm
instead of SSE2.

r2295
x86: AVX high bit-depth predict_16x16_v

Also restructure some code to reduce code size of various functions,
especially in high bit-depth.

r2294
x86: AVX2 high bit-depth predict_4x4_h

r2293
x86: AVX2 high bit-depth predict_16x16_h

r2292
x86: AVX2 high bit-depth predict_8x8c_h/predict_8x16c_h

r2291
x86util: Support ymm registers in HADD macros

r2290
x86: more AVX2 framework, AVX2 functions, plus some existing asm tweaks

AVX2 functions:
mc_chroma
intra_sad_x3_16x16
last64
ads
hpel
dct4
idct4
sub16x16_dct8
quant_4x4x4
quant_4x4
quant_4x4_dc
quant_8x8
SAD_X3/X4
SATD
var
var2
SSD
zigzag interleave
weightp
weightb
intra_sad_8x8_x9
decimate
integral
hadamard_ac
sa8d_satd
sa8d
lowres_init
denoise

r2289
x86inc: create xm# and ym#, analagous to m#

For when we want to mix simd sizes within one function.

r2288
x86inc: fix AVX emulation of cmp(p|s)(s|d)

r2287
x86-64: cabac_block_residual assembly

RDO: ~20% faster than C
Bitstream: ~50% faster than C
1-2% faster overall, highest on preset superfast/fast/medium.

r2286
OpenCL lookahead

OpenCL support is compiled in by default, but must be enabled at runtime by an
--opencl command line flag. Compiling OpenCL support requires perl. To avoid
the perl requirement use: configure --disable-opencl.

When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
device. Lowres intra cost prediction, lowres motion search (including subpel)
and bidir cost predictions are all done on the GPU. MB-tree and final slice
decisions are still done by the CPU. Presets which do not use a threaded
lookahead will not use OpenCL at all (superfast, ultrafast).

Because of data dependencies, the GPU must use an iterative motion search which
performs more total work than the CPU would do, so this is not work efficient
or power efficient. But if there are spare GPU cycles to spare, it can often
speed up the encode. Output quality when OpenCL lookahead is enabled is often
very slightly worse in quality than the CPU quality (because of the same data
dependencies).

x264 must compile its OpenCL kernels for your device before running them, and in
order to avoid doing this every run it caches the compiled kernel binary in a
file named x264_lookahead.clbin (--opencl-clbin FNAME to override). The cache
file will be ignored if the device, driver, or OpenCL source are changed.

x264 will use the first GPU device which supports the required cl_image
features required by its kernels. Most modern discrete GPUs and all AMD
integrated GPUs will work. Intel integrated GPUs (up to IvyBridge) do not
support those necessary features. Use --opencl-device N to specify a number of
capable GPUs to skip during device detection.

Switchable graphics environments (e.g. AMD Enduro) are currently not supported,
as some have bugs in their OpenCL drivers that cause output to be silently
incorrect.

Developed by MulticoreWare with support from AMD and Telestream.

r2285
weightp: improve scale/offset search, chroma

Rescale the scale factor if the offset clips. This makes weightp more effective
in fades to/from white (and an other situation that requires big offsets).

Search more than 1 scale factor and more than 1 offset, depending on --subme.

Try to find the optimal chroma denominator instead of hardcoding it.

Overall improvement: a few percent in fade-heavy clips, such as a sample from
Avatar: TLA.

r2284
Add slices-max feature

The H.264 spec technically has limits on the number of slices per frame. x264
normally ignores this, since most use-cases that require large numbers of
slices prefer it to. However, certain decoders may break with extremely large
numbers of slices, as can occur with some slice-max-size/mbs settings.

When set, x264 will refuse to create any slices beyond the maximum number,
even if slice-max-size/mbs requires otherwise.

r2283
Add slice-min-mbs feature

Works in conjunction with slice-max-mbs and/or slice-max-size to avoid overly
small slices.
Useful with certain decoders that barf on extremely small slices.

If slice-min-mbs would be violated as a result of slice-max-size, x264 will
exceed slice-max-size and print a warning.

r2282
Disable mbtree asm with cpu-independent option

Results vary between versions because of different rounding results.

r2281
Show "avs: no" --disable-avs option instead of empty string

r2280
lavf input: don't use deprecated AVStream fields

Fixes building against newer libavcodecs from the Libav project.

r2279
Fix y4m input with C420paldv colorspace

r2278
x86: correctly check stack alignment for Atom hadamard_ac

Regression in r2265 (only affected compilers with broken stack alignment,
like ICL on win32).

r2277
x86inc: fix some corner cases of SWAP

SWAP with >=3 named (rather than numbered) args
PERMUTE followed by SWAP with 2 named args
used to produce the wrong permutation

r2276
Fix array overreads that caused miscompilation in gcc 4.8

r2275
Fix undefined behavior in x264_ratecontrol_mb

r2274
ARM: Fix bug in x264_quant_4x4x4_neon

Alles anzeigen

Schotenhüter

2309

Zitat

r2309
x86: SSSE3 LUT-based faster coeff_level_run

~2x faster coeff_level_run.
Faster CAVLC encoding: {1%,2%,7%} overall with {superfast,medium,slower}.
Uses the same pshufb LUT abuse trick as in the previous ads_mvs patch.

r2308
x86-64: BMI2 cabac_residual functions

r2307
x86: SSSE3 ads_mvs

~55% faster ads in benchasm, ~15-30% in real encoding.
~4% faster "placebo" preset overall.

r2306
x86: AVX2 pixel_ssd_nv12_core

r2305
x86: AVX2 high bit-depth pixel_ssd

r2304
x86: AVX2 high bit-depth pixel_sad_x3/pixel_sad_x4

Also reduce the number of xmm registers used by sse2/ssse3 pixel_sad_x3.

r2303
x86: AVX2 high bit-depth vsad

r2302
x86: AVX2 high bit-depth pixel_sad

Also use loops instead of duplicating code; reduces code size by ~10kB with
negligible effect on performance.

r2301
x86: AVX2 high_bit_depth pixel_avg2, get_ref, mc_copy_w16, mc_luma

Also reduce the number of xmm registers used by mc_copy_* to avoid
saving and restoring xmm6 and xmm7 on 64-bit Windows.

r2300
x86: AVX2 nal_escape

Also rewrite the entire function to be faster and drop the AVX version which is no longer useful.

r2299
x86: AVX memzero_aligned

r2298
x86: AVX2 predict_16x16_dc

r2297
x86: AVX2 predict_8x8c_p/predict_8x16c_p

r2296
x86: AVX2 predict_16x16_p

Also fix the AVX implementation to correctly use the SSSE3 inline asm
instead of SSE2.

r2295
x86: AVX high bit-depth predict_16x16_v

Also restructure some code to reduce code size of various functions,
especially in high bit-depth.

r2294
x86: AVX2 high bit-depth predict_4x4_h

r2293
x86: AVX2 high bit-depth predict_16x16_h

r2292
x86: AVX2 high bit-depth predict_8x8c_h/predict_8x16c_h

r2291
x86util: Support ymm registers in HADD macros

r2290
x86: more AVX2 framework, AVX2 functions, plus some existing asm tweaks

AVX2 functions:
mc_chroma
intra_sad_x3_16x16
last64
ads
hpel
dct4
idct4
sub16x16_dct8
quant_4x4x4
quant_4x4
quant_4x4_dc
quant_8x8
SAD_X3/X4
SATD
var
var2
SSD
zigzag interleave
weightp
weightb
intra_sad_8x8_x9
decimate
integral
hadamard_ac
sa8d_satd
sa8d
lowres_init
denoise

r2289
x86inc: create xm# and ym#, analagous to m#

For when we want to mix simd sizes within one function.

r2288
x86inc: fix AVX emulation of cmp(p|s)(s|d)

r2287
x86-64: cabac_block_residual assembly

RDO: ~20% faster than C
Bitstream: ~50% faster than C
1-2% faster overall, highest on preset superfast/fast/medium.

r2286
OpenCL lookahead

OpenCL support is compiled in by default, but must be enabled at runtime by an
--opencl command line flag. Compiling OpenCL support requires perl. To avoid
the perl requirement use: configure --disable-opencl.

When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
device. Lowres intra cost prediction, lowres motion search (including subpel)
and bidir cost predictions are all done on the GPU. MB-tree and final slice
decisions are still done by the CPU. Presets which do not use a threaded
lookahead will not use OpenCL at all (superfast, ultrafast).

Because of data dependencies, the GPU must use an iterative motion search which
performs more total work than the CPU would do, so this is not work efficient
or power efficient. But if there are spare GPU cycles to spare, it can often
speed up the encode. Output quality when OpenCL lookahead is enabled is often
very slightly worse in quality than the CPU quality (because of the same data
dependencies).

x264 must compile its OpenCL kernels for your device before running them, and in
order to avoid doing this every run it caches the compiled kernel binary in a
file named x264_lookahead.clbin (--opencl-clbin FNAME to override). The cache
file will be ignored if the device, driver, or OpenCL source are changed.

x264 will use the first GPU device which supports the required cl_image
features required by its kernels. Most modern discrete GPUs and all AMD
integrated GPUs will work. Intel integrated GPUs (up to IvyBridge) do not
support those necessary features. Use --opencl-device N to specify a number of
capable GPUs to skip during device detection.

Switchable graphics environments (e.g. AMD Enduro) are currently not supported,
as some have bugs in their OpenCL drivers that cause output to be silently
incorrect.

Developed by MulticoreWare with support from AMD and Telestream.

r2285
weightp: improve scale/offset search, chroma

Rescale the scale factor if the offset clips. This makes weightp more effective
in fades to/from white (and an other situation that requires big offsets).

Search more than 1 scale factor and more than 1 offset, depending on --subme.

Try to find the optimal chroma denominator instead of hardcoding it.

Overall improvement: a few percent in fade-heavy clips, such as a sample from
Avatar: TLA.

r2284
Add slices-max feature

The H.264 spec technically has limits on the number of slices per frame. x264
normally ignores this, since most use-cases that require large numbers of
slices prefer it to. However, certain decoders may break with extremely large
numbers of slices, as can occur with some slice-max-size/mbs settings.

When set, x264 will refuse to create any slices beyond the maximum number,
even if slice-max-size/mbs requires otherwise.

r2283
Add slice-min-mbs feature

Works in conjunction with slice-max-mbs and/or slice-max-size to avoid overly
small slices.
Useful with certain decoders that barf on extremely small slices.

If slice-min-mbs would be violated as a result of slice-max-size, x264 will
exceed slice-max-size and print a warning.

r2282
Disable mbtree asm with cpu-independent option

Results vary between versions because of different rounding results.

r2281
Show "avs: no" --disable-avs option instead of empty string

r2280
lavf input: don't use deprecated AVStream fields

Fixes building against newer libavcodecs from the Libav project.

r2279
Fix y4m input with C420paldv colorspace

r2278
x86: correctly check stack alignment for Atom hadamard_ac

Regression in r2265 (only affected compilers with broken stack alignment,
like ICL on win32).

r2277
x86inc: fix some corner cases of SWAP

SWAP with >=3 named (rather than numbered) args
PERMUTE followed by SWAP with 2 named args
used to produce the wrong permutation

r2276
Fix array overreads that caused miscompilation in gcc 4.8

r2275
Fix undefined behavior in x264_ratecontrol_mb

r2274
ARM: Fix bug in x264_quant_4x4x4_neon

Alles anzeigen

sneaker2

2310

Zitat

Fix two bugs in slice-min-mbs and slices-max

Slices-max broke slice-max-size when slice-max wasn't used.
Slice-min-mbs broke in rare cases near the end of a threadslice.

Schotenhüter

2334

Zitat

r2334
OpenCL support improvement/refactoring

Autoload the OpenCL library so that it's not required to run an openCL-enabled
build of x264.

Update X264_BUILD, which should have been changed with the first patch.

r2333
x86: shave a few instructions off AVX deblock

r2332
x86: AVX2 dequant_4x4_dc

r2331
x86: AVX2 high bit-depth dequant

r2330
x86-64: 64-bit variant of AVX2 hpel_filter

~5% faster than 32-bit.

r2329
x86: AVX2 high bit-depth denoise_dct

28->15 cycles

Also reorder instructions to use fewer registers, 3 cycles faster on Ivy Bridge with 64-bit Windows.

r2328
x86: AVX2 high bit-depth quant

quant_4x4: 13->6 cycles
quant_4x4_dc: 14->8 cycles
quant_8x8: 47->24 cycles
quant_4x4x4: 48->25 cycles

r2327
x86: AVX2 add16x16_idct_dc

27 -> 19 cycles

r2326
x86: faster AVX2 quant_4x4x4

10->9 cycles

r2325
x86: AVX2 intra_sad_x3_8x8c

30->22 cycles

r2324
x86: AVX2 high bit-depth intra_sad_x3_8x8

43->24 cycles

r2323
x86: AVX2 deblock strength

30->18 cycles

r2322
x86: Faster high bit-depth intra_sad_x3_4x4

20->16 cycles on Ivy Bridge

r2321
x86: faster SSSE3 hpel

~7% faster using the pmulhrsw trick from mc_chroma.

r2320
x86-64: faster SSSE3 trellis

~2% faster trellis.

r2319
x86: 32-byte align the stack if possible

Avoids the need for manual 32 byte array alignment on compilers that support
-mpreferred-stack-boundary.

r2318
x86inc: Utilize the shadow space on 64-bit Windows

Store XMM6 and XMM7 in the shadow space in functions that clobbers them.
This way we don't have to adjust the stack pointer as often,
reducing the number of instructions as well as code size.

r2317
x86: Don't use explicitly aligned versions of SAD on AVX CPUs

On modern CPUs movdqu isn't slower than movdqa when used on aligned data and using the same code in both cases saves cache.

This was already done for the high bit-depth AVX2 implementation but the aligned version still exists as dead code so remove that.

r2316
x86: Add missing initializations for high bit-depth sad_aligned

r2315
x86: add Jaguar CPU detection

r2314
x86inc: Remove .rodata kludges

The Mach-O bug was fixed in yasm 0.8.0 and we don't support versions that old.

a.out was superseded by ELF on sane systems a few decades ago.

r2313
checkasm: Use 64-bit cycle counters

Prevents overflows that can occur in some cases.

r2312
checkasm: Fix stack alignment bug

r2311
Fix invalid memcpy in sliced-threads

Likely didn't actually break in practice, but memcpy with src==dst
is incorrect.

Alles anzeigen

sneaker2

2345 (Download)

Zitat

rev2345
Tweak i16x16-delta-quant-avoidance code

Don't omit the delta quant if it'd raise the quantizer to do so; this fixes
a rare flickering issue caused by deblocking.

rev2344
x86: faster AVX2 iDCT, AVX deblock_luma_h, deblock_luma_h_intra

rev2343
Add new color primaries, transfer characteristics, matrix coefficients

rev2342
Add "--stitchable" option for segmented encoding

Stops x264 from attempting to optimize global stream headers, ensuring that
different segments of a video will have identical headers when used with
identical encoding settings.

rev2341
Interface: if vbv-maxrate < bitrate, set bitrate = vbv-maxrate

This probably makes more sense to the user than setting vbv-maxrate = bitrate,
as before.

rev2340
OpenCL cosmetics

rev2339
Fix possible crash when writing very large filler NALUs

Bitstream-reallocation function didn't handle the case of filler.

rev2338
Fix build with PIC on some systems

rev2337
Fix potential misaligment crash in AVX2 denoise_dct

rev2336
Fix building with compilers without inline asm support

Also fix crash in high bit depth builds compiled with unaligned stack.

rev2335
Fix compilation with OpenCL on MacOS X

Also fix crash in the case of OpenCL error during encoding.

Alles anzeigen

Schotenhüter

2348

Zitat

rev2348
x86: SSSE3 implementation of pixel_sad_x3 and pixel_sad_x4

rev2347
x86: Faster AVX2 pixel_sad_x3 and pixel_sad_x4

rev2346
x86: Remove X264_CPU_SSE_MISALIGN functions

Prevents a crash if the misaligned exception mask bit is cleared for some reason.

Misaligned SSE functions are only used on AMD Phenom CPUs and the benefit is miniscule.
They also require modifying the MXCSR control register and by removing those functions
we can get rid of that complexity altogether.

VEX-encoded instructions also supports unaligned memory operands. I tried adding AVX
implementations of all removed functions but there were no performance improvements on
Ivy Bridge. pixel_sad_x3 and pixel_sad_x4 had significant code size reductions though
so I kept them and added some minor cosmetics fixes and tweaks.

Alles anzeigen

Schotenhüter

2358

Zitat

Anton Mitrofanov [Mon, 26 Aug 2013 19:20:31 +0200 (21:20 +0400)]
Fix masked access violation in KERNEL32

Caused crashes under gdb in Windows and might cause other unknown problems.

Hiroki Taniura [Sat, 24 Aug 2013 18:18:57 +0200 (01:18 +0900)]
Fix GPAC support on Windows

Henrik Gramner [Sun, 11 Aug 2013 19:50:42 +0200 (19:50 +0200)]
Windows Unicode support

Windows, unlike most other operating systems, uses UTF-16 for Unicode strings while x264 is designed for UTF-8.

This patch does the following in order to handle things like Unicode filenames:
* Keep strings internally as UTF-8.
* Retrieve the CLI command line as UTF-16 and convert it to UTF-8.
* Always use Unicode versions of Windows API functions and convert strings to UTF-16 when calling them.
* Attempt to use legacy 8.3 short filenames for external libraries without Unicode support.

Kieran Kunhya [Sat, 20 Jul 2013 19:47:59 +0200 (18:47 +0100)]
AVC-Intra support

This format has been reverse engineered and x264's output has almost exactly
the same bitstream as Panasonic cameras and encoders produce. It therefore does
not comply with SMPTE RP2027 since Panasonic themselves do not comply with
their own specification. It has been tested in Avid, Premiere, Edius and
Quantel.

Parts of this patch were written by Jason Garrett-Glaser and some reverse
engineering was done by Joseph Artsimovich.

Henrik Gramner [Mon, 8 Jul 2013 21:06:42 +0200 (12:06 -0700)]
Transparent hugepage support

Combine frame and mb data mallocs into a single large malloc.
Additionally, on Linux systems with hugepage support, ask for hugepages on
large mallocs.

This gives a small performance improvement (~0.2-0.9%) on systems without
hugepage support, as well as a small memory footprint reduction.

On recent Linux kernels with hugepage support enabled (set to madvise or
always), it improves performance up to 4% at the cost of about 7-12% more
memory usage on typical settings..

It may help even more on Haswell and other recent CPUs with improved 2MB page
support in hardware.

Alles anzeigen

may24

Ähm, wo gibt's den die Windows builds ? x264.nl bzw. dessen Mirrors sind erst auf 2345 ...

Goldwingfahrer

http://komisar.gin.by/

akapuma

Mittlerweile sind wir bereits bei r2409. Changelog

Gruß

akapuma

Selur

Adaptive Quantizer 3 hat es (nach Jahren) in den 'stable'-Branch geschafft!

x264 - OpenSource AVC/H.264 Video Codec

Jetzt mitmachen!

Tags