>The open-source AV1 decoder dav1d was updated yesterday to version 0.3.0. With the third release, new assembly code provides some serious performance gains on both the PC and mobile platforms.
>On the x86 side, this release mostly improves the SSSE3 performance of dav1d. Xuefeng Jiang contributed with prediction of chroma from luma and Paeth intra prediction functions, delivering 0,8% and 0,4% improved global performance.
>Liwei Wang continued his work on inverse transform with larger 8x32, 32x16 and 32x32 and up to 64x64 blocks, providing the largest speedup of this release, way over 10% on some video’s.
>dav1d 0.3.0 also introduces the first SSE4.1 assembly. In most cases the added SSE4.1 instructions aren’t useful in addition to SSSE3, but Victorien Le Couviour—Tuffet found a usecase where it was. He optimized the CDEF filter, resulting in a 1,15x speedup on the module level and around 1,5% overall.
>Meanwhile Henrik Gramner wrote some very clever SSE2 code to speed up entropy decoding/bitstream reading, which started to eat up a large proportion of decode time, especially on AVX2. The assembly code resulted in a speedup for all 64-bit x86 platforms, measured around 4% for AVX2 and 2% for SSSE3 and SSE4.1
>Overall these commits make dav1d 0.3.0 around 24% faster on SSSE3, 26% faster on SSE4.1 and 4% faster on AVX2 CPUs
>While single-threaded aomdec is still quite strong, with multiple threads dav1d 0.3.0 is making libaom an even smaller spot in the rear view mirror
>Martin Storsjö delivered two very nice commits speeding up the loopfilter and selfguided looprestoration with NEON assembly code. Both functions were speeded up by about 3x, resulting in performance gains anywhere from 7% to 36%. Not only allows this for higher resolutions, frame rates and bitrates, but also brings down power consumption on identical content.
medium.com