采用5x5高斯滤波,使灰度位图更加光滑,优化操作效率,
目前,2560x960像素的灰度位图在1.4毫秒内完成计算。
MyDetectFilterG5 PROC mov qword ptr [rsp 08h], rbx ; backupReg mov qword ptr [rsp 10h], rsi ; backupReg mov qword ptr [rsp 18h], rdi ; backupReg mov rbx, qword ptr [rsp 28h] ; currPixel mov rsi, qword ptr [rsp 30h] ; lineCache mov rdi, qword ptr [rsp 38h] ; destPixel mov qword ptr [rsp 20h], r12 ; backupReg mov qword ptr [rsp 28h], r13 ; backupReg vinserti128 ymm8, ymm8, xmm6, 01h ; backupReg vinserti128 ymm9, ymm9, xmm7, 01h ; backupReg movzx rax, byte ptr [r9 00h] ; factor_00 vmovq xmm0, rax ; factor_00 vpbroadcastw ymm3, xmm0 ; factor_00 movzx rax, byte ptr [r9 01h] ; factor_01 vmovq xmm0, rax ; factor_01 vpbroadcastw ymm4, xmm0 ; factor_01 movzx rax, byte ptr [r9 02h] ; factor_02 vmovq xmm0, rax ; factor_02 vpbroadcastw ymm5, xmm0 ; factor_02 movzx rax, byte ptr [r9 03h] ; factor_03 vmovq xmm0, rax ; factor_03 vpbroadcastw ymm6, xmm0 ; factor_03 movzx rax, byte ptr [r9 04h] ; factor_04 vmovq xmm0, rax ; factor_04 vpbroadcastw ymm7, xmm0 ; factor_04 mov r9, rcx ; colNumber add r9, 0Fh ; colNumber 15 shr r9, 04h ; colNumber / 16 mov r11, rbx ; currPixel sub r11, r8 ; currPixel - stride * 1 mov r10, r11 ; currPixel - stride * 1 sub r10, r8 ; currPixel - stride * 2 mov r12, rbx ; currPixel add r12, r8 ; currPixel stride * 1 mov r13, r12 ; currPixel stride * 1 add r13, r8 ; currPixel stride * 2 S0: xor rax, rax mov rcx, r9 S1: vpmovzxbw ymm0, xmmword ptr [r10 rax] ; currPixel - stride * 2 vpmullw ymm0, ymm0, ymm3 vmovdqa ymm1, ymm0 vpmovzxbw ymm0, xmmword ptr [r11 rax] ; currPixel - stride * 1 vpmullw ymm0, ymm0, ymm4 vpaddusw ymm1, ymm1, ymm0 vpmovzxbw ymm0, xmmword ptr [rbx rax] ; currPixel vpmullw ymm0, ymm0, ymm5 vpaddusw ymm1, ymm1, ymm0 vpmovzxbw ymm0, xmmword ptr [r12 rax] ; currPixel stride * 1 vpmullw ymm0, ymm0, ymm6 vpaddusw ymm1, ymm1, ymm0 vpmovzxbw ymm0, xmmword ptr [r13 rax] ; currPixel stride * 2 vpmullw ymm0, ymm0, ymm7 vpaddusw ymm1, ymm1, ymm0 vpsrlw ymm1, ymm1, 08h vmovdqu ymmword ptr [rsi rax * 02h], ymm1 ; lineCache add rax, 10h ; nextBlock loop S1 ; colRepeat xor rax, rax mov rcx, r9 S2: vpmullw ymm0, ymm3, ymmword ptr [rsi rax * 02h - 04h] vmovdqa ymm1, ymm0 vpmullw ymm0, ymm4, ymmword ptr [rsi rax * 02h - 02h] vpaddusw ymm1, ymm1, ymm0 vpmullw ymm0, ymm5, ymmword ptr [rsi rax * 02h] vpaddusw ymm1, ymm1, ymm0 vpmullw ymm0, ymm6, ymmword ptr [rsi rax * 02h 02h] vpaddusw ymm1, ymm1, ymm0 vpmullw ymm0, ymm7, ymmword ptr [rsi rax * 02h 04h] vpaddusw ymm1, ymm1, ymm0 vpsrlw ymm1, ymm1, 08h vextracti128 xmm0, ymm1, 01h vpackuswb xmm0, xmm1, xmm0 vmovdqu xmmword ptr [rdi rax], xmm0 ; destPixel add rax, 10h ; nextBlock loop S2 ; colRepeat mov r10, r11 mov r11, rbx mov rbx, r12 mov r12, r13 add r13, r8 add rdi, r8 dec rdx ; rowNumber ja S0 ; rowRepeat vextracti128 xmm6, ymm8, 01h ; resumeReg vextracti128 xmm7, ymm9, 01h ; resumeReg mov rbx, qword ptr [rsp 08h] ; resumeReg mov rsi, qword ptr [rsp 10h] ; resumeReg mov rdi, qword ptr [rsp 18h] ; resumeReg mov r12, qword ptr [rsp 20h] ; resumeReg mov r13, qword ptr [rsp 28h] ; resumeReg ret MyDetectFilterG5 ENDP
这个代码只能在windows-64bit操作系统下的编译操作只能使用MASM编译。
函数的C语言定义是:
extern "C" void MyDetectFilterG5(DWORD colNumber, DWORD rowNumber, DWORD colStride, PVOID currTable, PVOID currPixel, PVOID lineCache, PVOID destPixel);
rowNumber:位图高度按像素数计算,每个像素字节
colNumber:位图宽度,以像素个数计算,每像素一字节
colStride:位图的行存储长度按字节数计算
currTable:是一个5个byte5个数值的累加值必须是256
lineCache:临时行缓存,一指针,指向缓存的第4个字节,缓存容量8 + colNumber * 2字节
currPixel:输入灰度位图,一个指针,指向左上角的像素
destPixel:输出灰度位图,一个指针,指向左上角的像素