本站使用了 Pjax 等基于 JavaScript 的开发技术,但您的浏览器已禁用 JavaScript,请开启 JavaScript 以保证网站正常显示!

二维卷积优化-以高斯滤波为例

二维卷积公式

一维卷积公式为:

$$f * g(n) = \int_{-\infty}^{+\infty}f(\tau)g(n-\tau)$$

以上公式表示为f函数与g(n)的卷积,从公式看其意义为$f(\tau)$与$g(n - \tau)$相乘结果在$[-\infty, +\infty]$区间上的积分,$g(n)$中n是卷积核的大小。这是连续的卷积,下面看看离散卷积公式:

$$f * g(n) = \sum_{-\infty}^{+\infty}{f(\tau)g(n - \tau)}$$

把积分运算替换成累加运算便得到了离散卷积表示,积分运算也是连续域上的无线累加。

二维卷积公式表示为:

$$I * K(m,n) = \sum_{m}\sum_{n}{I(i -m, j -n)K(m, n)}$$

其中I为二维矩阵,K为卷积核,大小为mxn。上述公式表示$I(i -m, j -n)$与$K(m, n)$相乘后累加之和,即将卷积核(mxn矩阵)旋转180度之后与I矩阵对应点一一相乘相加,结果再赋予$I(i,j)$点。与之相对应的运算还有相关

$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n)K(m, n)}$$

与卷积的区别是将减号替换成了加号,对应的变化则是,卷积核不需要再旋转180度,而是直接相加求和。在计算机中,省去旋转这一步处理将会方便优化和提升运算效率,而卷积长用在图像处理中,旋转180度对计算机中处理图像而言往往区别不大,所以在很多图像处理处理库、深度学习库中卷积运算常常使用相关运算代替。本文下面优化的也是针对相关运算处理。以上公式卷积公式实现代码为

/**
 * @brief 原始卷积
 * 
 * @param image 
 * @param kernel_size 
 */
void Origin(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    cv::Mat gauss_kernel = cv::Mat(kernel_size, kernel_size, CV_32F); 
    
    for (size_t i = 0; i < kernel_size; i++)
    {
        for (size_t j = 0; j < kernel_size; j++)
        {
            float val = gauss_mat.at<float>(i, 0) * gauss_mat.at<float>(j, 0);
            gauss_kernel.at<float>(i, j) = val;
        }
    }
    
    size_t rows = image.rows;
    size_t cols = image.cols;
    size_t half_kernel_size = kernel_size / 2;

    for (size_t i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (size_t j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (size_t m = 0; m < kernel_size; m++)
            {
                for (size_t n = 0; n < kernel_size; n++)
                {
                    int r = i + m - half_kernel_size;
                    int c = j + n - half_kernel_size;
                    float val = image.at<uchar>(r, c);
                    conv += (val * gauss_kernel.at<float>(m,n));
                }
            }
            *dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
        } 
    }
}

以上是依照公式实现,未作任何优化的版本。

优化技巧

下面以高斯滤波为例,演示分别经过卷积核分离、Simd优化以及OpenMP并行优化步骤对高斯滤波进行优化。
首先,高斯滤波在图像处理的实现为用一个高斯核$K(m,n)}去与输入图像做卷积,而二维高斯核表示为

$$K(m,n) = {\frac{1}{2\pi\sigma_1\sigma_2}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}-\frac{(n - \mu_2)^2}{2\sigma_2^2}}}$$

代入二维卷积公式
$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n)K(m, n)}$$

$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n){\frac{1}{2\pi\sigma_1\sigma_2}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}-\frac{(n - \mu_2)^2}{2\sigma_2^2}}}}$$

卷积核分离

展开卷积核指数项

$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n){\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}}}.{\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(n -\mu_2)^2}{2\sigma_2^2}}}}$$

可以看到,内层求和高斯核中与m无关,m是固定的,与n有关系,所以可以将

$${\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}}}$$

提出内层求和

$$I * K(m,n) = \sum_{m}{\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}}}.[\sum_{n}{I(i + m, j + n){\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(n -\mu_2)^2}{2\sigma_2^2}}}}]$$

观察上式可以看出,内层循环实际上就是就是一个一维的高斯卷积,因为内层求和中m是固定的,n是变动的,n表示的是矩阵列索引,所以内层卷积表示的是水平方向的卷积。而再看外层求和,实际上也是一个一维高斯卷积,输入为内层卷积计算结果与外层卷积核,此时m是变动的,m表示的是矩阵的行索引,所以外层表示的是竖直方向的卷积。至此,已经将一个二维高斯卷积分解为两个一维高斯卷积,先运行水平方向卷积,再计算数值方向卷积。

效率提升分析

假设一幅图像大小为WxH,参与卷积的高斯核大小为mxn,忽略卷积时图像边界点不参与计算。那么计算量可以这么统计,以乘加次数统计计算量,一个点的卷积运算量为一次卷积计算,即遍历一次邻域点乘加高斯核,计算量为高斯核大小mxn。一幅图像有WxH个像素点,因此需要执行WxH次卷积操作,总运算量为

$$O =W.H.m.n$$

而分离卷积核后,首先执行水平方向卷积运算,一个像素卷积运算量为n,整幅图像为WxHxn。然后执行竖直方向卷积运算,一个像素卷积运算量为m,整幅图像为WxHxm,所以总运算量为

$$O = W.H(n + m)$$

理论加速比为

$$ratio = \frac{mn}{m + n}$$

与原始卷积作为对比,依照分离卷积的理论方法,实现代码可以改为

/**
 * @brief 分离卷积核
 * 
 * @param image 
 * @param kernel_size 
 */
void SeparateKernel(const cv::Mat image,cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);   
    
    for (size_t i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss_mat.at<float>(i, 0);
    }

    int half_kernel_size = kernel_size / 2;
    int rows = image.rows;
    int cols = image.cols;

    //水平卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
                //conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
        }
    }

    //竖直卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = - half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
        }
    }
}

Simd优化

在分离卷积核的基础之上,我们继续下一步的优化。可以使用simd技术进行进一步的效率优化,使用simd技术可以实现单次循环中,一次进行多个像素点的同时卷积计算,这是基于cpu提供的单指令多数据技术能力实现。这里采用的是opencv中内置的simd优化框架,入门介绍可以看我的另一篇文章opencv统一simd框架使用(一),以下给出分离卷积后simd优化代码的实现,目前实现的是128bit的标准,如果想继续提高可以实现256bit甚至512bit的simd优化。

/**
 * @brief 分离卷积核+Simd
 * 
 * @param image 
 * @param kernel_size 
 */
void SeparateKernelSimd(const cv::Mat image, cv::Mat& dst,  const size_t kernel_size)
{
    cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);   
    
    for (int i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss.at<float>(i, 0);
    }

    int rows = image.rows;
    int cols = image.cols;
    int half_kernel_size = kernel_size / 2;
    
    //水平卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32()};
            std::array<cv::v_uint32x4, 4> v_ijn_arr{cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
                cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(vijn, b1, b2);
                cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
                cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
                v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
                v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
                v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
                v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
            }
            std::array<cv::v_uint16x8, 2> v_conv16{cv::v_setzero_u16(),cv::v_setzero_u16()};
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }

        //处理剩余像素点
        int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
        for (int j = current_j; j < cols - half_kernel_size; j ++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
    //竖直卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {        
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_imj_arr{cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
            for (int m = - half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
                cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(v_uint8, b1, b2);
                cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
                cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
                v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
                v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
                v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
                v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i,j) = uchar(conv);
        }
    }
}

OpenMP并行优化

在优化大量计算任务方面,OpenMP也可以提供帮助,尤其是在优化for循环方面,优化已经存在的程序代码OpenMP是一个非常好的选择。如下面基于已经分离的卷积代码,加上OpenMP的实现,非常简单,对比原始代码你会发现添加了两行代码

/**
 * @brief 分离卷积核+OpenMP并行
 *
 * @param image
 * @param kernel_size
 */
void SeparateKernelOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);

    for (size_t i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss_mat.at<float>(i, 0);
    }

    int half_kernel_size = kernel_size / 2;
    int rows = image.rows;
    int cols = image.cols;

    //水平卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
                //conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = conv;
        }
    }

    //竖直卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
}

通过在for循环外部添加

#pragma omp parallel for

就实现了源代码基础上做多线程并行优化。另外,多线程与Simd是兼容的,也就是说我们可以在Simd优化的基础之上加入OpenMP多线程并行优化,实现也非常简单:

/**
 * @brief 分离卷积核+Simd+OpenMP
 *
 * @param image
 * @param kernel_size
 */
void SeparateKernelSimdOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);

    for (int i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss.at<float>(i, 0);
    }

    int rows = image.rows;
    int cols = image.cols;
    int half_kernel_size = kernel_size / 2;

    //水平卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_ijn_arr{ cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
                cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(vijn, b1, b2);
                cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
                cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
                v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
                v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
                v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
                v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }

        //处理剩余像素点
        int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
        for (int j = current_j; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
    //竖直卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_imj_arr{ cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
                cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(v_uint8, b1, b2);
                cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
                cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
                v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
                v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
                v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
                v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
}

测试

介绍完分离卷积核-simd-OpenMP多线程优化步骤后,下面进行测试,以看看每一种步骤相比原始程序代码有多少提升。以下是测试结果:
convolution-optimization.png
可以看到,原始卷积代码随着图像的尺寸增加计算耗时增加非常快,而使用卷积核分离优化后相比原来结果拥有大幅度的下降。在分离卷积核基础上,分别加入simd与OpenMP并行优化后也继续提高了计算效率,相对来说分离OpenMP比Simd加速更明显一些。而在结合了分离卷积核、simd与OpenMP优化后,计算效率继续提高,相比原始未作优化程序已经得到了非常可观的优化提升。但是在对比OpenCV的实现时,仍有提升空间。原因大概使用了OpenCV的最新4.5.4版本,并加入了tbb依赖编译。OpenCV优化卷积时也使用了分离卷积核与simd等并行技术,相比之下我感觉我的simd实现并不是非常完美,类型转换太多处理不是很好,加上tbb在多线程优化上相比OpenMP有着更大的发挥空间。待后续研究看看能否继续提高超过OpenCV优化。

完整测试代码

/**
 * @file main.cpp
 * @author your name (you@domain.com)
 * @brief 
 * @version 0.1
 * @date 2021-10-25
 * 
 * @copyright Copyright (c) 2021
 * 
 */

#include "opencv2/opencv.hpp"
#include "opencv2/core/simd_intrinsics.hpp"

#include <omp.h>

#include "matplotlibcpp.h"
namespace plt = matplotlibcpp;


/**
 * @brief 原始卷积
 * 
 * @param image 
 * @param kernel_size 
 */
void Origin(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    cv::Mat gauss_kernel = cv::Mat(kernel_size, kernel_size, CV_32F); 
    
    for (size_t i = 0; i < kernel_size; i++)
    {
        for (size_t j = 0; j < kernel_size; j++)
        {
            float val = gauss_mat.at<float>(i, 0) * gauss_mat.at<float>(j, 0);
            gauss_kernel.at<float>(i, j) = val;
        }
    }
    
    size_t rows = image.rows;
    size_t cols = image.cols;
    size_t half_kernel_size = kernel_size / 2;

    for (size_t i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (size_t j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (size_t m = 0; m < kernel_size; m++)
            {
                for (size_t n = 0; n < kernel_size; n++)
                {
                    int r = i + m - half_kernel_size;
                    int c = j + n - half_kernel_size;
                    float val = image.at<uchar>(r, c);
                    conv += (val * gauss_kernel.at<float>(m,n));
                }
            }
            *dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
        } 
    }
}


/**
 * @brief 分离卷积核
 * 
 * @param image 
 * @param kernel_size 
 */
void SeparateKernel(const cv::Mat image,cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);   
    
    for (size_t i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss_mat.at<float>(i, 0);
    }

    int half_kernel_size = kernel_size / 2;
    int rows = image.rows;
    int cols = image.cols;

    //水平卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
                //conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
        }
    }

    //竖直卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = - half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
        }
    }
}

/**
 * @brief 分离卷积核+OpenMP并行
 *
 * @param image
 * @param kernel_size
 */
void SeparateKernelOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);

    for (size_t i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss_mat.at<float>(i, 0);
    }

    int half_kernel_size = kernel_size / 2;
    int rows = image.rows;
    int cols = image.cols;

    //水平卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
                //conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = conv;
        }
    }

    //竖直卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
}

/**
 * @brief 分离卷积核+Simd
 * 
 * @param image 
 * @param kernel_size 
 */
void SeparateKernelSimd(const cv::Mat image, cv::Mat& dst,  const size_t kernel_size)
{
    cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);   
    
    for (int i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss.at<float>(i, 0);
    }

    int rows = image.rows;
    int cols = image.cols;
    int half_kernel_size = kernel_size / 2;
    
    //水平卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32()};
            std::array<cv::v_uint32x4, 4> v_ijn_arr{cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
                cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(vijn, b1, b2);
                cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
                cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
                v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
                v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
                v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
                v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
            }
            std::array<cv::v_uint16x8, 2> v_conv16{cv::v_setzero_u16(),cv::v_setzero_u16()};
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }

        //处理剩余像素点
        int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
        for (int j = current_j; j < cols - half_kernel_size; j ++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
    //竖直卷积
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {        
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_imj_arr{cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
            for (int m = - half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
                cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(v_uint8, b1, b2);
                cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
                cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
                v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
                v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
                v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
                v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i,j) = uchar(conv);
        }
    }
}

/**
 * @brief 分离卷积核+Simd+OpenMP
 *
 * @param image
 * @param kernel_size
 */
void SeparateKernelSimdOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);

    for (int i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss.at<float>(i, 0);
    }

    int rows = image.rows;
    int cols = image.cols;
    int half_kernel_size = kernel_size / 2;

    //水平卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_ijn_arr{ cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
                cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(vijn, b1, b2);
                cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
                cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
                v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
                v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
                v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
                v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }

        //处理剩余像素点
        int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
        for (int j = current_j; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
    //竖直卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_imj_arr{ cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
                cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(v_uint8, b1, b2);
                cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
                cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
                v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
                v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
                v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
                v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
}


/**
 * @brief 分离卷积核+Simd+OpenMP
 *
 * @param image
 * @param kernel_size
 */
void SeparateKernelSimdOpenMP256(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
    cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
    std::vector<float> gauss_kernel(kernel_size);

    for (int i = 0; i < kernel_size; i++)
    {
        gauss_kernel[i] = gauss.at<float>(i, 0);
    }

    int rows = image.rows;
    int cols = image.cols;
    int half_kernel_size = kernel_size / 2;

    //水平卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_ijn_arr{ cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
                cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(vijn, b1, b2);
                cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
                cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
                v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
                v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
                v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
                v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }

        //处理剩余像素点
        int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
        for (int j = current_j; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int n = -half_kernel_size; n < half_kernel_size; n++)
            {
                //I(i + m, i + n) * g(n), m固定,遍历n
                conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
    //竖直卷积
#pragma omp parallel for
    for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
    {
        for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
        {
            std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
            std::array<cv::v_uint32x4, 4> v_imj_arr{ cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
                cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
                cv::v_uint16x8 b1, b2;
                cv::v_expand(v_uint8, b1, b2);
                cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
                cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
                cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
                v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
                v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
                v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
                v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
            }
            std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
            v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
            v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
            cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
            cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
        }
        for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
        {
            float conv = 0.0f;
            for (int m = -half_kernel_size; m < half_kernel_size; m++)
            {
                //I(i + m, i + n) * g(n), n固定,遍历m
                conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
            }
            *dst.ptr<uchar>(i, j) = uchar(conv);
        }
    }
}



int main(int argc, char** argv)
{
    std::vector<double> x;
    std::vector<double> origin_y;
    std::vector<double> opencv_y;
    std::vector<double> separate_y;
    std::vector<double> separate_simd_y;
    std::vector<double> separate_openmp_y;
    std::vector<double> separate_simd_openmp_y;
    for (int h = 2; h < 20; h++)
    {
        size_t height = 100 * h;
        x.push_back(height);
        size_t width = height * (4.0 / 3.0);
        cv::Mat source = cv::Mat(cv::Size(width, height), CV_8UC1, cv::Scalar::all(100));
        cv::RNG rng;
        rng.fill(source, cv::RNG::NORMAL, 100, 20);

        cv::Mat separate_result = source.clone();
        cv::Mat orgin_result = source.clone();
        cv::Mat simd_result = source.clone();
        cv::Mat separate_omp_result = source.clone();
        cv::Mat separate_simd_omp_result = source.clone();
        cv::Mat opencv_result = source.clone();

        auto t0 = std::chrono::system_clock::now();
        cv::GaussianBlur(source, opencv_result, cv::Size(7, 7), 0.0, 0.0);
        auto t1 = std::chrono::system_clock::now();
        SeparateKernel(source, separate_result, 7);
        auto t2 = std::chrono::system_clock::now();
        SeparateKernelSimd(source, simd_result, 7);
        auto t3 = std::chrono::system_clock::now();
        SeparateKernelOpenMP(source, separate_omp_result, 7);
        auto t4 = std::chrono::system_clock::now();
        SeparateKernelSimdOpenMP(source, separate_simd_omp_result, 7);
        auto t5 = std::chrono::system_clock::now();
        Origin(source, orgin_result, 7);
        auto t6 = std::chrono::system_clock::now();

        opencv_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count());
        separate_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count());
        separate_simd_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count());
        separate_openmp_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count());
        separate_simd_openmp_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t5 - t4).count());
        origin_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t6 - t5).count());
    }

    std::map<std::string, std::string> origin_result;
    std::map<std::string, std::string> opencv_result;
    std::map<std::string, std::string> separate_result;
    std::map<std::string, std::string> simd_result;
    std::map<std::string, std::string> openmp_result;
    std::map<std::string, std::string> simd_openmp_result;
    origin_result.insert(std::pair<std::string, std::string>("color", "red"));
    origin_result.insert(std::pair<std::string, std::string>("label", "none-optimization"));
    opencv_result.insert(std::pair<std::string, std::string>("color", "purple"));
    opencv_result.insert(std::pair<std::string, std::string>("label", "opencv(with tbb)"));
    separate_result.insert(std::pair<std::string, std::string>("color", "yellow"));
    separate_result.insert(std::pair<std::string, std::string>("label", "separate"));
    simd_result.insert(std::pair<std::string, std::string>("color", "green"));
    simd_result.insert(std::pair<std::string, std::string>("label", "separate+simd"));
    openmp_result.insert(std::pair<std::string, std::string>("color", "blue"));
    openmp_result.insert(std::pair<std::string, std::string>("label", "separate+openmp"));
    simd_openmp_result.insert(std::pair<std::string, std::string>("color", "black"));
    simd_openmp_result.insert(std::pair<std::string, std::string>("label", "separate+simd+openmp"));
    plt::plot(x, opencv_y, opencv_result);
    plt::plot(x, origin_y, origin_result);
    plt::plot(x, separate_y, separate_result);
    plt::plot(x, separate_simd_y, simd_result);
    plt::plot(x, separate_openmp_y, openmp_result);
    plt::plot(x, separate_simd_openmp_y, simd_openmp_result);
    plt::xlabel("image size");
    plt::ylabel("cost time:ms");
    plt::legend();
    plt::title("Gaussian blur performance");
    plt::show();
 
    return 0;
}

参考文章

图像处理中的卷积核分离


本文由芒果浩明发布,转载请注明出处。
本文链接:https://mangoroom.cn/opencv/convolution-optimization-taking-gaussian-filtering-as-an-example.html


 继续浏览关于 opencvopenmpsimdconvolution-optimizationseparate-kernelgaussian-blur 的文章

 本文最后更新于:2021/10/30 19:34:22,可能因经年累月而与现状有所差异

 引用转载请注明:芒果的博客 > opencv,算法,计算机视觉 > 二维卷积优化-以高斯滤波为例

精选评论

  1. yumi
    yumi 回复

    Windows 10Chrome 112.0.0.0

    simd分离卷积那块感觉有问题。竖直卷积是怎么load数据的?

  2. yumi
    yumi 回复

    Windows 10Chrome 112.0.0.0

    simd分离卷积那块感觉有问题。竖直卷积是怎么load数据的?

    1. 芒果

      数值卷机用单独一个for循环遍历行数了,列数的索引j是不变的