二维卷积公式
一维卷积公式为:
$$f * g(n) = \int_{-\infty}^{+\infty}f(\tau)g(n-\tau)$$
以上公式表示为f函数与g(n)的卷积,从公式看其意义为$f(\tau)$与$g(n - \tau)$相乘结果在$[-\infty, +\infty]$区间上的积分,$g(n)$中n是卷积核的大小。这是连续的卷积,下面看看离散卷积公式:
$$f * g(n) = \sum_{-\infty}^{+\infty}{f(\tau)g(n - \tau)}$$
把积分运算替换成累加运算便得到了离散卷积表示,积分运算也是连续域上的无线累加。
二维卷积公式表示为:
$$I * K(m,n) = \sum_{m}\sum_{n}{I(i -m, j -n)K(m, n)}$$
其中I为二维矩阵,K为卷积核,大小为mxn。上述公式表示$I(i -m, j -n)$与$K(m, n)$相乘后累加之和,即将卷积核(mxn矩阵)旋转180度之后与I矩阵对应点一一相乘相加,结果再赋予$I(i,j)$点。与之相对应的运算还有相关,
$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n)K(m, n)}$$
与卷积的区别是将减号替换成了加号,对应的变化则是,卷积核不需要再旋转180度,而是直接相加求和。在计算机中,省去旋转这一步处理将会方便优化和提升运算效率,而卷积长用在图像处理中,旋转180度对计算机中处理图像而言往往区别不大,所以在很多图像处理处理库、深度学习库中卷积运算常常使用相关运算代替。本文下面优化的也是针对相关运算处理。以上公式卷积公式实现代码为
/**
* @brief 原始卷积
*
* @param image
* @param kernel_size
*/
void Origin(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
cv::Mat gauss_kernel = cv::Mat(kernel_size, kernel_size, CV_32F);
for (size_t i = 0; i < kernel_size; i++)
{
for (size_t j = 0; j < kernel_size; j++)
{
float val = gauss_mat.at<float>(i, 0) * gauss_mat.at<float>(j, 0);
gauss_kernel.at<float>(i, j) = val;
}
}
size_t rows = image.rows;
size_t cols = image.cols;
size_t half_kernel_size = kernel_size / 2;
for (size_t i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (size_t j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (size_t m = 0; m < kernel_size; m++)
{
for (size_t n = 0; n < kernel_size; n++)
{
int r = i + m - half_kernel_size;
int c = j + n - half_kernel_size;
float val = image.at<uchar>(r, c);
conv += (val * gauss_kernel.at<float>(m,n));
}
}
*dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
}
}
}
以上是依照公式实现,未作任何优化的版本。
优化技巧
下面以高斯滤波为例,演示分别经过卷积核分离、Simd优化以及OpenMP并行优化步骤对高斯滤波进行优化。
首先,高斯滤波在图像处理的实现为用一个高斯核$K(m,n)}去与输入图像做卷积,而二维高斯核表示为
$$K(m,n) = {\frac{1}{2\pi\sigma_1\sigma_2}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}-\frac{(n - \mu_2)^2}{2\sigma_2^2}}}$$
代入二维卷积公式
$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n)K(m, n)}$$
得
$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n){\frac{1}{2\pi\sigma_1\sigma_2}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}-\frac{(n - \mu_2)^2}{2\sigma_2^2}}}}$$
卷积核分离
展开卷积核指数项
$$I * K(m,n) = \sum_{m}\sum_{n}{I(i + m, j + n){\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}}}.{\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(n -\mu_2)^2}{2\sigma_2^2}}}}$$
可以看到,内层求和高斯核中与m无关,m是固定的,与n有关系,所以可以将
$${\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}}}$$
提出内层求和
$$I * K(m,n) = \sum_{m}{\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(m -\mu_1)^2}{2\sigma_1^2}}}.[\sum_{n}{I(i + m, j + n){\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(n -\mu_2)^2}{2\sigma_2^2}}}}]$$
观察上式可以看出,内层循环实际上就是就是一个一维的高斯卷积,因为内层求和中m是固定的,n是变动的,n表示的是矩阵列索引,所以内层卷积表示的是水平方向的卷积。而再看外层求和,实际上也是一个一维高斯卷积,输入为内层卷积计算结果与外层卷积核,此时m是变动的,m表示的是矩阵的行索引,所以外层表示的是竖直方向的卷积。至此,已经将一个二维高斯卷积分解为两个一维高斯卷积,先运行水平方向卷积,再计算数值方向卷积。
效率提升分析
假设一幅图像大小为WxH,参与卷积的高斯核大小为mxn,忽略卷积时图像边界点不参与计算。那么计算量可以这么统计,以乘加次数统计计算量,一个点的卷积运算量为一次卷积计算,即遍历一次邻域点乘加高斯核,计算量为高斯核大小mxn。一幅图像有WxH个像素点,因此需要执行WxH次卷积操作,总运算量为
$$O =W.H.m.n$$
而分离卷积核后,首先执行水平方向卷积运算,一个像素卷积运算量为n,整幅图像为WxHxn。然后执行竖直方向卷积运算,一个像素卷积运算量为m,整幅图像为WxHxm,所以总运算量为
$$O = W.H(n + m)$$
理论加速比为
$$ratio = \frac{mn}{m + n}$$
与原始卷积作为对比,依照分离卷积的理论方法,实现代码可以改为
/**
* @brief 分离卷积核
*
* @param image
* @param kernel_size
*/
void SeparateKernel(const cv::Mat image,cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (size_t i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss_mat.at<float>(i, 0);
}
int half_kernel_size = kernel_size / 2;
int rows = image.rows;
int cols = image.cols;
//水平卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
//conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
}
}
//竖直卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = - half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
}
}
}
Simd优化
在分离卷积核的基础之上,我们继续下一步的优化。可以使用simd技术进行进一步的效率优化,使用simd技术可以实现单次循环中,一次进行多个像素点的同时卷积计算,这是基于cpu提供的单指令多数据技术能力实现。这里采用的是opencv中内置的simd优化框架,入门介绍可以看我的另一篇文章opencv统一simd框架使用(一),以下给出分离卷积后simd优化代码的实现,目前实现的是128bit的标准,如果想继续提高可以实现256bit甚至512bit的simd优化。
/**
* @brief 分离卷积核+Simd
*
* @param image
* @param kernel_size
*/
void SeparateKernelSimd(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (int i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss.at<float>(i, 0);
}
int rows = image.rows;
int cols = image.cols;
int half_kernel_size = kernel_size / 2;
//水平卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32()};
std::array<cv::v_uint32x4, 4> v_ijn_arr{cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
cv::v_uint16x8 b1, b2;
cv::v_expand(vijn, b1, b2);
cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
}
std::array<cv::v_uint16x8, 2> v_conv16{cv::v_setzero_u16(),cv::v_setzero_u16()};
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
//处理剩余像素点
int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
for (int j = current_j; j < cols - half_kernel_size; j ++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
//竖直卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_imj_arr{cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
for (int m = - half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
cv::v_uint16x8 b1, b2;
cv::v_expand(v_uint8, b1, b2);
cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i,j) = uchar(conv);
}
}
}
OpenMP并行优化
在优化大量计算任务方面,OpenMP也可以提供帮助,尤其是在优化for循环方面,优化已经存在的程序代码OpenMP是一个非常好的选择。如下面基于已经分离的卷积代码,加上OpenMP的实现,非常简单,对比原始代码你会发现添加了两行代码
/**
* @brief 分离卷积核+OpenMP并行
*
* @param image
* @param kernel_size
*/
void SeparateKernelOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (size_t i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss_mat.at<float>(i, 0);
}
int half_kernel_size = kernel_size / 2;
int rows = image.rows;
int cols = image.cols;
//水平卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
//conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = conv;
}
}
//竖直卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
}
通过在for循环外部添加
#pragma omp parallel for
就实现了源代码基础上做多线程并行优化。另外,多线程与Simd是兼容的,也就是说我们可以在Simd优化的基础之上加入OpenMP多线程并行优化,实现也非常简单:
/**
* @brief 分离卷积核+Simd+OpenMP
*
* @param image
* @param kernel_size
*/
void SeparateKernelSimdOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (int i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss.at<float>(i, 0);
}
int rows = image.rows;
int cols = image.cols;
int half_kernel_size = kernel_size / 2;
//水平卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_ijn_arr{ cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
cv::v_uint16x8 b1, b2;
cv::v_expand(vijn, b1, b2);
cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
//处理剩余像素点
int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
for (int j = current_j; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
//竖直卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_imj_arr{ cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
cv::v_uint16x8 b1, b2;
cv::v_expand(v_uint8, b1, b2);
cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
}
测试
介绍完分离卷积核-simd-OpenMP多线程优化步骤后,下面进行测试,以看看每一种步骤相比原始程序代码有多少提升。以下是测试结果:
可以看到,原始卷积代码随着图像的尺寸增加计算耗时增加非常快,而使用卷积核分离优化后相比原来结果拥有大幅度的下降。在分离卷积核基础上,分别加入simd与OpenMP并行优化后也继续提高了计算效率,相对来说分离OpenMP比Simd加速更明显一些。而在结合了分离卷积核、simd与OpenMP优化后,计算效率继续提高,相比原始未作优化程序已经得到了非常可观的优化提升。但是在对比OpenCV的实现时,仍有提升空间。原因大概使用了OpenCV的最新4.5.4版本,并加入了tbb依赖编译。OpenCV优化卷积时也使用了分离卷积核与simd等并行技术,相比之下我感觉我的simd实现并不是非常完美,类型转换太多处理不是很好,加上tbb在多线程优化上相比OpenMP有着更大的发挥空间。待后续研究看看能否继续提高超过OpenCV优化。
完整测试代码
/**
* @file main.cpp
* @author your name (you@domain.com)
* @brief
* @version 0.1
* @date 2021-10-25
*
* @copyright Copyright (c) 2021
*
*/
#include "opencv2/opencv.hpp"
#include "opencv2/core/simd_intrinsics.hpp"
#include <omp.h>
#include "matplotlibcpp.h"
namespace plt = matplotlibcpp;
/**
* @brief 原始卷积
*
* @param image
* @param kernel_size
*/
void Origin(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
cv::Mat gauss_kernel = cv::Mat(kernel_size, kernel_size, CV_32F);
for (size_t i = 0; i < kernel_size; i++)
{
for (size_t j = 0; j < kernel_size; j++)
{
float val = gauss_mat.at<float>(i, 0) * gauss_mat.at<float>(j, 0);
gauss_kernel.at<float>(i, j) = val;
}
}
size_t rows = image.rows;
size_t cols = image.cols;
size_t half_kernel_size = kernel_size / 2;
for (size_t i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (size_t j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (size_t m = 0; m < kernel_size; m++)
{
for (size_t n = 0; n < kernel_size; n++)
{
int r = i + m - half_kernel_size;
int c = j + n - half_kernel_size;
float val = image.at<uchar>(r, c);
conv += (val * gauss_kernel.at<float>(m,n));
}
}
*dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
}
}
}
/**
* @brief 分离卷积核
*
* @param image
* @param kernel_size
*/
void SeparateKernel(const cv::Mat image,cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (size_t i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss_mat.at<float>(i, 0);
}
int half_kernel_size = kernel_size / 2;
int rows = image.rows;
int cols = image.cols;
//水平卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
//conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
}
}
//竖直卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = - half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = static_cast<uchar>(conv);
}
}
}
/**
* @brief 分离卷积核+OpenMP并行
*
* @param image
* @param kernel_size
*/
void SeparateKernelOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss_mat = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (size_t i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss_mat.at<float>(i, 0);
}
int half_kernel_size = kernel_size / 2;
int rows = image.rows;
int cols = image.cols;
//水平卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n, 行不动,移动列
//conv += (image.at<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = conv;
}
}
//竖直卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m。列不动,移动行
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
}
/**
* @brief 分离卷积核+Simd
*
* @param image
* @param kernel_size
*/
void SeparateKernelSimd(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (int i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss.at<float>(i, 0);
}
int rows = image.rows;
int cols = image.cols;
int half_kernel_size = kernel_size / 2;
//水平卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32()};
std::array<cv::v_uint32x4, 4> v_ijn_arr{cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
cv::v_uint16x8 b1, b2;
cv::v_expand(vijn, b1, b2);
cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
}
std::array<cv::v_uint16x8, 2> v_conv16{cv::v_setzero_u16(),cv::v_setzero_u16()};
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
//处理剩余像素点
int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
for (int j = current_j; j < cols - half_kernel_size; j ++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
//竖直卷积
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_imj_arr{cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32()};
for (int m = - half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
cv::v_uint16x8 b1, b2;
cv::v_expand(v_uint8, b1, b2);
cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i,j) = uchar(conv);
}
}
}
/**
* @brief 分离卷积核+Simd+OpenMP
*
* @param image
* @param kernel_size
*/
void SeparateKernelSimdOpenMP(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (int i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss.at<float>(i, 0);
}
int rows = image.rows;
int cols = image.cols;
int half_kernel_size = kernel_size / 2;
//水平卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_ijn_arr{ cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
cv::v_uint16x8 b1, b2;
cv::v_expand(vijn, b1, b2);
cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
//处理剩余像素点
int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
for (int j = current_j; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
//竖直卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_imj_arr{ cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
cv::v_uint16x8 b1, b2;
cv::v_expand(v_uint8, b1, b2);
cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
}
/**
* @brief 分离卷积核+Simd+OpenMP
*
* @param image
* @param kernel_size
*/
void SeparateKernelSimdOpenMP256(const cv::Mat image, cv::Mat& dst, const size_t kernel_size)
{
cv::Mat gauss = cv::getGaussianKernel(kernel_size, 0, CV_32F);
std::vector<float> gauss_kernel(kernel_size);
for (int i = 0; i < kernel_size; i++)
{
gauss_kernel[i] = gauss.at<float>(i, 0);
}
int rows = image.rows;
int cols = image.cols;
int half_kernel_size = kernel_size / 2;
//水平卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_ijn_arr{ cv::v_setzero_u32(),cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n,行不动,移动列
cv::v_uint8x16 vijn = cv::v_load(image.ptr<uchar>(i, j + n));
cv::v_uint16x8 b1, b2;
cv::v_expand(vijn, b1, b2);
cv::v_expand(b1, v_ijn_arr[0], v_ijn_arr[1]);
cv::v_expand(b2, v_ijn_arr[2], v_ijn_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[n + half_kernel_size]);
v_conv[0] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[0])) * v_gauss_factor);
v_conv[1] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[1])) * v_gauss_factor);
v_conv[2] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[2])) * v_gauss_factor);
v_conv[3] += (cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_ijn_arr[3])) * v_gauss_factor);
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
//处理剩余像素点
int current_j = (cols - kernel_size) - (cols - kernel_size) % 16;
for (int j = current_j; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int n = -half_kernel_size; n < half_kernel_size; n++)
{
//I(i + m, i + n) * g(n), m固定,遍历n
conv += (*image.ptr<uchar>(i, j + n) * gauss_kernel[n + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
//竖直卷积
#pragma omp parallel for
for (int i = half_kernel_size; i < rows - half_kernel_size; i++)
{
for (int j = half_kernel_size; j < cols - half_kernel_size; j += 16)
{
std::array<cv::v_float32x4, 4> v_conv{ cv::v_setzero_f32(),cv::v_setzero_f32(), cv::v_setzero_f32(), cv::v_setzero_f32() };
std::array<cv::v_uint32x4, 4> v_imj_arr{ cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32(), cv::v_setzero_u32() };
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m, 列不动,移动行
cv::v_uint8x16 v_uint8 = cv::v_load(dst.ptr<uchar>(i + m, j));
cv::v_uint16x8 b1, b2;
cv::v_expand(v_uint8, b1, b2);
cv::v_expand(b1, v_imj_arr[0], v_imj_arr[1]);
cv::v_expand(b2, v_imj_arr[2], v_imj_arr[3]);
cv::v_float32x4 v_gauss_factor = cv::v_setall_f32(gauss_kernel[m + half_kernel_size]);
v_conv[0] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[0])) * v_gauss_factor;
v_conv[1] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[1])) * v_gauss_factor;
v_conv[2] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[2])) * v_gauss_factor;
v_conv[3] += cv::v_cvt_f32(cv::v_reinterpret_as_s32(v_imj_arr[3])) * v_gauss_factor;
}
std::array<cv::v_uint16x8, 2> v_conv16{ cv::v_setzero_u16(),cv::v_setzero_u16() };
v_conv16[0] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[0]), cv::v_round(v_conv[1])));
v_conv16[1] = cv::v_abs(cv::v_pack(cv::v_round(v_conv[2]), cv::v_round(v_conv[3])));
cv::v_uint8x16 v_conv_uint8 = cv::v_pack(v_conv16[0], v_conv16[1]);
cv::v_store(dst.ptr<uchar>(i, j), v_conv_uint8);
}
for (int j = half_kernel_size; j < cols - half_kernel_size; j++)
{
float conv = 0.0f;
for (int m = -half_kernel_size; m < half_kernel_size; m++)
{
//I(i + m, i + n) * g(n), n固定,遍历m
conv += (*dst.ptr<uchar>(i + m, j) * gauss_kernel[m + half_kernel_size]);
}
*dst.ptr<uchar>(i, j) = uchar(conv);
}
}
}
int main(int argc, char** argv)
{
std::vector<double> x;
std::vector<double> origin_y;
std::vector<double> opencv_y;
std::vector<double> separate_y;
std::vector<double> separate_simd_y;
std::vector<double> separate_openmp_y;
std::vector<double> separate_simd_openmp_y;
for (int h = 2; h < 20; h++)
{
size_t height = 100 * h;
x.push_back(height);
size_t width = height * (4.0 / 3.0);
cv::Mat source = cv::Mat(cv::Size(width, height), CV_8UC1, cv::Scalar::all(100));
cv::RNG rng;
rng.fill(source, cv::RNG::NORMAL, 100, 20);
cv::Mat separate_result = source.clone();
cv::Mat orgin_result = source.clone();
cv::Mat simd_result = source.clone();
cv::Mat separate_omp_result = source.clone();
cv::Mat separate_simd_omp_result = source.clone();
cv::Mat opencv_result = source.clone();
auto t0 = std::chrono::system_clock::now();
cv::GaussianBlur(source, opencv_result, cv::Size(7, 7), 0.0, 0.0);
auto t1 = std::chrono::system_clock::now();
SeparateKernel(source, separate_result, 7);
auto t2 = std::chrono::system_clock::now();
SeparateKernelSimd(source, simd_result, 7);
auto t3 = std::chrono::system_clock::now();
SeparateKernelOpenMP(source, separate_omp_result, 7);
auto t4 = std::chrono::system_clock::now();
SeparateKernelSimdOpenMP(source, separate_simd_omp_result, 7);
auto t5 = std::chrono::system_clock::now();
Origin(source, orgin_result, 7);
auto t6 = std::chrono::system_clock::now();
opencv_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count());
separate_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count());
separate_simd_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count());
separate_openmp_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count());
separate_simd_openmp_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t5 - t4).count());
origin_y.push_back(std::chrono::duration_cast<std::chrono::milliseconds>(t6 - t5).count());
}
std::map<std::string, std::string> origin_result;
std::map<std::string, std::string> opencv_result;
std::map<std::string, std::string> separate_result;
std::map<std::string, std::string> simd_result;
std::map<std::string, std::string> openmp_result;
std::map<std::string, std::string> simd_openmp_result;
origin_result.insert(std::pair<std::string, std::string>("color", "red"));
origin_result.insert(std::pair<std::string, std::string>("label", "none-optimization"));
opencv_result.insert(std::pair<std::string, std::string>("color", "purple"));
opencv_result.insert(std::pair<std::string, std::string>("label", "opencv(with tbb)"));
separate_result.insert(std::pair<std::string, std::string>("color", "yellow"));
separate_result.insert(std::pair<std::string, std::string>("label", "separate"));
simd_result.insert(std::pair<std::string, std::string>("color", "green"));
simd_result.insert(std::pair<std::string, std::string>("label", "separate+simd"));
openmp_result.insert(std::pair<std::string, std::string>("color", "blue"));
openmp_result.insert(std::pair<std::string, std::string>("label", "separate+openmp"));
simd_openmp_result.insert(std::pair<std::string, std::string>("color", "black"));
simd_openmp_result.insert(std::pair<std::string, std::string>("label", "separate+simd+openmp"));
plt::plot(x, opencv_y, opencv_result);
plt::plot(x, origin_y, origin_result);
plt::plot(x, separate_y, separate_result);
plt::plot(x, separate_simd_y, simd_result);
plt::plot(x, separate_openmp_y, openmp_result);
plt::plot(x, separate_simd_openmp_y, simd_openmp_result);
plt::xlabel("image size");
plt::ylabel("cost time:ms");
plt::legend();
plt::title("Gaussian blur performance");
plt::show();
return 0;
}
参考文章
本文由芒果浩明发布,转载请注明出处。
本文链接:https://mangoroom.cn/opencv/convolution-optimization-taking-gaussian-filtering-as-an-example.html
Windows 10Chrome 112.0.0.0
simd分离卷积那块感觉有问题。竖直卷积是怎么load数据的?
Windows 10Chrome 112.0.0.0
simd分离卷积那块感觉有问题。竖直卷积是怎么load数据的?
数值卷机用单独一个for循环遍历行数了,列数的索引j是不变的