【安富莱DSP教程】第9章 BasicMathFunctions的使用(二)
特别说明:完整45期数字信号处理教程,原创高性能示波器代码全开源地址:链接第9章 BasicMathFunctions的使用(二)
本期教程主要讲基本函数中的相反数,偏移,位移,减法和比例因子。
9.1 相反数(Vector Negate)
9.2 求和(Vector Offset)
9.3 点乘(Vector Shift)
9.4 减法(Vector Sub)
9.5 比例因子(Vector Scale)
9.6 BasicMathFunctions的重要说明
9.7 总结
9.1 相反数(Vector Negate)
这部分函数主要用于求相反数,公式描述如下:
pDst = -pSrc, 0 <= n < blockSize.
特别注意,这部分函数支持目标指针和源指针指向相同的缓冲区。
9.1.1 arm_negate_f32
这个函数用于求32位浮点数的相反数,源代码分析如下:
/**
* @briefNegates the elements of a floating-point vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*/
void arm_negate_f32(
float32_t * pSrc,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* negate the input */ (1)
in1 = -in1;
in2 = -in2;
in3 = -in3;
in4 = -in4;
/* store the result to destination */
*pDst = in1;
*(pDst + 1) = in2;
*(pDst + 2) = in3;
*(pDst + 3) = in4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
*pDst++ = -*pSrc++;
/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数的相反数求解比较简单,直接在相应的变量前加上负号即可。
9.1.2 arm_negate_q31
这个函数用于求32位定点数的相反数,源代码分析如下:
/**
* @briefNegates the elements of a Q31 vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* \par
* The function uses saturating arithmetic.
* The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF.
*/
void arm_negate_q31(
q31_t * pSrc,
q31_t * pDst,
uint32_t blockSize)
{
q31_t in; /* Temporary variable */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
*pDst++ = __QSUB(0, in1); (2)
*pDst++ = __QSUB(0, in2);
*pDst++ = __QSUB(0, in3);
*pDst++ = __QSUB(0, in4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the result in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;
/* Decrement the loop counter */
blkCnt--;
}
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
2. 饱和运算__QSUB我们在上一章已经详细讲述了,这就就是实现数值0减去相应的参数变量。
9.1.3 arm_negate_q15
这个函数用于求16位定点数的相反数,源代码分析如下:
/**
* @briefNegates the elements of a Q15 vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* \par Conditions for optimum performance
*Input and output buffers should be aligned by 32-bit
*
*
* <b>Scaling and Overflow Behavior:</b> (1)
* \par
* The function uses saturating arithmetic.
* The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF.
*/
void arm_negate_q15(
q15_t * pSrc,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q15_t in;
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2; /* Temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Read two inputs at a time */ (2)
in1 = _SIMD32_OFFSET(pSrc);
in2 = _SIMD32_OFFSET(pSrc + 2);
/* negate two samples at a time */ (3)
in1 = __QSUB16(0, in1);
/* negate two samples at a time */
in2 = __QSUB16(0, in2);
/* store the result to destination 2 samples at a time */ (4)
_SIMD32_OFFSET(pDst) = in1;
/* store the result to destination 2 samples at a time */
_SIMD32_OFFSET(pDst + 2) = in2;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the result in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == (q15_t) 0x8000) ? 0x7fff : -in;
/* Decrement the loop counter */
blkCnt--;
}
}
1. 这个函数使用了饱和运算。
饱和运算数值0x8000将变成0x7FFF。
2. 一次读取两个Q15格式的数据。
3. 由于__QSUB是SIMD指令,这里可以实现一次计算两个Q15数据的相反数。
4. 这里实现一次赋值两个Q15数据。
9.1.4 arm_negate_q7
这个函数用于求8位定点数的相反数,源代码分析如下:
/**
* @briefNegates the elements of a Q7 vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* \par
* The function uses saturating arithmetic.
* The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F.
*/
void arm_negate_q7(
q7_t * pSrc,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q7_t in;
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t input; /* Input values1-4 */
q31_t zero = 0x00000000; (2)
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Read four inputs */
input = *__SIMD32(pSrc)++; (3)
/* Store the Negated results in the destination buffer in a single cycle by packing the results */
*__SIMD32(pDst)++ = __QSUB8(zero, input); (4)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */ \
in = *pSrc++;
*pDst++ = (in == (q7_t) 0x80) ? 0x7f : -in;
/* Decrement the loop counter */
blkCnt--;
}
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80将变成0x7F。
2. 给局部变量赋初值,防止默认初始值不是0,所以从某种意义上来说,给变量赋初值是很有必要的。
3. 一次读取4个Q7格式的数据到input里面。
4. 通过__QSUB8实现一次计算四个Q7格式数据的相反数。
9.1.5 实例讲解
实验目的:
1. 四种类型数据的相反数。
实验内容:
1. 按下K1键, 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
/*
*********************************************************************************************************
* 函 数 名: DSP_Negate
* 功能说明: 求相反数
* 形 参:无
* 返 回 值: 无
*********************************************************************************************************
*/
static void DSP_Negate(void)
{
static float32_t pSrc;
static float32_t pDst;
static q31_t pSrc1;
static q31_t pDst1;
static q15_t pSrc2;
static q15_t pDst2;
static q7_t pSrc3 = 127; /* 为了说明问题,在这里设置初始值为127,然后查看0x80是否饱和为0x7F */
static q7_t pDst3;
pSrc -= 1.23f;
arm_negate_f32(&pSrc, &pDst, 1);
printf("arm_negate_f32 = %f\r\n", pDst);
pSrc1 -= 1;
arm_negate_q31(&pSrc1, &pDst1, 1);
printf("arm_negate_q31 = %d\r\n", pDst1);
pSrc2 -= 1;
arm_negate_q15(&pSrc2, &pDst2, 1);
printf("arm_negate_q15 = %d\r\n", pDst2);
pSrc3 += 1;
arm_negate_q7(&pSrc3, &pDst3, 1);
printf("arm_negate_q7 = %d\r\n", pDst3);
printf("***********************************\r\n");
} 9.2 偏移(Vector Offset)
这部分函数主要用于求相反数,公式描述如下:
pDst = pSrc + offset, 0 <= n < blockSize.
注意,这部分函数支持目标指针和源指针指向相同的缓冲区。
9.2.1 arm_offset_f32
这个函数用于求32位浮点数的偏移,源代码分析如下:
/**
* @briefAdds a constant offset to a floating-point vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*/
void arm_offset_f32(
float32_t * pSrc,
float32_t offset,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */ (1)
/* Add offset and then store the results in the destination buffer. */
/* read samples from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
/* add offset to input */
in1 = in1 + offset;
/* read samples from source */
in3 = *(pSrc + 2);
/* add offset to input */
in2 = in2 + offset;
/* read samples from source */
in4 = *(pSrc + 3);
/* add offset to input */
in3 = in3 + offset;
/* store result to destination */
*pDst = in1;
/* add offset to input */
in4 = in4 + offset;
/* store result to destination */
*(pDst + 1) = in2;
/* store result to destination */
*(pDst + 2) = in3;
/* store result to destination */
*(pDst + 3) = in4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (*pSrc++) + offset;
/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数的偏移值求解比较简单,加上相应的偏移值并赋值给目标变量即可。
9.2.2 arm_offset_q31
这个函数用于求32位定点数的偏移值,源代码分析如下:
/**
* @briefAdds a constant offset to a Q31 vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range are saturated.
*/
void arm_offset_q31(
q31_t * pSrc,
q31_t offset,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
*pDst++ = __QADD(in1, offset); (2)
*pDst++ = __QADD(in2, offset);
*pDst++ = __QADD(in3, offset);
*pDst++ = __QADD(in4, offset);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = __QADD(*pSrc++, offset);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrc++ + offset);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
2. 指令__QADD我们在上章教程中已经讲解过,这里是实现两个参数相加。
9.2.3 arm_offset_q15
这个函数用于求16位定点数的偏移,源代码分析如下:
/**
* @briefAdds a constant offset to a Q15 vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range are saturated.
*/
void arm_offset_q15(
q15_t * pSrc,
q15_t offset,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t offset_packed; /* Offset packed to 32 bit */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Offset is packed to 32 bit in order to use SIMD32 for addition */
offset_packed = __PKHBT(offset, offset, 16); (2)
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer, 2 samples at a time. */
*__SIMD32(pDst)++ = __QADD16(*__SIMD32(pSrc)++, offset_packed); (3)
*__SIMD32(pDst)++ = __QADD16(*__SIMD32(pSrc)++, offset_packed);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
*pDst++ = (q15_t) __QADD16(*pSrc++, offset);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrc++ + offset), 16);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 这个函数使用了饱和运算。
饱和运算数值0x8000将变成0x7FFF。
2. 将两个Q15格式的变量合并成一个Q31格式的数据,方便指令__QADD16的调用。
3. 由于__QADD16是SIMD指令,这里调用一次就能实现两个Q15格式数据的计算。
9.2.4 arm_offset_q7
这个函数用于求8位定点数的偏移,源代码分析如下:
/**
* @briefAdds a constant offset to a Q7 vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range are saturated.
*/
void arm_offset_q7(
q7_t * pSrc,
q7_t offset,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t offset_packed; /* Offset packed to 32 bit */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Offset is packed to 32 bit in order to use SIMD32 for addition */ (2)
offset_packed = __PACKq7(offset, offset, offset, offset);
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination bufferfor 4 samples at a time. */
*__SIMD32(pDst)++ = __QADD8(*__SIMD32(pSrc)++, offset_packed); (3)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT(*pSrc++ + offset, 8);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrc++ + offset, 8);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80将变成0x7F。
2. 通过__PACKq7将4个Q7格式的数据合并成一个Q31格式的数据。
3. 由于__QADD8是SIMD指令,这里调用一次就能实现四个Q8格式数据的计算。
9.2.5 实例讲解
实验目的:
1. 四种类型数据的相反数。
实验内容:
1. 按下K2键, 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
/*
*********************************************************************************************************
* 函 数 名: DSP_Offset
* 功能说明: 偏移
* 形 参:无
* 返 回 值: 无
*********************************************************************************************************
*/
static void DSP_Offset(void)
{
static float32_t pSrcA;
static float32_t Offset = 0.0f;
static float32_t pDst;
static q31_tpSrcA1;
static q31_tOffset1 = 0;
static q31_tpDst1;
static q15_tpSrcA2;
static q15_tOffset2 = 0;
static q15_tpDst2;
static q7_tpSrcA3;
static q7_tOffset3 = 0;
static q7_tpDst3;
Offset--;
arm_offset_f32(&pSrcA, Offset, &pDst, 1);
printf("arm_add_f32 = %frn", pDst);
Offset1--;
arm_offset_q31(&pSrcA1, Offset1, &pDst1, 1);
printf("arm_add_q31 = %drn", pDst1);
Offset2--;
arm_offset_q15(&pSrcA2, Offset2, &pDst2, 1);
printf("arm_add_q15 = %drn", pDst2);
Offset3--;
arm_offset_q7(&pSrcA3, Offset3, &pDst3, 1);
printf("arm_add_q7 = %drn", pDst3);
printf("***********************************rn");
} 9.3 位移(Vector Shift)
这部分函数主要用于实现位移,公式描述如下:
pDst = pSrc << shift, 0 <= n < blockSize.
注意,这部分函数支持目标指针和源指针指向相同的缓冲区。
9.3.1 arm_shift_q31
这个函数用于求32位定点数的位移,源代码分析如下:
/**
* @briefShifts the elements of a Q31 vector a specified number of bits.
* @param*pSrc points to the input vector
* @paramshiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
*
* <b>Scaling and Overflow Behavior:</b> (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range will be saturated.
*/
void arm_shift_q31(
q31_t * pSrc,
int8_t shiftBits,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign = (shiftBits & 0x80); /* Sign of shiftBits */ (3)
#ifndef ARM_MATH_CM0_FAMILY
q31_t in1, in2, in3, in4; /* Temporary input variables */
q31_t out1, out2, out3, out4; /* Temporary output variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
if(sign == 0u) (4)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A<< shiftBits */
/* Shift the input and then store the results in the destination buffer. */
in1 = *pSrc;
in2 = *(pSrc + 1);
out1 = in1 << shiftBits;
in3 = *(pSrc + 2);
out2 = in2 << shiftBits;
in4 = *(pSrc + 3);
if(in1 != (out1 >> shiftBits)) (5)
out1 = 0x7FFFFFFF ^ (in1 >> 31);
if(in2 != (out2 >> shiftBits))
out2 = 0x7FFFFFFF ^ (in2 >> 31);
*pDst = out1;
out3 = in3 << shiftBits;
*(pDst + 1) = out2;
out4 = in4 << shiftBits;
if(in3 != (out3 >> shiftBits))
out3 = 0x7FFFFFFF ^ (in3 >> 31);
if(in4 != (out4 >> shiftBits))
out4 = 0x7FFFFFFF ^ (in4 >> 31);
*(pDst + 2) = out3;
*(pDst + 3) = out4;
/* Update destination pointer to process next sampels */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
}
else (6)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A >>shiftBits */
/* Shift the input and then store the results in the destination buffer. */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
*pDst = (in1 >> -shiftBits); (7)
*(pDst + 1) = (in2 >> -shiftBits);
*(pDst + 2) = (in3 >> -shiftBits);
*(pDst + 3) = (in4 >> -shiftBits);
pSrc += 4u;
pDst += 4u;
blkCnt--;
}
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A (>> or <<) shiftBits */
/* Shift the input and then store the result in the destination buffer. */ (8)
*pDst++ = (sign == 0u) ? clip_q63_to_q31((q63_t) * pSrc++ << shiftBits) :
(*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
1. 如果函数的参数shiftBits是正数那么表示左移,如果参数shiftBits是负数那么就是右移。
2. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
3. 获取偏移值shiftBits是正数还是负数。
4. 如果移位值是正数,那么就是左移。
5. 数值的左移仅支持将其左移后再右移相应的位数后数值不变的情况,如果不满足这个条件,那么输出结果只有两种结果(这里就是实现输出结果的饱和运算)。
out = 0x7FFFFFFF & 0xFFFFFFFF =0x80000000
out = 0x7FFFFFFF & 0x0000000 =0x7FFFFFFF
6. 如果移位值是负数,那么就是右移。
7. 将偏移值取反然后左移即可。
8. 用于实现剩余数值偏移的计算。
9.3.2 arm_shift_q15
这个函数用于求16位定点数的位移,源代码分析如下:
/**
* @briefShifts the elements of a Q15 vector a specified number of bits.
* @param*pSrc points to the input vector
* @paramshiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range will be saturated.
*/
void arm_shift_q15(
q15_t * pSrc,
int8_t shiftBits,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign; /* Sign of shiftBits */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1, in2; /* Temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80); (3)
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Read 2 inputs */
in1 = *pSrc++;
in2 = *pSrc++;
/* C = A << shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
#ifndefARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in1 << shiftBits), 16),
__SSAT((in2 << shiftBits), 16), 16);
#else
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in2 << shiftBits), 16), (4)
__SSAT((in1 << shiftBits), 16), 16);
#endif /* #ifndefARM_MATH_BIG_ENDIAN */
in1 = *pSrc++;
in2 = *pSrc++;
#ifndefARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in1 << shiftBits), 16),
__SSAT((in2 << shiftBits), 16), 16);
#else
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in2 << shiftBits), 16),
__SSAT((in1 << shiftBits), 16), 16);
#endif /* #ifndefARM_MATH_BIG_ENDIAN */
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift and then store the results in the destination buffer. */
*pDst++ = __SSAT((*pSrc++ << shiftBits), 16); (5)
/* Decrement the loop counter */
blkCnt--;
}
}
else (6)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Read 2 inputs */
in1 = *pSrc++;
in2 = *pSrc++;
/* C = A >> shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
#ifndefARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT((in1 >> -shiftBits),
(in2 >> -shiftBits), 16);
#else
*__SIMD32(pDst)++ = __PKHBT((in2 >> -shiftBits), (7)
(in1 >> -shiftBits), 16);
#endif /* #ifndefARM_MATH_BIG_ENDIAN */
in1 = *pSrc++;
in2 = *pSrc++;
#ifndefARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT((in1 >> -shiftBits),
(in2 >> -shiftBits), 16);
#else
*__SIMD32(pDst)++ = __PKHBT((in2 >> -shiftBits),
(in1 >> -shiftBits), 16);
#endif /* #ifndefARM_MATH_BIG_ENDIAN */
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
*pDst++ = (*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#else
/* Run the below code for Cortex-M0 */
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80);
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift and then store the results in the destination buffer. */
*pDst++ = __SSAT(((q31_t) * pSrc++ << shiftBits), 16);
/* Decrement the loop counter */
blkCnt--;
}
}
else
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
*pDst++ = (*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 如果函数的参数shiftBits是正数那么表示左移,如果参数shiftBits是负数那么就是右移。
2. 这个函数使用了饱和运算。
饱和运算数值0x8000将变成0x7FFF。
3. 获取偏移值是正数还是负数。
4. 通过调用一次__PKHBT实现两个Q15格式数据的计算。
5. 剩余几个数值的计算。
6. 如果位移值为负数,那么就是右移。
7. 将位移值取反以后,通过调用一次__PKHBT实现两个Q15格式数据的计算。
9.3.3 arm_shift_q7
这个函数用于求8位定点数的位移,源代码分析如下:
/**
* @briefShifts the elements of a Q7 vector a specified number of bits.
* @param*pSrc points to the input vector
* @paramshiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* par Conditions for optimum performance
*Input and output buffers should be aligned by 32-bit
*
*
* <b>Scaling and Overflow Behavior:</b> (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range will be saturated.
*/
void arm_shift_q7(
q7_t * pSrc,
int8_t shiftBits,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign; /* Sign of shiftBits */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t in1; /* Input value1 */
q7_t in2; /* Input value2 */
q7_t in3; /* Input value3 */
q7_t in4; /* Input value4 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80); (3)
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Read 4 inputs */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
(4)
/* Store the Shifted result in the destination buffer in single cycle by packing the outputs */
*__SIMD32(pDst)++ = __PACKq7(__SSAT((in1 << shiftBits), 8),
__SSAT((in2 << shiftBits), 8),
__SSAT((in3 << shiftBits), 8),
__SSAT((in4 << shiftBits), 8));
/* Update source pointer to process next sampels */
pSrc += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A << shiftBits */ (5)
/* Shift the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((*pSrc++ << shiftBits), 8);
/* Decrement the loop counter */
blkCnt--;
}
}
else (6)
{
shiftBits = -shiftBits;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Read 4 inputs */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* Store the Shifted result in the destination buffer in single cycle by packing the outputs */
*__SIMD32(pDst)++ = __PACKq7((in1 >> shiftBits), (in2 >> shiftBits),
(in3 >> shiftBits), (in4 >> shiftBits));
pSrc += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the input and then store the result in the destination buffer. */
in1 = *pSrc++;
*pDst++ = (in1 >> shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#else
/* Run the below code for Cortex-M0 */
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80);
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT(((q15_t) * pSrc++ << shiftBits), 8);
/* Decrement the loop counter */
blkCnt--;
}
}
else
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the input and then store the result in the destination buffer. */
*pDst++ = (*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 如果函数的参数shiftBits是正数那么表示左移,如果参数shiftBits是负数那么就是右移。
2. 这个函数使用了饱和运算。
饱和运算数值0x80将变成0x7F。
3. 获取偏移值是正数还是负数。
4. 通过调用一次__PACKq7实现四个Q7格式数据的位移。
5. 剩余几不足4个数据的位移求解。
6. 如果移位值是负数,那么就是右移。
9.3.4 实例讲解
实验目的:
1. 三种类型数据的位移。
实验内容:
1. 按下K3键, 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
/*
*********************************************************************************************************
* 函 数 名: DSP_Shift
* 功能说明: 位移
* 形 参:无
* 返 回 值: 无
*********************************************************************************************************
*/
static void DSP_Shift(void)
{
static q31_tpSrcA1 = 0x88886666;
static q31_tpDst1;
static q15_tpSrcA2 = 0x8866;
static q15_tpDst2;
static q7_tpSrcA3 = 0x86;
static q7_tpDst3;
arm_shift_q31(&pSrcA1, 3, &pDst1, 1);
printf("arm_shift_q31 = %8xrn", pDst1);
arm_shift_q15(&pSrcA2, -3, &pDst2, 1);
printf("arm_shift_q15 = %4xrn", pDst2);
arm_shift_q7(&pSrcA3, 3, &pDst3, 1);
printf("arm_shift_q7 = %2xrn", pDst3);
printf("***********************************rn");
} 9.4 减法(Vector Sub)
这部分函数主要用于实现减法,公式描述如下:
pDst = pSrcA - pSrcB, 0 <= n < blockSize.
9.4.1 arm_sub_f32
这个函数用于求32位浮点数的减法,源代码分析如下:
/**
* @brief Floating-point vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*/
void arm_sub_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary variables */
float32_t inB1, inB2, inB3, inB4; /* temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
/* Read 4 input samples from sourceA and sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);
/* dst = srcA - srcB */
/* subtract and store the result */ (1)
*pDst = inA1 - inB1;
*(pDst + 1) = inA2 - inB2;
*(pDst + 2) = inA3 - inB3;
*(pDst + 3) = inA4 - inB4;
/* Update pointers to process next sampels */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) - (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数的减法运算比较简单,直接两个数值相减即可。
9.4.2 arm_sub_q31
这个函数用于求32位定点数的减法,源代码分析如下:
/**
* @brief Q31 vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range will be saturated.
*/
void arm_sub_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
*pDst++ = __QSUB(inA1, inB1); (2)
*pDst++ = __QSUB(inA2, inB2);
*pDst++ = __QSUB(inA3, inB3);
*pDst++ = __QSUB(inA4, inB4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = __QSUB(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ - *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
2. __QSUB也是SIMD指令,这里可以用这个指令实现两个Q31格式数据的饱和减法。
9.4.3 arm_sub_q15
这个函数用于求16位定点数的减法,源代码分析如下:
/**
* @brief Q15 vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b>
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range will be saturated.
*/
void arm_sub_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2;
q31_t inB1, inB2;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer two samples at a time. */
inA1 = *__SIMD32(pSrcA)++; (1)
inA2 = *__SIMD32(pSrcA)++;
inB1 = *__SIMD32(pSrcB)++;
inB2 = *__SIMD32(pSrcB)++;
*__SIMD32(pDst)++ = __QSUB16(inA1, inB1); (2)
*__SIMD32(pDst)++ = __QSUB16(inA2, inB2);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q15_t) __QSUB16(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ - *pSrcB++), 16);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 这里一次读取两个Q15格式的数据。
2. 由于__QSUB16是SIMD指令,在这里调用一次__QSUB16可以实现两次减法运算。
9.4.4 arm_sub_q7
这个函数用于求8位定点数的减法,源代码分析如下:
/**
* @brief Q7 vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b>
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range will be saturated.
*/
void arm_sub_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer 4 samples at a time. */
*__SIMD32(pDst)++ = __QSUB8(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++); (1)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = __SSAT(*pSrcA++ - *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ - *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. __QSUB8也是SIMD指令,调用一次就能实现4个Q7格式数据的减法运算。
9.4.5 实例讲解
实验目的:
1. 四种种类型数据的减法。
实验内容:
1. 按下按键UP, 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
/*
*********************************************************************************************************
* 函 数 名: DSP_Sub
* 功能说明: 减法
* 形 参:无
* 返 回 值: 无
*********************************************************************************************************
*/
static void DSP_Sub(void)
{
static float32_t pSrcA = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t pSrcB = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t pDst;
static q31_tpSrcA1 = {1,1,1,1,1};
static q31_tpSrcB1 = {1,1,1,1,1};
static q31_tpDst1;
static q15_tpSrcA2 = {1,1,1,1,1};
static q15_tpSrcB2 = {1,1,1,1,1};
static q15_tpDst2;
static q7_tpSrcA3 = {0x70,1,1,1,1};
static q7_tpSrcB3 = {0x7f,1,1,1,1};
static q7_t pDst3;
pSrcA += 1.1f;
arm_sub_f32(pSrcA, pSrcB, pDst, 5);
printf("arm_sub_f32 = %frn", pDst);
pSrcA1 += 1;
arm_sub_q31(pSrcA1, pSrcB1, pDst1, 5);
printf("arm_sub_q31 = %drn", pDst1);
pSrcA2 += 1;
arm_sub_q15(pSrcA2, pSrcB2, pDst2, 5);
printf("arm_sub_q15 = %drn", pDst2);
pSrcA3 += 1;
arm_sub_q7(pSrcA3, pSrcB3, pDst3, 5);
printf("arm_sub_q7 = %drn", pDst3);
printf("***********************************rn");
} 9.5 比例因子(Vector Scale)
这部分函数主要用于实现数据的比例放大和缩小,浮点数据公式描述如下:
pDst = pSrc * scale, 0 <= n < blockSize.
如果是Q31,Q15,Q7格式的数据,公式描述如下:
pDst = (pSrc * scaleFract) << shift, 0 <= n < blockSize.
这种情况下,比例因子就是:
scale = scaleFract * 2^shift.
注意,这部分函数支持目标指针和源指针指向相同的缓冲区。
9.5.1 arm_scale_f32
这个函数用于求32位浮点数的比例放缩,源代码分析如下:
/**
* @brief Multiplies a floating-point vector by a scalar.
* @param *pSrc points to the input vector
* @param scale scale factor to be applied
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*/
void arm_scale_f32(
float32_t * pSrc,
float32_t scale,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variabels */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the results in the destination buffer. */
/* read input samples from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
/* multiply with scaling factor */ (1)
in1 = in1 * scale;
/* read input sample from source */
in3 = *(pSrc + 2);
/* multiply with scaling factor */
in2 = in2 * scale;
/* read input sample from source */
in4 = *(pSrc + 3);
/* multiply with scaling factor */
in3 = in3 * scale;
in4 = in4 * scale;
/* store the result to destination */
*pDst = in1;
*(pDst + 1) = in2;
*(pDst + 2) = in3;
*(pDst + 3) = in4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (*pSrc++) * scale;
/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数据的比例因子计算比较简单,源浮点数相应相应的比例因子即可。
9.5.2 arm_scale_q31
这个函数用于求32位定点数的比例放缩,源代码分析如下:
/**
* @brief Multiplies a Q31 vector by a scalar.
* @param *pSrc points to the input vector
* @param scaleFract fractional portion of the scale value
* @param shift number of bits to shift the result by
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The input data <code>*pSrc</code> and <code>scaleFract</code> are in 1.31 format.
* These are multiplied to yield a 2.62 intermediate result and this is shifted with saturation to 1.31 format.
*/
void arm_scale_q31(
q31_t * pSrc,
q31_t scaleFract,
int8_t shift,
q31_t * pDst,
uint32_t blockSize)
{
int8_t kShift = shift + 1; /* Shift to apply after scaling */ (2)
int8_t sign = (kShift & 0x80);
uint32_t blkCnt; /* loop counter */
q31_t in, out;
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variabels */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
if(sign == 0u) (3)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read four inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* multiply input with scaler value */ (4)
in1 = ((q63_t) in1 * scaleFract) >> 32;
in2 = ((q63_t) in2 * scaleFract) >> 32;
in3 = ((q63_t) in3 * scaleFract) >> 32;
in4 = ((q63_t) in4 * scaleFract) >> 32;
/* apply shifting */
out1 = in1 << kShift;
out2 = in2 << kShift;
/* saturate the results. */
if(in1 != (out1 >> kShift)) (5)
out1 = 0x7FFFFFFF ^ (in1 >> 31);
if(in2 != (out2 >> kShift))
out2 = 0x7FFFFFFF ^ (in2 >> 31);
out3 = in3 << kShift;
out4 = in4 << kShift;
*pDst = out1;
*(pDst + 1) = out2;
if(in3 != (out3 >> kShift))
out3 = 0x7FFFFFFF ^ (in3 >> 31);
if(in4 != (out4 >> kShift))
out4 = 0x7FFFFFFF ^ (in4 >> 31);
/* Store result destination */
*(pDst + 2) = out3;
*(pDst + 3) = out4;
/* Update pointers to process next sampels */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
}
else {
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read four inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* multiply input with scaler value */
in1 = ((q63_t) in1 * scaleFract) >> 32;
in2 = ((q63_t) in2 * scaleFract) >> 32;
in3 = ((q63_t) in3 * scaleFract) >> 32;
in4 = ((q63_t) in4 * scaleFract) >> 32;
/* apply shifting */ (6)
out1 = in1 >> -kShift;
out2 = in2 >> -kShift;
out3 = in3 >> -kShift;
out4 = in4 >> -kShift;
/* Store result destination */
*pDst = out1;
*(pDst + 1) = out2;
*(pDst + 2) = out3;
*(pDst + 3) = out4;
/* Update pointers to process next sampels */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
if(sign == 0)
{
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
in = *pSrc++;
in = ((q63_t) in * scaleFract) >> 32;
out = in << kShift;
if(in != (out >> kShift))
out = 0x7FFFFFFF ^ (in >> 31);
*pDst++ = out;
/* Decrement the loop counter */
blkCnt--;
}
}
else
{
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
in = *pSrc++;
in = ((q63_t) in * scaleFract) >> 32;
out = in >> -kShift;
*pDst++ = out;
/* Decrement the loop counter */
blkCnt--;
}
}
}
1. 源数据和比例因子都是Q31格式。这样他们的乘积就是1.31 * 1.31 = 2.62格式。由于输出结果也是Q31格式,那么源数据和比例因子的乘积需要右移32位,并且输出结果需要饱和处理。
2. 这里不清楚为什么要加1操作,留作以后解决。
3. 如果位移是正值,那么就是左移位,否则就是右移位。
4. 将源数据和比例因子的乘积左移32位,保证结果也是Q31格式。
5. 这里是对结果的饱和处理。
6. 数值的右移不存在饱和问题,这里直接取反即可。
9.5.3 arm_scale_q15
这个函数用于求16位定点数的比例放缩,源代码分析如下:
/**
* @brief Multiplies a Q15 vector by a scalar.
* @param *pSrc points to the input vector
* @param scaleFract fractional portion of the scale value
* @param shift number of bits to shift the result by
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The input data <code>*pSrc</code> and <code>scaleFract</code> are in 1.15 format.
* These are multiplied to yield a 2.30 intermediate result and this is shifted with saturation to 1.15 format.
*/
void arm_scale_q15(
q15_t * pSrc,
q15_t scaleFract,
int8_t shift,
q15_t * pDst,
uint32_t blockSize)
{
int8_t kShift = 15 - shift; /* shift to apply after scaling */ (2)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1, in2, in3, in4;
q31_t inA1, inA2; /* Temporary variables */
q31_t out1, out2, out3, out4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Reading 2 inputs from memory */
inA1 = *__SIMD32(pSrc)++; (3)
inA2 = *__SIMD32(pSrc)++;
/* C = A * scale */
/* Scale the inputs and then store the 2 results in the destination buffer
* in single cycle by packing the outputs */
out1 = (q31_t) ((q15_t) (inA1 >> 16) * scaleFract); (4)
out2 = (q31_t) ((q15_t) inA1 * scaleFract);
out3 = (q31_t) ((q15_t) (inA2 >> 16) * scaleFract);
out4 = (q31_t) ((q15_t) inA2 * scaleFract);
/* apply shifting */
out1 = out1 >> kShift;
out2 = out2 >> kShift;
out3 = out3 >> kShift;
out4 = out4 >> kShift;
/* saturate the output */
in1 = (q15_t) (__SSAT(out1, 16)); (5)
in2 = (q15_t) (__SSAT(out2, 16));
in3 = (q15_t) (__SSAT(out3, 16));
in4 = (q15_t) (__SSAT(out4, 16));
/* store the result to destination */ (6)
*__SIMD32(pDst)++ = __PKHBT(in2, in1, 16);
*__SIMD32(pDst)++ = __PKHBT(in4, in3, 16);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q15_t) (__SSAT(((*pSrc++) * scaleFract) >> kShift, 16));
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q15_t) (__SSAT(((q31_t) * pSrc++ * scaleFract) >> kShift, 16));
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 源数据和比例因子的数据格式都是Q15,这样的话,输出结果就是1.15 * 1.15 = 2.30格式,由于输出结果也是Q15格式,所以输出结果需要饱和处理。
2. 这个变量设计很巧妙,这样下面处理正数左移和负数右移就很方面了,可以直接使用一个右移就可以实现。
3. 读取两个Q15格式的数据。
4. 将源数据乘以比例因子后赋值给Q31格式的变量。
5. 对输出结果做饱和处理。
6. 通过调用一次__PKHBT指令,将两个Q15格式的数据都赋值给目的变量。
9.5.4 arm_scale_q7
这个函数用于求8位定点数的比例放缩,源代码分析如下:
/**
* @brief Multiplies a Q7 vector by a scalar.
* @param *pSrc points to the input vector
* @param scaleFract fractional portion of the scale value
* @param shift number of bits to shift the result by
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*
* <b>Scaling and Overflow Behavior:</b> (1)
* par
* The input data <code>*pSrc</code> and <code>scaleFract</code> are in 1.7 format.
* These are multiplied to yield a 2.14 intermediate result and this is shifted with saturation to 1.7 format.
*/
void arm_scale_q7(
q7_t * pSrc,
q7_t scaleFract,
int8_t shift,
q7_t * pDst,
uint32_t blockSize)
{
int8_t kShift = 7 - shift; /* shift to apply after scaling */ (2)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t in1, in2, in3, in4, out1, out2, out3, out4; /* Temporary variables to store input & output */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Reading 4 inputs from memory */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
/* C = A * scale */
/* Scale the inputs and then store the results in the temporary variables. */
out1 = (q7_t) (__SSAT(((in1) * scaleFract) >> kShift, 8)); (3)
out2 = (q7_t) (__SSAT(((in2) * scaleFract) >> kShift, 8));
out3 = (q7_t) (__SSAT(((in3) * scaleFract) >> kShift, 8));
out4 = (q7_t) (__SSAT(((in4) * scaleFract) >> kShift, 8));
/* Packing the individual outputs into 32bit and storing in
* destination buffer in single write */
*__SIMD32(pDst)++ = __PACKq7(out1, out2, out3, out4); (4)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) (__SSAT(((*pSrc++) * scaleFract) >> kShift, 8));
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) (__SSAT((((q15_t) * pSrc++ * scaleFract) >> kShift), 8));
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 源数据和比例因子的数据格式都是Q7,这样的话,输出结果就是1.7 * 1.7 = 2.14格式,由于输出结果也是Q7格式,所以输出结果需要饱和处理。
2. 这个变量设计很巧妙,这样下面处理正数左移和负数右移就很方面了,可以直接使用一个右移就可以实现。
3. 对源数据和比例因子的输出结果做8位精度的饱和处理。
9.5.5 实例讲解
实验目的:
1. 四种种类型数据的比例放缩。
实验内容:
1. 按下按键DOWN 串口打印输出结果
实验现象:
通过窗口上位机软件SecureCRT(V5光盘里面有此软件)查看打印信息现象如下:
程序设计:
/*
*********************************************************************************************************
* 函 数 名: DSP_Scale
* 功能说明: 比例因子
* 形 参:无
* 返 回 值: 无
*********************************************************************************************************
*/
static void DSP_Scale(void)
{
static float32_t pSrcA = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t scale = 0.0f;
static float32_t pDst;
static q31_tpSrcA1 = {0x6fffffff,1,1,1,1};
static q31_tscale1 = 0x6fffffff;
static q31_tpDst1;
static q15_tpSrcA2 = {0x6fff,1,1,1,1};
static q15_tscale2 = 0x6fff;
static q15_tpDst2;
static q7_tpSrcA3 = {0x70,1,1,1,1};
static q7_tscale3 = 0x6f;
static q7_t pDst3;
scale += 0.1f;
arm_scale_f32(pSrcA, scale, pDst, 5);
printf("arm_sub_f32 = %frn", pDst);
scale1 += 1;
arm_scale_q31(pSrcA1, scale1, 0, pDst1, 5);
printf("arm_scale_q31 = %xrn", pDst1);
scale2 += 1;
arm_scale_q15(pSrcA2, scale2, 0, pDst2, 5);
printf("arm_scale_q15 = %xrn", pDst2);
scale3 += 1;
arm_scale_q7(pSrcA3, scale3, 0, pDst3, 5);
printf("arm_scale_q7 = %xrn", pDst3);
printf("***********************************rn");
} 9.6 BasicMathFunctions的重要说明
截至到这里,BasicMathFunctions函数已经讲解完了,也许大家也发现了这些函数的一些共同点,在前面第8章的时候我们简单的阐述过,这里再进一步的阐述一下:
l 这些函数基本都是支持重入的。
l 基本每个函数都有四种数据类型,F32,Q31,Q15,Q7。
l 函数中数值的处理基本都是4个为一组,这么做的原因是F32,Q31,Q15,Q7就可以统一采用一个程序设计架构,便于管理。更重要的是可以在Q15和Q7数据处理中很好的发挥SIMD指令的作用(因为4个为一组的话,可以用SIMD指令正好处理2个Q15数据或者4个Q7数据)。
l 部分函数是支持目标指针和源指针指向相同的缓冲区。
关于这个的使用,我们没有在前面的讲解中举例子,下面举一个简单的例子进行说明,这里就以9.5小节中scale函数进行说明:
static void DSP_Scale(void)
{
static float32_t pSrcA = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t scale = 0.0f;
static q31_tpSrcA1 = {0x6fffffff,1,1,1,1};
static q31_tscale1 = 0x6fffffff;
static q15_tpSrcA2 = {0x6fff,1,1,1,1};
static q15_tscale2 = 0x6fff;
static q7_tpSrcA3 = {0x70,1,1,1,1};
static q7_tscale3 = 0x6f;
scale += 0.1f;
arm_scale_f32(pSrcA, scale, pSrcA, 5); (1)
printf("arm_sub_f32 = %frn", pSrcA);
scale1 += 1;
arm_scale_q31(pSrcA1, scale1, 0, pSrcA1, 5); (2)
printf("arm_scale_q31 = %xrn", pSrcA1);
scale2 += 1;
arm_scale_q15(pSrcA2, scale2, 0, pSrcA2, 5); (3)
printf("arm_scale_q15 = %xrn", pSrcA2);
scale3 += 1;
arm_scale_q7(pSrcA3, scale3, 0, pSrcA3, 5); (4)
printf("arm_scale_q7 = %xrn", pSrcA3);
printf("***********************************rn");
}
上面代码的(1)至(4)目标指针和源指针指向相同的缓冲区。
9.7 总结
BasicMathFunctions函数就跟大家讲这么多,希望初学的同学多多的联系,并在自己以后的项目中多多使用,效果必将事半功倍。
页:
[1]