【安富莱DSP教程】第9章 BasicMathFunctions的使用（二）

席萌0209 发表于 2015-3-19 10:30:13

特别说明：完整45期数字信号处理教程，原创高性能示波器代码全开源地址：链接
第9章 BasicMathFunctions的使用（二）

本期教程主要讲基本函数中的相反数，偏移，位移，减法和比例因子。
9.1 相反数（Vector Negate）
9.2 求和（Vector Offset）
9.3 点乘（Vector Shift）
9.4 减法（Vector Sub）
9.5 比例因子（Vector Scale）
9.6 BasicMathFunctions的重要说明
9.7 总结

9.1 相反数（Vector Negate）

这部分函数主要用于求相反数，公式描述如下：
 pDst = -pSrc, 0 <= n < blockSize.
特别注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.1.1 arm_negate_f32

这个函数用于求32位浮点数的相反数，源代码分析如下：
/**
* @briefNegates the elements of a floating-point vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*/

void arm_negate_f32(
float32_t * pSrc,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variables */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);

/* negate the input */ (1)
in1 = -in1;
in2 = -in2;
in3 = -in3;
in4 = -in4;

/* store the result to destination */
*pDst = in1;
*(pDst + 1) = in2;
*(pDst + 2) = in3;
*(pDst + 3) = in4;

/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
*pDst++ = -*pSrc++;

/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数的相反数求解比较简单，直接在相应的变量前加上负号即可。

9.1.2 arm_negate_q31

这个函数用于求32位定点数的相反数，源代码分析如下：
/**
* @briefNegates the elements of a Q31 vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* \par
* The function uses saturating arithmetic.
* The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF.
*/

void arm_negate_q31(
q31_t * pSrc,
q31_t * pDst,
uint32_t blockSize)
{
q31_t in; /* Temporary variable */
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;

*pDst++ = __QSUB(0, in1); (2)
*pDst++ = __QSUB(0, in2);
*pDst++ = __QSUB(0, in3);
*pDst++ = __QSUB(0, in4);

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the result in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

/* Decrement the loop counter */
blkCnt--;
}
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
2. 饱和运算__QSUB我们在上一章已经详细讲述了，这就就是实现数值0减去相应的参数变量。

9.1.3 arm_negate_q15

这个函数用于求16位定点数的相反数，源代码分析如下：
/**
* @briefNegates the elements of a Q15 vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* \par Conditions for optimum performance
*Input and output buffers should be aligned by 32-bit
*
*
* Scaling and Overflow Behavior: (1)
* \par
* The function uses saturating arithmetic.
* The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF.
*/

void arm_negate_q15(
q15_t * pSrc,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q15_t in;

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */

q31_t in1, in2; /* Temporary variables */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Read two inputs at a time */ (2)
in1 = _SIMD32_OFFSET(pSrc);
in2 = _SIMD32_OFFSET(pSrc + 2);

/* negate two samples at a time */ (3)
in1 = __QSUB16(0, in1);

/* negate two samples at a time */
in2 = __QSUB16(0, in2);

/* store the result to destination 2 samples at a time */ (4)
_SIMD32_OFFSET(pDst) = in1;
/* store the result to destination 2 samples at a time */
_SIMD32_OFFSET(pDst + 2) = in2;

/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the result in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == (q15_t) 0x8000) ? 0x7fff : -in;

/* Decrement the loop counter */
blkCnt--;
}
}
1. 这个函数使用了饱和运算。
饱和运算数值0x8000将变成0x7FFF。
2. 一次读取两个Q15格式的数据。
3. 由于__QSUB是SIMD指令，这里可以实现一次计算两个Q15数据的相反数。
4. 这里实现一次赋值两个Q15数据。

9.1.4 arm_negate_q7

这个函数用于求8位定点数的相反数，源代码分析如下：
/**
* @briefNegates the elements of a Q7 vector.
* @param*pSrc points to the input vector
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* \par
* The function uses saturating arithmetic.
* The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F.
*/

void arm_negate_q7(
q7_t * pSrc,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q7_t in;

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t input; /* Input values1-4 */
q31_t zero = 0x00000000; (2)

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Read four inputs */
input = *__SIMD32(pSrc)++; (3)

/* Store the Negated results in the destination buffer in a single cycle by packing the results */
*__SIMD32(pDst)++ = __QSUB8(zero, input); (4)

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */ \
 in = *pSrc++;
*pDst++ = (in == (q7_t) 0x80) ? 0x7f : -in;

/* Decrement the loop counter */
blkCnt--;
}
}
1. 这个函数使用了饱和运算。
饱和运算数值0x80将变成0x7F。
2. 给局部变量赋初值，防止默认初始值不是0，所以从某种意义上来说，给变量赋初值是很有必要的。
3. 一次读取4个Q7格式的数据到input里面。
4. 通过__QSUB8实现一次计算四个Q7格式数据的相反数。

9.1.5 实例讲解

实验目的：
1. 四种类型数据的相反数。
实验内容：
1. 按下K1键, 串口打印输出结果
实验现象：
通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：
程序设计：
/*
*********************************************************************************************************
* 函数名: DSP_Negate
* 功能说明: 求相反数
* 形参：无
* 返回值: 无
*********************************************************************************************************
*/
static void DSP_Negate(void)
{
static float32_t pSrc;
static float32_t pDst;
static q31_t pSrc1;
static q31_t pDst1;
static q15_t pSrc2;
static q15_t pDst2;
static q7_t pSrc3 = 127; /* 为了说明问题，在这里设置初始值为127，然后查看0x80是否饱和为0x7F */
static q7_t pDst3;
pSrc -= 1.23f;
arm_negate_f32(&pSrc, &pDst, 1);
printf("arm_negate_f32 = %f\r\n", pDst);

pSrc1 -= 1;
arm_negate_q31(&pSrc1, &pDst1, 1);
printf("arm_negate_q31 = %d\r\n", pDst1);

pSrc2 -= 1;
arm_negate_q15(&pSrc2, &pDst2, 1);
printf("arm_negate_q15 = %d\r\n", pDst2);

pSrc3 += 1;
arm_negate_q7(&pSrc3, &pDst3, 1);
printf("arm_negate_q7 = %d\r\n", pDst3);
printf("***********************************\r\n");
}

席萌0209 发表于 2015-3-19 10:37:10

9.2 偏移（Vector Offset）

这部分函数主要用于求相反数，公式描述如下：
pDst = pSrc + offset, 0 <= n < blockSize.
 注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.2.1 arm_offset_f32

这个函数用于求32位浮点数的偏移，源代码分析如下：
/**
* @briefAdds a constant offset to a floating-point vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*/
void arm_offset_f32(
float32_t * pSrc,
float32_t offset,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4;

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */ (1)
/* Add offset and then store the results in the destination buffer. */
/* read samples from source */
in1 = *pSrc;
in2 = *(pSrc + 1);

/* add offset to input */
in1 = in1 + offset;

/* read samples from source */
in3 = *(pSrc + 2);

/* add offset to input */
in2 = in2 + offset;

/* read samples from source */
in4 = *(pSrc + 3);

/* add offset to input */
in3 = in3 + offset;

/* store result to destination */
*pDst = in1;

/* add offset to input */
in4 = in4 + offset;

/* store result to destination */
*(pDst + 1) = in2;

/* store result to destination */
*(pDst + 2) = in3;

/* store result to destination */
*(pDst + 3) = in4;

/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (*pSrc++) + offset;

/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数的偏移值求解比较简单，加上相应的偏移值并赋值给目标变量即可。

9.2.2 arm_offset_q31

这个函数用于求32位定点数的偏移值，源代码分析如下：
/**
* @briefAdds a constant offset to a Q31 vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range are saturated.
*/

void arm_offset_q31(
q31_t * pSrc,
q31_t offset,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;

*pDst++ = __QADD(in1, offset); (2)
*pDst++ = __QADD(in2, offset);
*pDst++ = __QADD(in3, offset);
*pDst++ = __QADD(in4, offset);

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = __QADD(*pSrc++, offset);

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrc++ + offset);

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
2. 指令__QADD我们在上章教程中已经讲解过，这里是实现两个参数相加。

9.2.3 arm_offset_q15

这个函数用于求16位定点数的偏移，源代码分析如下：
/**
* @briefAdds a constant offset to a Q15 vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range are saturated.
*/

void arm_offset_q15(
q15_t * pSrc,
q15_t offset,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t offset_packed; /* Offset packed to 32 bit */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* Offset is packed to 32 bit in order to use SIMD32 for addition */
offset_packed = __PKHBT(offset, offset, 16); (2)

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer, 2 samples at a time. */
*__SIMD32(pDst)++ = __QADD16(*__SIMD32(pSrc)++, offset_packed); (3)
*__SIMD32(pDst)++ = __QADD16(*__SIMD32(pSrc)++, offset_packed);

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
*pDst++ = (q15_t) __QADD16(*pSrc++, offset);

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrc++ + offset), 16);

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 这个函数使用了饱和运算。
饱和运算数值0x8000将变成0x7FFF。
2. 将两个Q15格式的变量合并成一个Q31格式的数据，方便指令__QADD16的调用。
3. 由于__QADD16是SIMD指令，这里调用一次就能实现两个Q15格式数据的计算。

9.2.4 arm_offset_q7

这个函数用于求8位定点数的偏移，源代码分析如下：
/**
* @briefAdds a constant offset to a Q7 vector.
* @param*pSrc points to the input vector
* @paramoffset is the offset to be added
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range are saturated.
*/

void arm_offset_q7(
q7_t * pSrc,
q7_t offset,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t offset_packed; /* Offset packed to 32 bit */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* Offset is packed to 32 bit in order to use SIMD32 for addition */ (2)
offset_packed = __PACKq7(offset, offset, offset, offset);

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination bufferfor 4 samples at a time. */
*__SIMD32(pDst)++ = __QADD8(*__SIMD32(pSrc)++, offset_packed); (3)

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT(*pSrc++ + offset, 8);

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrc++ + offset, 8);

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 这个函数使用了饱和运算。
饱和运算数值0x80将变成0x7F。
2. 通过__PACKq7将4个Q7格式的数据合并成一个Q31格式的数据。
3. 由于__QADD8是SIMD指令，这里调用一次就能实现四个Q8格式数据的计算。

9.2.5 实例讲解

实验目的：
1. 四种类型数据的相反数。
实验内容：
1. 按下K2键, 串口打印输出结果
实验现象：
通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：
程序设计：
/*
*********************************************************************************************************
* 函数名: DSP_Offset
* 功能说明: 偏移
* 形参：无
* 返回值: 无
*********************************************************************************************************
*/
static void DSP_Offset(void)
{
static float32_t pSrcA;
static float32_t Offset = 0.0f;
static float32_t pDst;
static q31_tpSrcA1;
static q31_tOffset1 = 0;
static q31_tpDst1;

static q15_tpSrcA2;
static q15_tOffset2 = 0;
static q15_tpDst2;

static q7_tpSrcA3;
static q7_tOffset3 = 0;
static q7_tpDst3;

Offset--;
arm_offset_f32(&pSrcA, Offset, &pDst, 1);
printf("arm_add_f32 = %frn", pDst);

Offset1--;
arm_offset_q31(&pSrcA1, Offset1, &pDst1, 1);
printf("arm_add_q31 = %drn", pDst1);

Offset2--;
arm_offset_q15(&pSrcA2, Offset2, &pDst2, 1);
printf("arm_add_q15 = %drn", pDst2);

Offset3--;
arm_offset_q7(&pSrcA3, Offset3, &pDst3, 1);
printf("arm_add_q7 = %drn", pDst3);
printf("***********************************rn");
}

席萌0209 发表于 2015-3-19 10:42:31

9.3 位移（Vector Shift）

这部分函数主要用于实现位移，公式描述如下：
pDst = pSrc << shift, 0 <= n < blockSize.
注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.3.1 arm_shift_q31

这个函数用于求32位定点数的位移，源代码分析如下：
/**
* @briefShifts the elements of a Q31 vector a specified number of bits.
* @param*pSrc points to the input vector
* @paramshiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
*
* Scaling and Overflow Behavior: (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range will be saturated.
*/

void arm_shift_q31(
q31_t * pSrc,
int8_t shiftBits,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign = (shiftBits & 0x80); /* Sign of shiftBits */ (3)

#ifndef ARM_MATH_CM0_FAMILY

q31_t in1, in2, in3, in4; /* Temporary input variables */
q31_t out1, out2, out3, out4; /* Temporary output variables */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

if(sign == 0u) (4)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* C = A<< shiftBits */
 /* Shift the input and then store the results in the destination buffer. */
 in1 = *pSrc;
 in2 = *(pSrc + 1);
 out1 = in1 << shiftBits;
 in3 = *(pSrc + 2);
 out2 = in2 << shiftBits;
 in4 = *(pSrc + 3);
 if(in1 != (out1 >> shiftBits)) (5)
 out1 = 0x7FFFFFFF ^ (in1 >> 31);

 if(in2 != (out2 >> shiftBits))
 out2 = 0x7FFFFFFF ^ (in2 >> 31);

 *pDst = out1;
 out3 = in3 << shiftBits;
 *(pDst + 1) = out2;
 out4 = in4 << shiftBits;

 if(in3 != (out3 >> shiftBits))
 out3 = 0x7FFFFFFF ^ (in3 >> 31);

 if(in4 != (out4 >> shiftBits))
 out4 = 0x7FFFFFFF ^ (in4 >> 31);

 *(pDst + 2) = out3;
 *(pDst + 3) = out4;

 /* Update destination pointer to process next sampels */
 pSrc += 4u;
 pDst += 4u;

 /* Decrement the loop counter */
 blkCnt--;
}
}
else (6)
{

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* C = A >>shiftBits */
 /* Shift the input and then store the results in the destination buffer. */
 in1 = *pSrc;
 in2 = *(pSrc + 1);
 in3 = *(pSrc + 2);
 in4 = *(pSrc + 3);

 *pDst = (in1 >> -shiftBits); (7)
 *(pDst + 1) = (in2 >> -shiftBits);
 *(pDst + 2) = (in3 >> -shiftBits);
 *(pDst + 3) = (in4 >> -shiftBits);

 pSrc += 4u;
 pDst += 4u;

 blkCnt--;
}

}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = A (>> or <<) shiftBits */
/* Shift the input and then store the result in the destination buffer. */ (8)
*pDst++ = (sign == 0u) ? clip_q63_to_q31((q63_t) * pSrc++ << shiftBits) :
 (*pSrc++ >> -shiftBits);

/* Decrement the loop counter */
blkCnt--;
}

}
1. 如果函数的参数shiftBits是正数那么表示左移，如果参数shiftBits是负数那么就是右移。
2. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
3. 获取偏移值shiftBits是正数还是负数。
4. 如果移位值是正数，那么就是左移。
5. 数值的左移仅支持将其左移后再右移相应的位数后数值不变的情况，如果不满足这个条件，那么输出结果只有两种结果（这里就是实现输出结果的饱和运算）。
out = 0x7FFFFFFF & 0xFFFFFFFF =0x80000000
out = 0x7FFFFFFF & 0x0000000 =0x7FFFFFFF
6. 如果移位值是负数，那么就是右移。
7. 将偏移值取反然后左移即可。
8. 用于实现剩余数值偏移的计算。

9.3.2 arm_shift_q15

这个函数用于求16位定点数的位移，源代码分析如下：
/**
* @briefShifts the elements of a Q15 vector a specified number of bits.
* @param*pSrc points to the input vector
* @paramshiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range will be saturated.
*/

void arm_shift_q15(
q15_t * pSrc,
int8_t shiftBits,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign; /* Sign of shiftBits */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */

q15_t in1, in2; /* Temporary variables */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80); (3)

/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* Read 2 inputs */
 in1 = *pSrc++;
 in2 = *pSrc++;
 /* C = A << shiftBits */
 /* Shift the inputs and then store the results in the destination buffer. */
#ifndefARM_MATH_BIG_ENDIAN

 *__SIMD32(pDst)++ = __PKHBT(__SSAT((in1 << shiftBits), 16),
 __SSAT((in2 << shiftBits), 16), 16);

#else

 *__SIMD32(pDst)++ = __PKHBT(__SSAT((in2 << shiftBits), 16), (4)
 __SSAT((in1 << shiftBits), 16), 16);

#endif /* #ifndefARM_MATH_BIG_ENDIAN */

 in1 = *pSrc++;
 in2 = *pSrc++;

#ifndefARM_MATH_BIG_ENDIAN

 *__SIMD32(pDst)++ = __PKHBT(__SSAT((in1 << shiftBits), 16),
 __SSAT((in2 << shiftBits), 16), 16);

#else

 *__SIMD32(pDst)++ = __PKHBT(__SSAT((in2 << shiftBits), 16),
 __SSAT((in1 << shiftBits), 16), 16);

#endif /* #ifndefARM_MATH_BIG_ENDIAN */

 /* Decrement the loop counter */
 blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
 /* C = A << shiftBits */
 /* Shift and then store the results in the destination buffer. */
 *pDst++ = __SSAT((*pSrc++ << shiftBits), 16); (5)

 /* Decrement the loop counter */
 blkCnt--;
}
}
else (6)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* Read 2 inputs */
 in1 = *pSrc++;
 in2 = *pSrc++;

 /* C = A >> shiftBits */
 /* Shift the inputs and then store the results in the destination buffer. */
#ifndefARM_MATH_BIG_ENDIAN

 *__SIMD32(pDst)++ = __PKHBT((in1 >> -shiftBits),
 (in2 >> -shiftBits), 16);

#else

 *__SIMD32(pDst)++ = __PKHBT((in2 >> -shiftBits), (7)
 (in1 >> -shiftBits), 16);

#endif /* #ifndefARM_MATH_BIG_ENDIAN */

 in1 = *pSrc++;
 in2 = *pSrc++;

#ifndefARM_MATH_BIG_ENDIAN

 *__SIMD32(pDst)++ = __PKHBT((in1 >> -shiftBits),
 (in2 >> -shiftBits), 16);

#else

 *__SIMD32(pDst)++ = __PKHBT((in2 >> -shiftBits),
 (in1 >> -shiftBits), 16);

#endif /* #ifndefARM_MATH_BIG_ENDIAN */

 /* Decrement the loop counter */
 blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
 /* C = A >> shiftBits */
 /* Shift the inputs and then store the results in the destination buffer. */
 *pDst++ = (*pSrc++ >> -shiftBits);

 /* Decrement the loop counter */
 blkCnt--;
}
}

#else

/* Run the below code for Cortex-M0 */

/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80);

/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
 /* C = A << shiftBits */
 /* Shift and then store the results in the destination buffer. */
 *pDst++ = __SSAT(((q31_t) * pSrc++ << shiftBits), 16);

 /* Decrement the loop counter */
 blkCnt--;
}
}
else
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
 /* C = A >> shiftBits */
 /* Shift the inputs and then store the results in the destination buffer. */
 *pDst++ = (*pSrc++ >> -shiftBits);

 /* Decrement the loop counter */
 blkCnt--;
}
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 如果函数的参数shiftBits是正数那么表示左移，如果参数shiftBits是负数那么就是右移。
2. 这个函数使用了饱和运算。
饱和运算数值0x8000将变成0x7FFF。
3. 获取偏移值是正数还是负数。
4. 通过调用一次__PKHBT实现两个Q15格式数据的计算。
5. 剩余几个数值的计算。
6. 如果位移值为负数，那么就是右移。
7. 将位移值取反以后，通过调用一次__PKHBT实现两个Q15格式数据的计算。

9.3.3 arm_shift_q7

这个函数用于求8位定点数的位移，源代码分析如下：
/**
* @briefShifts the elements of a Q7 vector a specified number of bits.
* @param*pSrc points to the input vector
* @paramshiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param*pDst points to the output vector
* @paramblockSize number of samples in the vector
* @return none.
*
* par Conditions for optimum performance
*Input and output buffers should be aligned by 32-bit
*
*
* Scaling and Overflow Behavior: (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range will be saturated.
*/

void arm_shift_q7(
q7_t * pSrc,
int8_t shiftBits,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign; /* Sign of shiftBits */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t in1; /* Input value1 */
q7_t in2; /* Input value2 */
q7_t in3; /* Input value3 */
q7_t in4; /* Input value4 */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80); (3)

/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* C = A << shiftBits */
 /* Read 4 inputs */
 in1 = *pSrc;
 in2 = *(pSrc + 1);
 in3 = *(pSrc + 2);
 in4 = *(pSrc + 3);
 (4)
 /* Store the Shifted result in the destination buffer in single cycle by packing the outputs */
 *__SIMD32(pDst)++ = __PACKq7(__SSAT((in1 << shiftBits), 8),
 __SSAT((in2 << shiftBits), 8),
 __SSAT((in3 << shiftBits), 8),
 __SSAT((in4 << shiftBits), 8));
 /* Update source pointer to process next sampels */
 pSrc += 4u;

 /* Decrement the loop counter */
 blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
 /* C = A << shiftBits */ (5)
 /* Shift the input and then store the result in the destination buffer. */
 *pDst++ = (q7_t) __SSAT((*pSrc++ << shiftBits), 8);

 /* Decrement the loop counter */
 blkCnt--;
}
}
else (6)
{
shiftBits = -shiftBits;
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* C = A >> shiftBits */
 /* Read 4 inputs */
 in1 = *pSrc;
 in2 = *(pSrc + 1);
 in3 = *(pSrc + 2);
 in4 = *(pSrc + 3);

 /* Store the Shifted result in the destination buffer in single cycle by packing the outputs */
 *__SIMD32(pDst)++ = __PACKq7((in1 >> shiftBits), (in2 >> shiftBits),
 (in3 >> shiftBits), (in4 >> shiftBits));

 pSrc += 4u;

 /* Decrement the loop counter */
 blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
 /* C = A >> shiftBits */
 /* Shift the input and then store the result in the destination buffer. */
 in1 = *pSrc++;
 *pDst++ = (in1 >> shiftBits);

 /* Decrement the loop counter */
 blkCnt--;
}
}

#else

/* Run the below code for Cortex-M0 */

/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80);

/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
 /* C = A << shiftBits */
 /* Shift the input and then store the result in the destination buffer. */
 *pDst++ = (q7_t) __SSAT(((q15_t) * pSrc++ << shiftBits), 8);

 /* Decrement the loop counter */
 blkCnt--;
}
}
else
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
 /* C = A >> shiftBits */
 /* Shift the input and then store the result in the destination buffer. */
 *pDst++ = (*pSrc++ >> -shiftBits);

 /* Decrement the loop counter */
 blkCnt--;
}
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}
1. 如果函数的参数shiftBits是正数那么表示左移，如果参数shiftBits是负数那么就是右移。
2. 这个函数使用了饱和运算。
饱和运算数值0x80将变成0x7F。
3. 获取偏移值是正数还是负数。
4. 通过调用一次__PACKq7实现四个Q7格式数据的位移。
5. 剩余几不足4个数据的位移求解。
6. 如果移位值是负数，那么就是右移。

9.3.4 实例讲解

实验目的：
1. 三种类型数据的位移。
实验内容：
1. 按下K3键, 串口打印输出结果
实验现象：
通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：
程序设计：
/*
*********************************************************************************************************
* 函数名: DSP_Shift
* 功能说明: 位移
* 形参：无
* 返回值: 无
*********************************************************************************************************
*/
static void DSP_Shift(void)
{
static q31_tpSrcA1 = 0x88886666;
static q31_tpDst1;

static q15_tpSrcA2 = 0x8866;
static q15_tpDst2;

static q7_tpSrcA3 = 0x86;
static q7_tpDst3;

arm_shift_q31(&pSrcA1, 3, &pDst1, 1);
printf("arm_shift_q31 = %8xrn", pDst1);

arm_shift_q15(&pSrcA2, -3, &pDst2, 1);
printf("arm_shift_q15 = %4xrn", pDst2);

arm_shift_q7(&pSrcA3, 3, &pDst3, 1);
printf("arm_shift_q7 = %2xrn", pDst3);
printf("***********************************rn");
}

席萌0209 发表于 2015-3-19 10:47:29

9.4 减法（Vector Sub）

这部分函数主要用于实现减法，公式描述如下：
pDst = pSrcA - pSrcB, 0 <= n < blockSize.

9.4.1 arm_sub_f32

这个函数用于求32位浮点数的减法，源代码分析如下：
/**
* @brief Floating-point vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*/

void arm_sub_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary variables */
float32_t inB1, inB2, inB3, inB4; /* temporary variables */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
/* Read 4 input samples from sourceA and sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);

/* dst = srcA - srcB */
/* subtract and store the result */ (1)
*pDst = inA1 - inB1;
*(pDst + 1) = inA2 - inB2;
*(pDst + 2) = inA3 - inB3;
*(pDst + 3) = inA4 - inB4;

/* Update pointers to process next sampels */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) - (*pSrcB++);

/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数的减法运算比较简单，直接两个数值相减即可。

9.4.2 arm_sub_q31

这个函数用于求32位定点数的减法，源代码分析如下：
/**
* @brief Q31 vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range will be saturated.
*/

void arm_sub_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;

inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;

*pDst++ = __QSUB(inA1, inB1); (2)
*pDst++ = __QSUB(inA2, inB2);
*pDst++ = __QSUB(inA3, inB3);
*pDst++ = __QSUB(inA4, inB4);

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = __QSUB(*pSrcA++, *pSrcB++);

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ - *pSrcB++);

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 这个函数使用了饱和运算。
饱和运算数值0x80000000将变成0x7FFFFFFF。
2. __QSUB也是SIMD指令，这里可以用这个指令实现两个Q31格式数据的饱和减法。

9.4.3 arm_sub_q15

这个函数用于求16位定点数的减法，源代码分析如下：
/**
* @brief Q15 vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior:
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range will be saturated.
*/

void arm_sub_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2;
q31_t inB1, inB2;

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer two samples at a time. */
inA1 = *__SIMD32(pSrcA)++; (1)
inA2 = *__SIMD32(pSrcA)++;
inB1 = *__SIMD32(pSrcB)++;
inB2 = *__SIMD32(pSrcB)++;

*__SIMD32(pDst)++ = __QSUB16(inA1, inB1); (2)
*__SIMD32(pDst)++ = __QSUB16(inA2, inB2);

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q15_t) __QSUB16(*pSrcA++, *pSrcB++);

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ - *pSrcB++), 16);

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 这里一次读取两个Q15格式的数据。
2. 由于__QSUB16是SIMD指令，在这里调用一次__QSUB16可以实现两次减法运算。

9.4.4 arm_sub_q7

这个函数用于求8位定点数的减法，源代码分析如下：
/**
* @brief Q7 vector subtraction.
* @param *pSrcA points to the first input vector
* @param *pSrcB points to the second input vector
* @param *pDst points to the output vector
* @param blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior:
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range will be saturated.
*/

void arm_sub_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer 4 samples at a time. */
*__SIMD32(pDst)++ = __QSUB8(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++); (1)

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = __SSAT(*pSrcA++ - *pSrcB++, 8);

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ - *pSrcB++, 8);

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. __QSUB8也是SIMD指令，调用一次就能实现4个Q7格式数据的减法运算。

9.4.5 实例讲解

实验目的：
1. 四种种类型数据的减法。
实验内容：
1. 按下按键UP, 串口打印输出结果
实验现象：
通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：
程序设计：
/*
*********************************************************************************************************
* 函数名: DSP_Sub
* 功能说明: 减法
* 形参：无
* 返回值: 无
*********************************************************************************************************
*/
static void DSP_Sub(void)
{
static float32_t pSrcA = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t pSrcB = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t pDst;
static q31_tpSrcA1 = {1,1,1,1,1};
static q31_tpSrcB1 = {1,1,1,1,1};
static q31_tpDst1;

static q15_tpSrcA2 = {1,1,1,1,1};
static q15_tpSrcB2 = {1,1,1,1,1};
static q15_tpDst2;

static q7_tpSrcA3 = {0x70,1,1,1,1};
static q7_tpSrcB3 = {0x7f,1,1,1,1};
static q7_t pDst3;

pSrcA += 1.1f;
arm_sub_f32(pSrcA, pSrcB, pDst, 5);
printf("arm_sub_f32 = %frn", pDst);
pSrcA1 += 1;
arm_sub_q31(pSrcA1, pSrcB1, pDst1, 5);
printf("arm_sub_q31 = %drn", pDst1);

pSrcA2 += 1;
arm_sub_q15(pSrcA2, pSrcB2, pDst2, 5);
printf("arm_sub_q15 = %drn", pDst2);

pSrcA3 += 1;
arm_sub_q7(pSrcA3, pSrcB3, pDst3, 5);
printf("arm_sub_q7 = %drn", pDst3);
printf("***********************************rn");
}

席萌0209 发表于 2015-3-19 10:54:22

9.5 比例因子（Vector Scale）

这部分函数主要用于实现数据的比例放大和缩小，浮点数据公式描述如下：
 pDst = pSrc * scale, 0 <= n < blockSize.
如果是Q31，Q15，Q7格式的数据，公式描述如下：
 pDst = (pSrc * scaleFract) << shift, 0 <= n < blockSize.
这种情况下，比例因子就是：
 scale = scaleFract * 2^shift.
 注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.5.1 arm_scale_f32

这个函数用于求32位浮点数的比例放缩，源代码分析如下：
/**
* @brief Multiplies a floating-point vector by a scalar.
* @param *pSrc points to the input vector
* @param scale scale factor to be applied
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*/

void arm_scale_f32(
float32_t * pSrc,
float32_t scale,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variabels */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the results in the destination buffer. */
/* read input samples from source */
in1 = *pSrc;
in2 = *(pSrc + 1);

/* multiply with scaling factor */ (1)
in1 = in1 * scale;

/* read input sample from source */
in3 = *(pSrc + 2);

/* multiply with scaling factor */
in2 = in2 * scale;

/* read input sample from source */
in4 = *(pSrc + 3);

/* multiply with scaling factor */
in3 = in3 * scale;
in4 = in4 * scale;
/* store the result to destination */
*pDst = in1;
*(pDst + 1) = in2;
*(pDst + 2) = in3;
*(pDst + 3) = in4;

/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (*pSrc++) * scale;

/* Decrement the loop counter */
blkCnt--;
}
}
1. 浮点数据的比例因子计算比较简单，源浮点数相应相应的比例因子即可。

9.5.2 arm_scale_q31

这个函数用于求32位定点数的比例放缩，源代码分析如下：
/**
* @brief Multiplies a Q31 vector by a scalar.
* @param *pSrc points to the input vector
* @param scaleFract fractional portion of the scale value
* @param shift number of bits to shift the result by
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The input data <code>*pSrc</code> and <code>scaleFract</code> are in 1.31 format.
* These are multiplied to yield a 2.62 intermediate result and this is shifted with saturation to 1.31 format.
*/

void arm_scale_q31(
q31_t * pSrc,
q31_t scaleFract,
int8_t shift,
q31_t * pDst,
uint32_t blockSize)
{
int8_t kShift = shift + 1; /* Shift to apply after scaling */ (2)
int8_t sign = (kShift & 0x80);
uint32_t blkCnt; /* loop counter */
q31_t in, out;

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */

q31_t in1, in2, in3, in4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variabels */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

if(sign == 0u) (3)
{
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* read four inputs from source */
 in1 = *pSrc;
 in2 = *(pSrc + 1);
 in3 = *(pSrc + 2);
 in4 = *(pSrc + 3);

 /* multiply input with scaler value */ (4)
 in1 = ((q63_t) in1 * scaleFract) >> 32;
 in2 = ((q63_t) in2 * scaleFract) >> 32;
 in3 = ((q63_t) in3 * scaleFract) >> 32;
 in4 = ((q63_t) in4 * scaleFract) >> 32;

 /* apply shifting */
 out1 = in1 << kShift;
 out2 = in2 << kShift;

 /* saturate the results. */
 if(in1 != (out1 >> kShift)) (5)
 out1 = 0x7FFFFFFF ^ (in1 >> 31);

 if(in2 != (out2 >> kShift))
 out2 = 0x7FFFFFFF ^ (in2 >> 31);

 out3 = in3 << kShift;
 out4 = in4 << kShift;

 *pDst = out1;
 *(pDst + 1) = out2;

 if(in3 != (out3 >> kShift))
 out3 = 0x7FFFFFFF ^ (in3 >> 31);

 if(in4 != (out4 >> kShift))
 out4 = 0x7FFFFFFF ^ (in4 >> 31);

 /* Store result destination */
 *(pDst + 2) = out3;
 *(pDst + 3) = out4;

 /* Update pointers to process next sampels */
 pSrc += 4u;
 pDst += 4u;

 /* Decrement the loop counter */
 blkCnt--;
}

}
else {
/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
 /* read four inputs from source */
 in1 = *pSrc;
 in2 = *(pSrc + 1);
 in3 = *(pSrc + 2);
 in4 = *(pSrc + 3);

 /* multiply input with scaler value */
 in1 = ((q63_t) in1 * scaleFract) >> 32;
 in2 = ((q63_t) in2 * scaleFract) >> 32;
 in3 = ((q63_t) in3 * scaleFract) >> 32;
 in4 = ((q63_t) in4 * scaleFract) >> 32;

 /* apply shifting */ (6)
 out1 = in1 >> -kShift;
 out2 = in2 >> -kShift;

 out3 = in3 >> -kShift;
 out4 = in4 >> -kShift;

 /* Store result destination */
 *pDst = out1;
 *(pDst + 1) = out2;

 *(pDst + 2) = out3;
 *(pDst + 3) = out4;

 /* Update pointers to process next sampels */
 pSrc += 4u;
 pDst += 4u;

 /* Decrement the loop counter */
 blkCnt--;
}
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

if(sign == 0)
{
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
in = *pSrc++;
in = ((q63_t) in * scaleFract) >> 32;

out = in << kShift;
if(in != (out >> kShift))
out = 0x7FFFFFFF ^ (in >> 31);

*pDst++ = out;

/* Decrement the loop counter */
blkCnt--;
}
}
else
{
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
in = *pSrc++;
in = ((q63_t) in * scaleFract) >> 32;

out = in >> -kShift;

*pDst++ = out;

/* Decrement the loop counter */
blkCnt--;
}

}
}
1. 源数据和比例因子都是Q31格式。这样他们的乘积就是1.31 * 1.31 = 2.62格式。由于输出结果也是Q31格式，那么源数据和比例因子的乘积需要右移32位，并且输出结果需要饱和处理。
2. 这里不清楚为什么要加1操作，留作以后解决。
3. 如果位移是正值，那么就是左移位，否则就是右移位。
4. 将源数据和比例因子的乘积左移32位，保证结果也是Q31格式。
5. 这里是对结果的饱和处理。
6. 数值的右移不存在饱和问题，这里直接取反即可。

9.5.3 arm_scale_q15

这个函数用于求16位定点数的比例放缩，源代码分析如下：
/**
* @brief Multiplies a Q15 vector by a scalar.
* @param *pSrc points to the input vector
* @param scaleFract fractional portion of the scale value
* @param shift number of bits to shift the result by
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The input data <code>*pSrc</code> and <code>scaleFract</code> are in 1.15 format.
* These are multiplied to yield a 2.30 intermediate result and this is shifted with saturation to 1.15 format.
*/

void arm_scale_q15(
q15_t * pSrc,
q15_t scaleFract,
int8_t shift,
q15_t * pDst,
uint32_t blockSize)
{
int8_t kShift = 15 - shift; /* shift to apply after scaling */ (2)
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1, in2, in3, in4;
q31_t inA1, inA2; /* Temporary variables */
q31_t out1, out2, out3, out4;

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Reading 2 inputs from memory */
inA1 = *__SIMD32(pSrc)++; (3)
inA2 = *__SIMD32(pSrc)++;

/* C = A * scale */
/* Scale the inputs and then store the 2 results in the destination buffer
* in single cycle by packing the outputs */
out1 = (q31_t) ((q15_t) (inA1 >> 16) * scaleFract); (4)
out2 = (q31_t) ((q15_t) inA1 * scaleFract);
out3 = (q31_t) ((q15_t) (inA2 >> 16) * scaleFract);
out4 = (q31_t) ((q15_t) inA2 * scaleFract);

/* apply shifting */
out1 = out1 >> kShift;
out2 = out2 >> kShift;
out3 = out3 >> kShift;
out4 = out4 >> kShift;

/* saturate the output */
in1 = (q15_t) (__SSAT(out1, 16)); (5)
in2 = (q15_t) (__SSAT(out2, 16));
in3 = (q15_t) (__SSAT(out3, 16));
in4 = (q15_t) (__SSAT(out4, 16));

/* store the result to destination */ (6)
*__SIMD32(pDst)++ = __PKHBT(in2, in1, 16);
*__SIMD32(pDst)++ = __PKHBT(in4, in3, 16);

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q15_t) (__SSAT(((*pSrc++) * scaleFract) >> kShift, 16));

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q15_t) (__SSAT(((q31_t) * pSrc++ * scaleFract) >> kShift, 16));

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 源数据和比例因子的数据格式都是Q15，这样的话，输出结果就是1.15 * 1.15 = 2.30格式，由于输出结果也是Q15格式，所以输出结果需要饱和处理。
2. 这个变量设计很巧妙，这样下面处理正数左移和负数右移就很方面了，可以直接使用一个右移就可以实现。
3. 读取两个Q15格式的数据。
4. 将源数据乘以比例因子后赋值给Q31格式的变量。
5. 对输出结果做饱和处理。
6. 通过调用一次__PKHBT指令，将两个Q15格式的数据都赋值给目的变量。

9.5.4 arm_scale_q7

这个函数用于求8位定点数的比例放缩，源代码分析如下：
/**
* @brief Multiplies a Q7 vector by a scalar.
* @param *pSrc points to the input vector
* @param scaleFract fractional portion of the scale value
* @param shift number of bits to shift the result by
* @param *pDst points to the output vector
* @param blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The input data <code>*pSrc</code> and <code>scaleFract</code> are in 1.7 format.
* These are multiplied to yield a 2.14 intermediate result and this is shifted with saturation to 1.7 format.
*/

void arm_scale_q7(
q7_t * pSrc,
q7_t scaleFract,
int8_t shift,
q7_t * pDst,
uint32_t blockSize)
{
int8_t kShift = 7 - shift; /* shift to apply after scaling */ (2)
uint32_t blkCnt; /* loop counter */

#ifndef ARM_MATH_CM0_FAMILY

/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t in1, in2, in3, in4, out1, out2, out3, out4; /* Temporary variables to store input & output */

/*loop Unrolling */
blkCnt = blockSize >> 2u;

/* First part of the processing with loop unrolling.Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Reading 4 inputs from memory */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;

/* C = A * scale */
/* Scale the inputs and then store the results in the temporary variables. */
out1 = (q7_t) (__SSAT(((in1) * scaleFract) >> kShift, 8)); (3)
out2 = (q7_t) (__SSAT(((in2) * scaleFract) >> kShift, 8));
out3 = (q7_t) (__SSAT(((in3) * scaleFract) >> kShift, 8));
out4 = (q7_t) (__SSAT(((in4) * scaleFract) >> kShift, 8));

/* Packing the individual outputs into 32bit and storing in
* destination buffer in single write */
*__SIMD32(pDst)++ = __PACKq7(out1, out2, out3, out4); (4)

/* Decrement the loop counter */
blkCnt--;
}

/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;

while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) (__SSAT(((*pSrc++) * scaleFract) >> kShift, 8));

/* Decrement the loop counter */
blkCnt--;
}

#else

/* Run the below code for Cortex-M0 */

/* Initialize blkCnt with number of samples */
blkCnt = blockSize;

while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) (__SSAT((((q15_t) * pSrc++ * scaleFract) >> kShift), 8));

/* Decrement the loop counter */
blkCnt--;
}

#endif /* #ifndef ARM_MATH_CM0_FAMILY */

}
1. 源数据和比例因子的数据格式都是Q7，这样的话，输出结果就是1.7 * 1.7 = 2.14格式，由于输出结果也是Q7格式，所以输出结果需要饱和处理。
2. 这个变量设计很巧妙，这样下面处理正数左移和负数右移就很方面了，可以直接使用一个右移就可以实现。
3. 对源数据和比例因子的输出结果做8位精度的饱和处理。

9.5.5 实例讲解

实验目的：
1. 四种种类型数据的比例放缩。
实验内容：
1. 按下按键DOWN 串口打印输出结果
实验现象：
通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：
程序设计：
/*
*********************************************************************************************************
* 函数名: DSP_Scale
* 功能说明: 比例因子
* 形参：无
* 返回值: 无
*********************************************************************************************************
*/
static void DSP_Scale(void)
{
static float32_t pSrcA = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t scale = 0.0f;
static float32_t pDst;
static q31_tpSrcA1 = {0x6fffffff,1,1,1,1};
static q31_tscale1 = 0x6fffffff;
static q31_tpDst1;

static q15_tpSrcA2 = {0x6fff,1,1,1,1};
static q15_tscale2 = 0x6fff;
static q15_tpDst2;

static q7_tpSrcA3 = {0x70,1,1,1,1};
static q7_tscale3 = 0x6f;
static q7_t pDst3;

scale += 0.1f;
arm_scale_f32(pSrcA, scale, pDst, 5);
printf("arm_sub_f32 = %frn", pDst);
scale1 += 1;
arm_scale_q31(pSrcA1, scale1, 0, pDst1, 5);
printf("arm_scale_q31 = %xrn", pDst1);

scale2 += 1;
arm_scale_q15(pSrcA2, scale2, 0, pDst2, 5);
printf("arm_scale_q15 = %xrn", pDst2);

scale3 += 1;
arm_scale_q7(pSrcA3, scale3, 0, pDst3, 5);
printf("arm_scale_q7 = %xrn", pDst3);
printf("***********************************rn");
}

席萌0209 发表于 2015-3-19 10:58:11

9.6 BasicMathFunctions的重要说明

截至到这里，BasicMathFunctions函数已经讲解完了，也许大家也发现了这些函数的一些共同点，在前面第8章的时候我们简单的阐述过，这里再进一步的阐述一下：
l 这些函数基本都是支持重入的。
l 基本每个函数都有四种数据类型，F32，Q31，Q15，Q7。
l 函数中数值的处理基本都是4个为一组，这么做的原因是F32，Q31，Q15，Q7就可以统一采用一个程序设计架构，便于管理。更重要的是可以在Q15和Q7数据处理中很好的发挥SIMD指令的作用（因为4个为一组的话，可以用SIMD指令正好处理2个Q15数据或者4个Q7数据）。
l 部分函数是支持目标指针和源指针指向相同的缓冲区。
关于这个的使用，我们没有在前面的讲解中举例子，下面举一个简单的例子进行说明，这里就以9.5小节中scale函数进行说明：
static void DSP_Scale(void)
{
static float32_t pSrcA = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t scale = 0.0f;
static q31_tpSrcA1 = {0x6fffffff,1,1,1,1};
static q31_tscale1 = 0x6fffffff;

static q15_tpSrcA2 = {0x6fff,1,1,1,1};
static q15_tscale2 = 0x6fff;
static q7_tpSrcA3 = {0x70,1,1,1,1};
static q7_tscale3 = 0x6f;

scale += 0.1f;
arm_scale_f32(pSrcA, scale, pSrcA, 5);          (1)
printf("arm_sub_f32 = %frn", pSrcA);
scale1 += 1;
arm_scale_q31(pSrcA1, scale1, 0, pSrcA1, 5);    (2)
printf("arm_scale_q31 = %xrn", pSrcA1);

scale2 += 1;
arm_scale_q15(pSrcA2, scale2, 0, pSrcA2, 5);    (3)
printf("arm_scale_q15 = %xrn", pSrcA2);

scale3 += 1;
arm_scale_q7(pSrcA3, scale3, 0, pSrcA3, 5);       (4)
printf("arm_scale_q7 = %xrn", pSrcA3);
printf("***********************************rn");
}
上面代码的（1）至（4）目标指针和源指针指向相同的缓冲区。

9.7 总结

BasicMathFunctions函数就跟大家讲这么多，希望初学的同学多多的联系，并在自己以后的项目中多多使用，效果必将事半功倍。

页: [1]

硬汉嵌入式论坛's Archiver

【安富莱DSP教程】第9章 BasicMathFunctions的使用（二）