INTRODUCTION
Probably the largest single challenge facing a software engineer porting an algorithm to a particular platform is the need to optimize the algorithm for that platform. Algorithm descriptions, by themselves, are rarely optimized, in part because appropriate optimizations likely depends on the intended platform. For example, an ASIC implementation may benefit from segmenting the algorithm into parallel operations performed by separate execution units, using minimal memory. On the other hand, a PC implementation is likely to be single threaded yet have access to large amounts of memory, allowing retention of previously calculated partial results and liberal use of inline subroutines. Often the tradeoff comes down to a question of speed versus memory usage, with the latter further subdivided between RAM and ROM usage.
Given the enormous range of available choices, it behooves the system designer to adopt an organized approach to the optimization of an algorithm. In practice, a topdown methodology can provide a useful structure for addressing and organizing the optimization process. This means evaluating the optimization possibilities firstly by considering the algorithm from its most general form, then through a sequence of successively finergrain representations, applying appropriate optimizations where possible at each stage. This process can be considered to consist of four general phases: optimization at the algorithmic level, optimization of each major functional block, optimization at the subroutine level, and optimization of actual linebyline code. The boundaries between some of these phases may not always be well defined, and they may interact or even conflict.
TOP DOWN DESIGN
Optimization of individual functions can often yield significant gains, as the designer potentially has the option of substituting entirely different code, as long as it yields equivalent results. In some cases this can be as simple a matter as employing a lookup table in place of a calculation, or vice versa. However, more extensive alterations are sometimes possible, such as devising a way to replace an existing binary search with a simple feed forward calculation. Another useful strategy can be to arrange for different functional blocks to share common data or computational resources: Two different functions using a common lookup table might decrease ROM size on an architecture design by 2x.
Optimization at the subroutine level involves consideration of just how a function is best divided into subroutines. Obviously, there are likely to be savings in code space if a given subroutine can be used by multiple sections of the program. Optimal subroutine segmentation can also reduce memory requirements by facilitating nonoverlapping reuse of array memory. Where advantageous, use of inline code can usefully reduce the overhead of formal subroutine calls.
Much optimization is likely to take place at the level of individual lines of code. This can involve considerable juggling of the order of specific operations, to satisfy the pipeline requirements of the intended processor. The required numerical accuracy of each calculation should be considered against the capabilities of the hardware platform. If a calculation can be done using integer arithmetic in place of floating math, or floating math in place in double precision, there are likely to be savings in both memory and MIPS. Also, some processors benefit from performing blockoriented operations, since common factors, like filter coefficients, can be held in processor registers, or at least primary cache, during the processing of the block. And some specific calculations lend themselves to numerical simplification. For example, the first and last butterflies of an FFT involve coefficients of only +/ 1, so they can be done without multiplication, only addition. Integer multiplies by powers of two are best performed with a low cycle count shift left operation. Performance evaluation tools, like profilers, can indicate especially sensitive, timecritical sections of code. Rewriting these sections from C to assembler, where applicable, can sometimes yield significant speed improvements.
Ultimately, the full optimization of an algorithm requires an almost encyclopedic knowledge of both the concepts and the inner workings of an algorithm. Fortunately, by following an organized top down mode of analysis, the primary available optimizations of the algorithm, in both general and processorspecific terms, can reliably be identified and implemented.
ALGORITHMS
The first step in optimization of an algorithm is to read the theoretical specification and understand what each algorithm block is actually doing. Can a faster method be found? Numerical Recipes in C is an excellent text for finding C code for numerical analysis functions. Keeping up to date on new fast FFT, wavelet, search, hash functions, etc is an essential skill. A literature search and Internet hunt can uncover many gems to improve an implementation. If discovery that no one has come up with a clever way to reduce the computational complexity of a particular DSP module, using mathematical skills and ability to reduce the equation is guaranteed to create stardom if something new is discovered.
Here is an interesting question, "For this particular data set, can the DSP module be simplified?" Sometimes, one can reduce computation complexity by analyzing the application itself. For example, a standard possesses wideband capability, but the particular application will never utilize wideband. Therefore, removing high band manipulations in the code will not affect the quality and reduce the computational time considerably. In reality, most source code has already done this for you, but this example may trigger the mind to the question in the algorithm optimization quest, will the essential nature of the algorithm be diminished by removing this feature?
Often limiting filter coefficient length is likely to alter some frequency response inconsequentially. The simplification can save ROM space and multiplier register length and MIPS.
There are many different ways to implement an FFT (decimation in time or frequency, different radixes, etc.). Rarely is a fully general FFT required. The best choice for a given application is the one with the fewest number of operations that fits well on the implementation architecture. For example, if the data vectors are 4^n, then a radix4 FFT implementation would be faster than a radix2 FFT.
Another question is, "if the algorithm uses vector quanitization, can one reduce the VQ tables if they are never used by the application?" For example, Twin VQ has different VQ tables for different audio sampling rates and bit rates. If one is never broadcasting those particular bit rates or sample rates, then there is no reason to include these VQ tables into the data set.
"Can quality reduction be performed to match the capabilities of the I/O hardware?" There is no reason to use a 24bit audio data type if at the output, the audio codec will only have 50dB SNR. Product marketing issues aside, it is important to analyze the algorithm for number of bits of data required in order to meet a desired audio quality without internal bias or distortion. Thus, one could reduce the algorithm implementation based on bits of accuracy and gain a considerable speed up in execution time.
Here is an algorithm optimization example from the Yamaha FM synthesis patents as described by (Kahrs et al., 1998).
Figure 1. FM synthesis equation (Chowning, 1973)
This is a typical equation one might come across in reading an algorithm specification. In Kahrs, et al., 1998, Mark analyzed the Yamaha patents (Uchiyama and and Suzuke, 1998) for computation of Chowning's FM synthesis technique via logarithmic table lookup. This technique implies a processor architecture with large memory; where multiplies and computation of sinusoids is MIPS(CPU) intensive and not possible. The example illustrates equation reduction and algorithmic optimization by taking the FM synthesis technique to a different, equivalent mathematical domain and then performing the calculations required. The steps are shown below.
Figure 2. Logarithmic FM synthesis calculation (Kahrs, et al, p. 224, 1998)
Taking the steps analyzed by Kahrs, for every y[n] we have:
Here is an example equation from the MPEG2AAC (Bosi, et. al, 1996) specification. A more basic, if redundant technique is illustrated with the correlation function.
Figure 3. Typical Equation found in an Algorithm Standard
Assuming one wishes to implement correlation directly in the time domain the equation is analyzed thus. The next figure is this same equation, but simplified to show what exact operations correlation performs. There are two separate arrays being multiplied and summed in the top summation. In the bottom, the identical array elements are squared and summed. Finally, the two summation results are divided and put into the correlation array for that index.
Figure 4. Equation indices readjusted for easier reading
Again, the next step, assuming that no algorithmic optimization applies, is to count the number of primitive operations.
In this case we have for the outer Loop: 
1 store 
1 divide 
Counting for the Inner Loop there are: 
2 loads 
1 copy 

2 multiplies 
2 adds 
There are N_CRB1inner loop iterations and N_SF outer loop executions.
(Incidentally, there is a further algorithmic optimization on correlation that performs correlation in the frequency domain (Gardner, 1994), but to delve into this technique is beyond the scope of this paper.)
BEAN COUNTING
Now onto more fundamental "how to" skills of this paper. Most DSP people like challenges and complex problems, but sometimes a simple, good accounting procedure performs optimization better than searching for the ultimate MDCT technique. Using an Excel^{©} spreadsheet, primitive instructions of the algorithm signalprocessing module are tallied. All signal processing architectures implement loads, stores, multiples, additions, subtractions, divisions and comparisons in some form or another. By breaking the DSP module into these computational primitives, a specific mapping onto a particular architecture can be performed. For example, a multiply on a Pentium may be a sequential function, yet on a VLIW architecture, an array of multiplies is performed in one special DSP instruction called "vector multiply". This is why an Excel spreadsheet is being used. Computational primitives can be mapped onto cycle counts per particular processor architecture via the symbol capability of Excel formulas. For Intel Pentium floating point, mapping the correlation primitives to cycle counts should look like this:
Figure 5. Excel Spreadsheet Cycle Counts for Correlation Function
For each function or subroutine, the algorithm specification is examined, equations reduced, overhead of C code accounted for, assumptions made, such as no cache hits, and primitive instructions tallied. This way one can keep track of the overall cycle counts. The calculation for the flt mult block looks like this: =240*2*J3. This formula states there are two multiplies times 240 times, the value of the inner loop N_CRB1, times the number of calls this routine has per block of data to be encoded. In this case, the other loop, N_SF is executed 240 times per block of data, or audio frame, to be encoded.
Another interesting equation is the CPU % total. In this case, the equation reads SUM(C10:I10)*1/266000000*100*240/8000. What this equation has done is computed the overall CPU usage per a block of audio data at 8kHz sampling rate. A block of data is considered to be 240 samples. The processor speed is a 266MHz Pentium. Although this equation is here for illustration purposes only, it does show how to calculate the overall time an algorithm took to process a frame of data.
THE IMPLEMENTATION
Now that a theoretical minimum can be determined by simply keeping a tally of the primitive operations required to execute the algorithm, the next view is the real world. There are a few initial questions to ask. "What is the amount of accuracy required?" "The module is implementing what type of filter?" "The input to this filter is in what type?" "The output goes to where?" "The math is?" For example, one may not need to use a 32bit data type or a 64bit data type for many calculations. But, one might consider keeping a data type as long (32bits on Intel Architecture) for the purposes of 32bit data alignment. On Intel processors, retrieving 16bit data types (a short on Intel Architecture) from memory causes stalls in the instruction execution time. Therefore, it is better to leave all data types as 32bit, if possible. Another example is that float data types on Pentium seem to run faster than double, even though the instruction cycle count is the same. This is due to a double on Intel Architecture being 64bits. Memory access takes as long since twice the amount of bytes are read from memory and loaded into the cache in comparison to the float data type. Once data is in the cache, a double data type does perform similarly to a float data type on Intel Architecture.
OVERALL PERFORMANCE
The question now becomes "what am I trying to do?" In other words, spending six months squeezing out 5% of a working C implementation of AC3 that now takes 7% of the total CPU anyway might be time better spent reading Dilbert cartoons. Would the boss notice the difference? Or does this 5% optimization result in an application working on a specific processor or not? Again, the top down approach is the best methodology to answer this question. By measuring the code and first focusing in on bottlenecks, one can reduce the CPU time significantly in a few hours and impress one's friends at the same time.
PROFILERS AND TIMERS
Profilers and timers are critical tools for optimizing code. Using timers, one can discover how long the algorithm takes overall to execute. If timer calls are inserted before and after functions, how much each routine contributes to the overall time is discovered, thus targeting certain DSP modules for optimization. Often optimizing modules known to be computationally intensive, such as the FFT, for speed improvement are not necessarily the places where most cycles are actually wasted. A good profiler will immediately bring this fact to light.
Profiling code in real time and profiling subroutines of the algorithm will show where the cycles are being used. For example, one assumes the Excel spreadsheet is theoretically correct and then upon running the code, discovers that the algorithm executes 20% slower than originally anticipated. What went wrong?
Certainly cache misses and out of order execution may be responsible. Or one might have inaccurately calculated the Excel spreadsheet. The issue might even lie in the fact that the processor documentation stating cycle count is wrong.
Here are some wonderful tools for Intel Architecture to assist in measuring and tuning the code for speed and memory. Vtune is a tool provided by Intel that is reasonably priced. VTUNE is useful, but always verify your results with at least a rough wallclock measurement. Further information about VTUNE is available at: http://developer.intel.com.
Another handy tool is wintop. Wintop is a nice little applet that reports the CPU percentage of each thread in real time. Again, not totally to be believed, but a handy rough check.
To obtain realistic rough timing measure, run a large data file through the algorithm and simply use a stopwatch to measure the time. The amount of time the data file represents can then be used to calculate on the overall CPU usage.
There are also MSVC timer functions for measuring individual subroutines, loops and overall application performance. See MSVC help for more information.
The most accurate timer code for Intel Architecture is an assembly instruction that reads the timestamp counter. This counter keeps an accurate count of the cycles executed on the processor. The assembly instruction is called rdtsc and there is also a subroutine that models this instruction.
One can build macros and timing reports based on this routine and create a completely customizable profiler for the particular algorithm being optimized. See http://developer.intel.com/drg/pentiumII/appnotes/RDTSCPM1.HTM for more details and the code itself.
C CODE AND COMPILERS
The first thing to do when looking to optimize code is to play with the compiler optimization flags. For Intel Architecture, a nice set of flags to try is /0xa. Often it is not /O2, the MSVC optimize for speed flag that gives the best performance. One does need to be careful with compiler flags for often they make certain assumptions with your application and code. For example, one flag is /0a, assume no aliasing. To use this flag, no memory can be accessed by two different variable names.
To check and see if the C code optimizations that are written next actually are the best solution, well, one can either use VTUNE or generate a .asm file from Microsoft MSVC build environment and check for yourself if the code completely piped.
This implies that one knows the architecture of the floating point unit and also how a compiler generates assembly. The floating point units for Pentium and PII possess a stack architecture of depth 8. PII does have different clock counts than Pentium (Fog, 1998).
The example below compiles a MAC, but for blended, (or all) Intel Architectures. The compiler flag in MSVC for all processors is /GB.
In C:
sum = 0; 
for (i=0; i < N; I++) 
sum += a[i]*b[i]; 
Figure 6. Basic multiple accumulate function used in many FIR filters
The output of MS VC++ 5.0 compiler with original MAC:
loop:
fld 
DWORD PTR _a$[esp+eax+400] 
fmul 
DWORD PTR _b$[esp+eax+400] 
add 
eax, 4 
cmp 
eax, 400 
faddp 
ST(1), ST(0) 
jl 
loop// 6 clocks per MAC 
fstp 
ST(0) 
Figure 7. Basic multiple accumulate function assembly code
This file was generated using these MSVC directives: release build, Assembly in .asm file, compiler optimization is "maximize speed" ( /O2).
Here is the MAC C code, rewritten for better mapping on Intel Architecture.
for (i=0; i < N; i+=4) 
{ 
sum += a[i]*b[i] 
+ a[i+1]*b[i+1] 
+ a[i+2]*b[i+2] 
+ a[i+3]*b[i+3]; 
} 
Figure 8. Basic multiple accumulate function unrolled
Rule: If N%4 == 0 and throughput is 4 then UNROLL loop by 4. 
The resulting MSVC++ 5.0 assembly output the unrolled MAC compile described in figure 8.
loop:
fld 
DWORD PTR _a$[esp+eax+324] 
fld 
DWORD PTR _a$[esp+eax+320] 
fmul 
DWORD PTR _b$[esp+eax+320] 
fld 
DWORD PTR _a$[esp+eax+328] 
fxch 
ST(2) 
fmul 
DWORD PTR _b$[esp+eax+324] 
fld 
DWORD PTR _a$[esp+eax+332] 
fxch 
ST(3) 
fmul 
DWORD PTR _b$[esp+eax+328] 
fxch 
ST(1) 
faddp 
ST(2), ST(0) 
fxch 
ST(2) 
fmul 
DWORD PTR _b$[esp+eax+332] 
fxch 
ST(2) 
faddp 
ST(1), ST(0) 
fxch 
ST(1) 
add 
eax, 16 
cmp 
eax, 320 
faddp 
ST(1), ST(0) 
faddp 
ST(1), ST(0) 
jl 
loop 
fstp 
ST(0) 
Figure 9. Basic MAC unrolled MSVC generated assembly
This code is perfectly piped or 3clks/MAC on a Pentium. No assembly was ever written.
ASSEMBLY
When speed is absolutely critical and all other techniques have been tried, now is the time to write assembly code. The MAC, the most common instruction in DSP, in C:
for (i=0; i<n; i++) 
sum += a[i]*b[i]; 
How fast can this be done on the Pentium? 
Output of MS VC++ 5.0 compiler:
loop:
1 
fld 
DWORD PTR [ecx] 
; a[i] 
3 
fmul 
DWORD PTR [edx] 
; b[i] 

add 
edx,4 
separate integer 

add 
ecx,4 ; 
concurrent code 

dec 
eax 
; loop counter 
1 
faddp 
ST(1) 
; add product to sum 
1 
jne 
loop 

Figure 10. Nonpipelined multiply accumulate with clock counts
This code takes 6 clocks/MAC. 
FADD needs the result of FMUL, and so FADD must wait the 3cycle multiplier latency. The loop is also not unrolled, but only 1 clock is saved by not unrolling the loop. If the routine is pipelined, the FADD/FMUL latency dependency is eliminated.
A MAC is the formula s[i] = sum(x[i]*y[i]) where i is limited. If we assume limited summing, indices reversed, the equation becomes x[i]*y[i] +x[i1]*y[i1] +x[i2]*y[i2]...
Each x[i  n]*y[in] can be computed independently of the addition. This naturally leads to a grid methodology for optimum scheduling. Each multiply is laid out and noted when the result is available to add to the total sum. Any space in the column would imply a wait, or stall in the pipeline, i.e. a wasted CPU cycle.
for ( 
i=0 
i=1 
i=2 
... 
clock 1 
fld 

clock 2 
fmul 

clock 3 
fld 

clock 4 
fmul 

. 
fadd 

. 
fld 

. 
fmul 

. 
fadd 

. 
. 

. 
. 
Figure 11. Grid of assembly instructions to assure no pipeline "holes".
Each row of the grid represents 1 clock. Each column represents the ith iteration of the for loop. The below conditions must be satisfied to optimally pipeline the grid.
Stack Contents


ST(0) 
ST(1) 
ST(2) 



s(n2) 
m(n1) 
1 
fld DWORD PTR [esi+ebx+K] 
a(n) 
s(n2) 
m(n1) 
1 
fmul DWORD PTR [edi+ebx+K] 
m(n) 
s(n2) 
m(n1) 

fxch ST(2) 
m(n1) 
s(n2) 
m(n) 
1 
faddp ST(1) 
s(n2) 
m(n) 

Figure 12. Showing MAC terms on the floating point unit stack
This shows that the cycle count for a perfectly piped multiply accumulate is 3 clocks on a Pentium. The ST(i) shows the depth of the floating point unit stack and illustrates what elements of the array are on the stack at a give time through the loop. s(n) is the sum, m(n) is the multiplied result and a(n) is one of the operands loaded onto the floating point stack.
SUMMARY
A top down design methodology for algorithm optimization was presented. The general technique overview with an algorithm optimization example and basic DSP module equation was described. Use of profiling tools, writing simple C code optimizations and using compiler optimization flags for Intel Architecture was also presented. Writing code in Pentium floating point assembly was discussed as the last step in the optimization design process.
ACKNOWLEDGEMENTS
The author of this paper would like to thank Mike Keith, Jeff Kidder, John Stewart, musicdsp@shoko.calarts.edu mailing list and especially Mark Davis for valuable insight, tidbits of great technique, criticisms and information.
REFERENCES
Kahrs, M.; Branderburg, K., editors. 1998. Applications of Digital Signal Processing to Audio and Acoustics. Kluwer Academic Publishers, Norwell, Massachusetts.
Chowning, J. 1973. The Synthesis of Complex Audio Spectra by Means of Frequency Modulation. Journal of the Audio Engineering Society, 21(7): 526534.
Uchiyama, Y. and Suzuki. H. 1998. Electronic Musical Instrument Forming Tones by Wave Computation. U.S. Patent 4,616,546.
Bosi, M.; Brandenburg, K.; Quackenbush, S.; Dietz, M.; Johnston, J.; Herre, J.; Fuchs, H.; Oikawa, Y.; Akagiri, K.; Coleman, M.; Iwadare, M.; Lueck, C.; Gbur, U.; Teichmann, B. 1996. IS 138187 (MPEG2 Advanced Audio Coding, AAC).
Pentium Family User's Manual, Volume 3: Architecture and Programming Manual.
Optimizations for Intel's 32Bit Processors, Intel Application Note AP500.
VTUNE 3.0 Intel Architecture Information Library Documentation.
Fog A. 1998. Assembly Resources, Available via WWW: <URL: http://agner.org/assem/>.
Lee M. E. 1997. Optimization of Computer Programs in C, Available via WWW: <URL: http://www.ontek.com/mikey/optimization.html>.
Gardner, W. G. 1994. Efficient convolution without inputoutput delay. Presented at the 97th convention of the Audio Engineering Society, San Francisco. Preprint 3897.
Hyde, R. 1998. Randall Hyde's Assembly Language Page, Available via WWW: <URL: http://webster.ucr.edu/Page_asm/>.
Abrash, M. 1994. Zen of Code Optimization, Coriolis Group Books, Scottsdale, Arizona.
Eckerdal, J. 1999. The Assembly Gems Page, Available via WWW: <URL: http://www.df.lth.se/~john_e/>.
Hseih, P. 1998. Programming Optimization, Available via WWW: <URL: http://www.geocities.com/SiliconValley/9498/optimize.html>.
Oppenheim, A. V.; Shaffer, R. W. 1989. DiscreteTime Signal Processing, PrenticeHall, Inc., Englewood Cliffs, New Jersey.
Press, W.H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. 1997. Numerical Recipes in C, Third Edition, Cambridge University Press, New York, NY.