Compiler Optimizations
Compilers and optimization
This program clears N bytes of memory at address data
void memclr(char *data, int N) { for (; N > 0; N==) { *data = 0; data++; } }
Compiler cannot possibly know if N can be 0 on input
The compiler will test for this case explicitly
Compiler doesn't know if data array pointer is four-byte aligned or not
if four-byte aligned, it can clear four bytes at a time using an int store rather than a char
Compiler doesn't know if N is a multiple of 4
if it is, the compilr can repeat the loop body four time or store four bytes at a time using an int store.
The compiler has to be conservative, and developers must be aware of this:
areas where compiler will be conservative
processor arch that the C compiler is mapping to
limits of a specific c compiler
Research done using C compilers armcc and arm-elf-gcc
armcc -Otime -C -o test.o test.c
fromelf -text/c test.o > test.txt
GCC compiler
arm-elf-gcc -O2 -fomit-frame-pointer -c -o test.o test.c
arm-elf-objdump -d test.o > test.txt
most ARM Data processing operation are 32-bit - this is the case even though ARM processors can efficiently load/store 8, 16, and 32 bit data.
Avoid implicit or explicit narrowing casts
typically cost extra cycles
Avoid char and short types for function arguments or return values
Use explicity casts when reading array entries or global variables into local variables, or writing local variables ot to array entries.
makes it clear that you are taking a narrow width for fast ops
narrowly stored in memory, expanded and stored in registers
int example(short* array) { // array is a char -> narrow type
}
in the above example, the array has datatype short, which is a narrowin width data type. However, I explicitly cast this to an unsigned int, which will be a 32-bits.
This shows that in memory, the data type is narrow, while in register its expanded
if I used the short or char, the load instruction would be LDRH (load half word).
LDRH does not allow for a shifted address offset
therefore, an extra instruction is used to calculate the address of item i in the array
Loop optimizations
Variable iteration count loops
Loop Unrolling
avoid loop overhead.
each iteration requires 2 instructions, branch conditoin, instruction to increment
unrolling reduced the number of times this overhead in incured
caution against unrolling a number of times that is not a multiple of N
Fixed-length
Strip Mining
determine MVL of SIMD vectors
Function calls - four-register rule - calls to functions with fewer than 5 arguments are significantly more efficient than those with 5 or more. - compiler can pass args to registers
Compare the Two:
char *queue_bytes_v1( char *Q_start, // inclusive start position char *Q_end, // exclusive end position char *Q_ptr, // current queue pointer char *data, unsinged int N) // Number of bytes to insert { do { *(Q_ptr++) = *(data++);
}
compare the ARM assembly to a similar func with only three args
Structs
the sizeof() struct is the size of the largest member.
define members in order based on size from smallest to largest
ths will optimize space by avoiding more padding than necessary.
CORRECT INCORRECT struct packet { struct packet { char a; char a; char b; int d; short c; short c; int d; char b; } }
Was this helpful?