Compiler Optimizations

  • Compilers and optimization

  • This program clears N bytes of memory at address data

void memclr(char *data, int N) { for (; N > 0; N==) { *data = 0; data++; } }

  • Compiler cannot possibly know if N can be 0 on input

    • The compiler will test for this case explicitly

  • Compiler doesn't know if data array pointer is four-byte aligned or not

    • if four-byte aligned, it can clear four bytes at a time using an int store rather than a char

  • Compiler doesn't know if N is a multiple of 4

    • if it is, the compilr can repeat the loop body four time or store four bytes at a time using an int store.

  • The compiler has to be conservative, and developers must be aware of this:

    • areas where compiler will be conservative

    • processor arch that the C compiler is mapping to

    • limits of a specific c compiler

  • Research done using C compilers armcc and arm-elf-gcc

    • armcc -Otime -C -o test.o test.c

    • fromelf -text/c test.o > test.txt

      GCC compiler

    • arm-elf-gcc -O2 -fomit-frame-pointer -c -o test.o test.c

    • arm-elf-objdump -d test.o > test.txt

most ARM Data processing operation are 32-bit - this is the case even though ARM processors can efficiently load/store 8, 16, and 32 bit data.

- Best practice is to use 32-bit datatypes like int or long
    - avoid using char and short
  • Avoid implicit or explicit narrowing casts

    • typically cost extra cycles

  • Avoid char and short types for function arguments or return values

  • Use explicity casts when reading array entries or global variables into local variables, or writing local variables ot to array entries.

    • makes it clear that you are taking a narrow width for fast ops

    • narrowly stored in memory, expanded and stored in registers

    int example(short* array) { // array is a char -> narrow type

      unsigned int i;
      int sum = 0;
    
      for (i = 0; i < 100; i++) {
          sum += *(array++);
      }
    
      return int;

    }

    • in the above example, the array has datatype short, which is a narrowin width data type. However, I explicitly cast this to an unsigned int, which will be a 32-bits.

      • This shows that in memory, the data type is narrow, while in register its expanded

      • if I used the short or char, the load instruction would be LDRH (load half word).

        • LDRH does not allow for a shifted address offset

          • therefore, an extra instruction is used to calculate the address of item i in the array

  • Loop optimizations

    • Variable iteration count loops

      • Loop Unrolling

        • avoid loop overhead.

        • each iteration requires 2 instructions, branch conditoin, instruction to increment

        • unrolling reduced the number of times this overhead in incured

        • caution against unrolling a number of times that is not a multiple of N

    • Fixed-length

      • Strip Mining

        • determine MVL of SIMD vectors

Function calls - four-register rule - calls to functions with fewer than 5 arguments are significantly more efficient than those with 5 or more. - compiler can pass args to registers

    - (ABI Dependent) functions with 5 or more arguments often require both callee and caller to access the stack for some arguments.
        - ABI -> callee vs caller saved convention
    
- If a C function has more than 4 args or a C++ function has more than 3 args
    - It is more efficient to use structs
    - C methods have an implicity 'this' as the first arg, thus 3 explicit args is the rule of thumb

- two-word args such as long long or double are passed in a pair of consecurive arg registers
    - exploit SIMD vectorization if possible

Compare the Two:

char *queue_bytes_v1( char *Q_start, // inclusive start position char *Q_end, // exclusive end position char *Q_ptr, // current queue pointer char *data, unsinged int N) // Number of bytes to insert { do { *(Q_ptr++) = *(data++);

    if (Q_ptr == Q_end)
    {
        Q_ptr = Q_start;
    }
} while ( --N);
return Q_ptr;

}

compare the ARM assembly to a similar func with only three args

  • Structs

    • the sizeof() struct is the size of the largest member.

    • define members in order based on size from smallest to largest

      • ths will optimize space by avoiding more padding than necessary.

    CORRECT INCORRECT struct packet { struct packet { char a; char a; char b; int d; short c; short c; int d; char b; } }