Poor GPIO performance can be problematic when evaluating application-specific code on any microcontroller, let alone members of the EFM32 family. However, as noted in the Knowledge Base article Slow GPIO Toggling on EFM32, the source of the poor performance is not the EFM32 microcontroller but the example projects provided in Simplicity Studio defaulting to the Debug build configuration. Together, the absence of any compiler optimizations, the addition of code both to maintain the scope of local variables and provide association between C and assembly language statements, and the compilation of emlib debugging assertions, contribute to substantially slower code execution, especially in the case of GPIO toggling.
All code, including the GPIO functions provided by emlib, is compiled with full optimizations by switching to the Release build configuration. Unfortunately, as noted above, debugging becomes a challenge because of the resultant loose correspondence between C source code and assembly. Can a happy medium be achieved whereby debugger source correspondence is maintained without the dramatic slow down of pin set and clear operations?
Switching to a less aggressive level of optimization would seem to be a sensible first step, but as the following screen shows for code compiled with -O1, there's already a loss of source code and debugger correspondence:
Macro substitution performed for GPIO_PinOutSet() and GPIO_PinOutClear() results in efficient str and str.w instructions, but the movs that loads the necessary bit mask is actually performed before the call to the WDOGn_Feed(). The problem arises when attempting to set a breakpoint on either of the GPIO functions, which instead ends up on ldr instructions that read the parameters for WDOGn_Feed() call. Traceability is also lost with any attempt to step over either of the GPIO functions resulting the processor resuming execution instead.
While -O1 optimization does not help in this case, optimizing for size with -Os is worth considering. The compiled code looks similar:
Breakpoints intended for the GPIO functions still end up being set on the WDOGn_Feed() call, but attempting to step over WDOGn_Feed() does nothing. Instead, the debugger must be switched to assembly language stepping to execute the ldr and bl instructions. It can then be switched back to C source level debug, and stepping over GPIO_PinOutSet() and GPIO_PinOutClr() will succeed.
Unfortunately, traceability with -Os optimization is hit or miss. There's no way to know which C source statements can be single-stepped, which must be stepped at the assembly language level, and which will just end up causing the debugger to resume execution.
At this point, stepping away from emlib while still building the project in the Debug configuration might be worthwhile. Why do this? As noted above, having debug features imposes a certain amount of overhead, some of which (like assertions) is attributable to emlib.
Maybe the way to better GPIO performance is to get as close to the hardware as possible, seeing as emlib's job is to provide a certain level of hardware abstraction. This means manipulating GPIO pins directly via their port data, input, set, clear, and toggle registers. Consider what happens when GPIO_PinOutSet() and GPIO_PinOutClr() are replaced with writes to the relevant port DOUTSET and DOUTCLR registers:
Even though built with the Debug configuration, this code shows a very simple and deterministic sequence of assembly language instructions for the set and clear operations. It's also easy to see how the compiler would likely optimize this by reorganizing the instructions such that a single ldr fetches the GPIO base register address and a single movs loads the bit mask for the set and clear registers. The str/str.w instructions that write the bit mask to the set and clear registers would remain.
Is it possible to improve this further, say by using bit-banding? Bit-banding has the benefit of a single alias register address permitting direct access to one peripheral register bit, but a port set or clear operation still ends up being a three instruction sequence. As in the unoptimized direct register access code above, a pin set (or clear) still requires loading of a register address (the bit-band alias), a register write argument, and a store instruction.
Knowing this, it should also be apparent that multiple set or clear operations using the DOUTSET and DOUTCLR registers would actually be faster than using bit-banding because a single bit mask can set or clear multiple pins while bit-banding would require multiple alias register writes to achieve the same results.
Given the examples discussed above, direct register access provides near parity for GPIO performance in the Debug and Release configurations. While "GPIO->P[gpioPortD].DOUTCLR = 0x1" may not be quite as self-documenting as "GPIO_PinOutClear (gpioPortD, 0)", it will deliver similar performance regardless of the compiler's optimization level (and its readability is easily improved with the use of mnemonic macros like "#define BIT0 0x1").