Do the RTCC retention registers have the same access time as regular RAM or do they reside in a low frequency clock domain and require multi-cycle accesses?
As the Clock Management Unit (CMU) chapter in the reference manual for each Series 1 EFM32 or EFR32 device shows, the Real Time Counter and Calendar (RTCC) is clocked by the LFECLK, but there's no indication if this is used for the retention registers. To figure out the actual cycle counts needed to access the retention registers, we'll need to run some code in sufficiently tight loops that performs reads or writes over which we can find a per iteration average.
Here's what this code looks like:
// emlib
#include "em_chip.h"
#include "em_cmu.h"
#include "em_emu.h"
#include "em_gpio.h"
#include "em_rtcc.h"
// BSP
#include "bsp.h"
// SDK drivers
#define RETARGET_VCOM
#include "retargetserial.h"
#include
#include
static void initRTCC(void);
int main(void)
{
unsigned int msTicks_start, msTicks_end;
unsigned int loop, retRate, ramRate;
register volatile unsigned int word0;
register unsigned int word1;
// Device initialization
CHIP_Init();
// Set the HFCLK frequency
CMU_HFRCOBandSet(cmuHFRCOFreq_16M0Hz);
// Enable clock to access LF domains
CMU_ClockEnable(cmuClock_HFLE, true);
// Select LFRCO as LFECLK for RTCC
CMU_ClockSelectSet(cmuClock_LFE, cmuSelect_LFRCO);
CMU_ClockEnable(cmuClock_RTCC, true);
// Retarget STDIO to USART1
RETARGET_SerialInit();
RETARGET_SerialCrLf(1);
// Start SysTick at max value and count down
SysTick->LOAD = 0xffffff;
SysTick->CTRL |= (SysTick_CTRL_CLKSOURCE_Msk | SysTick_CTRL_ENABLE_Msk);
// Retention register write loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
RTCC->RET[0].REG = loop;
__DSB();
}
msTicks_end = SysTick->VAL;
retRate = msTicks_start - msTicks_end;
printf("Done: %u retention writes in %u ticks\n\n", 16384, retRate);
// Retention register read loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
word1 = RTCC->RET[0].REG;
__DSB();
}
msTicks_end = SysTick->VAL;
retRate = msTicks_start - msTicks_end;
printf("Done: %u retention reads in %u ticks\n\n", 16384, retRate);
// RAM write loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
word0 = loop;
__DSB();
}
msTicks_end = SysTick->VAL;
ramRate = msTicks_start - msTicks_end;
printf("Done: %u RAM writes in %u ticks\n\n", 16384, ramRate);
// RAM read loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
word1 = word0;
__DSB();
}
msTicks_end = SysTick->VAL;
ramRate = msTicks_start - msTicks_end;
printf("Done: %u RAM reads in %u ticks\n\n", 16384, ramRate);
// Wait here
while (1);
}
Running this code on a Series 1 Giant Gecko device at 16 MHz, we get the following output:
Done: 16384 retention writes in 180505 ticks
Done: 16384 retention reads in 196930 ticks
Done: 16384 RAM writes in 114693 ticks
Done: 16384 RAM reads in 98309 ticks
At first glance, we can see that accesses to the retention registers are slower than RAM. Based on these numbers, each iteration of the retention write loop takes 180505 / 16384 = 11 clock cycles. Let's break this down further by looking at the disassembly:
Of the 11 clock cycles, we know the adds and cmp.w instructions take 1 clock while the bne.n takes 2. The data synchronization barrier (dsb sy), which we insert in order to force the core to wait for completion of the memory write, takes 1 clock plus however many cycles are required for previous memory accesses to complete. Similarly, we know that the str.w takes 2 clocks plus the memory access. Putting this together, we find that the retention register access itself takes 11 - 2 (str.w) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 4 clock cycles.
Now, let's compare this to this to RAM writes. Each iteration of the loop takes 114693 / 16384 = 7 clock cycles. Here's the disassembly:
The resulting assembly is identical to that used for the retention register write loop with the only difference being the use of the stack pointer-relative access to RAM. It follows, then, that we can find the RAM access itself takes 7 - 2 (str) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 0 clock cycles.
Wait a second! How can a RAM access take 0 clock cycles? The answer is that it can't. A closer inspection reveals that the str instruction takes 2 clock cycles. The first cycle is the decoding/dispatching of the opcode while the second is the actual memory write. So, knowing this, we can say that a zero wait state write to RAM on EFM32/EFR32 takes 1 clock cycle. This also means that the retention register write takes 5 clocks cycles, 4 more than a RAM write, which is what we actually calculated above.
Having gone through this exercise for writes, we can now examine the disassembly of the read loops for both RAM and retention registers. Here's the code for RAM:
Each iteration of this loop takes 98309 / 16384 = 6 clock cycles such that the memory access is 6 - 2 (ldr) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 0 clock cycles. Of course, just like the str instruction, the ldr instruction consists of a decode/dispatch and a memory read, which we know takes 1 clock cycle for zero wait state RAM. Similarly, we can find that each retention register read takes 196930 / 16384 = 12 clock cycles. Here's the loop disassembly:
The cycle counts here are the same as above such that the retention register read is 12 - 2 (ldr.w) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 6 clock cycles longer than a RAM read (7 clocks total). With this data in hand, let's summarize what we know so far:
RAM Read
RAM Write
Retention Register Read
Retention Register Write
Clock Cycles
1
1
7
5
But this is just one data point: 16 MHz operation on Series 1 Giant Gecko. These figures can and do differ across operating frequency and other Series 1 devices. The table below summarizes the execution times for each loop iteration across all Series 1 devices.
Note that these are the per iteration cycle times for each read and write loop and not the actual read and write cycle access times. These must be calculated from examination of the assembly code for each loop. On the devices with the Cortex-M4 core, these are exactly what is shown above, so why are there differences? For starters, flash wait states extend the instruction fetch times and, thus, their execution times. The Memory System Controller instruction cache largely counters this as can be seen for Giant Gecko where this has little impact as the number of wait states is increased from 0 to 1 to 2.
Obviously, flash wait states alone are not the only factor affecting performance. On Giant Gecko, wait states are also introduced for peripheral accesses at frequencies above 50 MHz and for RAM at frequencies above 38 MHz. In the case of xG1 and xG13 vs. xG12, the ability of each design to meet specific timing requirements is clearly a factor. Despite having comparable core feature sets, it shouldn't be a surprise that the superset 1 MB flash/256 KB RAM xG12 devices might not be capable of supporting the same access times as their slimmer xG1 and xG13 siblings.
Finally, it's worth examining why Series 1 Tiny Gecko (EFM32TG11) requires more clock cycles to execute each loop iteration than the other devices when running at comparable frequencies, especially in the case of RAM. The differences here are specifically due to the different micro architecture of the Cortex-M0+ core. Let's first look at the disassembly of the retention register read loop:
This alone does not seem substantially different from the assembly generated by the compiler for Cortex-M4 devices. The subs and cmp instructions execute in a single cycle while the ldr is nominally a two clock operation. Where we find a difference is the dsb instruction, which takes 3 clocks on the Cortex-M0+. Factoring this in, we can now see that the retention register read takes an extra 17 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 8 cycles on Tiny Gecko 11, thus making the entire access 9 clock cycles when running at 19 MHz and zero wait states.
Knowing this, we should expect the RAM read loop to require no extra clock cycles beyond those required for the opcodes and the RAM access itself. Here's the disassembly:
From the table above, we know the RAM read loop takes 10 clocks per iteration, so it's a simple matter to see that the RAM access itself takes 10 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 1 extra cycle on Tiny Gecko 11. In other words, the ldr instruction takes 3 clock cycles: one to decode the opcode and two for the read from RAM. As also shown in the table above, each iteration of the RAM write loop takes 10 clock cycles and, given the similarity of the assembly language output from the C compiler, we can simply substitute a str instruction into the listing above and see that the same timing applies. The calculation for the retention register write loop is left as an exercise for the reader.
RTCC Retention Register vs. RAM Access Times
Do the RTCC retention registers have the same access time as regular RAM or do they reside in a low frequency clock domain and require multi-cycle accesses?
As the Clock Management Unit (CMU) chapter in the reference manual for each Series 1 EFM32 or EFR32 device shows, the Real Time Counter and Calendar (RTCC) is clocked by the LFECLK, but there's no indication if this is used for the retention registers. To figure out the actual cycle counts needed to access the retention registers, we'll need to run some code in sufficiently tight loops that performs reads or writes over which we can find a per iteration average.
Here's what this code looks like:
Running this code on a Series 1 Giant Gecko device at 16 MHz, we get the following output:
At first glance, we can see that accesses to the retention registers are slower than RAM. Based on these numbers, each iteration of the retention write loop takes 180505 / 16384 = 11 clock cycles. Let's break this down further by looking at the disassembly:
Of the 11 clock cycles, we know the adds and cmp.w instructions take 1 clock while the bne.n takes 2. The data synchronization barrier (dsb sy), which we insert in order to force the core to wait for completion of the memory write, takes 1 clock plus however many cycles are required for previous memory accesses to complete. Similarly, we know that the str.w takes 2 clocks plus the memory access. Putting this together, we find that the retention register access itself takes 11 - 2 (str.w) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 4 clock cycles.
Now, let's compare this to this to RAM writes. Each iteration of the loop takes 114693 / 16384 = 7 clock cycles. Here's the disassembly:
The resulting assembly is identical to that used for the retention register write loop with the only difference being the use of the stack pointer-relative access to RAM. It follows, then, that we can find the RAM access itself takes 7 - 2 (str) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 0 clock cycles.
Wait a second! How can a RAM access take 0 clock cycles? The answer is that it can't. A closer inspection reveals that the str instruction takes 2 clock cycles. The first cycle is the decoding/dispatching of the opcode while the second is the actual memory write. So, knowing this, we can say that a zero wait state write to RAM on EFM32/EFR32 takes 1 clock cycle. This also means that the retention register write takes 5 clocks cycles, 4 more than a RAM write, which is what we actually calculated above.
Having gone through this exercise for writes, we can now examine the disassembly of the read loops for both RAM and retention registers. Here's the code for RAM:
Each iteration of this loop takes 98309 / 16384 = 6 clock cycles such that the memory access is 6 - 2 (ldr) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 0 clock cycles. Of course, just like the str instruction, the ldr instruction consists of a decode/dispatch and a memory read, which we know takes 1 clock cycle for zero wait state RAM. Similarly, we can find that each retention register read takes 196930 / 16384 = 12 clock cycles. Here's the loop disassembly:
The cycle counts here are the same as above such that the retention register read is 12 - 2 (ldr.w) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 6 clock cycles longer than a RAM read (7 clocks total). With this data in hand, let's summarize what we know so far:
Retention Register Read
Retention Register Write
But this is just one data point: 16 MHz operation on Series 1 Giant Gecko. These figures can and do differ across operating frequency and other Series 1 devices. The table below summarizes the execution times for each loop iteration across all Series 1 devices.
Note that these are the per iteration cycle times for each read and write loop and not the actual read and write cycle access times. These must be calculated from examination of the assembly code for each loop. On the devices with the Cortex-M4 core, these are exactly what is shown above, so why are there differences? For starters, flash wait states extend the instruction fetch times and, thus, their execution times. The Memory System Controller instruction cache largely counters this as can be seen for Giant Gecko where this has little impact as the number of wait states is increased from 0 to 1 to 2.
Obviously, flash wait states alone are not the only factor affecting performance. On Giant Gecko, wait states are also introduced for peripheral accesses at frequencies above 50 MHz and for RAM at frequencies above 38 MHz. In the case of xG1 and xG13 vs. xG12, the ability of each design to meet specific timing requirements is clearly a factor. Despite having comparable core feature sets, it shouldn't be a surprise that the superset 1 MB flash/256 KB RAM xG12 devices might not be capable of supporting the same access times as their slimmer xG1 and xG13 siblings.
Finally, it's worth examining why Series 1 Tiny Gecko (EFM32TG11) requires more clock cycles to execute each loop iteration than the other devices when running at comparable frequencies, especially in the case of RAM. The differences here are specifically due to the different micro architecture of the Cortex-M0+ core. Let's first look at the disassembly of the retention register read loop:
This alone does not seem substantially different from the assembly generated by the compiler for Cortex-M4 devices. The subs and cmp instructions execute in a single cycle while the ldr is nominally a two clock operation. Where we find a difference is the dsb instruction, which takes 3 clocks on the Cortex-M0+. Factoring this in, we can now see that the retention register read takes an extra 17 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 8 cycles on Tiny Gecko 11, thus making the entire access 9 clock cycles when running at 19 MHz and zero wait states.
Knowing this, we should expect the RAM read loop to require no extra clock cycles beyond those required for the opcodes and the RAM access itself. Here's the disassembly:
From the table above, we know the RAM read loop takes 10 clocks per iteration, so it's a simple matter to see that the RAM access itself takes 10 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 1 extra cycle on Tiny Gecko 11. In other words, the ldr instruction takes 3 clock cycles: one to decode the opcode and two for the read from RAM. As also shown in the table above, each iteration of the RAM write loop takes 10 clock cycles and, given the similarity of the assembly language output from the C compiler, we can simply substitute a str instruction into the listing above and see that the same timing applies. The calculation for the retention register write loop is left as an exercise for the reader.