Can you provide some information about how long it takes for the TRNG to start outputting data, particularly after exiting a low energy mode?
As shown in this diagram...
...the TRNG (True Random Number Generator) consists of several sub-blocks that operate together to generate cryptographically secure random numbers. Each of these sub-blocks takes a certain number of clock cycles to do its part of the process.
The "random" part of the TRNG is provided by ring oscillators that make up the entropy source and run asynchronously to the rest of the system. Because the entropy source is asynchronous, we can speak of it in a way similar to an analog-to-digital converter in that its output is "sampled" by the upstream sub-blocks in the TRNG, the clock for which is the HFPERCLK.
So, once the power is supplied, the entropy source begins running, but, naturally, needs some time to "warm-up" before its output is sufficiently random. This start-up delay is determined by the TRNG_INITWAITVAL register and defaults to the maximum possible 256 clocks, which should not be reduced.
How exactly do we know that the output from the entropy source is "sufficiently random?" Obviously, just assuming that 256 clocks is enough because it's the maximum possible value for a register does not sound particularly robust. Instead, the TRNG includes sub-blocks which test if the output of the entropy source meets standards of cryptographic randomness. Specifically, these sub-blocks execute the National Institute of Standards repetition count and adaptive proportion tests with window sizes of 64 and 4096 bits, respectively, as described in standard NIST-800-90B and the Bundesamt für Sicherheit in der Informationstechnik online test described in AIS 31. These tests run in parallel such that the associated delay is 4096 clocks because of the 4096-bit adaptive proportion test.
Whitening, which is the conditioning of the entropy source output to reduce bias and correlation, occurs after the integrity tests run. This takes 128 clocks and is enabled by default because the TRNG_CONTROL register CONDBYPASS = 0 out of reset.
Filling of the output FIFO is the last step in the start-up process and takes 64 words × 32 clocks (bits) / word = 2048 clocks.
Summing up these components, we see that the TRNG takes...
...which would be about 344 µs at the default HFRCO frequency of 19 MHz.
This delay is incurred not only when the module is first enabled but also upon wake-up from the EM2 and EM3 low-energy modes. Sampling of the entropy source, integrity testing, output whitening/conditioning, and, of course, the output FIFO all run from the HFPERCLK, which, along with the entropy source itself, is disabled in EM2 and EM3. Upon wake-up, all of these must restart, thus necessitating the delay.
Giant Gecko 12 deviates from the figure presented above. Its TRNG_INITWAITVAL register defaults to 1024 instead of 256, yielding a net delay of 1024 (INITWAITVAL) + 4096 (entropy tests) + 128 (whitening) + 2048 (FIFO fill) = 7296 clocks.
I'm trying to output two sine waves to the two VDAC channels on the EFR32BG13 by reading these values from a table and writing them to the VDAC_COMBDATA register with LDMA (later I'm going to output data in the same fashion as it is received via BLE).
I setup two 16-bit buffers with the data and initialize the LDMA as follows:
#define BUFFER_SIZE 4
uint16_t pingBuffer[BUFFER_SIZE];
uint16_t pongBuffer[BUFFER_SIZE];
// Ping-pong transfer count
uint32_t ppCount;
// Descriptor linked list for LDMA transfers
LDMA_Descriptor_t descLink[2];
void initLdmaPingPong(void)
{
CMU_ClockEnable(cmuClock_LDMA, true);
// Basic LDMA configuration
LDMA_Init_t ldmaInit = LDMA_INIT_DEFAULT;
LDMA_Init(&ldmaInit);
// Configure LDMA transfer type
cfg = (LDMA_TransferCfg_t)LDMA_TRANSFER_CFG_PERIPHERAL(ldmaPeripheralSignal_VDAC0_CH0);
// Use LINK descriptor macros for ping-pong transfers
LDMA_Descriptor_t xfer[] =
{
LDMA_DESCRIPTOR_LINKREL_M2P_BYTE(
&pingBuffer, // source
&(VDAC0->COMBDATA), // destination
BUFFER_SIZE, // data transfer size
1), // link relative offset (links to next)
LDMA_DESCRIPTOR_LINKREL_M2P_BYTE(
&pongBuffer, // source
&(VDAC0->COMBDATA), // destination
BUFFER_SIZE, // data transfer size
-1) // link relative offset(links to previous)
};
descLink[0] = xfer[0];
descLink[1] = xfer[1];
descLink[0].xfer.ignoreSrec = true; // ignores single requests to reduce energy usage
descLink[0].xfer.size = ldmaCtrlSizeHalf; // transfers half words instead of bytes
descLink[1].xfer.ignoreSrec = true; // ignores single requests to reduce energy usage
descLink[1].xfer.size = ldmaCtrlSizeHalf; // transfers half words instead of bytes
LDMA_IntEnable(LDMA_IF_DONE_DEFAULT);
// Start Ping-Pong transfer
ppCount = 0;
LDMA_StartTransfer(LDMA_CHANNEL, (void*)&cfg, (void*)&descLink);
NVIC_ClearPendingIRQ(LDMA_IRQn);
NVIC_EnableIRQ(LDMA_IRQn);
}
A TIMER triggers the VDAC output via the PRS, and I fill the buffers accordingly when the LDMA triggers its DONE interrupt.
The problem I have is that I only get the correct output on channel 0. If I transfer longwords using the LDMA, then I get a signal from both outputs, but the output is not correct, e.g. every fourth ping-pong transfer fails, and I get 0V output for this period.
What is the correct structure to write to the COMBDATA register? How do I manage to accomplish this with a ping-pong buffer?
Our user's setup of the LDMA here is mostly correct except for one major flaw:
descLink1[0].xfer.ignoreSrec = true; // ignores single requests to reduce energy usage
descLink1[0].xfer.size = ldmaCtrlSizeHalf; // transfers half words instead of bytes
descLink1[1].xfer.ignoreSrec = true; // ignores single requests to reduce energy usage
descLink1[1].xfer.size = ldmaCtrlSizeHalf; // transfers half words instead of bytes
While this looks correct, in theory, the problem is with the attempt to use halfwords for the transfer size. While there is no issue itself with the LDMA performing a halfword read or write, it cannot do so when the target register is a memory-mapped peripheral. All write accesses to peripheral registers must be longword writes. In the case of the COMBDATA register, this actually makes perfect sense when you consider that its sole purpose is to update both VDAC output channels simultaneously.
So, in the customer's case, the solution is fairly simple. Instead of maintaining two ping-pong buffer arrays of type uint16_t, the data to be buffered needs to be stored to permit the two 16-bit output values to be read as 32 bits that are written together to the COMBDATA register. This could be done by simple ordering the data so that the 0th entry in the buffer is the first channel 1 output value (VDAC_COMBDATA_CH1DATA), the 1st entry is the first channel 0 output value (VDAC_COMBDATA_CH0DATA), the 2nd entry is the second channel 1 output value, the 3rd entry is the second channel 0 output value, etc. The same thing could be done with a typedef struct that has as its two components variables for the channel 1 and channel 0 output values.
Having made this change, the LDMA descriptors would be modified for 32-bit transfers:
descLink1[0] = xfer[0];
descLink1[1] = xfer[1];
descLink1[0].xfer.ignoreSrec = true; // ignores single requests to reduce energy usage
descLink1[0].xfer.size = ldmaCtrlSizeWord; // transfers words instead of bytes
descLink1[1].xfer.ignoreSrec = true; // ignores single requests to reduce energy usage
descLink1[1].xfer.size = ldmaCtrlSizeWord; // transfers words instead of bytes
If there's a need to generate a waveform on one output, then have it remain unchanged while the other channel is updated, this could be done with the original code above with the proviso that the destination for each descriptor would be &(VDAC0->CH0DATA) or &(VDAC0->CH1DATA) as needed.
NOTE: While this discussion involves the VDAC on Series 1 EFM32 and EFR32 devices, it applies identically to the DAC on Series 0 EFM32 and EZR32 devices. The requirement for aligned longword accesses to peripheral registers is universal across all EFM32, EFR32, and EZR32 devices.
Do the RTCC retention registers have the same access time as regular RAM or do they reside in a low frequency clock domain and require multi-cycle accesses?
As the Clock Management Unit (CMU) chapter in the reference manual for each Series 1 EFM32 or EFR32 device shows, the Real Time Counter and Calendar (RTCC) is clocked by the LFECLK, but there's no indication if this is used for the retention registers. To figure out the actual cycle counts needed to access the retention registers, we'll need to run some code in sufficiently tight loops that performs reads or writes over which we can find a per iteration average.
Here's what this code looks like:
// emlib
#include "em_chip.h"
#include "em_cmu.h"
#include "em_emu.h"
#include "em_gpio.h"
#include "em_rtcc.h"
// BSP
#include "bsp.h"
// SDK drivers
#define RETARGET_VCOM
#include "retargetserial.h"
#include
#include
static void initRTCC(void);
int main(void)
{
unsigned int msTicks_start, msTicks_end;
unsigned int loop, retRate, ramRate;
register volatile unsigned int word0;
register unsigned int word1;
// Device initialization
CHIP_Init();
// Set the HFCLK frequency
CMU_HFRCOBandSet(cmuHFRCOFreq_16M0Hz);
// Enable clock to access LF domains
CMU_ClockEnable(cmuClock_HFLE, true);
// Select LFRCO as LFECLK for RTCC
CMU_ClockSelectSet(cmuClock_LFE, cmuSelect_LFRCO);
CMU_ClockEnable(cmuClock_RTCC, true);
// Retarget STDIO to USART1
RETARGET_SerialInit();
RETARGET_SerialCrLf(1);
// Start SysTick at max value and count down
SysTick->LOAD = 0xffffff;
SysTick->CTRL |= (SysTick_CTRL_CLKSOURCE_Msk | SysTick_CTRL_ENABLE_Msk);
// Retention register write loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
RTCC->RET[0].REG = loop;
__DSB();
}
msTicks_end = SysTick->VAL;
retRate = msTicks_start - msTicks_end;
printf("Done: %u retention writes in %u ticks\n\n", 16384, retRate);
// Retention register read loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
word1 = RTCC->RET[0].REG;
__DSB();
}
msTicks_end = SysTick->VAL;
retRate = msTicks_start - msTicks_end;
printf("Done: %u retention reads in %u ticks\n\n", 16384, retRate);
// RAM write loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
word0 = loop;
__DSB();
}
msTicks_end = SysTick->VAL;
ramRate = msTicks_start - msTicks_end;
printf("Done: %u RAM writes in %u ticks\n\n", 16384, ramRate);
// RAM read loop
msTicks_start = SysTick->VAL;
for (loop = 0; loop < 16384; loop++)
{
word1 = word0;
__DSB();
}
msTicks_end = SysTick->VAL;
ramRate = msTicks_start - msTicks_end;
printf("Done: %u RAM reads in %u ticks\n\n", 16384, ramRate);
// Wait here
while (1);
}
Running this code on a Series 1 Giant Gecko device at 16 MHz, we get the following output:
Done: 16384 retention writes in 180505 ticks
Done: 16384 retention reads in 196930 ticks
Done: 16384 RAM writes in 114693 ticks
Done: 16384 RAM reads in 98309 ticks
At first glance, we can see that accesses to the retention registers are slower than RAM. Based on these numbers, each iteration of the retention write loop takes 180505 / 16384 = 11 clock cycles. Let's break this down further by looking at the disassembly:
Of the 11 clock cycles, we know the adds and cmp.w instructions take 1 clock while the bne.n takes 2. The data synchronization barrier (dsb sy), which we insert in order to force the core to wait for completion of the memory write, takes 1 clock plus however many cycles are required for previous memory accesses to complete. Similarly, we know that the str.w takes 2 clocks plus the memory access. Putting this together, we find that the retention register access itself takes 11 - 2 (str.w) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 4 clock cycles.
Now, let's compare this to this to RAM writes. Each iteration of the loop takes 114693 / 16384 = 7 clock cycles. Here's the disassembly:
The resulting assembly is identical to that used for the retention register write loop with the only difference being the use of the stack pointer-relative access to RAM. It follows, then, that we can find the RAM access itself takes 7 - 2 (str) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 0 clock cycles.
Wait a second! How can a RAM access take 0 clock cycles? The answer is that it can't. A closer inspection reveals that the str instruction takes 2 clock cycles. The first cycle is the decoding/dispatching of the opcode while the second is the actual memory write. So, knowing this, we can say that a zero wait state write to RAM on EFM32/EFR32 takes 1 clock cycle. This also means that the retention register write takes 5 clocks cycles, 4 more than a RAM write, which is what we actually calculated above.
Having gone through this exercise for writes, we can now examine the disassembly of the read loops for both RAM and retention registers. Here's the code for RAM:
Each iteration of this loop takes 98309 / 16384 = 6 clock cycles such that the memory access is 6 - 2 (ldr) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 0 clock cycles. Of course, just like the str instruction, the ldr instruction consists of a decode/dispatch and a memory read, which we know takes 1 clock cycle for zero wait state RAM. Similarly, we can find that each retention register read takes 196930 / 16384 = 12 clock cycles. Here's the loop disassembly:
The cycle counts here are the same as above such that the retention register read is 12 - 2 (ldr.w) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 6 clock cycles longer than a RAM read (7 clocks total). With this data in hand, let's summarize what we know so far:
RAM Read
RAM Write
Retention Register Read
Retention Register Write
Clock Cycles
1
1
7
5
But this is just one data point: 16 MHz operation on Series 1 Giant Gecko. These figures can and do differ across operating frequency and other Series 1 devices. The table below summarizes the execution times for each loop iteration across all Series 1 devices.
Note that these are the per iteration cycle times for each read and write loop and not the actual read and write cycle access times. These must be calculated from examination of the assembly code for each loop. On the devices with the Cortex-M4 core, these are exactly what is shown above, so why are there differences? For starters, flash wait states extend the instruction fetch times and, thus, their execution times. The Memory System Controller instruction cache largely counters this as can be seen for Giant Gecko where this has little impact as the number of wait states is increased from 0 to 1 to 2.
Obviously, flash wait states alone are not the only factor affecting performance. On Giant Gecko, wait states are also introduced for peripheral accesses at frequencies above 50 MHz and for RAM at frequencies above 38 MHz. In the case of xG1 and xG13 vs. xG12, the ability of each design to meet specific timing requirements is clearly a factor. Despite having comparable core feature sets, it shouldn't be a surprise that the superset 1 MB flash/256 KB RAM xG12 devices might not be capable of supporting the same access times as their slimmer xG1 and xG13 siblings.
Finally, it's worth examining why Series 1 Tiny Gecko (EFM32TG11) requires more clock cycles to execute each loop iteration than the other devices when running at comparable frequencies, especially in the case of RAM. The differences here are specifically due to the different micro architecture of the Cortex-M0+ core. Let's first look at the disassembly of the retention register read loop:
This alone does not seem substantially different from the assembly generated by the compiler for Cortex-M4 devices. The subs and cmp instructions execute in a single cycle while the ldr is nominally a two clock operation. Where we find a difference is the dsb instruction, which takes 3 clocks on the Cortex-M0+. Factoring this in, we can now see that the retention register read takes an extra 17 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 8 cycles on Tiny Gecko 11, thus making the entire access 9 clock cycles when running at 19 MHz and zero wait states.
Knowing this, we should expect the RAM read loop to require no extra clock cycles beyond those required for the opcodes and the RAM access itself. Here's the disassembly:
From the table above, we know the RAM read loop takes 10 clocks per iteration, so it's a simple matter to see that the RAM access itself takes 10 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 1 extra cycle on Tiny Gecko 11. In other words, the ldr instruction takes 3 clock cycles: one to decode the opcode and two for the read from RAM. As also shown in the table above, each iteration of the RAM write loop takes 10 clock cycles and, given the similarity of the assembly language output from the C compiler, we can simply substitute a str instruction into the listing above and see that the same timing applies. The calculation for the retention register write loop is left as an exercise for the reader.
Our application makes use of energy harvesting, so we try to overlap operations in our firmware when possible, e.g. using the LDMA to service a peripheral while the CPU is executing code. So, knowing this, can you tell us if it is possible to run code from the boot loader flash on EFM32TG11 at the same time as we are programming the main flash array?
What a great question! This seems like a reasonable assumption because the boot loader flash array on Series 1 devices resides at address 0x0FE10000 while the main flash array sits at 0x0. Given the two different base addresses, these would seem to be physically distinct flash arrays that would have their own dedicated charge pumps for generation of the high voltages used during program and erase operations.
Alas, in this case, different addresses do not mean physically separate arrays with separate charge pumps. Running code that programs or erases the flash from the boot loader array is subject to the same CPU stall that occurs when from the main flash array on devices that have single flash instances (more on this later). Let's investigate this further. Is there a visually satisfying way that shows us what's actually happening with the CPU when the flash high voltage is applied? Of course there is! We can toggle a GPIO pin.
First, though, we need to consider how to actually get code into the boot loader space. To do that, we'll follow the steps outlined in the Knowledge Base article "Using a Linker Script in GCC to Locate Functions at Specific Memory Regions" and have Simplicity Studio use a custom linker file with some additions. To start, we'll add the BLFLASH region of memory. Note that the first 4 KB of the boot loader flash is excluded specifically to keep the EFM32TG11 factory boot loader intact.
We'll use the following simple program to directly issue a flash page erase (no use of emlib in order to keep the code compact and as close to assembly as possible). Notice the commented out function declarations. We'll make use of these later to place the erasepage() function in main flash or RAM in order to see how it behaves.
#include "em_device.h"
void erasepage(void);
int main(void)
{
// Disable MSC prefetch
MSC->READCTRL &= ~_MSC_READCTRL_PREFETCH_MASK;
// Disable the MSC cache
MSC->READCTRL |= MSC_READCTRL_IFCDIS;
// Configure PC2 as an output
CMU->HFBUSCLKEN0 |= CMU_HFBUSCLKEN0_GPIO;
GPIO->P[2].MODEL = GPIO_P_MODEL_MODE2_PUSHPULL;
while (1)
{
erasepage();
}
}
#define BLSPACE __attribute__((__section__(".blspace")))
void BLSPACE erasepage(void)
//void __attribute__ ((section(".ram"))) erasepage(void)
//void erasepage(void)
{
// Enable writes
MSC->WRITECTRL |= MSC_WRITECTRL_WREN;
// Address to erase
MSC->ADDRB = 0x8000;
// Load internal write address register from MSC_ADDRB
MSC->WRITECMD = MSC_WRITECMD_LADDRIM;
// Turn on PC2
GPIO->P[2].DOUT = 0x4;
// Enable high voltage to start erase operation
MSC->WRITECMD = MSC_WRITECMD_ERASEPAGE;
// Flush pipeline after starting erase
__NOP();
__ISB();
// Turn off PC2
GPIO->P[2].DOUT = 0x0;
// Make sure erase is complete
while (MSC->STATUS & MSC_STATUS_BUSY);
// Halt here
__BKPT(0);
}
So, with the code downloaded to the Tiny Gecko 11 device and the erasepage() function residing at 0x0FE11000 in the boot loader flash, here's what we see on the scope:
That 26.3 ms pulse is nicely in line with the typical page erase time specified in the datasheet, too:
Naturally, we should see what happens when the erasepage() function is placed in the main array. To do so, we can remove the BLSPACE attribute from the erasepage() function declaration or simply comment out the declaration entirely and uncomment the third declaration which excludes the BLSPACE attribute. Here's what we see now:
Looks good! This matches right up with the result seen when running from the boot loader array, which indicates that, in both cases, the CPU is stalled while the high voltage is applied to the flash during the erase operation. Of course, we need to see the alternative, which is what happens when erasepage() runs from RAM. To see, this comment out the current declaration that places the code in flash and uncomment out the declaration with the .ram section attribute. Running this code, we see:
The short duration of this GPIO pulse when running erasepage() from RAM clearly indicates that the CPU is still executing code while the high voltage is applied. Series 1 devices default to operation at 19 MHz, so this amounts to just a few instructions as shown below in the debugger disassembly:
Knowing all of this, there is a remaining question: What happens on dual bank devices like EFM32PG12, which explicitly have read-while-write (RWW) support via the MSC_WRITECTRL_RWWEN bit. Let's run the same code with erasepage() in the boot loader flash but with the erase target addresses being 0x40000 and 0xC0000, which, on Pearl Gecko 12, are in the bottom and top flash banks, respectively. We'll also need to modify the write enabling sequence to permit RWW operation as follows:
In the side-by-side scope captures below, the page erase operation takes the datasheet typical 27 ms when the target address is 0x40000 but only 1.8 µs when the target address is 0xC0000. What this shows is that the boot loader flash is part of the same physical flash block as the bottom flash bank (0x0 to 0x7FFFF) and thus shares the same high voltage circuitry. The top flash bank (0x80000 to 0xFFFFF) is physically distinct from the boot loader array, which is why code can continue executing even when the high voltage is applied for erasing (or programming).
Users of Giant Gecko Series 1 should take note that this behavior is reversed. On these devices, the boot loader array is associated with the upper flash bank, so RWW is possible when the target of flash operations is in the lower flash bank. Of course, as the reference manual denotes, RWW is always possible when code is running in one of the physical flash arrays and the target of flash operations is the other (e.g. lower bank and upper bank or vice versa).
What signals can be used to drive the CLKIN0 input on Series 1 MCUs (EFR32/EFM32) devices on which it appears?
Answer:
Some EFR32 and EFM32 Series 1 MCU devices support a CLKIN0 input to the CMU module, as depicted below - taken from Figure 10.1 "CMU Overview - High Frequency Portion" of the EFM32PG12 Reference Manual:
The purpose of the CLKIN0 input path is two-fold, depending on what is implemented on a particular MCU device. CLKIN0 supports a low-frequency (<= 1 MHz) square wave clock input which can be used:
to clock the device from an external CMOS source (BLUE path above), or
by the DPLL as a reference clock (GREEN path above)
For guidance on how to implement either of these configurations, see the following Knowledge Base articles:
An integrated DC-DC buck converter is an option on most Series 1 EFM32 and EFR32 devices. While this term is used across the EFM32/EFR32 documentation and often shortened to "DCDC", it's important to note that what's present on the chip still requires an inductor and a capacitor to generate the output voltage to which the DCDC is programmed.
In fact, what is actually seen on the VREGSW pin, which is an output from the MCU, is not the regulated voltage itself but the switching waveform associated with the DCDC. This waveform is, obviously, very different from what is seen at the actual output of the converter, which is the regulated voltage seen after the inductor and where the capacitor is connected to ground, e.g. the VDCDC node as shown in the figure below:
Programming of the DC-DC converter and device power configurations are covered in Application Note 0948: EFM32 and EFR32 Series 1 Power Configurations and DC-DC, but one item not addressed is the waveform seen on and voltage swing of the VREGSW pin. While the frequency can vary, the output waveform looks like this:
The voltage swing on VREGSW can vary, as well, including over temperature, but can be expected to range from -1V to VREGVDD + 1V, where the under/overshoot is due to the switching current flowing through the ESD/body diode during the DCDC powertrain's dead time.
How can I enable debug lock (flash read protection) for my EFR32 application?
In particular, we are using Simplicity Commander to generate a hex file that combines our tokens for encryption and signing. We also use Commander to program our target and know that Commander can also lock a device on a target board with...
commander device lock
...but is there a way to enable debug lock via a hex file so that this step just happens as a matter of course when flashing our devices?
Yes, absolutely! In fact, because you are generating a hex file with tokens for use with Gecko Bootloader, it's a simple matter to modify this file to also enable debug lock.
As you're already running Commander with a series of different command line switches in order to generate everything needed for Gecko Bootloader, you just need to add one more call at the end of all of your processing:
Note in particular that should be the file you generate with Commander that contains your signing and encryption tokens. These are actually located in the lock bits page, so it's necessary to add the patch that sets the debug lock word (DLW) in this file.
Why is this the case? In theory, the DLW could be set in any hex file. Simplicity Commander or any other tool used to program the hex file into flash would write the required data wherever it needs to be located. However, if you program the tokens hex file after programming the DLW in another hex file, it's likely that your programming software (including Commander) would first erase the lock bits page, thus unlocking the device because debug lock does not take effect until after a reset.
Consequently, regardless of where or when you choose to enable debug lock, make sure it is part of the last flash programming operation performed so that it takes effect with the next reset and cannot otherwise be undone.
The first line specifies that the address of the following variable should be 0x0FE04000 + (127 << 2 ) = 0x0FE041FC. LOCKBITS_BASE is #defined in the CMSIS Cortex-M Peripheral Access Layer Header File for the device in question and is 0x0FE04000 for all current EFM32, EFR32, and EZR32 devices.
Naturally, the constant debug_lock_word sets the DLW to 0x0 (at the previously specified address), but what does the __root modifier do? As explained in the following IAR Technical Note...
__root prevents the linker from removing a variable or function if it is not referred to in the main() call-tree or in any interrupt call-tree. Short of perhaps checking the state of the debug_lock_word in firmware to enable/disable debugging features, there would usually be no reason for it to be referenced in main() or elsewhere, thus the need to prevent the linker from removing it.
Obviously, this same technique can be used for any of the other words in the lock bits page. For example, to have pin resets treated as hard instead of soft resets, it's necessary to clear bit 2 of lock bits page word 122, which is configuration word 0 (CLW0), and is done with (bit 1 of CLW0 enables the bootloader and remains set in this particular example):
Locations in user data space can be programmed with constant data in the same way. For example, a product and manufacturer name could be stored as follows:
A curious user might ask "How exactly do I know this works? The .hex file format is rather cryptic looking to the untrained eye. Is there a way to parse this .hex file output from the build process and verify the expected constants are there?"
Various binary tools can be used to dump and otherwise convert .hex files, but the output from the attached hexinfo program (attached to this article for Windows machines after compiling from the sources at https://github.com/bradgrantham/hex2bin) does the job conveniently enough:
Each contiguous region in the hex file can be associated with a section of the compiled binary data. The largest is, of course, the program code itself, while the three smaller regions correspond to the product_name, manufacturer_name, and debug_lock_word constant definitions at the addresses specified by the location #pragmas that precede them.
I'm seeing a problem with Gecko Bootloader v1.6 and a signed and encrypted GBL file. When the upload reaches 99%, it always fails with error 0x0b03 on the EFR32BG1V132 device, which has 16 KB of RAM. When I trace the code, parser_parse() is returning 0x1006 at one of the uploads (close to the last one) which is BOOTLOADER_ERROR_PARSER_SIGNATURE.
When I run this same code on a device with 32 KB of RAM (e.g. EFR32BG1P232), it uploads successfully. There appears to be something unique to the 16 KB devices that is blocking the signature verification because uploads complete successfully with unsigned images.
Based on this customer description, it would appear that 16 KB RAM flavors of the EFR32 Blue Gecko (and by way of comparison, 16 KB versions of EFR32 Flex and Mighty Gecko) cannot run Gecko Bootloader. Is this really the case?
Fortunately, this problem is actually related to the age of the devices being used. In December 2016, EFR32 Blue, Flex, and Mighty Gecko "V" part numbers (EFR32BG1V132) were upgraded to have full encryption capabilities instead of AES alone. The full list of affected part numbers can be found in the Product Change Notification (PCN) document announcing the upgrade:
Gecko Bootloader requires SHA hashing and ECDSA encryption to perform signature verification. Because these features were not present on the original EFR32xG1V devices, Gecko Bootloader fails with the parser signature error message above when run on this older hardware. Apart from visually inspecting the date code, there is no way to determine if a suspect device has full encryption enabled, so Gecko Bootloader has no means of checking for this and thus fails in this fashion.
32-bit Knowledge Base
What is the TRNG (True Random Number Generator) start-up time?
Can you provide some information about how long it takes for the TRNG to start outputting data, particularly after exiting a low energy mode?
As shown in this diagram...
...the TRNG (True Random Number Generator) consists of several sub-blocks that operate together to generate cryptographically secure random numbers. Each of these sub-blocks takes a certain number of clock cycles to do its part of the process.
The "random" part of the TRNG is provided by ring oscillators that make up the entropy source and run asynchronously to the rest of the system. Because the entropy source is asynchronous, we can speak of it in a way similar to an analog-to-digital converter in that its output is "sampled" by the upstream sub-blocks in the TRNG, the clock for which is the HFPERCLK.
So, once the power is supplied, the entropy source begins running, but, naturally, needs some time to "warm-up" before its output is sufficiently random. This start-up delay is determined by the TRNG_INITWAITVAL register and defaults to the maximum possible 256 clocks, which should not be reduced.
How exactly do we know that the output from the entropy source is "sufficiently random?" Obviously, just assuming that 256 clocks is enough because it's the maximum possible value for a register does not sound particularly robust. Instead, the TRNG includes sub-blocks which test if the output of the entropy source meets standards of cryptographic randomness. Specifically, these sub-blocks execute the National Institute of Standards repetition count and adaptive proportion tests with window sizes of 64 and 4096 bits, respectively, as described in standard NIST-800-90B and the Bundesamt für Sicherheit in der Informationstechnik online test described in AIS 31. These tests run in parallel such that the associated delay is 4096 clocks because of the 4096-bit adaptive proportion test.
Whitening, which is the conditioning of the entropy source output to reduce bias and correlation, occurs after the integrity tests run. This takes 128 clocks and is enabled by default because the TRNG_CONTROL register CONDBYPASS = 0 out of reset.
Filling of the output FIFO is the last step in the start-up process and takes 64 words × 32 clocks (bits) / word = 2048 clocks.
Summing up these components, we see that the TRNG takes...
256 (INITWAITVAL) + 4096 (entropy tests) + 128 (whitening) + 2048 (FIFO fill) = 6528 clocks
...which would be about 344 µs at the default HFRCO frequency of 19 MHz.
This delay is incurred not only when the module is first enabled but also upon wake-up from the EM2 and EM3 low-energy modes. Sampling of the entropy source, integrity testing, output whitening/conditioning, and, of course, the output FIFO all run from the HFPERCLK, which, along with the entropy source itself, is disabled in EM2 and EM3. Upon wake-up, all of these must restart, thus necessitating the delay.
Giant Gecko 12 deviates from the figure presented above. Its TRNG_INITWAITVAL register defaults to 1024 instead of 256, yielding a net delay of 1024 (INITWAITVAL) + 4096 (entropy tests) + 128 (whitening) + 2048 (FIFO fill) = 7296 clocks.
Writing to the VDAC (DAC) COMBDATA Register with LDMA (DMA)
I'm trying to output two sine waves to the two VDAC channels on the EFR32BG13 by reading these values from a table and writing them to the VDAC_COMBDATA register with LDMA (later I'm going to output data in the same fashion as it is received via BLE).
I setup two 16-bit buffers with the data and initialize the LDMA as follows:
A TIMER triggers the VDAC output via the PRS, and I fill the buffers accordingly when the LDMA triggers its DONE interrupt.
The problem I have is that I only get the correct output on channel 0. If I transfer longwords using the LDMA, then I get a signal from both outputs, but the output is not correct, e.g. every fourth ping-pong transfer fails, and I get 0V output for this period.
What is the correct structure to write to the COMBDATA register? How do I manage to accomplish this with a ping-pong buffer?
Our user's setup of the LDMA here is mostly correct except for one major flaw:
While this looks correct, in theory, the problem is with the attempt to use halfwords for the transfer size. While there is no issue itself with the LDMA performing a halfword read or write, it cannot do so when the target register is a memory-mapped peripheral. All write accesses to peripheral registers must be longword writes. In the case of the COMBDATA register, this actually makes perfect sense when you consider that its sole purpose is to update both VDAC output channels simultaneously.
So, in the customer's case, the solution is fairly simple. Instead of maintaining two ping-pong buffer arrays of type uint16_t, the data to be buffered needs to be stored to permit the two 16-bit output values to be read as 32 bits that are written together to the COMBDATA register. This could be done by simple ordering the data so that the 0th entry in the buffer is the first channel 1 output value (VDAC_COMBDATA_CH1DATA), the 1st entry is the first channel 0 output value (VDAC_COMBDATA_CH0DATA), the 2nd entry is the second channel 1 output value, the 3rd entry is the second channel 0 output value, etc. The same thing could be done with a typedef struct that has as its two components variables for the channel 1 and channel 0 output values.
Having made this change, the LDMA descriptors would be modified for 32-bit transfers:
If there's a need to generate a waveform on one output, then have it remain unchanged while the other channel is updated, this could be done with the original code above with the proviso that the destination for each descriptor would be &(VDAC0->CH0DATA) or &(VDAC0->CH1DATA) as needed.
NOTE: While this discussion involves the VDAC on Series 1 EFM32 and EFR32 devices, it applies identically to the DAC on Series 0 EFM32 and EZR32 devices. The requirement for aligned longword accesses to peripheral registers is universal across all EFM32, EFR32, and EZR32 devices.
RTCC Retention Register vs. RAM Access Times
Do the RTCC retention registers have the same access time as regular RAM or do they reside in a low frequency clock domain and require multi-cycle accesses?
As the Clock Management Unit (CMU) chapter in the reference manual for each Series 1 EFM32 or EFR32 device shows, the Real Time Counter and Calendar (RTCC) is clocked by the LFECLK, but there's no indication if this is used for the retention registers. To figure out the actual cycle counts needed to access the retention registers, we'll need to run some code in sufficiently tight loops that performs reads or writes over which we can find a per iteration average.
Here's what this code looks like:
Running this code on a Series 1 Giant Gecko device at 16 MHz, we get the following output:
At first glance, we can see that accesses to the retention registers are slower than RAM. Based on these numbers, each iteration of the retention write loop takes 180505 / 16384 = 11 clock cycles. Let's break this down further by looking at the disassembly:
Of the 11 clock cycles, we know the adds and cmp.w instructions take 1 clock while the bne.n takes 2. The data synchronization barrier (dsb sy), which we insert in order to force the core to wait for completion of the memory write, takes 1 clock plus however many cycles are required for previous memory accesses to complete. Similarly, we know that the str.w takes 2 clocks plus the memory access. Putting this together, we find that the retention register access itself takes 11 - 2 (str.w) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 4 clock cycles.
Now, let's compare this to this to RAM writes. Each iteration of the loop takes 114693 / 16384 = 7 clock cycles. Here's the disassembly:
The resulting assembly is identical to that used for the retention register write loop with the only difference being the use of the stack pointer-relative access to RAM. It follows, then, that we can find the RAM access itself takes 7 - 2 (str) - 1 (dsb) - 1 (adds) - 1 (cmp.w) - 2 (bne.n) = 0 clock cycles.
Wait a second! How can a RAM access take 0 clock cycles? The answer is that it can't. A closer inspection reveals that the str instruction takes 2 clock cycles. The first cycle is the decoding/dispatching of the opcode while the second is the actual memory write. So, knowing this, we can say that a zero wait state write to RAM on EFM32/EFR32 takes 1 clock cycle. This also means that the retention register write takes 5 clocks cycles, 4 more than a RAM write, which is what we actually calculated above.
Having gone through this exercise for writes, we can now examine the disassembly of the read loops for both RAM and retention registers. Here's the code for RAM:
Each iteration of this loop takes 98309 / 16384 = 6 clock cycles such that the memory access is 6 - 2 (ldr) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 0 clock cycles. Of course, just like the str instruction, the ldr instruction consists of a decode/dispatch and a memory read, which we know takes 1 clock cycle for zero wait state RAM. Similarly, we can find that each retention register read takes 196930 / 16384 = 12 clock cycles. Here's the loop disassembly:
The cycle counts here are the same as above such that the retention register read is 12 - 2 (ldr.w) - 1 (dsb) - 1 (subs) - 2 (bne.n) = 6 clock cycles longer than a RAM read (7 clocks total). With this data in hand, let's summarize what we know so far:
Retention Register Read
Retention Register Write
But this is just one data point: 16 MHz operation on Series 1 Giant Gecko. These figures can and do differ across operating frequency and other Series 1 devices. The table below summarizes the execution times for each loop iteration across all Series 1 devices.
Note that these are the per iteration cycle times for each read and write loop and not the actual read and write cycle access times. These must be calculated from examination of the assembly code for each loop. On the devices with the Cortex-M4 core, these are exactly what is shown above, so why are there differences? For starters, flash wait states extend the instruction fetch times and, thus, their execution times. The Memory System Controller instruction cache largely counters this as can be seen for Giant Gecko where this has little impact as the number of wait states is increased from 0 to 1 to 2.
Obviously, flash wait states alone are not the only factor affecting performance. On Giant Gecko, wait states are also introduced for peripheral accesses at frequencies above 50 MHz and for RAM at frequencies above 38 MHz. In the case of xG1 and xG13 vs. xG12, the ability of each design to meet specific timing requirements is clearly a factor. Despite having comparable core feature sets, it shouldn't be a surprise that the superset 1 MB flash/256 KB RAM xG12 devices might not be capable of supporting the same access times as their slimmer xG1 and xG13 siblings.
Finally, it's worth examining why Series 1 Tiny Gecko (EFM32TG11) requires more clock cycles to execute each loop iteration than the other devices when running at comparable frequencies, especially in the case of RAM. The differences here are specifically due to the different micro architecture of the Cortex-M0+ core. Let's first look at the disassembly of the retention register read loop:
This alone does not seem substantially different from the assembly generated by the compiler for Cortex-M4 devices. The subs and cmp instructions execute in a single cycle while the ldr is nominally a two clock operation. Where we find a difference is the dsb instruction, which takes 3 clocks on the Cortex-M0+. Factoring this in, we can now see that the retention register read takes an extra 17 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 8 cycles on Tiny Gecko 11, thus making the entire access 9 clock cycles when running at 19 MHz and zero wait states.
Knowing this, we should expect the RAM read loop to require no extra clock cycles beyond those required for the opcodes and the RAM access itself. Here's the disassembly:
From the table above, we know the RAM read loop takes 10 clocks per iteration, so it's a simple matter to see that the RAM access itself takes 10 - 2 (ldr) - 3 (dsb) - 1 (subs) - 1 (cmp) - 2 (bne.n) = 1 extra cycle on Tiny Gecko 11. In other words, the ldr instruction takes 3 clock cycles: one to decode the opcode and two for the read from RAM. As also shown in the table above, each iteration of the RAM write loop takes 10 clock cycles and, given the similarity of the assembly language output from the C compiler, we can simply substitute a str instruction into the listing above and see that the same timing applies. The calculation for the retention register write loop is left as an exercise for the reader.
Read-While-Write and the Boot Loader Flash on EFM32 and EFR32 Series 1
Our application makes use of energy harvesting, so we try to overlap operations in our firmware when possible, e.g. using the LDMA to service a peripheral while the CPU is executing code. So, knowing this, can you tell us if it is possible to run code from the boot loader flash on EFM32TG11 at the same time as we are programming the main flash array?
What a great question! This seems like a reasonable assumption because the boot loader flash array on Series 1 devices resides at address 0x0FE10000 while the main flash array sits at 0x0. Given the two different base addresses, these would seem to be physically distinct flash arrays that would have their own dedicated charge pumps for generation of the high voltages used during program and erase operations.
Alas, in this case, different addresses do not mean physically separate arrays with separate charge pumps. Running code that programs or erases the flash from the boot loader array is subject to the same CPU stall that occurs when from the main flash array on devices that have single flash instances (more on this later). Let's investigate this further. Is there a visually satisfying way that shows us what's actually happening with the CPU when the flash high voltage is applied? Of course there is! We can toggle a GPIO pin.
First, though, we need to consider how to actually get code into the boot loader space. To do that, we'll follow the steps outlined in the Knowledge Base article "Using a Linker Script in GCC to Locate Functions at Specific Memory Regions" and have Simplicity Studio use a custom linker file with some additions. To start, we'll add the BLFLASH region of memory. Note that the first 4 KB of the boot loader flash is excluded specifically to keep the EFM32TG11 factory boot loader intact.
The .blspace section is added next for functions designated to reside in the BLFLASH region of memory:
Lastly, a check is added to make sure functions placed in the .blspace section do not overflow the allotted space:
We'll use the following simple program to directly issue a flash page erase (no use of emlib in order to keep the code compact and as close to assembly as possible). Notice the commented out function declarations. We'll make use of these later to place the erasepage() function in main flash or RAM in order to see how it behaves.
So, with the code downloaded to the Tiny Gecko 11 device and the erasepage() function residing at 0x0FE11000 in the boot loader flash, here's what we see on the scope:
That 26.3 ms pulse is nicely in line with the typical page erase time specified in the datasheet, too:
Naturally, we should see what happens when the erasepage() function is placed in the main array. To do so, we can remove the BLSPACE attribute from the erasepage() function declaration or simply comment out the declaration entirely and uncomment the third declaration which excludes the BLSPACE attribute. Here's what we see now:
Looks good! This matches right up with the result seen when running from the boot loader array, which indicates that, in both cases, the CPU is stalled while the high voltage is applied to the flash during the erase operation. Of course, we need to see the alternative, which is what happens when erasepage() runs from RAM. To see, this comment out the current declaration that places the code in flash and uncomment out the declaration with the .ram section attribute. Running this code, we see:
The short duration of this GPIO pulse when running erasepage() from RAM clearly indicates that the CPU is still executing code while the high voltage is applied. Series 1 devices default to operation at 19 MHz, so this amounts to just a few instructions as shown below in the debugger disassembly:
Knowing all of this, there is a remaining question: What happens on dual bank devices like EFM32PG12, which explicitly have read-while-write (RWW) support via the MSC_WRITECTRL_RWWEN bit. Let's run the same code with erasepage() in the boot loader flash but with the erase target addresses being 0x40000 and 0xC0000, which, on Pearl Gecko 12, are in the bottom and top flash banks, respectively. We'll also need to modify the write enabling sequence to permit RWW operation as follows:
In the side-by-side scope captures below, the page erase operation takes the datasheet typical 27 ms when the target address is 0x40000 but only 1.8 µs when the target address is 0xC0000. What this shows is that the boot loader flash is part of the same physical flash block as the bottom flash bank (0x0 to 0x7FFFF) and thus shares the same high voltage circuitry. The top flash bank (0x80000 to 0xFFFFF) is physically distinct from the boot loader array, which is why code can continue executing even when the high voltage is applied for erasing (or programming).
Users of Giant Gecko Series 1 should take note that this behavior is reversed. On these devices, the boot loader array is associated with the upper flash bank, so RWW is possible when the target of flash operations is in the lower flash bank. Of course, as the reference manual denotes, RWW is always possible when code is running in one of the physical flash arrays and the target of flash operations is the other (e.g. lower bank and upper bank or vice versa).
Specifications for External Clock Input via CLKIN0
Question:
What signals can be used to drive the CLKIN0 input on Series 1 MCUs (EFR32/EFM32) devices on which it appears?
Answer:
Some EFR32 and EFM32 Series 1 MCU devices support a CLKIN0 input to the CMU module, as depicted below - taken from Figure 10.1 "CMU Overview - High Frequency Portion" of the EFM32PG12 Reference Manual:
The purpose of the CLKIN0 input path is two-fold, depending on what is implemented on a particular MCU device. CLKIN0 supports a low-frequency (<= 1 MHz) square wave clock input which can be used:
For guidance on how to implement either of these configurations, see the following Knowledge Base articles:
VREGSW Pin Output Waveform and Voltage Swing
An integrated DC-DC buck converter is an option on most Series 1 EFM32 and EFR32 devices. While this term is used across the EFM32/EFR32 documentation and often shortened to "DCDC", it's important to note that what's present on the chip still requires an inductor and a capacitor to generate the output voltage to which the DCDC is programmed.
In fact, what is actually seen on the VREGSW pin, which is an output from the MCU, is not the regulated voltage itself but the switching waveform associated with the DCDC. This waveform is, obviously, very different from what is seen at the actual output of the converter, which is the regulated voltage seen after the inductor and where the capacitor is connected to ground, e.g. the VDCDC node as shown in the figure below:
Programming of the DC-DC converter and device power configurations are covered in Application Note 0948: EFM32 and EFR32 Series 1 Power Configurations and DC-DC, but one item not addressed is the waveform seen on and voltage swing of the VREGSW pin. While the frequency can vary, the output waveform looks like this:
The voltage swing on VREGSW can vary, as well, including over temperature, but can be expected to range from -1V to VREGVDD + 1V, where the under/overshoot is due to the switching current flowing through the ESD/body diode during the DCDC powertrain's dead time.
FIT/SER for EFM32/EFR32 Series 1 RAM
What is the soft error rate (SER) or failure in time (FIT) rate for the RAM on EFM32 or EFR32 Series 1 devices?
The fundamental RAM bit cell design on all EFM32 and EFR32 Series 1 devices has a failure in time (FIT) rate of 510.
How can I enable debug lock (flash read protection) for my EFR32 application?
How can I enable debug lock (flash read protection) for my EFR32 application?
In particular, we are using Simplicity Commander to generate a hex file that combines our tokens for encryption and signing. We also use Commander to program our target and know that Commander can also lock a device on a target board with...
...but is there a way to enable debug lock via a hex file so that this step just happens as a matter of course when flashing our devices?
Yes, absolutely! In fact, because you are generating a hex file with tokens for use with Gecko Bootloader, it's a simple matter to modify this file to also enable debug lock.
As you're already running Commander with a series of different command line switches in order to generate everything needed for Gecko Bootloader, you just need to add one more call at the end of all of your processing:
Note in particular that should be the file you generate with Commander that contains your signing and encryption tokens. These are actually located in the lock bits page, so it's necessary to add the patch that sets the debug lock word (DLW) in this file.
Why is this the case? In theory, the DLW could be set in any hex file. Simplicity Commander or any other tool used to program the hex file into flash would write the required data wherever it needs to be located. However, if you program the tokens hex file after programming the DLW in another hex file, it's likely that your programming software (including Commander) would first erase the lock bits page, thus unlocking the device because debug lock does not take effect until after a reset.
Consequently, regardless of where or when you choose to enable debug lock, make sure it is part of the last flash programming operation performed so that it takes effect with the next reset and cannot otherwise be undone.
How do I set the Debug Lock Word (DLW) or other locations in the lock bits or user data pages in my C program using IAR?
IAR provides a very convenient #pragma mechanism for locating variables at specific memory addresses:
This #pragma is then followed by the specific variable declaration. Programming the DLW to lock debug access is thus as simple as:
The first line specifies that the address of the following variable should be 0x0FE04000 + (127 << 2 ) = 0x0FE041FC. LOCKBITS_BASE is #defined in the CMSIS Cortex-M Peripheral Access Layer Header File for the device in question and is 0x0FE04000 for all current EFM32, EFR32, and EZR32 devices.
Naturally, the constant debug_lock_word sets the DLW to 0x0 (at the previously specified address), but what does the __root modifier do? As explained in the following IAR Technical Note...
https://www.iar.com/support/tech-notes/linker/the-linker-removing-functions-and-variables-or-external-not-found/
__root prevents the linker from removing a variable or function if it is not referred to in the main() call-tree or in any interrupt call-tree. Short of perhaps checking the state of the debug_lock_word in firmware to enable/disable debugging features, there would usually be no reason for it to be referenced in main() or elsewhere, thus the need to prevent the linker from removing it.
Obviously, this same technique can be used for any of the other words in the lock bits page. For example, to have pin resets treated as hard instead of soft resets, it's necessary to clear bit 2 of lock bits page word 122, which is configuration word 0 (CLW0), and is done with (bit 1 of CLW0 enables the bootloader and remains set in this particular example):
Locations in user data space can be programmed with constant data in the same way. For example, a product and manufacturer name could be stored as follows:
A curious user might ask "How exactly do I know this works? The .hex file format is rather cryptic looking to the untrained eye. Is there a way to parse this .hex file output from the build process and verify the expected constants are there?"
Various binary tools can be used to dump and otherwise convert .hex files, but the output from the attached hexinfo program (attached to this article for Windows machines after compiling from the sources at https://github.com/bradgrantham/hex2bin) does the job conveniently enough:
Each contiguous region in the hex file can be associated with a section of the compiled binary data. The largest is, of course, the program code itself, while the three smaller regions correspond to the product_name, manufacturer_name, and debug_lock_word constant definitions at the addresses specified by the location #pragmas that precede them.
Gecko Bootloader Fails on EFR32xG1 Devices with 16 KB of RAM (EFR32BG1V, EFR32FG1V, and EFR32MG1V)
I'm seeing a problem with Gecko Bootloader v1.6 and a signed and encrypted GBL file. When the upload reaches 99%, it always fails with error 0x0b03 on the EFR32BG1V132 device, which has 16 KB of RAM. When I trace the code, parser_parse() is returning 0x1006 at one of the uploads (close to the last one) which is BOOTLOADER_ERROR_PARSER_SIGNATURE.
When I run this same code on a device with 32 KB of RAM (e.g. EFR32BG1P232), it uploads successfully. There appears to be something unique to the 16 KB devices that is blocking the signature verification because uploads complete successfully with unsigned images.
Based on this customer description, it would appear that 16 KB RAM flavors of the EFR32 Blue Gecko (and by way of comparison, 16 KB versions of EFR32 Flex and Mighty Gecko) cannot run Gecko Bootloader. Is this really the case?
Fortunately, this problem is actually related to the age of the devices being used. In December 2016, EFR32 Blue, Flex, and Mighty Gecko "V" part numbers (EFR32BG1V132) were upgraded to have full encryption capabilities instead of AES alone. The full list of affected part numbers can be found in the Product Change Notification (PCN) document announcing the upgrade:
https://www.silabs.com/documents/public/pcns/PB-1612141-EFR32MG1-EFR32BG1-EFR32FG1-DataSheet-Rev-1.1.pdf
Gecko Bootloader requires SHA hashing and ECDSA encryption to perform signature verification. Because these features were not present on the original EFR32xG1V devices, Gecko Bootloader fails with the parser signature error message above when run on this older hardware. Apart from visually inspecting the date code, there is no way to determine if a suspect device has full encryption enabled, so Gecko Bootloader has no means of checking for this and thus fails in this fashion.