The mitigation for the instability of STM8 EEPROM

I hope you read this article if you have a project involving EEPROM on STM8 controller. Lets me walk you through the hardware bug I found along with my proposed mitigation.

The symptom of the bug

You might see no symptom at all on your hardware and software combination. However, it is not that uncommon to stumble on a faulty STM8 chip which occasionally don’t let you write to the EEPROM. In other words, you can’t reliably write data to the EEPROM. Sometimes it works sometimes it doesn’t. If this is precisely why you are here, you came to the right place.

Encountering

I am not the first one who found the bug I read about it on a post on Mark Stevens’ blog.
I guess this guy is a big fan of STM8, his tutorial articles about STM8 are very detailed, which is why it also included the strange behavior he found.
Basically, I did the same thing and witness the same thing as him. I was following the instructions from the STM8 reference manual
Enabling write access to the DATA area
After a device reset, it is possible to disable the DATA area write protection by writing consecutively two values called MASS keys to the FLASH_DUKR register. These programmed keys are then compared to two hardware key values First hardware key: 0b1010 1110 (0xAE) Second hardware key: 0b0101 0110 (0x56) The following steps are required to disable write protection of the DATA area: 1. Write a first 8-bit key into the FLASH_DUKR register. When this register is written for
the first time after a reset, the data bus content is not latched into the register, but compared to the first hardware key value (0xAE). 2. If the key available on the data bus is incorrect, the application can re-enter two MASS keys to try unprotecting the DATA area. 3. If the first hardware key is correct, the FLASH_DUKR register is programmed with the second key. The data bus content is still not latched into the register, but compared to the second hardware key value (0x56). 4. If the key available on the data bus is incorrect, the data EEPROM area remains write protected until the next reset. Any new write command sent to this address is ignored. 5. If the second hardware key is correct, the DATA area is write unprotected and the DUL
bit of the FLASH_IAPSR register is set. Before starting programming, the application must verify that the DATA area is not write
protected by checking that the DUL bit is effectively set. The application can choose, at any time, to disable again write access to the DATA area by clearing the DUL bit.
And very carefully make sure I do everything right.
I compared my code with the references I found;
I compared it with Mark’s,

if (FLASH_IAPSR_DUL == 0) {
  FLASH_DUKR = 0xae;
  FLASH_DUKR = 0x56;
}

// write to EEPROM

FLASH_IAPSR_DUL = 0;

I walked through the standard peripheral driver code release by STMicroelectronics themselves,

void FLASH_Unlock(FLASH_MemType_TypeDef FLASH_MemType)
{
  /* Check parameter */
  assert_param(IS_MEMORY_TYPE_OK(FLASH_MemType));
  
  /* Unlock program memory */
  if(FLASH_MemType == FLASH_MEMTYPE_PROG)
  {
    FLASH->PUKR = FLASH_RASS_KEY1;
    FLASH->PUKR = FLASH_RASS_KEY2;
  }
  /* Unlock data memory */
  else
  {
    FLASH->DUKR = FLASH_RASS_KEY2; /* Warning: keys are reversed on data memory !!! */
    FLASH->DUKR = FLASH_RASS_KEY1;
  }
}

but I see nothing wrong with my code.

However, by reading Mark’s article, I know I am not the only one. And after some struggling and a bit of luck, I finally figured what going on.

What went wrong?

Our KEY1 is CPU’s KEY2, and our KEY2 is CPU’s KEY1.
This is what happened:

Our perspective | CPU perspective
----------------+----------------------------------
Send KEY1       | Receive KEY2, Do nothing
Send KEY2       | Receive KEY1, Do nothing
Write           | Do nothing
Lock            | Do nothing
Send KEY1       | Receive KEY2, Unlock EEPROM write
Send KEY2       | Receive KEY1, Do nothing
Write           | Write to EEPROM
Lock            | Restrict EEPROM write
Send KEY1       | Receive KEY2, Do nothing
Send KEY2       | Receive KEY1, Do nothing
Write           | Do nothing
Lock            | Do nothing
Send KEY1       | Receive KEY2, Unlock EEPROM write
Send KEY2       | Receive KEY1, Do nothing
Write           | Write to EEPROM
Lock            | Restrict EEPROM write

The hardware keys (MASS keys) should always be 0x56 and 0xAE consecutively for both EEPROM and FLASH memory. No need to reverse the order for EEPROM.

The document is wrong; the official driver is wrong; everything else is just wrong. As simple as that.

But on second thought, I was wrong

The world is not full of simpletons and ST’s engineers are hardly one of them.

My speculative instinct kicked in. I guess we see the deliberate fix for the bug here, but for the other way around, and under the wrong assumptions about the underlying flaw.

What did ST do?

I guess they, just like me, concluded that something went wrong and reverse the order of MASS keys for EEPROM everywhere; edit the documents, made changes to official drivers, etc. This is probably the reason why we see the reverse-ordered keys for EEPROM unlocking in the reference manual and ultimately the following comment in the official driver.

/* Warning: keys are reversed on data memory !!! */

However, at this point, we knew that the fix failed. There is hardware that accepts ascending ordered keys (mine for instance) and those that accepts reverse-ordered keys.

The root of everything

At this point, I can only guess that the real underlying flaw is the fact that internal states of FLASH_DUKR register did not get reset properly under some circumstances. This can’t merely solve by reverse the order of MASS keys; no matter what the order is, there are always exceptional cases.

The workaround/mitigation

Fortunately, the software solution to mitigate the bug is possible. We can exploit the fact that STM8 allow a wrong attempt for the first byte of MASS keys without locking up the device.

If the key available on the data bus is incorrect, the application can re-enter two MASS
keys to try unprotecting the DATA area

Even if we don’t know which byte the FLASH_DUKR register will consider as the first byte of the unlocking sequence, we can just brute force the keys.

// with SFR definitions from:
// https://github.com/the-cave/stm8s-header
while (!(FLASH->IAPSR & FLASH_IAPSR_DUL)) {
  // the order here does not matter anymore
  FLASH->DUKR = FLASH_RASS_KEY1;
  FLASH->DUKR = FLASH_RASS_KEY2;
}

Now, there are two possible scenarios with the workaround.
We will be able to unlock the device either by

Our perspective | CPU perspective
----------------+----------------------------------
Send KEY1       | Receive KEY1, Do nothing
Send KEY2       | Receive KEY2, Unlock EEPROM write
Write           | Write to EEPROM
Lock            | Restrict EEPROM write

or

Our perspective | CPU perspective
----------------+----------------------------------
Send KEY1       | Receive KEY2, Do nothing
Send KEY2       | Receive KEY1, Do nothing
Send KEY1       | Receive KEY2, Unlock EEPROM write
Send KEY2       | Receive KEY1, Do nothing
Write           | Write to EEPROM
Lock            | Restrict EEPROM write

; and if you want to be more cautious,
you can also use the watchdog timer to prevent device locking.

The faulty chip gradually changed

This old hardware bug managed to stay for at least five years under the radar.
Mark’s article was in 2013, and my discovery and mitigation popped up in 2018.

After running the mitigated solution for some time, the faulty STM8 gradually change its behavior to match the reference manual. Now the buggy behavior is gone; the standard driver will work just fine on the same faulty hardware previously needs mitigated solutions. However, since we do not yet understand the mechanism why it changed, it is recommended to use the mitigated solution for EEPROM unlocking regardless of the situations.

A type of flaw hard to caught is the one that inconsistently surfaced.

See some code in an actual project

This article is the rewrite of my article on GitHub in 2018.
The original article contains some more information about my project back then, and with an actual code you can compile.
Head there if you want to learn a few more thing on this bug.