# Entropykey v3 Firmware internals
This document aims to describe some amount of the EntropyKey v3's firmware
internals from the point of view of intending to aid in design and debugging.
The firmware is written in Rust. It uses mvirkkunen's USB stack for STM32
and it is an RTFM application.
The flow of data from the device to the host is over a CDC Serial link for
now though this is an internal implementation detail and is not exposed to the
rest of the firmware to allow for replacement with neater RAW apis later.
As such, all handling of the "protocol" is kept within the USB layer at this
Data flow within the device is somewhat more complex as there are a number of
stages that data has to be processed in. Firstly the two generators are hooked
up to SPI devices, and as such, we have to process the data from them SPI word
by SPI word. Then we have a number of software mixing stages which are semi
asynchronous from one another, and then an output framing to flow into the USB
layer as described above.
## SPI setup
The SPI setup is critical to the performance of the entropykey. The SPI devices
are configured at the hardware level to be slaves, with their clock driven from
TIM3_CH1 allowing for varying the capture bitrate with moderate ease. The
devices should be set with software NSS permanently active, 16 bit framing,
read only (MOSI only enabled, BIDIMODE=0, RXONLY=1)
The clock polarity and phase is not important, other than that we wish both
devices to be configured the same to reduce chances of not spotting correlation.
Ditto MSB/LSB first is irrelevant but must be the same between the devices.
For efficiency, we tie DMA channels to the SPI devices so that the data is read
out of the SPI devices using DMA rather than using software interrupt. This
allows for us to take fewer interrupts while handling the data flow.
### DMA setup for SPI
Our device only has DMA1 since it's not high density.
We use channel 2 for SPI1_RX and channel 4 for SPI2_RX
As we're using 16 bit framing, we do a 16bit Peripheral-to-Memory transfer
setup, arranging the transfers in groups of 64 halfwords, thus consuming 256
bytes of memory for the two DMA buffers in use.
We trigger from the half-transfer and full-transfer interrupts to consume the
first or second halves of the buffers respectively, and the channels are
arranged in circular mode so that once set up they simply generate IRQs and
are otherwise uninteresting to us.
We can control the flow of data out of the SPI devices by varying the rate of
the clock coming from the TIM3_CH1 channel. It's worth noting that we ought
to match up CPOL/CPHA and the idle state of that channel so that if we disable
the PWM on that channel, then the SPI blocks aren't unhappy.
Because the two channels are lock-stepped and channel 2 has higher priority over
channel 4, if we only take the interrupt from channel 4 we can be confident that
channel 2's content has already been filled to the same level. As such we take
one interrupt which tells us that the two buffers have reached the same half-fill
(first or second half) and we can process them both without needing to wait
for a second interrupt.
## Data flow from the SPI DMA buffers
On IRQ there will be 64 bytes of generated data on each of two channels. To
proceed three entropy estimations have to occur. We must estimate the entropy
in each of the buffers independently, and also in a virtual buffer formed by
xoring the data together. The running average of those estimations provides us
with an idea of whether or not the generators are tending toward an extreme
or correlating with each other.
If the total number of shannons estimated by this process is less than the hash
size of the mixing function then we designate that block as failed and skip the
following stages, instead waiting for the next interrupt.
We then mix the full 128 bytes of data from the two generators into our mixing
function, crediting it with the entropy estimates generated from the first
stage processing. The maximum amount of entropy which could be credited is
therefore 1024 shannons during this stage of processing. Providing that the
total is greater than the hash size, we're OK and we will claim the hash size
of shannons as we move on.
## Flowing data into the FIPS checks
Once the data leaves the hashing function attached to the SPI DMA buffers, we
have to gather it together into a FIPS 140-2 sized buffer for validation. This
process requires that we acquire 20,000 bits of data which if we have a 128 bit
hashing function will effectively be represented by a 2512 byte bufer (157 hashes)
of which we will use about 156 and a half.
Ideally we'll fit a pair of these buffers into RAM, though we accept that it's
possible we won't be able to. Initial estimates suggest we'll manage it. The
USB stack consumes about 600B and all the above maybe 1 or 2 KiB, so that ought
to leave plenty of room for a pair of FIPS buffers (one accumulating, one being
processed / sent out).
When a FIPS buffer is filled, it has to be validated by means of the three FIPS
140-2 checks. These are known as the monobit, poker, and runs checks. FIPS
defines them with floatypoint and a number of other hassles, so we have our own
integer-only check customised for our target.
Whether the buffer passes or fails the FIPS tests is irrelevant, it is sent to
## Analogue checks
We're going to be monitoring the 5v, 3.3v, HT line, and the current temperature
of the µC. These are done on a single ADC in a round-robin fashion where every
time we complete a conversion. To do the best job we can, we run the ADCs in
their slowest possible mode. We also use SCAN mode and a DMA channel to ensure
that we take interrupts as infrequently as possible. The channels scanned
are always the VREF_INT, the TEMP_INT, and the three voltages above.
We set the sampling time to the longest possible, and reduce ADCCLK to as slow
as makes sense. Our goal is to have as few interrupts as we can do, while still
usefully monitoring the lines. We have to use ADC1 for this because ADC2 only
has DMA by means of slaving to ADC1 anyway and also VREF_INT and TEMP_INT are
only muxed into ADC1's inputs.
The slowest we can set the ADC clock to is PCLK2/8 which is therefore 9MHz since
we're running APB2 at full speed for DMA etc reasons on the SPI.
The recommended sample time for the temperature sensor is 17.5µS and at 9MHz
each cycle is one nineth of a microsecond. As such we should set the sample
time selection bits to 0b111 which is 239.5 cycles (slowest) thereby allowing
for full precision. Once the DMA complete interrupt is taken, we can post
updated values for the measured entries to the statistics process. Conversion
is started *ONLY* whenever a FIPS buffer completes.
### DMA for ADCs
The DMA for ADC1 comes in only on DMA1 Channel 1. As such that needs to be set
to a very low priority so that requests for DMA don't overrule SPI completions.
The DMA has to be 16 bit in/out and must be for all five channels.
## Statistics and what they mean
If the input mixers go "bad" then the system throughput is essentially halted.
Where 'bad' means that the input entropy estimators are dropping too low too
For the FIPS checks to signal a failure there has to be a significant number
of failed blocks within the test period. A reasonable approach is to permit
no more than 6 or 8 failed blocks per 4096 tested blocks. If we reach that
limit then we also lock the system out.
With respect to the analogue values, we're looking for outliers. For example
if the 3.3v line goes too low or high, or the 5v line drops, the HT line looks
iffy, etc. Temperature values are also gently poked at to ensure that we're not
being margined by temperature. Any extreme value for too long will result in
the system locking out.
A locked out system is achieved by simply quiescing the timer which drives the
## Managing the throughput of the device
After measurement to determine the theoretical maximum throughput of the device
we can choose a slightly slower bitrate which we set as the maximum supported
timer clock rate. After that we can reduce the timer if we detect that the
output buffers are consistently having data thrown away, thereby reducing the
power load. If we're often blocked because there's no data available to send
and there's room to speed up the timer, we can do so.