Department of Computer Science and Technology

Security Group

SHA3-32bit dataset: building templates for recovering SHA-3 inputs

The SHA3-32bit dataset contains recordings of the power-supply current changes of the 32-bit processor STM32F303RCT7, which has one ARM Cortex-M4 core, on a ChipWhisperer-Lite (CW-Lite) board. We used an NI PXIe-5160 10-bit oscilloscope, which can sample at 2.5 GS/s into 2 GB of sampling memory, and an NI PXIe-5423 wave generator, as an external clock signal source, to supply the target board with a 5 MHz square wave signal.

More details of the attack are described in the following paper:

Because the size of the datasets is quite large, we provide here only a part of the data as an example. If you are interested in full data access, please email [Javascript required].

Source code

The source code of the SHA-3 implementation on CW-Lite is available below.

Recording scripts

The scripts to control the recording platform are available below.

The reference trace, which is the average of 1600 recorded raw traces:

A raw trace contains 7500000 samples (Raw[0] to Raw[7499999]), covering 15000 clock cycles. In the later experiments, we defined the first clock cycle (Clock[0]) as the 500 samples from Raw[75455] to Raw[759554], where Clock[0][0] = Raw[75455], and we used samples in 14500 clock cycles from Clock[0].

Detection dataset

We recorded traces of 16000 Keccak-f permutations for interesting clock cycle detection. The raw data are stored in 100 sets, such as:

The raw traces were later processed to the processed data (Code_preprocessing_20240328.zip), where each clock cycle contains only one sample (S[x], where S[x] = sum(Clock[x][20], ..., Clock[x][69])), such as:

With the pre-calculated intermediate_values.zip (updated: 2024-03-28) (code: Code_intermediate_values_20240328.zip), the detection results of the interesting clock cycle sets are:

where the related code files are:

The interesting clock cycle sets contains the index of the interesting clock cycles for each intermediate 32-bit word, where we use tag "A00" to represent state \(\alpha'_{0}\), "B01" for \(\beta_{1}\), C02 for \(\mathbf{C}_{2}\), etc. Please check file "readme.txt" for the detail instructions to use our Python code.

Profiling dataset

We recorded traces of 64000 Keccak-f permutations to profile the templates. The raw data are stored in 400 sets, such as:

The raw traces were later processed by resampling down to 10 samples per clock cycle, such as

Then the processed trace set were further processed according to the interesting-clock-cycle sets, by concatenating the samples of the interesting clock cycles of the target 32-bit word into trace fragments, such as:

which are for the first 32-bit word in state \(\alpha'_{0}\).

The intermediate values for target bytes are as follows:

With the trace fragments, the intermediate values, as well as the interesting clock cycle sets, we can build the templates:

Testing dataset (SHA3-512)

In our experiments, we tested all SHA-3 and SHAKE functions with inputs that can be absorbed within one or two invocations of the Keccak-f permutation. Here, we publish the testing data set of SHA3-512 with inputs being absorbed in one invocations, as an example.

We recorded 1000 traces for this test. The raw data are stored in 10 zip files, such as:

Similar to what we had done on the profiling traces, the raw traces were later processed by resampling down to 10 samples per clock cycle and were also stored in the following 10 zip files:

We also keep all the corresponding input and output strings of these recorded traces, to check if we correctly predicted these values:

With these processed traces and corresponding data, as well as the interesting-clock-cycle sets and the templates, we can finish this test with the following code:

A faster version of the Python code

We have optimized our template building and testing Python code as follows:

This optimized version mostly takes advantage of the parallelization of the NumPy library to reduce the computing time. We ran this optimized version on a 32-core server with 256GB memory.

Note that the new templates and testing results differ slightly from the original version published in our paper, due to some precision issues related to floating-point arithmetic. However, the differences are not statistically significant, and we publish the results of the new version here: