This will arguably be one of the more boring posts on the blog however I figured some keen eyes might extract something useful out of this, even if not technical.

Note –if you would like take a closer look at the scratch project, you can find it here.

A couple of months back I had started reading an interesting book by the late Richard Hamming, "The Art of Doing Science and Engineering: Learning to Learn".

It's a good book but we only get one chapter on hamming code (and *four *on digital filters); earlier chapters however are definitely word a read. Really and truly I hardly understood most it but I did love the narrative on learning, how to learn, and the story on how the hamming code was discovered.

Since my interest in digital filters was, and is, somewhat non-existent I decided hamming code would be a great place to learn a couple of interesting topics and tools for a few reasons:

- It's a well defined problem domain (no room for interpretation);
- Multiple components with
*some*shared dependency between components; - Small but not too small of an algorithm to optimise and learn from.

With these characteristics I wanted to explore a few areas around profiling, benchmarking, assembly, concurrency, SIMD and overall bit twiddling all in one place and done gradually.

## Introduction

I will spare the detailed explanation of Hamming code as there is already very work out there, as well as different ways of approaching the problem depending on hardware and software. That said, for both of these cases I **highly **recommend two videos by 3Blue1Brown and Ben Eater respectively.

In this particular task I want to be able to do one thing and do it fast. Encode, add noise and decode binary files (yes, *technically* it's three things). Here are some initial versions trying to encode & decode a lorem ipsum text file.

Well, at least it's roughly equal amount of gibberish. Maybe I got the byte slices all wrong?

Seems like I did but instead of fixing it, I made it worse. Some alignment issues, algorithm issues and endianness issues later we managed to get some basic text encoding & decoding working that would translate well to any binary files (in theory).

Time to apply some synthetic noise on a per-block basis with a 66.67% chance of flipping a single bit. In real world scenarios binary blocks are interleaved to reduce chances of single blocks have more than 1 bit error which Hamming cannot handle.

The reason for this is that when a bit is damaged in transmission, it tends to occur in a short burst were a few adjacent bits are all flipped altogether. Interleaving blocks builds resiliency against such scenarios.

```
Block: 0000000001100001011100100111010001100101011100100110000101101000
Encoded: 0011000010111001001110100011001001011100100110000010110110010111
Decoded: 0000000001100001011100100111010001100101011100100110000101101000
Match? true
Block: 0000000001100001011010010110011101110101011001010110011000100000
Encoded: 0011000010110100101100111011101001011001010110011100010000010011
flipping bit: 12
error at bit: 12
Decoded: 0000000001100001011010010110011101110101011001010110011000100000
Match? true
```

I tried my hand at some images and somehow made a cursed cat look less terrifying. It's a feature not a bug.

As it turns out, my file reader is only able to process up to, and exactly, 7168 bytes? I decided to not worry about this for now and move on; I'd be more than happy to feedback on alternative ways to write this.

## Implementation

It took some back-of-the-napkin dry runs till I got the crux of the problem and looking at other similar implementations, we narrowed down to something that worked. I wanted something simple, that can be optimised so we landed for the following initial implementation.

We will be going over each component, (**a**) encoder, (**b**) decoder and (**c**) parity check independently to avoid context switching too much and streamlining the improvements.

### Encoder

The initial encoder is a little bit allover the place but is arguably the simplest part of this implementation. We need to figure how many parity bits, `len_power`

, we need (`log2{64 bits} = 6`

) and from that derive the data width, `len`

(`2^6 = 64`

); you will notice soon enough that this entire dance is not required.

```
pub fn encode(block: &mut u64) -> u64 {
let len_power = (2..).find(|&r| 2u32.pow(r) - r - 1 >= 32).unwrap();
let len = 2usize.pow(len_power);
// ...
}
```

These are the only parameters that the encoder needs to be aware of before we can start to encode our message. Starting with `1001`

as our example input message, we first go through each bit, see if the index of that bit is a power of 2 and therefore a parity bit. If it is a power of 2, we shift the input message left (i.e. *skip* the current bit), otherwise we insert the bit into the encoded message as is.

```
pub fn encode(block: &mut u64) -> u64 {
// ...
let mut code = 0u64;
for i in 0..len {
// Check if `i` is not a power of 2
if (i != 0) && (i & (i - 1)) != 0 {
code |= (0b1 << i) & *block as u64;
} else {
*block <<= 1;
}
}
// ...
}
```

Lastly, we go through each parity bit an compute the remaining four parity bits and insert them into the final encoded message.

```
pub fn encode(block: &mut u64) -> u64 {
// ...
for i in 0..len_power {
// If the parity check is odd, set the bit to 1 otherwise move on.
if !parity(&code, i) {
code |= 0b1 << (2usize.pow(i));
}
}
encoded
}
```

If we were to think of this graphically, it may look something like this, (**1**) we get our raw input, (**2**) we map the raw input to the encoded output as part of its data bits and (**3**) we calculate the parity bits and map them to the encoded output as parity bits.

Let's test the implementation out; the results match the expected output and running through a simple benchmark (criterion), does not yield promising results. Our baseline median execution time of **430.44ns** for our encoder with a large number of odd outliers. We now have something that we can improve.

There are some definite quick wins we can work with here. For starters, since we know we're always going to work with `u64`

(or any fixed-width block size), `len_power`

and `len`

do not need to be computed every time we want to encode a block. Let's also remove a `copy`

for the internal `encoded`

variable that is not needed.

```
pub fn encode(block: &mut u64) -> u64 {
let len_power = 6;
let len = 64;
let mut code = 0u64;
// ... Remove `encoded` and work directly on `code` only
for i in 0..len_power {
// If the parity check is odd, set the bit to 1 otherwise move on.
if !parity(&code, i) {
code |= 0b1 << (2usize.pow(i) - 1);
}
}
code
}
```

Re-running our benchmarks, we now run in a median time of `361.27ns`

, a ~17.5% performance increase. We're still seeing some outliers, but the average execution time is now bundled up better.

At this stage, there are two branching conditions left in our encoder which I will largely leave as to avoid over optimising early on. Let's profile (via `perf`

) and plot (via `flamegraph`

) our encoder and take a peak at the overall execution time.

This SVG was hand-edited to save space and keep it interactive. It was not fun and the sample frequency in this sample is low but it translates well enough to larger sample rates (~99Hz).