troy@home:~$

Know Your Encodings: Roll your own Base64 encoder


Introduction

Most of us will recognize Base64 when we see it. But how familiar are you with it really? Do you know how it works? In this post we will take a deep dive into Base64 and how it is used to encode data. I think that one of the best ways to learn a process like Base64 encoding is to roll our own code. So, let’s make our own base64 encoder/decoder in python.

From Wikipedia:

“Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.

Common to all binary-to-text encoding schemes, Base64 is designed to carry data stored in binary formats across channels that only reliably support text content. Base64 is particularly prevalent on the World Wide Web where one of its uses is the ability to embed image files or other binary assets inside textual assets such as HTML and CSS files.”

Before we write any code, let’s familiarize ourselves with the Base64 encoding and decoding process. We will focus on Base64 encoding as defined by RFC 4648, which is the most common standard. The following table shows the encoding scheme for Base64:

Binary Char Binary Char Binary Char Binary Char
000000 A 010000 Q 100000 g 110000 w
000001 B 010001 R 100001 h 110001 x
000010 C 010010 S 100010 i 110010 y
000011 D 010011 T 100011 j 110011 z
000100 E 010100 U 100100 k 110100 0
000101 F 010101 V 100101 l 110101 1
000110 G 010110 W 100110 m 110110 2
000111 H 010111 X 100111 n 110111 3
001000 I 011000 Y 101000 o 111000 4
001001 J 011001 Z 101001 p 111001 5
001010 K 011010 a 101010 q 111010 6
001011 L 011011 b 101011 r 111011 7
001100 M 011100 c 101100 s 111100 8
001101 N 011101 d 101101 t 111101 9
001110 O 011110 e 101110 u 111110 +
001111 P 011111 f 101111 v 111111 /

Encoding

Let’s start with encoding. Below are the steps necessary to convert a text-based message into Base64. Of course, any binary data, such as a jpg file for instance, can be encoded in Base64. However, in this post we are going to focus on encoding text with Base64 in order to keep things simple. Since Base64, as mentioned above, is a binary-to-text encoding scheme, we will first need to convert our text to a binary representation.

  1. Convert each character to its ASCII code in binary representation.
  2. Join the resulting binary strings into a single string.
  3. Add zeros to the end of the binary string to ensure the number of bits is a multiple of 6.
  4. Break this binary string into 6-bit octets.
  5. Convert these 6-bit octets into their base64 character according to the table.
  6. Join these base64 characters into a single string, adding any necessary ‘=’ characters for padding.

Once we have each character in our text message encoded as binary data, we join each of those binary strings into one binary string. We will next break this string into groups of six bits (sextets), however before we do so, we need to ensure that the number of bits in the string is evenly divisible by six. If not, we will add just enough zeroes to the end of the string until it is. It turns out that in the case of Base64, there are only three possibilites required: adding no zeroes, adding two zeroes or adding four zeroes. Once we have this new binary string, we then break it down into sextets. Then we use the above table to convert these sextets into their respective Base64 characters. Lastly, we append a number of ‘=’ characters to indicate how many zeroes we used as padding in step 3. If we appended no zeroes, we add no ‘=’ characters. If we appended two zeroes, we add a single ‘=’. If we appended 4 zeroes, we add two ‘=’ characters. A simple way to remember this is to divide the number of zeroes appended by 2 to get the number of ‘=’ to add.

Example 1

The best way to understand this is to look at an example. Say that we wanted to encode the following short text in Base64: “Hi!” How would we do it? First, we need to convert each character to its ASCII code in binary representation. The following table shows the results:

Character ASCII (decimal) ASCII (binary)
‘H’ 72 01001000
‘i’ 105 01101001
’!’ 33 00100001

Next, we join these binary codes into a single string, which yields:

010010000110100100100001

Now we need to append zeroes to this string until the number of bits is a multiple of 6. However, you will notice in this case that we have 24 bits, which is already a multiple of 6. As a result, we do not need to add any trailing zeroes.

Next, we need to break this string down into groups of 6 bits (sextets). Doing so yields four sextets:

'010010', '000110', '100100', '100001'

Next, we use our Base64 table above to convert these to their corresponding Base64 codes. This yields the following string:

'SGkh'

Lastly, we need to append the ‘=’ character to this string either once or twice depending upon how many zeroes we added as padding in step 3. In this case we did not need to add any zeroes, so we also do not need to add any ‘=’ characters. So we are done. The Base64 encoding of ‘Hi!’ is ‘SGkh’. You can verify this yourself using an online Base64 encoder.

This was a very simple example due to the fact that it did not require padding. Let’s now look at another example where padding is necessary.

Example 2

Say that we wanted to encode the string ‘Test’.

We start by again converting each character to its ASCII code in binary:

Character ASCII (decimal) ASCII (binary)
‘T’ 84 01010100
‘e’ 101 01100101
’s’ 115 01110011
‘t’ 116 01110100

Now we join these together into a single string:

01010100011001010111001101110100

Note that there are 4 * 8 = 32 bits in this string, which is not a multiple of 6. The next largest multiple of 6 is 36. Since this is 4 greater than 32, we must add 4 zeroes to the end of our binary string, yielding:

010101000110010101110011011101000000

Now we can break this binary string into sextets:

'010101', '000110', '010101', '110011', '011101' '000000'

Next, we encode these using our Base64 table, which gives the following string:

'VGVzdA'

We’re not quite done. Remember that we had to add four zeroes as padding in step 3 in order to ensure that the number of bits in our binary string was a multiple of 6. Whenever we add four zeroes, we have to add two ‘=’ characters to the end of our Base64 encoded string to signify this. After doing so, we have our Base64 encoded string of:

'VGVzdA=='

Again, go ahead and verify that this is correct using an online Base64 encoder.

We did not cover an example where the number of zeroes required as padding was 2, but just remember that when we add 2 zeroes as padding, we must append a single ‘=’ to the end of our Base64 encoded string.

Now let’s move on to the Base64 decoding process.

Decoding

Decoding a Base64 string is simply the inverse of the encoding process.

  1. Determine padding by obtaining the number of ‘=’ characters at the end of the encoded string. This will be zero, two, or four.
  2. Convert each character in the encoded string back to binary using the Base64 table.
  3. Combine these binary strings into a single binary string.
  4. Remove trailing zeroes from this string based on the number of ‘=’ characters in the original encoded string
  5. Break this binary string down into 8-bit octets.
  6. Convert these 8-bit binary strings back into ASCII characters

Let’s again look at a simple example.

Example 1

We’ll decode the Base64 encoded string ‘Y3liZXI=’.

First, we see a single ‘=’ character at the end. We know that this indicates padding of two zeroes. Keep that in mind.

Next, we convert each character back to its binary equivalent using our Base64 table. Doing so yields the following 6-bit sextets:

Encoded character Binary
‘Y’ 011000
‘3’ 110111
‘l’ 100101
‘i’ 100010
‘Z’ 011001
‘X’ 010111
‘I’ 001000

Now we combine these into a single string:

011000110111100101100010011001010111001000

Next, we remove the number of trailing zeroes as indicated by the padding. As mentioned in step 1 we had a single ‘=’ character, which indicates padding of two zeroes. Remove them:

0110001101111001011000100110010101110010

Now we break this string into 8-bit octets:

'01100011', '01111001', '01100010', '01100101', '01110010'

Lastly, we convert these back into ASCII characters:

Binary ASCII code ASCII character
01100011 99 ‘c’
01111001 121 ‘y’
01100010 98 ‘b’
01100101 101 ‘e’
01110010 114 ‘r’

The decoded message is ‘cyber’.

Rolling our own Base64 encoder/decoder in python

Now that we have an understanding of how Base64 works, let’s cement our knowledge by writing a Base64 encoder/decoder in python.

A few caveats before we start. First, this python script will be for educational purposes only. Although you could use it to encode text-based messages, you would certainly prefer to use the base64 module in python or the base64 command in bash for example. Second is that Base64 encoding is a binary-to-text scheme. So you can use it to encode any binary data like a jpg file or an executable. For sake of example, our Base64 encoder/decoder will only be capable of accepting text-based messages. Third, the code we write will not be as efficient as you might find in the python base64 module (or other versions). Our code will be intentionally expository to aid the learning process.

We’ll encapsulate our code in a python class. This class will have one attribute and two methods. The attribute will be our Base64 table represented as a dictionary. It will have a method to encode and one to decode. I’ve gone ahead and imported the ceil function from the math module as we’ll need that for our encoding method. I’ve also imported the base64 module. We’ll use that for testing to ensure that our home-brewed encoder/decoder is working properly.

Below is the boilerplate script:

#!/usr/bin/python3

from math import ceil
import base64

class Base64e:

  def __init__(self):
    pass


  def encode(self, input):
    pass


  def decode(self, input):
    pass

if __name__ == "__main__":
  pass

Here I name our class ‘Base64e’, where our __init__() and encode/decode methods our not yet defined. Let’s begin by coding our __init__() method. Here we will define the dictionary to represent our Base64 encoding table. Any new instance of our Base64e class will automatically contain this dictionary. This is straightforward:

class Base64e:

    def __init__(self):
        self.base64_encoding = {
            '000000': 'A',  '010000': 'Q',  '100000': 'g',  '110000': 'w',
            '000001': 'B',  '010001': 'R',  '100001': 'h',  '110001': 'x',
            '000010': 'C',  '010010': 'S',  '100010': 'i',  '110010': 'y',
            '000011': 'D',  '010011': 'T',  '100011': 'j',  '110011': 'z',
            '000100': 'E',  '010100': 'U',  '100100': 'k',  '110100': '0',
            '000101': 'F',  '010101': 'V',  '100101': 'l',  '110101': '1',
            '000110': 'G',  '010110': 'W',  '100110': 'm',  '110110': '2',
            '000111': 'H',  '010111': 'X',  '100111': 'n',  '110111': '3',
            '001000': 'I',  '011000': 'Y',  '101000': 'o',  '111000': '4',
            '001001': 'J',  '011001': 'Z',  '101001': 'p',  '111001': '5',
            '001010': 'K',  '011010': 'a',  '101010': 'q',  '111010': '6',
            '001011': 'L',  '011011': 'b',  '101011': 'r',  '111011': '7',
            '001100': 'M',  '011100': 'c',  '101100': 's',  '111100': '8',
            '001101': 'N',  '011101': 'd',  '101101': 't',  '111101': '9',
            '001110': 'O',  '011110': 'e',  '101110': 'u',  '111110': '+',
            '001111': 'P',  '011111': 'f',  '101111': 'v',  '111111': '/'
        }

Next, we will write the encode method. This will take one argument called input, which is our data to encode. I won’t go into a lot of detail here as the code, coupled with the comments, should be fairly self-explanatory.

    def encode(self, input):

        # convert ASCII characters to binary representation
        octets = [format(ord(c), '08b') for c in input]

        # combine octets into single binary string
        binary = ''.join(octets)

        # determine number of sextets needed 
        n_sextets = ceil(len(binary) / 6)

        # determine # of zeroes for padding
        remainder = 6 * n_sextets - len(binary)

        # append zeroes to binary string
        binary = binary + ('0' * remainder)

        # split binary string into 6-bit sextets
        sextets = [binary[i:i+6] for i in range(0, len(binary), 6)]

        # convert 6-bit sextets to Base64 encoding, add '=' padding
        output = ''.join([self.base64_encoding[b] for b in sextets]) + "=" * (remainder // 2)

        return output 

Then we will write our decode method. It also takes one argument input, which is a Base64 encoded string. In order to maintain our focus, we are not doing any input validation here and assume the user will input a valid Base64 encoded string. However, it would be best practice to perform such validation. I leave this as an exercise for the reader.

    def decode(self, input):

        # determine padding
        padding = input.count('=')

        # remove padding
        input = input.replace("=", "")
        
        # convert characters to binary sextets using Base64 table
        binary = [key for char in input for key, value in self.base64_encoding.items() if char == value]

        # join sextets into single binary string
        binary = ''.join(binary)

        # remove number of trailing zeroes as determined by padding
        binary = binary[:-(padding*2)] if padding else binary

        # split binary string into 8-bit binary octets
        octets = [binary[i:i+8] for i in range(0, len(binary), 8)]

        # convert octets back to ASCII characters
        output = [chr(int(octet, 2)) for octet in octets]

        # combine ASCII characters to string
        output = ''.join(output)
        
        return output

Finally, let’s write some code to test out our Base64e class. Here we just prompt the user for a string to encode, initialize an instance of our Base64e class called encoder, and encode the inputted message. Then we print the results of our encoding, along with the results of encoding the same message with the python base64 module. We also plug the encoded message into the decode method to ensure that we have the correct decoded message.

if __name__ == "__main__":
    input = input("Enter text to encode: ")
    encoder = Base64e()
    encoded = encoder.encode(input)
    decoded = encoder.decode(encoded)

    module_encoded = base64.b64encode(input.encode()).decode('utf-8')
    module_decoded = base64.b64decode(module_encoded).decode()

    print("Encoded values: ", f"Our encoder:           {encoded}", f"base64 module encoder: {module_encoded}", sep="\n")
    print("Decoded values: ", f"Our decoder:           {decoded}", f"base64 module decoder: {module_decoded}", sep="\n")

Now let’s try it out. Below are the results of running the script, entering the text “Hi!” when prompted.

Enter text to encode: Hi!
Encoded values: 
Our encoder:           SGkh
base64 module encoder: SGkh
Decoded values: 
Our decoder:           Hi!
base64 module decoder: Hi!

We have indeed obtained the correct encoded and decoded message. I’d encourage you to try the code yourself and test out various inputs. You can get the complete script from my Github here.

Hopefully you’ve gotten a much better understanding of how Base64 works. I certainly have after writing this python script. Again, I often find the best way to gain an in-depth understanding of something like Base64 encoding is to write my own code. I hope to do more posts like this in the future where we can take a deep dive into some other encoding schemes.