Intro to representing data

0. Learning objectives

Explore how computers represent data under the hood
Explore how computers perform calculations under the hood
Convert between decimal, binary and hexadecimal representations
Perform arithmetic operations in binary
Start looking at how other type of data is represented
Get a taste of computer architecture

We are going to shift gears a bit. We'll start talking about how computers

represent and
perform calculations on

the data in your programs.

What do think you should get if you evaluate these expressions?

5 * 0.2 == 1
3 * 0.1 == 0.3

solution

True: this is what I would expect, since $5 \times 0.2$ is equal to $1$.
False: WHAT? Python! Com'on! $3 \times 0.1$ is equal to $0.3$! Well, actually the number $0.1$ cannot be represented exactly on a computer, so if you try to compute 3 * 0.1, you should see 0.30000000000000004. Strictly speaking, this is not equal to 0.3 so that's why the result of the expression is False.
So what's going on here? Python (and computers in general) represents numbers using a format that we will explore below, and it turns out that $0.1$ cannot be represented exactly using this format. So when you multiply it with $3$, the result will also be inexact. Let's investigate why.

Here is a post I saw on Instagram related to why 1/3 has infinitely repeating digits in a decimal number system. This is exactly what happens when we try to represent 0.1... where the knife limits the numbers of 1s and 0s that we can use in the representation.

1. How does a computer work?

At some point, all the statements and data that we have been working with in Python have to be converted into a bunch of 0s and 1s in your computer. Why? Well, the most fundamental building block for your computer is a transistor. There are actually billions of transistors in your computer. Transistors behave like switches (think of the light switch in your room). Simply speaking, transistor can be in one of two states: either ON or OFF. There are many types of transistors. On the right you can see one of the symbol used to represent one of these types transistor is electronic circuits. Discussing the details of how a transistor works is outside the scope of this course. You can think of it as a device that can only represent a single piece of information that is either 0 or 1 (ON or OFF). Not super interesting, but what if have two transistors? What about three transistors? How many possible states can we represent?

solution

Remember that each transistor can take on two possible states. Let's consider the case of two transistors and we'll call these transistor A and transistor B.

Suppose transistor A is ON. Then transistor B can be either ON or OFF. So there are two possibilities there.

Now suppose transistor A is OFF. The transistor B can still be either ON or OFF.Another two possibilities!

This means there are a total of 4 possible states when we have two transistors. This generalizes to $s = 2^n$ where $s$ is the number of possible states, and $n$ is the number transistors. So with three transistors, there are 8 possible states.

So our fundamental unit for storing information is something that is either 0 or 1. We call this a BInary digiT, or a bit for short.

Let's explore an example of all unique combination that we can have if we use 4 bits.

bit 1	bit 2	bit 3	bit 4

Each set of bits (row) represents a unique combination and we can count these combinations (right column). The larger the number of bits that we use, the more combinations we can have. Just like we said the number of possible combinations with $n$ bits is equal to $2^n$. So, what if we decide to associate an integer to each combination like we have done in the right column? Then we could represent any number from $0$ to $2^n - 1$ with $n$ bits.

There are some common sizes of bits that are usually used to represent numbers: a nibble consists of 4 bits, a byte consists of 8 bits and a word consists of 32 bits (though 64 bits is becoming more common). What is the largest positive integer you can represent with 64 bits?

solution

With 64 bits, we can represent any number between 0 and $2^{64}-1$, which is 18,446,744,073,709,551,615.

So the question now is: which combination of bits should we assign to which number? Spoiler alert: I didn't pick the combination above and the corresponding value in the right column randomly!

2. Representing information: positive integers

If you look at the table above, we decided that 1001₂ (the subscript 2 is to clarify that this number is written in base-2 - more about it below - and distinguish it from one-thousand-and-one) is a representation of number 9₁₀, and 1111₂ is a binary representation for the number 15₁₀. But how do we go from the 0s and 1s to a number we're familiar with? We'll look at this shortly, but first we need to understand the different ways numbers can be represented. In fact, the numbers 9₁₀ and 15₁₀ are represented in the "usual" decimal system we are familiar with, i.e. we use base-10.

Let's explore some common ways of representing numbers, and then we'll look at how we can convert between one number representation and another.

The main concept to remember when we look at different representations is the idea of the base, which represents how many options we have for each digit. You're definitely familiar with the base-10 numbering system (any idea why 10 is so popular?). That is, the numbering system in which we have 10 digits (0-9) to represent a number. What numbers can you represent with two base-10 digits? Well we can represent the numbers 0₁₀-99₁₀, or up to $10^2 - 1$. With $3$ base-10 digits? We can represent the numbers 0₁₀-999₁₀, or $10^3-1$. Hmm, this sounds familiar. Therefore, with a base $b$ numbering system and $n$ digits, we can represent numbers between 0 and $b^n-1$.

2.1. Binary numbers (base-2)

The goal we have is to represent different types of numbers as a sequence of 1s and 0s. For example, 1101₂ is a number represented in binary, and 1001111010110110₂ is also a binary number. We can also forget about leading 0s: in other words, 1101₂ is the same number as 0000000000001101₂.

To convert from a binary (base-2) number to a base-10 number, you should read the bits from right to left. Since we work with positional numbering system, where the position of a digit represents how "significant" that digit is within the number, we can write a number as:

$$d_{3}d_{2}d_{1}d_{0} = d_{3}\cdot b^{3}+d_{2}\cdot b^{2}+d_{1}\cdot b^{1}+d_{0}\cdot b^{0}$$

Where the $d$ are the digits, $b$ is the base, and the subscripts represent the location. This means that 5183 can be thought as:

$$5183=5\cdot10^{3}+1\cdot10^{2}+8\cdot10^{1}+3\cdot10^{0}$$

If we apply this idea to binary numbers, then the rightmost bits represent the lowest powers of two (starting with $2^0$ at the right). For example, let's look at the number 1101₂:

$$ 1101_{2} = 1\cdot2^{3}+1\cdot2^{2}+0\cdot2^{1}+1\cdot2^{0} = 8 + 4 + 1 = 13.$$

This mean that we can use this process to find a "natural" mapping between integers in base 10 and binary numbers. So, the binary number 1101₂ represents the number 13₁₀ in the computer! Perfect since it is only 1s and 0s.

Let's do another example: try to see what decimal integer is represented by the 16-bit number 11001001₂

Solution

This is a long one. Remember, we start counting positions from the very right (position zero) and each of the digit counts as current base (in this case 2) to the power of that position. So, the rightmost digit is worth $2^{0}$, the next one to its left $2^{1}$, the next one over $2^{2}$ and so on:

$$ 1\cdot2^{7} + 1\cdot2^{6} + 0\cdot2^{5} + 0\cdot2^{4} + 1\cdot2^{3} + 0\cdot2^{2} + 0\cdot2^{1} + 1\cdot2^{0} = 128 + 64 + 8 + 1 = 201. $$

Great, right? You can convert any integer number from base-2 (binary) to base-10 (decimal). Here is a few more for you to check out:

1101₂
10010001₂
11100000₂
11111100110₂

...to base-10!

1101₂ => 13₁₀
10010001₂ => 145₁₀
11100000₂ => 224₁₀
11111100110₂ => 2022₁₀

So we can convert numbers from binary to decimal. But how do we convert decimal to binary?

Here is one way"

Find the highest power of 2 that is less than or equal to $x$. For example, for the number 13, the highest power is $3$ (i.e. $2^3 = 8$ is the highest power of two less than 9 since the next power of two, $2^4$ is equal to 16). Save the exponent of the highest power of 2 and call it $p$ (in this example $p = 3$). Put a 1 at that position (3) in the binary number that you are building: 1xxx₂ (for our example with 13).
Subtract this highest power of 2 from $x$, i.e. compute the difference $y = x - 2^p$. In our case $y=13-2^{3}=5$
Go back to step 1 and, this time, use $y$ as starting number, i.e. find the highest power of $2$ that is less than or equal to $y$. In this case 2 since $2^{2}=4$ and $2^{3}=8$ and $y=5$. Mark a 1 at position 2 in the binary number you are building (11xx₂). If you have positions in your binary number that you have not filled between this round and the previous round, then you should fill them up with 0s. (Try, for example, to follow this process with the number 9₁₀
Keep going until $y$ is either 0 or 1, at which point you can write that in position 0, since $2^{0}=1$.

Sounds recursive, doesn't it? Maybe you can challenge yourself to write a function that does this conversion.

Using the steps above, convert 233₁₀ into binary.

solution

Starting with 233₁₀, the highest power of 2 is 128 (i.e. $2^7$). So we will have a 1 in the 7-th position of the binary number 1xxxxxxx₂. Now we compute 233-128 = 105. The highest power of 2 less than 105 is 64 (i.e. $2^6$), so we can write a 1 in the 6-th position: 11xxxxxx₂. We then have 105-64 = 41. The highest power of 2 less than 41 is 32 (i.e. $2^5$), so put a 1 in the 5-th position: 111xxxxx₂. We then have 41-32 = 9. The highest power of 2 less than 9 is 8 (i.e. $2^3$), so we put a 1 in the 3-rd position 111x1xxx₂, but, since we have left an empty spot (i.e., $2^4$), we have to fill it with 0: 11101xxx₂. Then we compute 9-8 = 1, which is equal to $2^0$, so we have a 1 in the 0-th position: 11101xx1. This means that we have to fill the empty position for $2^1$ and $2^2$ with zeros. Thus 233 in binary is equal to 11101001₂. It's always a good idea to check your result by using our previous algorithm for converting from binary back to decimal, i.e. computing $ 1\cdot2^{7} + 1\cdot2^{6} + 1\cdot2^{5} + 1\cdot2^{3} + 1\cdot2^{1} = 128 + 64 + 32 + 8 + 1 = 233_{2}$.

2.2. Hexadecimal numbers (base-16)

In a hexadecimal system, each digit can hold one of 16 values. So, what can we use for digits? We can start with our friends 0-9 to cover the first 10, then we need something else for the following 6. One of the most convenient choices is letters since they have a sort of natural ordering. So we are going to use A-F to represent the digits with value 10-15. This might sound weird, but it's very convenient when handling larger binary numbers. Remember that a cluster of 4 bits (a nibble) can represent a number between 0-15 which conveniently map to hexadecimal numbers between 0-F. So, in a large binary number, you can read off clusters of 4 bits starting from the right, convert each individual cluster into an hexadecimal digit, and then, if you need to, convert the hexadecimal number into a decimal to figure out what original binary number represented.

Using 4 bits, the conversion from binary (base-2), decimal (base-10) and hexadecimal (base-16) systems is summarized in the following table.

bit 1	bit 2	bit 3	bit 4	decimal	hexadecimal

Converting to/from a hexadecimal system follows the same algorithms that we described above to convert to/from a binary system. The only difference is that we work with powers of 16 instead of powers of 2.

For example, 2A₁₆ = $2 \cdot 16^1 + 10 \cdot 16^0 = 42_{10}$ (remember that A₁₆ represents 10₁₀).

Similarly to what we did for binary, converting the decimal number 981₁₀ to hexadecimal requires finding the highest power of 16 that is less than 981₁₀, which in this case is $16^2 = 256$. The trick, now, is to find how many 256₁₀ fit in 981₁₀. In binary, we only had to consider two digits either 0 or 1. In hexadecimal we might be able to fit up to 15 of our powers into the number! In this example, we can fit $3\cdot16^2$ into 981₁₀, since $3 \times 16^2 = 768_{10}$. So, the digit that we want to write down at position 2 is 3: 3xx₁₆. Next, we compute 981₁₀-768₁₀ = 213₁₀. The next highest power of 16 is $16^1$, and we can fit 13 of those into 213₁₀, since $13\cdot16^1 = 208_{10}$. Now, the hexadecimal digit that corresponds to 13₁₀ is D₁₆. So we write that at position 1: 3Dx₁₆. We then have 5 left over, so the $16^0$ slot is 5. Therefore, 981₁₀ can be represented as 3D5₁₆ in hexadecimal.

Alternatively, you could convert the decimal number to binary, and then read off chunks of 4 to convert to hexadecimal. Following the procedure from the previous section gives

$$ 981_{10} = \underbrace{\texttt{0011}}_{3_{10}}\underbrace{\texttt{1101}}_{13_{10}} \underbrace{\texttt{0101}_{2}}_{5_{10}} $$

which is again 3D5₁₆. Note that you'll need to write out the above table (or remember it) to do the conversion.

2.3. Positive integer representation summary

In every base, the representation of a number is positional just like the decimal representation is.

The following table shows the two representations for the first 7 positions.

Digit position	7	6	5	4	3	2	1	0
Positional representation in an arbitrary base b	b⁷	b⁶	b⁵	b⁴	b³	b²	b¹	b⁰
Representation for b = 10	10⁷	10⁶	10⁵	10⁴	10³	10²	10¹	10⁰
Values for b = 10	10000000	1000000	100000	10000	1000	100	10	1
Representation for b = 2	2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
Values for b = 2	128	64	32	16	8	4	2	1
Representation for b = 16	16⁷	16⁶	16⁵	16⁴	16³	16²	16¹	16⁰
Values for b = 16	268435456	16777216	1048576	65536	4096	256	16	1

Ex:                      1034     = (value in decimal - base 10 - representation)
    1*1000 + 0*100 + 3*10 + 4*1   = (in decimal representation)
    1*10³  + 0*10² + 3*10¹ + 4*10⁰ =
                         1034       (value in decimal representation)


                         1011     = (value in binary - base 2 - representation)
        1*8  + 0*4 + 1*2  + 1*1   = 
        1*2³ + 0*2²+ 3*2¹+ 4*2⁰  =
                           11       (value in decimal representation)


                           2A    =  (value in hexadecimal - base 16 - representation)
                     2*16 + A*1  = 
                     2*16¹+ A*16⁰ =
                           42       (value in decimal representation)

In bases larger that 10 - for example the hexadecimal (b = 16) - we need to encode digit values larger than 9 with unique symbols. The usual choice is to use letters since they have an underlying sorting (A < B < C < ...). For hexadecimal, the mapping is in the following table:

Digit in hexadecimal	Decimal representation	Binary representation
0	0	0000
1	1	0001
2	2	0010
3	3	0011
4	4	0100
5	5	0101
6	6	0110
7	7	0111
8	8	1000
9	9	1001
A	10	1010
B	11	1011
C	12	1100
D	13	1101
E	14	1110
F	15	1111

So, for each digit of hexadecimal (hex), you have 4 uniquely corresponding digits of binary. This makes the conversion from binary to hex (and vice-versa) straightforward.

Ex: ABCD = 1010101111001101
    104F = 0001000001001111

2.4 Doing math with binary numbers

We can actually do arithmetic with binary numbers the same way we do arithmetic with decimal numbers. The main difference is the "carry" principle. When doing arithmetic with decimal numbers, you "carry a one" if your current digit is greater than 9. In binary arithmetic, we "carry a one" if the current digit is greater than 1. For example if we want to add the two binary numbers 111₂ + 110₂: $$ \begin{array}{cccc} \textcolor{red}{1} & \textcolor{red}{1} & & \\ & 1 & 1 & 1 \\ + & 1 & 1 & 0 \\ \hline \textcolor{blue}{1} & \textcolor{blue}{1} & \textcolor{blue}{0} & \textcolor{blue}{1} \end{array} $$ which gives 1101₂ which is equal to 13₁₀. Note that 111₂ is 7₁₀ and 110₂ is 6₁₀, so $7 + 6 = 13$). As an exercise, try to add 1101₂ with 0011₂.

solution

1101₂ represents 13₁₀, and 0011₂ represents 3₁₀, so we whould get 16₁₀: $$ \begin{array}{ccccc} \textcolor{red}{1} & \textcolor{red}{1} & \textcolor{red}{1} & \textcolor{red}{1} & \\ & 1 & 1 & 0 & 1 \\ + & 0 & 0 & 1 & 1 \\ \hline \textcolor{blue}{1} & \textcolor{blue}{0} & \textcolor{blue}{0} & \textcolor{blue}{0} & \textcolor{blue}{0} \end{array} $$ which is 16₁₀!

What about multiplication? Multiplication follows the same idea as in decimal arithmetic. Actually, multiplication in binary arithmetic is nicer because the maximum you can get with any multiplication of a 0 and 1 is 1, so you never need to "carry" anything in the multiplication phase. You only need to carry a 1 during the addition phase. For example, multiplying 10₂ (2₁₀) with 11₂ (3₁₀) should give 6₁₀:

$$ \begin{array}{cccc} & & 1 & 0 \\ & \times & 1 & 1 \\ \hline & & 1 & 0 \\ + & 1 & 0 \\ \hline & \textcolor{blue}{1} & \textcolor{blue}{1} & \textcolor{blue}{0} & \end{array} $$

For more practice, try to multiply 1101₂ with 1001₂.

solution

1101₂ represents 13₁₀, and 1001₂ represents 9₁₀, so we whould get 117₁₀: $$ \begin{array}{ccccccc} & & & & 1 & 1 & 0 & 1 \\ & & & \times & 1 & 0 & 0 & 1 \\ \hline & & & & 1 & 1 & 0 & 1 \\ & & & 0 & 0 & 0 & 0 & \\ & & 0 & 0 & 0 & 0 & & \\ + & 1 & 1 & 0 & 1 & & & \\ \hline & 1 & 1 & 1 & 0 & 1 & 0 & 1 & \end{array} $$ which is 117₁₀!

3. Representing information: real numbers

The systems we have looked at so far are good for representing integers. But what about real numbers?! For example, how would you represent the number $0.5$? Well, let's think about how we can write 0.5 if we think about our base-10 representation: $0.5 = 0 \cdot 10^{0} + 5 \cdot 10^{-1}$. This seems to indicate that if we have position from 0 up to the left of the decimal point, we have position from -1 down to its right. What if we extend our base-2 system to include negative positions as well? In our positional representation, they will result in negative powers of 2. This means that, if we agree on a spot in the middle of our binary number where the "decimal point" should go, 0.5 could be represented with a single bit to the right of this location that represents $2^{-1}$:

$$ 0.5_{10} \rightarrow 0.1_{2} = 0 \cdot 2^0 + 1 \cdot 2^{-1} $$

The same idea can be used to represent numbers like $0.25_{10}$, which is $0.01_{2} = 0 \times 2^0 + 0 \times 2^{-1} + 1 \times 2^{-2}$, and $0.75_{10}$ which is $0.11_{2} = 0 \times 2^0 + 1 \times 2^{-1} + 1 \times 2^{-2}$. But now what about 3.25₁₀? 6.5₁₀? 13₁₀? The issue we encounter is that all of these numbers have the same pattern of 0s and 1s:

$$\begin{eqnarray} 3.25_{10} &= 1 \times 2^1 + 1 \times 2^0 + 0 \times 2^{-1} + 1 \times 2^{-2} &= 11.01_{2} \\ 6.5_{10} &= 1 \times 2^2 + 1 \times 2^1 + 0 \times 2^0 + 1 \times 2^{-1} &= 110.1_{2} \\ 13_{10} &= 1 \times 2^3 + 1 \times 2^2 + 0 \times 2^1 + 1 \times 2^0 &= 1101_{2} \end{eqnarray} $$

I added a decimal point in the examples above, which highlights that somehow we'll need to keep track of where the decimal point should go, with the understanding that, in binary representation, there is actually NO decimal point and what we do is somehow subdivide the bits in the binary number and decide which one represent the integer part and which one are used to represent the decimal part of the original real number. On top of that, we also need a way to keep track of the sign of the number. Is it positive or negative? One way to do this is to use a bit to identify the sign: 0 if positive and 1 if negative.

All these details are captured in a standard way that is used to represent real numbers in computers. This is called the floating-point numbers notation and it has a format similar to the scientific notation:

$$ -1^s \times F \times 2^e $$

where the first bit to the left is used to store the sign of the number ($s$), a certain number of bits for the mantissa $F$, and another bunch of bits for the exponent $e$. This representation is like a base-2 form of scientific notation.

We are not going into the details about how real numbers convert back and forth from this notation, you will see a little more about this in computer architecture. What we are going to do here, is to get a feeling about why certain numbers cannot be exactly represented using this notation, giving us weird answers like 3 * 0.1 -> 0.30000000000000004.

Remember that we represented integers with a fixed number of bits, so there was some largest integer that we could represent. For example we could represent the numbers 0-255 with 8 bits, but cannot represent 260 unless we use an extra bit. The floating-point representation can only use a fixed number of bits as well that we can use to store $F$ and $e$. However, since we are restricted to use powers of 2, not every number can be exactly represented. For example, the number $0.00675$ is representable exactly because it is $2^{-4}$ and $0.125$ is representable exactly because it is $2^{-3}$. Turns out that $0.1$ is not representable exactly! Let's try converting $0.1_{10}$ into a binary representation using an extension of the process we used above.

Starting with $0.1_{10}$, the highest power of 2 bigger than $0.1$ is $-4$ ($2^{-4} = 0.0675$), so in binary we have (with the decimal point only placed for clarity): $0.0001\dots_{2}$ where we still need to figure out what comes after position -4. Following our conversion procedure, we subtract: $0.1 - 2^{-4} = 0.0375$. Now we ask ourselves: what is the highest power of two that is less than $0.0375$? It is $2^{-5} = 0.03125$. So we can extend our binary representation: $0.00011\dots_{2}$. We now subtract $0.0375 - 2^{-5} = 0.00625$. If we keep going, we find a $1$ in the $2^{-8}$ slot, with a remainder of $0.0023475$, giving us $0.00011001\dots_{2}$ We then have a $1$ in the $2^{-9}$ slot, with a remainder of $0.000394375$, givin us $0.000110011\dots_{2}$. If you continue with this process, you might notice that the next digit is a 0, then another 0, then a 1, then another 1, then a 0.... and so on. It turns out that the representation of 0.1₁₀ in binary is periodic (just like 1/3 in decimal)!!! But we said that we only have a limited number of digits that we can use in our binary representation so, at some point, we will have to truncate our periodic binary number. This is what leads to loss of precision in our computations!

4. Representing information: other stuff

In order to represent floating-point numbers (as in the previous section), we had to come up with a standard (a convention we all agree on) for representing these numbers using bits. In fact, were using the IEEE 754 standard. We need these standards because, otherwise, a number might be represented in a certain way on computer A, but represented differently on computer B, which would be a big problem when trying to exchange information between those computers. There are many different standards that were developed over the years to make sure everybody builds computer that can easily exchange data (using 1s and 0s since that's all computers deal with).

Let's see an example considering characters. Imagine you sent a text message to your friend, where the characters on your phone are represented in a different way than they are on your friend's phone. This might cause some pretty bad communication problems! In fact, each character on your keyboard is associated with a specific sequence of bits. This is known as the American Standard Code for Information Interchange (ASCII), which is a standard for encoding characters. See the table below to the left. On the right we also show the table where, instead of the binary number associated with each character, we show the corresponding decimal and hexadecimal representation (You can verify that the binary values correspond to the decimal and hexadecimal values for a given character as an exercise!).

But what about emojis?? As you can see, the ASCII character set does not include emojis. Furthermore, ASCII is a very anglocentric encoding. What about all the beautiful languages in the word that do not use the Latin alphabet? In order to be able to include all these characters (and emojis) in your text message to your friend, then your message has to be encoded in a sequence of 1s and 0s and to do that, we need a mapping from each one of the characters and emojis into a sequence of bits. This is accomplished by the Unicode Standard (UTF-8).

Now, we could spend days and days here to talk about encoding (there are entire fields of engineering and computer science that study the best way to encode information), but the main message here is that in order to share information with other people, we need agree upon some encoding standard.

The same principle applies for all sort of other things that you use on your computer every day: sounds, images, videos, etc... Once we get familiar with few more data structures in Python, we are going to revisit this idea and explore how some of these things can be encoded and manipulated.